GAP Home USGS Home

| GAP home | USGS home |

Volume No. 12, 2003/2004

ANIMAL MODELING

Evaluating the Use of Statistical Decision Trees for Modeling Avian Habitats and Regional Range Distributions in the Great Plains

Milda R. Vaitkus, Geoffrey M. Henebry, Brian C. Putz, and James W. Merchant

Center for Advanced Land Management Information Technologies (CALMIT), School of Natural Resources, University of Nebraska , Lincoln

Introduction

Attempts to regionalize species models by mosaicking range distributions produced by neighboring state Gap Analysis projects have been problematic. Variations in habitat modeling result in significant differences in predicted species distributions within and across state lines. Additionally, there is a decided knowledge gap between the spatial and temporal scales used by biogeographers and wildlife managers. Using national geospatial data to map surrogates of habitat, the Nebraska Gap Analysis Project (NE-GAP) examined whether the use of statistical decision trees might help solve these problems.

Methods

We generated regional distributions of 20 selected breeding birds in the six-state GAP Great Plains region (IA, KS, MN, ND, NE, SD) using three recursive partitioning algorithms: QUEST (Quick, Unbiased, & Efficient StatisticalTrees; Loh and Shih 1997; Shih 2002); CART (Classification And Regression Trees; Breiman et al. 1984; De’ath and Fabricius 2000) as an implementation within QUEST; and CRUISE (Classification Rule with Unbiased Interaction Selection & Estimation, Kim and Loh 2000, 2001). Breeding Bird Survey (BBS) route level summaries (Sauer et al. 2003) over two time periods (last 10 and 30 years) were used for the occurrence data (presence/absence and abundance), while e nvironmental variables were developed from National Land Cover Data (Vogelmann et al. 1998), Daymet daily climatic means and variances (Thornton and Running 1999), State Soil Geographic (STATSGO) soil texture, and National Elevation Data.

Models were developed on a hexagonal grid produced by the EPA’s Environmental Monitoring and Assessment Program (EMAP) with a cell resolution across the Great Plains of approximately 40 km 2. This coverage was intersected with each variable data set to create hexagonal coverages containing averaged values, area-weighted average values, or compositional vectors for each hexagon. These coverage variables were then intersected with the BBS occurrence data. Multiple statistical decision trees were generated for each target species to evaluate the relative strengths and weaknesses of the different algorithms. These statistical trees were then pruned to provide model generality and inverted across the study area to obtain predicted habitat distributions (Figure 1).

Predicted range of Yellow-billed Cuckoo in Great Plains

Figures 1a and b. Yellow-billed Cuckoo 10 yr (1a) and 30 yr (1b) CART statistical decision trees and the associated predicted range distributions, shown in gray.

Algorithms were compared on the basis of speed of tree identification, interpretability of the cross-validated tree, and plausibility of the range distribution predicted from the tree. Model performance was evaluated by (1) calculating the proportion of species occurrences explained at the first model branch; (2) examining visually how well each model corresponded to published species distributions; (3) assessing correspondence of the model to the spatial distribution of the BBS data; and (4) the computational time required to generate a tree.

Results

CART’s exhaustive search of state space took much longer to generate a tree than CRUISE or QUEST (Table 1, Figure 2). All algorithms failed to generate model trees for species with large numbers of observations and/or relatively even distributions across the region (e.g., American Crow, n > 40,000); thus, only 12 of the original 20 species produced sufficient numbers of trees for comparative analysis. CRUISE used fewer observations (number of routes = 340) in its analysis than CART or CRUISE (number of observations/route).

Table 1. Comparison of processing times (CPU-minutes) of the different algorithms. QUEST and CART use # Observations, CRUISE uses #Routes =340.
   

Processing Time (CPU-minutes)

 

#Obs

QUEST

CART

CRUISE

#Routes = 340

Common Name

10yr

30yr

10yr

30yr

10yr

30yr

10yr

30yr

Baltimore Oriole

10,343

30,075

12.2

18.3

933.9

2643.5

0.13

0.12

Black Tern

5,713

9,797

9.3

20.5

428.3

627.6

0.12

0.14

Brown Thrasher

53,139

125,960

58.8

76.7

5343.1

no tree

0.16

0.12

Gray Catbird

5,085

11,755

4.4

9.5

1058.0

888.7

0.15

0.13

Great-crested Flycatcher

4,597

11,078

10.8

26.8

83.5

934.9

0.13

0.14

Lark Sparrow

5,025

10,069

7.6

18.3

737.0

782.6

0.13

0.11

Northern Cardinal

9,689

27,114

17.4

62.7

339.0

778.0

0.10

0.09

Northern Harrier

1,593

3,369

1.1

2.1

64.3

232.4

0.13

0.13

Red-bellied Woodpecker

2,498

5,162

3.4

8.0

174.2

254.2

0.12

0.11

Tree Swallow

6,628

6,628

11.5

23.5

598.6

954.4

0.14

0.13

Upland Sandpiper

12,537

12,537

15.5

29.3

189.5

4492.7

0.14

0.14

Yellow-billed Cuckoo

3,268

3,268

5.3

12.1

199.6

859.7

0.13

0.11

Average

   

8.6

21.4

382.6

1130.8

0.1

0.1

Example of a CART decision tree

Figure 2. Variation explained at first model branch as a function of computational cost for generation of entire tree.

Trees built from 30 yr data explained a higher percentage of observations at the first model branch than those built from 10 yr data: 30 yr data models averaging 97%, 98%, and 67% versus 10 yr data models averaging 94%, 95%, and 60% for QUEST, CART, and CRUISE, respectively. CRUISE model explanation (avg. 64%) was significantly lower at the first model branch than either QUEST (avg. 96%) or CART (avg. 97%), although the computational costs of the CRUISE models were significantly less (Figure 2). Fewer observations and different configuration of data (routes) between the 10 yr and 30 yr data sets led to differences in inverted geographic ranges (Figure 1). Determining whether this result is due to population trends or less data will require further analysis.

Of the 72 models reported in Table 2, in the first branch of the statistical tree three attempts resulted in no tree, eight models used land cover, eight models used soils or terrain, 15 used water vapor pressure or precipitation, 16 used insolation, and 22 used temperature or frost-free days. Thus, 77% (n=53) of the models relied on climatological variables and of these, 28% (n=15) used climatic variability to model species distribution at the first step of statistical partitioning. Sixty-two percent of the climatological models emphasized the transitional seasons of spring and fall (n=33) over summer (n=12) or winter (n=8). CART and CRUISE selected insolation variables more frequently than QUEST (Table 2), which in turn selected variables more readily interpretable in terms of ecophysiological constraints on bird populations, such as the interannual variability in frost-free days.

Table 2. Synopsis of variable types selected in the first branch of the generated statistical trees (Sp = spring, Su = summer, Fa = fall, Wi = winter).
 

QUEST

CART

CRUISE

Common Name

10yr

30yr

10yr

30yr

10yr

30yr

Baltimore Oriole

% Evergreen Forest

% Evergreen Forest

Mean Su Insolation

% Evergreen Forest

Terrain

Terrain

Black Tern

Mean Wi Vapor Pressure

CV 1 Wi FFD 2

Mean Fa Insolation

CV Sp min Air Temp

% Emergent Herbaceous Wetlands

CV Sp avg Air Temp

Brown Thrasher

N/T 3

% Evergreen Forest

% Evergreen Forest

N/T 3

% Evergreen Forest

Land Covers

Gray Catbird

Terrain

Terrain

Terrain

Terrain

Mean Su Insolation

Mean Su Insolation

Great-crested Flycatcher

Mean Su Insolation

Terrain

Mean Su Insolation

Mean Su Insolation

Mean Fa max Air Temp

Mean Su Insolation

Lark Sparrow

Mean Fa max Air Temp

Soils

CV Wi avg Air Temp

Mean Su Insolation

CV Wi min Air Temp

CV Wi min Air Temp

Northern Cardinal

Mean Sp FFD

CV Su FFD

Mean Sp Vapor Pressure

Mean Sp Vapor Pressure

Mean Sp Vapor Pressure

Mean Sp Vapor Pressure

Northern Harrier

Mean Sp Vapor Pressure

Mean Sp Precipitation freq

Mean Sp Insolation

Mean Sp Precipitation freq

CV Fa max Air Temp

CV Sp Insolation

Red-bellied Woodpecker

Mean Wi Air Temp

CV Fa FFD

Mean Sp Vapor Pressure

Mean Sp Vapor Pressure

Mean Sp Vapor Pressure

Mean Sp Vapor Pressure

Tree Swallow

Mean Wi FFD

Mean Wi FFD

Mean Sp Insolation

Mean Sp Insolation

Mean Wi max Air Temp

N/T 3

Upland Sandpiper

Mean Su FFD

Mean Su FFD

Mean Su Insolation

Mean Sp min Air Temp

CV Fa Insolation

CV Fa Insolation

Yellow-billed Cuckoo

CV Fa FFD

CV Fa FFD

CV Fa min Air Temp

Mean Fa Vapor Pressure

Mean Fa Vapor Pressure

Mean Fa Vapor Pressure

1 Frost-free days

2 Coefficient of variation

3 No tree was generated by the model

Conclusions

1. Unbiased variable selection in QUEST and CRUISE appeared to facilitate the rapid identification of parsimonious, robust models and plausible range distributions.

2. QUEST trees were generally preferable to CRUISE trees because the latter algorithm relied only upon presence/absence at the route level, while the former considered data on route-level abundance.

3. Developing habitat models using statistical trees generated from species occurrence data and environmental variables can lend a greater degree of objectivity to the modeling process, but there is still considerable subjectivity in the pruning stage that is needed for model generality (Henebry et al. 2001, Holland et al. 2002).

Acknowledgements

This work was supported in part through the GAP Research Project Evaluating the Use of Statistical Decision Trees for Modeling Avian Habitat and Regional Range Distribution from Occurrence Data and Environmental Variables.

Literature Cited

Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone. 1984. Classification and regression trees. Wadsworth and Brooks/Cole, Monterey, California. 358 pp.

De’ath, G., and K.E. Fabricius. 2000. Classification and regression trees: A powerful yet simple technique for ecological data analysis. Ecology 81:3178-3192.

Henebry, G.M., B.C. Putz, and J.W. Merchant. 2001. Modeling reptile and amphibian range distributions from species occurrences and landscape variables. Gap Analysis Bulletin 10:22-24.

Holland, A.K., G.M. Henebry, B.C. Putz, M.R. Vaitkus, and J.W. Merchant. 2002. Modeling avian habitat from species occurrence data and environmental variables: Assessing the effects of land cover and landscape pattern. Gap Analysis Bulletin 11:25-27.

Kim, H., and W.-Y. Loh. 2000. CRUISE User Manual. University of Wisconsin-Madison, Department of Statistics, Technical Report 989, November 10, 2000.

Kim, H., and W.-Y. Loh. 2001. Classification trees with unbiased multiway splits. Journal of the American Statistical Association 96:589-604

Loh, W.-Y., and Y.-S. Shih. 1997. Split selection methods for classification trees. Statistica Sinica 7:815-840.

Sauer, J.R., J.E. Hines, and J. Fallon. 2003. The North American Breeding Bird Survey, Results and analysis 1966 - 2002. Version 2003.1. USGS Patuxent Wildlife Research Center, Laurel, Maryland.

Shih, Y.-S. 2002. QUEST User Manual. Department of Mathematics, National Chung Cheng University, Taiwan, April 17, 2002.

Thornton, P.E., and S.W. Running. 1999. An improved algorithm for estimating incident daily solar radiation from measurements of temperature, humidity, and precipitation. Agriculture and Forest Meteorology 93:211-228.

Vogelmann, J., T. Sohl, and S. Howard. 1998. Regional characterization of land cover using multiple sources of data. Photogrammetric Engineering and Remote Sensing 64:45-57.

Return to Table of Contents