The use of classification trees (CT) for land cover mapping is becoming increasingly common (Hansen et al. 1996, Lawrence and Wright 2001, Pal and Mather 2003, Brown de Colstoun 2003). Classification trees, sometimes called decision trees, or CART (Classification and Regression Trees) offer several advantages over classification algorithms traditionally used for land cover mapping. One advantage is the ability to effectively use both categorical and continuous predictor data sets with different measurement scales. Other advantages include the ability to handle nonparametric training and predictor data, good computational efficiency, and an intuitive hierarchical representation of discrimination rules. Classification trees use multiple explanatory variables to predict a single response variable.
A major challenge with using classification trees for land cover mapping lies in spatially applying the rules generated from the CT software within a geographic information system (GIS). StatMod for ArcView 3.X was developed by Christine Garrard at Utah State University with the purpose of interfacing the CT tools available in S-PLUS with ArcView GIS (Garrard 2002). StatMod is available free and can be downloaded, with an accompanying user’s guide, from http://www.gis.usu.edu/~chrisg/avext/.
StatMod provides the option to automatically submit jobs to S-PLUS, in which case all interactions with S-PLUS are through an ArcView dialog box. Alternatively StatMod allows manual creation of tree models in S-PLUS, which can thereafter be spatially applied in ArcView. This flexibility extends the functionality of StatMod to a range of users―from those with little experience using S-PLUS to experienced S-PLUS users. A basic knowledge of ArcView is, however, necessary to successfully use StatMod.
The response variable, or training theme, is represented by training sites distributed throughout the study area and may be either a point or polygon theme. The training theme must have an attribute field containing codes or descriptions for the land cover classes to be modeled using the CT. Explanatory variables are spatial data layers from which CT rules will be generated to predict the spatial distribution of land cover. Examples of explanatory variables include individual satellite image bands, band transformations such as NDVI (Normalized Difference Vegetation Index), a digital elevation model, and topographic aspect or geology GIS data sets. When the training variable is a point theme, explanatory variables can be either polygon or grid themes. When the training variable is a polygon theme, explanatory variables must be grid themes.
Once a training theme and multiple explanatory themes are added to the View, the associated value for each training site is obtained by intersecting the training theme through the explanatory themes. When the training theme is a point theme, the value in each training site is the intersected value taken from the explanatory themes. When the training theme is a polygon theme, StatMod provides a choice of statistics such as mean, maximum, or majority value to characterize each training site. Refer to Figure 1 for a graphic depiction of how the response (dependent) variable and explanatory (independent) variables are identified in the StatMod dialog box.

Figure 1 . StatMod dialog box for Classification and Regression Trees.
The CT algorithm determines the appropriate characteristics of the response variable by recursively splitting the explanatory data into increasingly more homogeneous groups (Figure 2), producing a hierarchical tree composed of “rules” defining the characteristics of each response category (Figure 3). Commonly CT models are overfitted to the training data, that is, the CT algorithm recursively splits the data until rules are generated for specific training sites rather than entire response categories. Once an overfitted tree is generated, it can be reduced in size to create a tree that is neither precisely fitted to the training data nor so general that it is not meaningful. S-PLUS offers two methods for reducing tree size: “pruning” and “shrinking.”

Figure 2 . Example of four cover types discriminated by elevation and Fall NDVI.

Figure 3 . Discrimination rules from Figure 2 presented as a classification tree.
Choosing the best tree reduction method is typically achieved through iteratively growing and reducing tree models, with subsequent evaluation of deviation or misclassification error rates and testing different predictor variables and pruning or shrinking criteria. StatMod provides a convenient interface allowing the user to choose one of several methods of controlling tree size (Figure 4). These include a one standard error rule, Akaike’s information criterion (AIC), the size or number of tree nodes, and a cost complexity parameter. For more detailed information on options for controlling tree size refer to the StatMod user’s guide (Garrard 2002b).

Figure 4 . StatMod dialog box used to control tree size.
The mapping area comprises 5 million acres in Utah’s High Plateau region situated on the western edge of the Rocky Mountains. Vegetation cover includes basin big sagebrush at lower elevations, with expanses of pinyon-juniper communities at mid-elevations. Upper montane communities include Douglas-fir, aspen, and ponderosa pine, and at higher elevations spruce/fir mixes, aspen, and tundra dominate. Barren areas are present in the southeastern edge of the mapping area, which borders Utah’s slickrock country.
Approximately 3,800 training samples were available for the mapping area. All training samples were labeled with one of seven NLCD (National Land Cover Database) class codes. These correspond to Barren Lands, Deciduous Forest, Evergreen Forest, Mixed Forest, Shrub/Scrub, Grassland/Herbaceous, and Woody Wetlands. Twenty percent of the sample sites were randomly selected and withheld for accuracy assessment.
Predictor layers used for the classification tree included a digital elevation model, a raster landform model, and Enhanced Thematic Mapper (ETM) bands 1-5 and 7 (converted to grids) for a summer and fall date. Using StatMod, a classification tree was created using default S-PLUS model parameters. The tree was pruned to optimal size using Akaike’s information criterion (AIC).
StatMod produced the predicted land cover map and a text file with an .smg extension. The .smg file reports the predictor variables used in the construction of the tree, the number of terminal nodes produced, and the misclassification error rate. It also contains a textual presentation of the rules that comprise the classification tree. For this study, all predictor variables were used, and the tree was comprised of 70 terminal nodes. The tree had a misclassification rate of 0.19, meaning that 81% of the training data could be predicted by the classification tree.
The predicted map was produced as a grid and was displayed in the active View. Attribute information stored in separate fields in the .vat of the grid include the predicted land cover class, the probability of correct classification (inverse of misclassification), and calculated deviance for each grid cell. Figure 5 shows the probability values associated with each cell for the predicted grid. Low (black) to high (white) probabilities of correct classification are displayed using a graduated color ramp. It should be noted that “probability” is based on misclassification rates determined by model fit and not from an independent data source.

Figure 5 . Probability of correct classification ranging from low (black) to high (white).
StatMod also provides a convenient tool for assessing accuracy with a traditional error matrix and kappa calculation. Using the withheld 20% of the sample data, the Kappa tool in StatMod was used to intersect 712 withheld sample sites through the predicted land cover map. When polygon sample data are used, the tool assumes a correct classification when the majority of cells in the predicted map agree with the sample polygon. Overall accuracy was 75% with a kappa statistic of .67. User’s accuracies were as follows: Barren Lands (72%), Deciduous Forest (81%), Evergreen Forest (79%), Mixed Forest (55%), Shrub/Scrub (64%), Grassland/Herbaceous (76%), and Woody Wetlands (47%).
StatMod provides an easy-to-use and inexpensive tool for spatially applying the classification rules generated from the CT algorithm in S-PLUS. While the focus of this article was to use StatMod for classification trees, StatMod functions in a similar manner for regression trees. Classification trees are appropriate for discriminating distinct classes such as land cover. In a regression tree, the response variable is a continuous numeric field such as percent canopy cover. In addition to interfacing with S-PLUS for classification and regression trees, StatMod can be used to interface with SAS ® to create and spatially apply logistic regression models.
Brown de Colstoun, E.C., M.H. Story, C. Thompson, K. Commisso, T.G. Smith, and J.R. Irons. 2003. National Park vegetation mapping using multitemporal Landsat 7 data and a decision tree classifier. Remote Sensing of Environment 85:316-327.
Garrard, C.M. 2002. StatMod: A tool for interfacing ArcView GIS with statistical software to facilitate predictive ecological modeling. Master of Science Thesis. Utah State University.
Garrard, C.M. 2002b. StatMod Zone user’s guide. WWW URL: http://www.gis.usu.edu/~chrisg/avext/
Lawrence, R.L., and A. Wright. 2001. Rule-based classification systems using Classification and Regression Tree (CART) Analysis. Photogrammetric Engineering and Remote Sensing 67:1137-1142.
Pal, M., and P.M. Mather. 2003. An assessment of effectiveness of decision tree methods for land cover classification. Remote Sensing of Environment 86:554-565.
Return to Table of Contents