Animal Modeling
Efforts to protect a species or its habitat usually require that we know where it occurs and its likely abundance. Where management actions might harm a species of concern, it may be important to be sure that the species is absent from or rare at particular sites. The development of population models that are used for management often requires information on the distribution and abundance of species (Akçakaya et al. 1995). Broad-scale conservation planning, such as embodied by the Gap Analysis Program, also depends on information on the presence or abundance of species within large geographic areas (Possingham et al. 2000). Therefore, maps of the distribution and abundance of species are important tools for conservation management. In this article, we discuss the development, use, and evaluation of such maps.
Our article focuses on maps of species that are based on generalized linear models (GLMs), a particular class of statistical methods that includes simple linear regression and ANOVA. By including nonlinear terms, GLMs can incorporate nonlinear relationships between species and their habitat. Generalized additive models (GAMs), which are closely related to GLMs, can be used as an alternative to model nonlinear relationships. Details about GLMs and GAMs can be found in a range of sources (McCullagh and Nelder 1989, Hastie and Tibshirani 1990, Austin et al. 1984, Yee and Mitchell 1991, Guisan et al. 2002, Austin 2002). Most of our discussion about GLMs is also relevant to GAMs. We have focused on the use of GLMs partly because it is a method of species mapping with which we are familiar, but also because we believe it has clear advantages over alternative methods such as subjective judgement, envelope analysis, genetic algorithms, regression trees and neural networks (Elith and Burgman in press). GLMs provide a rigorous and statistically robust method for predicting the occurrence or abundance of species. The models are explicit and can be analysed for their ecological rationality (Austin 2002). They have the capacity for modeling complex relationships, including interactions, competition and population trends (Austin 2002, Fewster et al. 2000). Uncertainty in the predictions of GLMs can be assessed using confidence intervals, and the predictions can be tested (Guisan and Zimmerman 2000).
GLMs use data on the presence or abundance of species at sites. They relate these data to attributes of the sites, which become the explanatory variables of a regression model. The result is an equation that predicts the abundance or occurrence of a species based on the set of site attributes. For example, Parris (2001) developed a logistic regression equation for the probability of encountering the cascade treefrog (Litoria pearsoniana) at night along a 100-meter section of stream within forests of eastern Australia.
p = 1 / [1 + exp(10.48 – 2.204.log10(C) – 2.037P)],
where C is the annual volume of rain falling in the watershed above the stream, and P=1 if palms are present at the site and 0 otherwise. Cascade treefrogs are found more frequently in moist forest, as indicated by the presence of palms, and at larger streams (Figure 1). Maps of species can be developed by extrapolating the predictions to other sites based on the site attributes (Figure 2).

Figure 1. Logistic regression model of the probability of occurrence of the cascade treefrog (Litoria pearsoniana) as a function of stream size, measured by the annual volume of rainfall in the watershed upstream of the site and the presence or absence of palms (from Parris 2001).
There are numerous methods for determining which explanatory variables should be included in a regression model. For example, stepwise variable selection algorithm can be used to determine inclusion or exclusion on the basis of statistical significance. There are numerous philosophical and practical reasons, however, why this should not be done (Harrell 2001, Steyerberg et al. 2000). Stepwise variable selection will lead to biased estimates of the regression coefficients and their standard errors (Harrell 2001) and result in meaningless p-values for those variables that remain. An alternative method for variable selection is to use experts to choose the appropriate variables and then use the available data to estimate the parameters of the regression model. This approach is likely to produce better predictions than using statistical significance to determine whether a variable should be included in the model (Steyerberg et al. 2000). Where there is some uncertainty about which variables to include in the regression equation, multiple models can be developed and degrees of belief can be assigned to each (Burnham and Anderson 1998, Hilborn and Mangel 1997). In all cases, the ecological rationale behind the use of each variable needs to be clear. The best predictors are those that have a causal influence on species distribution at the scale of interest (Austin 2002).

Figure 2. Spatially explicit prediction of the probability that Leionema ralstonii, a rare shrub associated with rocky outcrops of south-eastern Australia, will be present in a 25 m grid cell. The predictions were derived from a GLM with explanatory variables based on topography, mapped rock type, and aerial photo interpretation of the amount of outcropping rock in the vicinity of each site (from Elith 2002).
One question that must be addressed when developing a regression model is how many data points are necessary. One rule of thumb is that for each explanatory degree of freedom (df) there should be a minimum of 10 informative observations (Harrell 2001). When modeling abundance data, this is equivalent to 10 survey sites for each explanatory df. When using presence/absence data it is equivalent to 10 absence records or 10 presence records, whichever is least common. An alternative approach to determining the level of survey effort is to determine how the precision of the predictions varies with sample size. An appropriate sample size will depend on the acceptable level of precision and the available resources. An approximate rule of thumb is that the standard errors of the regression coefficients will be halved for each quadrupling of the sample size. Surveys that are stratified to cover the range of variation in the explanatory variables, with allocation of samples designed to minimize variances, are necessary for estimating the real relationships and are likely to require fewer samples for the same level of precision compared to simple random samples (Austin 1989, Guisan and Zimmerman 2000).
Maps of the distribution of species invariably contain errors. Because the predictions of GLMs are based on a statistical model, precision in the predictions can be quantified by constructing confidence intervals (Elith et al. 2002). Interpreting such confidence intervals depends on the level of risk that is acceptable to the resource managers and where the burden of proof lies. For example, in order to protect habitat of endangered species, developers might be required to ensure that the upper confidence interval of the predicted probability of occupancy is below a prescribed threshold. Alternatively, habitat might be protected only if we are reasonably sure that it is utilized by the species of interest, i.e., if the lower confidence interval is above a prescribed threshold. The actual choice will depend on the management objectives, the costs and risks of action or inaction, and the acceptability of different levels of risk. Development of statistically based habitat maps allows these risks to be determined more easily than with alternative methods.
Evaluating the quality of predictions is often an important part of any modeling exercise. We have chosen not to use the term validation, because it might imply to some readers that the aim is to prove the predictions to be true (or false). Clearly, such an aim is meaningless, because we know a priori that any prediction will be incorrect to at least some degree. Evaluating the predictions indicates the level of bias in the predictions (calibration) and whether the accuracy of the relative ranking of occupied versus unoccupied sites (discrimination). Different statistics are required for these different types of evaluation, e.g., logistic calibration equations (Miller 1991), area under the Receiver Operating Characteristic curve (Hanley and McNeil 1982), Kappa (Cohen 1960), and correlation (Zheng and Agresti 2000). It is also necessary to consider the source of the data that are used. Ideally, data would be derived from further survey, but various resampling methods, such as bootstrapping, can be used to good effect where cost is prohibitive (Steyerberg et al. 2001).
One of the main reasons that GLMs are not used for species mapping is that most of the available data are not suitable. The data should be collected in an unbiased fashion; however, for most species presences are more likely to be recorded than absences. In GLMs, absences are as important as presence records. It is possible to use presence-only data (Zaniewski et al. 2002), but this can only provide relative predictions of occupancy or abundance, not actual values. However, this is in one way an advantage of GLMs; they emphasize that unbiased predictions require rigorous data collection. Any biases in the data, such as a failure to detect a species when it is present, will propagate through to the predictions. Although there are some recent examples where researchers have attempted to estimate and compensate for these sorts of errors (Tyre et al. in review, Wintle et al. in review), it is important to be mindful of the possible biases that are likely to occur.
The confidence intervals developed using GLMs quantify the uncertainty in the predictions that arises due to random sampling error. They do not address error associated with incorrect model specification, biases in the data, errors in the explanatory variables, or ambiguity. However, such uncertainties could be quantified with multiple models and sensitivity analyses (Elith et al. 2002). A final word of caution is that occupancy or abundance may not reflect the habitat quality of the species (Tyre et al. 2001). In cases where habitat quality can be measured at sites (e.g., by measuring survival and/or reproductive rates), it is possible to construct a GLM of habitat quality.
In a conservation planning framework, the required level of detail and reliability of a predictive map should be determined primarily by the management context. This then has repercussions for data quality, selection of predictor variables, evaluation of predictions, and for how we communicate information about the final species map.
Akçakaya, H.R., M.A. McCarthy, and J.L. Pearce. 1995. Linking landscape data with population viability analysis: Management options for the Helmeted Honeyeater Lichenostomus melanops cassidix. Biological Conservation 73:169-176.
Austin, M.P. 2002. Spatial prediction of species distribution: An interface between ecological theory and statistical modelling. Ecological Modeling 157:101-118.
Austin, M.P., R.B. Cunningham, and P.M. Fleming. 1984. New approaches to direct gradient analysis using environmental scalars and statistical curve-fitting procedures. Vegetatio 55:11-27.
Austin, M.P., and P.C. Heyligers. 1989. Vegetation survey design for conservation: Gradsect sampling of forests in northeastern NSW. Biological Conservation 50:13-32.
Burnham, K.P., and D.R. Anderson. 1998. Model selection and inference: A practical information-theoretic approach. Springer-Verlag, New York.
Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20:37-46.
Elith, J. 2002. Predicting the distribution of plants. Ph.D. thesis (unpublished), School of Botany, The University of Melbourne, Australia.
Elith, J., and M.A. Burgman. In press. Chapter 8: Habitat models for PVA. In C.A. Brigham and M.W. Schwartz, editors. Population viability in plants. Springer-Verlag, New York.
Elith, J., M.A. Burgman, and H.M. Regan. 2002. Mapping epistemic uncertainties and vague concepts in predictions of species distribution. Ecological Modelling 157:313-329.
Fewster, R.M., S.T. Buckland, G.M. Siriwardena, S.R. Baillie, and J.D. Wilson. 2000. Analysis of population trends for farmland birds using generalized additive models. Ecology 81:1970-1984.
Guisan, A., T.C. Edwards, Jr., and T. Hastie T. 2002. Generalized linear and generalized additive models in studies of species' distribution: Setting the scene. Ecological Modelling 157:89-100.
Guisan, A., and N.E. Zimmerman. 2000. Predictive habitat distribution models in ecology. Ecological Modelling 135:147-186.
Hanley, J.A., and B.J. McNeil. 1982. The meaning and use of the area under a Receiver Operating Characteristic (ROC) curve. Radiology 143:29-36.
Harrell, F.E. 2001. Regression modeling strategies with applications to linear models, logistic regression and survival analysis. Springer Series in Statistics. Springer-Verlag, New York.
Hastie, T. 1991. Generalized additive models. Pages 249-308 in J.M. Chambers and T.J. Hastie, editors. Statistical models in S. Wadsworth and Brooks/Cole Advanced Books and Software, Pacific Grove, California.
Hilborn, R., and M. Mangel. 1997. The ecological detective: Confronting models with data. In S.A. Levin and H.S. Horn, editors. Monographs in population biology. Princeton University Press, Princeton, New Jersey.
McCullagh, P., and J.A. Nelder. 1989. Generalized linear models. In D.R. Cox, D.V. Hinkley, D. Rubin, and B.W. Silverman, editors. Monographs on Statistics and Applied Probability, 2nd edition. Chapman and Hall, London.
Miller, M.E., S.L. Hui, and W.M. Tierney. 1991. Validation techniques for logistic regression models. Statistics in Medicine 10:1213-1226.
Parris, K.M. 2001. Distribution, habitat requirements and conservation of the cascade treefrog (Litoria pearsoniana, Anura: Hylidae). Biological Conservation 99:285-292.
Possingham, H.P., I.R. Ball, and S. Andelman. 2000. Mathematical methods for identifying representative reserve networks. Pages 291-306 in S. Ferson and M. Burgman, editors. Quantitative methods for conservation biology. Springer-Verlag, New York.
Steyerberg, E.W., M.J.C. Eijkemans, F.E. Harrell, and J.D.F. Habbema. 2000. Prognostic modelling with logistic regression analysis: A comparison of selection and estimation methods in small data sets. Statistics in Medicine 19:1059-1079.
Steyerberg, E.W., F.E. Harrell, G.J.J.M. Borsboom, M.J.C. Eijkemans, Y. Vergouwe, and J.D.F. Habbema. 2001. Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis. Journal of Clinical Epidemiology 54:774-781.
Tyre, A.J., H.P. Possingham, and D.B. Lindenmayer. 2001. Matching observed pattern with ecological process: Can territory occupancy provide information about life history parameters? Ecological Applications 11:1722-1738.
Tyre, A.J., B. Tenhumberg, S.A. Field, H.P. Possingham, D. Niejalke, and K. Parris (in review). Improving precision and reducing bias in biological surveys by estimating false negative error rates in presence-absence data. Ecological Applications.
Wintle, B.A., M.A. Burgman, and R.P. Kavanagh (in review). The magnitude and management consequences of false negative observation error in surveys of arboreal marsupials and large forest owls. Ecological Applications.
Yee, T.W., and N.D. Mitchell. 1991. Generalized additive models in plant ecology. Journal of Vegetation Science 2:587-602.
Zaniewski, A.E., A. Lehmann, and J.M. Overton. 2002. Predicting species distribution using presence-only data: A case study of native New Zealand ferns. Ecological Modelling 157:261-280.
Zheng, B., and A. Agresti. 2000. Summarizing the predictive power of a generalized linear model. Statistics in Medicine 19:1771-1781.
Return to Table of Contents