Ecological Archives E088-173-A1

D. Richard Cutler, Thomas C. Edwards, Jr., Karen H. Beard, Adele Cutler, Kyle T. Hess, Jacob Gibson, and Joshua J. Lawler. 2007. Random forests for classification in ecology. Ecology 88:2783–2792.

Appendix A. Technical details and additional capabilities of random forests.

TECHNICAL DETAILS

The Gini Index

For the K class classification problem the Gini index is defined to be G = k pk(1 – pk), where pk is the proportion of observations at the node in the kth class. The index is minimized when one of the pk takes the value 1 and all the others have the value 0. In this case the node is said to be pure and no further partitioning of the observations in that node will take place. The Gini index takes its maximum value when all the pk take on the vale 1/K, so the observations at the node are spread equally among the K classes. The Gini index for an entire classification tree is a weighted sum of the values of the Gini index at the terminal nodes, with the weights being the numbers of observations at the nodes. Thus, in the selection of the next node to split on, nodes which have large numbers of observations but for which only small improvements in the pks can be realized may be offset against nodes that have small numbers of observations but for which large improvements in the pks are possible.

The Number of Variables Available for Splitting at Each Node

The parameter mtry controls the number of variables available for splitting at each node in a tree in a random forest. The default values of mtry are (the integer value of) the square root of the number of variables for classification, and the number of variables divided by three for regression. The smaller value of mtry for classification is to ensure that the fitted classification trees in the random forest have small pairwise correlations, a characteristic that is not needed for regression trees in a random forest. In principle, for both applications, if there is strong predictive capability in a few variables then a smaller value of mtry is appropriate, and if the data contains a lot of variables that are weakly predictive of the response variable larger values of mtry are appropriate. In practice, RF results are quite insensitive to the values of mtry that are selected. To illustrate this point, for Verbascum thapsus in the Lava Beds NM invasive plants data, we ran RF five times at the default settings (mtry = 5) and obtained out-of-bag PCC values of 95.3%, 95.2%, 95.2%, 95.3%, 95.4%. Next, we ran RF once for each of the following values of mtry: 3, 4, 6, 7, 10, and 15. The out-of-bag PCC values for these six cases were 95.2%, 95.3%, 95.3%, 95.2%, 95.3%, and 95.3%. So, in this example, decreasing mtry to three and increasing it to half the total number of predictor variables had no effect on the correct classification rates. The other metrics we used—sensitivity, specificity, kappa, and AUC—exhibited the same kind of stability to changes in the value of mtry.

The R implementation of RF (Liaw and Wiener 2002) contains a function called tuneRF which will automatically select the optimal value of mtry with respect to out-of-bag correct classification rates. We have not used this function, in part because the performance of RF is insensitive to the chosen value of mtry, and in part because there is no research as yet to assess the effects of choosing RF parameters such as mtry to optimize out-of-bag error rates on the generalization error rates for RF.

The Number of Trees in the Random Forest

Another parameter that may be controlled in RF is the number of bootstrap samples selected from the raw data, which determines the number of trees in the random forest (ntree). The default value of ntree is 500. Very small values of ntree can result in poor classification performance, but ntree = 50 is adequate in most applications. Larger values of ntree result in more stable classifications and variable importance measures, but in our experience, the differences in stability are very small for large ranges of possible values of ntree. For example, we ran RF with ntree = 50 five times for V. thapsus using the Lava Beds NM data and obtained out-of-bag PCC values of 95.2%, 94.9%, 95.0%, 95.2%, and 95.2%. These numbers show slightly more variability than the five values listed in the previous section for the default ntree = 500, but the difference is very modest.

ADDITIONAL APPLICATIONS OF RANDOM FORESTS IN ECOLOGICAL STUDIES

In this section we describe RF’s capabilities for types of statistical analyses other than classification.

Regression and Survival Analysis

RF may be used to analyze data with a numerical response variable without making any distributional assumptions about the response or predictor variables, or the nature of the relationship between the response and predictor variables. Regression trees are fit to bootstrap samples of the data and the numerical predictions of the out-of-bag response variable values are averaged. Regression functions for regression trees are piecewise constant, or “stepped.” The same is true with regression functions from RF, but the steps are smaller and more numerous, allowing for better approximations of continuous functions. Prasad et al. (2006) is an application of RF to the prediction of abundance and basal area of four tree species in the southeastern United States. When the response variable is a survival or failure time, with or without censoring, RF may be used to compute fully non-parametric survival curves for each distinct combination of predictor variable values in the data set. The approach is similar to Cox’s proportional hazards model, but does not require the proportional hazards assumption that results in all the survival curves having the same general shape. Details of survival forests may be found in Breiman and Cutler (2005).

Proximities, Clustering, and Imputation of Missing Values

The proximity, or similarity, between any two points in a dataset is defined as the proportion of times the two points occur at the same terminal node. Two types of proximities may be obtained from RF. Out-of-bag proximities, which use only out-of-bag observations in the calculations, are the default proximities. Alternatively, proximities may be computed using all the observations. At this time proximities are the subject of intense research and the relative merits of the two kinds of proximities have yet to be resolved. Calculation of proximities is very computationally intensive. For the Lava Beds NM data (n = 8251) the memory required to compute the proximities exceeded the memory available in the Microsoft Windows version of R (4Gb). The FORTRAN implementation of RF (Breiman and Cutler 2005) has an option that permits the storage of a user-specified fixed number of largest proximities for each observation and this greatly reduces the amount of memory required.

Proximities may be used for cluster analysis and for graphical representation of the data by multidimensional scaling (MDS) (Breiman and Cutler 2005). See Appendix C for an example of an MDS plot for the classification of the nest and non-nest sites in the cavity nesting birds’ data. Proximities also may be used to impute missing values. Missing numerical observations are initially imputed using the median for the variable. Proximities are computed, and the missing values are replaced by weighted averages of values on the variable using the proximities as weights. The process may be iterated as many times as desired (the default is five times). For categorical explanatory variables, the imputed value is taken from the observation that has the largest proximity to the observation with a missing value.

As a sample application of imputation in RF, in three separate experiments using the LAQ data, we randomly selected and replaced 5%, 10%, and 50% of the values on the variable Elevation with missing values. We then imputed the missing values using RF with the number of iterations ranging from 1 to 25. The results for all the combinations of percentages of observations replaced by missing values and numbers of iterations of the RF imputation procedure were qualitatively extremely similar: the means of the original and imputed values were about the same (1069 vs. 1074, for one typical case); the correlations between the true and imputed values ranged from 0.964 to 0.967; and the imputed values were less dispersed than the true values, with standard deviations of about 335 for the imputed values compared to about 460 for the true values. This kind of contraction, or shrinkage, is typical of regression-based imputation procedures. When a large percentage of the values in a dataset have been imputed, Breiman and Cutler (2005) warn that, in subsequent classifications using the imputed data, the out-of-bag estimates of correct classification rates may overestimate the true generalization correct classification rate.

Detecting Multivariate Structure by Unsupervised Learning

Proximities may be used as inputs to traditional clustering algorithms to detect groups in multivariate data, but not all multivariate structure takes the form of clustering. RF uses a form of unsupervised learning (Hastie et al. 2001) to detect general multivariate structure without making assumptions on the existence of clusters within the data. The general approach is as follows: The original data is labeled class 1. The same data but with the values for each variable independently permuted constitute class 2. If there is no multivariate structure among the variables, RF should misclassify about 50% of the time. Misclassification rates substantially lower than 50% are indicative of multivariate structure that may be investigated using other RF tools, including variable importance, proximities, MDS plots, and clustering using proximities.

SOFTWARE USED IN ANALYSES

Stepwise discriminant analysis and preliminary data analyses and manipulations were carried out in SAS version 9.1.3 for Windows (SAS Institute, Cary NC). All other classifications and calculations of accuracy measures were carried out in R version 2.4.0 (R Development Core Team 2006). Logistic regression is part of the core distribution of R. LDA is included in the MASS package (Venables and Ripley 2002). Classification trees are fit in R using the rpart package (Therneau and Atkinson 2006). The R implementation of RF, randomForest, is due to Liaw and Wiener (2002).

SOURCES OF RANDOM FORESTS SOFTWARE

Three sources of software for RF currently exist. These are:

1. FORTRAN code is available from the RF website. (http://www.math.usu.edu/~adele/forests).

2. Liaw and Wiener (2002) have implemented an earlier version of the FORTRAN code for RF in the R statistical package.

3. Salford Systems (www.Salford-Systems.com) markets a professional implementation of RF with an easy-to-use interface.

The use of trade, product, or firm names in this publication is for descriptive purposes only and does not imply endorsement by the U.S. Government.

LITERATURE CITED

Breiman, L., and A. Cutler. 2005. Random Forests website: http://www.math.usu.edu/~adele/forests

Hastie, T. J., R. J. Tibshirani, and J. H. Friedman. 2001. The elements of statistical learning: data mining, inference, and prediction. Springer series in statistics, New York, New York, USA.

Liaw, A., and M. Wiener. 2002. Classification and Regression by randomForest. R News: The Newsletter of the R Project (http://cran.r-project.org/doc/Rnews/) 2(3):18–22.

Prasad, A. M., L. R. Iverson, and A. Liaw. 2006. Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181–199.

R Development Core Team. 2006. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org

Therneau, T. M., and E. Atkinson. 2006. rpart: Recursive Partitioning. R package version 3.1.

Venables, W. N., and B. D. Ripley. 2002. Modern applied statistics with S (Fourth Edition). Springer, New York, New York, USA.



[Back to E088-173]