Random Forests for Spatially Dependent Data

Abstract Spatial linear mixed-models, consisting of a linear covariate effect and a Gaussian process (GP) distributed spatial random effect, are widely used for analyses of geospatial data. We consider the setting where the covariate effect is nonlinear. Random forests (RF) are popular for estimating nonlinear functions but applications of RF for spatial data have often ignored the spatial correlation. We show that this impacts the performance of RF adversely. We propose RF-GLS, a novel and well-principled extension of RF, for estimating nonlinear covariate effects in spatial mixed models where the spatial correlation is modeled using GP. RF-GLS extends RF in the same way generalized least squares (GLS) fundamentally extends ordinary least squares (OLS) to accommodate for dependence in linear models. RF becomes a special case of RF-GLS, and is substantially outperformed by RF-GLS for both estimation and prediction across extensive numerical experiments with spatially correlated data. RF-GLS can be used for functional estimation in other types of dependent data like time series. We prove consistency of RF-GLS for β-mixing dependent error processes that include the popular spatial Matérn GP. As a byproduct, we also establish, to our knowledge, the first consistency result for RF under dependence. We establish results of independent importance, including a general consistency result of GLS optimizers of data-driven function classes, and a uniform law of large number under β-mixing dependence with weaker assumptions. These new tools can be potentially useful for asymptotic analysis of other GLS-style estimators in nonparametric regression with dependent data.


Introduction
Geo-referenced data, exhibiting spatial correlation, are commonly analyzed in a mixed-model framework consisting of a fixed-effect component for the covariates and a spatial randomeffect (Banerjee, Carlin, and Gelfand 2014). If Y, X, and , respectively, denote the response, the D-dimensional vector of covariates, and the spatial location, then the spatial linear mixed model can be expressed as Y = X β + w( ) + error. The linear regression term x β parsimoniously models the fixed covariate effect m(x) = E(Y | X = x), while the spatial random effect w( ) is often modeled flexibly using Gaussian Processes (GPs) that encode the dependence in the data across locations. The mixed model allows for inference on the covariate effect as the estimate of β, while adjusting for the spatial dependence via w. At the same time, it leverages the conveniences of a GP to seamlessly predict the outcome at a new location via kriging. This flexibility has made the spatial linear mixed model with GP-distributed random effects the flag-bearer in geo-statistics.
This article focuses on applications where the linearity assumption for the covariate effect m(x) is inappropriate. We consider the spatial nonlinear mixed-effect model Y = m(X) + w( ) + error. A nonlinear m(x) can be modeled in terms of splines or other basis function expansions, while still modeling the spatial effect using GP. However, smooth and continuous basis functions are not ideal for modeling nonsmooth and possibly discontinuous (like piecewise constant) covariate effects.
Also, basis functions on multi-dimensional covariate domains experience curse of dimensionality even for only 3 or 4 covariates as the knots are usually too far apart to adequately represent the covariate space (Taylor and Einbeck 2013).
Random forest (RF) (Breiman 2001) estimates nonlinear regression functions using an ensemble of nonsmooth, dataadaptive family of basis functions, which have more expressive powers than traditional fixed basis methods (Lin and Jeon 2006;Scornet 2016). RF has become one of the most popular method for flexible nonlinear function estimation, with wide applications in different scientific and engineering fields. While RF have also been used for a number of geospatial applications (Ahijevych et al. 2016;Di et al. 2019;Lim et al. 2019;Viscarra Rossel, Webster, and Kidd 2014;Fayad et al. 2016), most of the aforementioned work do not make procedural considerations for the RF algorithm to address the spatial correlation in the data. This is fundamentally at odds with the tenets of spatial modeling where the spatial correlation is explicitly accommodated using GP-distributed random effects.
Very little attention has been paid to how ignoring this spatial correlation affects function estimation using RF. The criterion (loss function) used to recursively split the nodes of decision trees of RF is essentially an ordinary least-square (OLS) loss. It is well known that OLS is suboptimal for dependent data as it ignores data correlation. Also, RF creates and consolidates the trees using "bagging" (bootstrap aggregation) (Breiman 1996). As spatial data are correlated, this resampling violates the assumption of independent and identically distributed (iid) data units, fundamental to bootstrapping. We provide empirical evidence that these limitations manifest in inferior estimation/prediction performance of RF under spatial dependence.
There have been two ways of accounting for spatial dependence in RF. The first approach performs spatial adjustment through kriging after fitting an RF without accounting for the spatial dependence. For prediction at new locations, it uses the spatial mixed model framework to combine this vanilla RF estimate of the covariate effect m(X) with kriging from the GP part (using the residuals from the RF fit) (Fayad et al. 2016;Fox, Ver Hoef, and Olsen 2020). This is referred to as RF residual kriging (RF-RK). The other approaches attempt to explicitly use spatial information in RF. The spatial RF (RFsp) of Hengl et al. (2018) adds all pairwise spatial-distances as additional covariates. Georganos et al. (2019) proposed geographically local estimation of RF. These approaches abandon the spatial mixed model framework, only focusing on prediction, modeling the response as a joint function g of the covariates and the locations, that is, E(Y) = g(X, ), and are thus not able to isolate or estimate the covariate effect m(x). Also, to our knowledge, there is no asymptotic theory justifying these approaches.
In this article, we bridge the gap between spatial mixedeffect modeling using GP and regression function estimation using RF by proposing a well-principled rendition of RF to explicitly incorporate the spatial correlation structure implied by GP. This enables using RF for estimating a nonlinear covariate effect in the spatial mixed-effect model while explicitly accounting for spatial correlation via the GP-distributed random effects. Our approach is motivated by the fact that in a GP-based spatial linear mixed model, marginal likelihood maximization for β is equivalent to a generalized least-square (GLS) optimization. Under data dependence, GLS is more efficient than OLS for linear regression, leading to better finite sample performance.
GLS has been previously used for parameter estimation in nonlinear regression under data dependence (Kimeldorf and Wahba 1971;Diggle and Hutchinson 1989). Since regression with polynomials, splines or other known basis functions are essentially linear in the parameters, the extension of these approaches to a GLS-style quadratic form global optimization is immediate and the solution is available in closed form. However, RF uses basis functions obtained by a data-adaptive, greedy algorithm, so a GLS formulation for estimating the regression part using RF in a spatial GP regression is not immediate and has not been explored before.
An observation, central to our GLS extension of regression trees and forests is that to split a node of a decision tree in RF, the local (intra-node) loss can be equivalently represented as a global (in all nodes) OLS linear regression problem with a binary design matrix. The subsequent node representative assignment is simply the OLS fit given the chosen split. For dependent data, we can replace this OLS step with a GLS optimization problem at every node split and grow the tree accordingly. The node representatives also become the GLS fits instead of node means. The global GLS loss-based splitting and fits thus use information from all current nodes as is desirable under data-dependence and are more efficient than the OLS analogs.
Our GLS-style RF also naturally accommodates data resampling or sub-sampling used for creating a forest of trees. The key observation for this is that in a linear regression between response Y and covariates X, the GLS loss with covariance matrix 0 coincides with an OLS loss withỸ = −1/2 0 Y,X = −1/2 0 X. Thus GLS can be thought of as OLS with the decorrelated responsesỸ. Analogously, we introduce resampling after the decorrelation in our algorithm, essentially resampling the uncorrelated contrastsỸ instead of the correlated outcomes Y.
We refer to our method as RF-GLS and present a computationally efficient algorithm for implementing it. RF-GLS reduces exactly to RF when the working correlation matrix used in the GLS-loss is the identity matrix. For spatial mixed model regression, RF-GLS estimates the nonlinear covariate effect while using the GP to model the spatial random effect. Hence, traditional spatial tasks like kriging (spatial predictions) can be performed easily. For large spatial data, RF-GLS also avoids onerous big GP computations by harmonizing with the nearest neighbor GPs (Datta et al. 2016a) to yield an algorithm of linear time-complexity.
We show that RF-GLS has distinct advantages over competing methods for both estimation and prediction. For estimation, RF-GLS, accounting explicitly for the spatial correlation via GLS loss, provides improved estimates of the covariate effect m(X) in a wide range of simulation scenarios over RF which completely ignores the spatial information. Other methods like RFsp do not even produce a separate estimate of the covariate effect and cannot be considered for the task of estimation. For prediction, RF-GLS outperforms both RF-RK and RFsp. RF-GLS conveniently uses the spatial mixed model framework where the structured spatial dependence is parsimoniously encoded via GP. Methods like RFsp abandon this mixed-model framework and uses a large number of additional distance-based covariates in RF. A downside to this unnecessary escalation of the problem to high-dimensional settings is that the several additional distance-based covariates far outnumber the true set of covariates thereby biasing the covariate selection at each node-splitting toward the spatial covariates. This leads to poor prediction performance of RFsp when the spatial noise is small relative to the covariate signal. On the other hand, when the spatial noise is large, RF-RK, estimating the covariate effect m(X) without accounting for the spatial dependence, performs substantially poorly while RFsp (dominated by spatial covariates) performs comparably to RF-GLS. RF-GLS performs best at all ranges of the covariate-signal-to-spatial-noise-ratio (SNR), where each of RF-RK and RFsp suffers at opposite ends of this SNR spectrum (see Figure 3).

Theoretical Contributions
Another major contribution of this article is a thorough theoretical analysis of RF-GLS under dependent error processes like GP. For iid data, Scornet et al. (2015) proved the consistency of Breiman's RF, which allows the node splitting and node representation to be based on the same data. The other strand of RF theory consider "honest" trees (Mentch and Hooker 2016;Wager and Athey 2018) which use disjoint data subsets for splitting and representation. To our knowledge, study of RF under dependent processes has not been conducted in either paradigms. RF-GLS, like Breiman's RF, uses the same data for partitioning and node representation. Hence we adopt the framework of Scornet et al. (2015) to study consistency.
Our main result (Theorem 3.1) proves RF-GLS is L 2 consistent if the error process is absolutely-regular (β-) mixing (Bradley 2005). As corollary, we prove RF-GLS is consistent for nonlinear mean estimation in the spatial mixed model regression under the ubiquitous Matérn family of covariance functions. As a byproduct of the theory, we also establish consistency of the vanilla CART (Breiman et al. 1984) and of RF under βmixing error processes. To our knowledge, this is the first result on consistency of these procedures under dependence.
The theoretical analysis, besides being novel in terms of establishing consistency for forest estimators under dependence, required addressing several new challenges that do not arise for the analysis of classic RF for iid settings. These complications were introduced by the global GLS-style quadratic loss and the dependent errors. We summarize the new contributions here: 1. Limits of GLS regression-trees: Node splitting and node representation in regression trees of classic RF uses local (intranode) sample means and variances which trivially asymptote to their population analogs. For RF-GLS, to adjust for data correlation, we use a global GLS (quadratic form) loss for node-splitting and the set of node representatives is the global GLS estimate. Both steps now depend on data from all nodes (weighted by their spatial dependence) and their asymptotic limits are non-trivial. We meticulously track these dataweights from all nodes when splitting a given node to show in Lemma 2.1 and Theorem 2.1 that the RF-GLS split-criterion and node representatives converge to the same desirable local population limits as in RF while being empirically more efficient under dependence. These findings may seem similar to the well-known fact that GLS and OLS estimates both converge to the same limits with the GLS being more efficient under dependence. However, those classic results assume a true linear model whereas for us the true function is nonlinear E(y) = m(X), and is being estimated by a mis-specified tree estimate. To our knowledge, limits of the global GLS tree estimates as population-level local node means for that tree is a new contribution. 2. Equicontinuity of split-criterion: A center-piece in the proof of consistency of RF in Scornet et al. (2015) was stochastic equicontinuity of the CART split criterion, such that if two set of splits are close, their corresponding empirical splitcriterion values are close, irrespective of the location of the splits. For RF-GLS, the split criterion (GLS loss with plugged in GLS estimate) is is a complicated matrix-based function of the design matrix representing a set of splits. Hence, the scalar techniques of Scornet et al. (2015) do not apply here. Our equicontinuity proof is quite involved and is built on viewing GLS predictions as oblique projections on the design matrices corresponding to the splits and subsequently using results on perturbations of projection operators. More details about the technique used can be found in Section 5.2.1. 3. New uniform laws of large number (ULLNs): Controlling the quadratic form estimation error for RF-GLS under dependence by an ULLN is key to establishing consistency. The standard technique of first proving an ULLN that replaces the dependent errors by iid ones fails here as the iid error sequence does not preserve the distribution of the crossproduct terms of the quadratic form. We use a novel strategy of creating separate bivariate iid error sequences for each of the off-diagonal bands of cross-product terms in the quadratic form. We then establish a new ULLN (Proposition S2.1) for these cross-product terms having the same concentration bound (in terms of random L p norm entropy numbers) as the squared (diagonal) terms. Next, to generalize the ULLNs from iid to dependent settings, existing results require Lipschitz continuity or an uniform bound on the estimators none of which are satisfied by regression-trees. We established a general ULLN (Proposition 5.3) for β-mixing processes that uses a weaker (2 + δ)th moment assumption that is satisfied by regression-trees. Neither of the two ULLNs were needed in the asymptotic study of RF for iid data as there was no quadratic form loss or cross-product terms, and the error process was independent thereby not requiring any dependent ULLNs. 4. General tools for machine learning for dependent data: Using these ULLNs described above we established a general consistency result (Theorem 5.1) for GLS estimates for a broad class of functions under dependent error. This result generalizes Theorem 10.2 of Györfi et al. (2002) (which was for iid data and OLS estimates) to β-mixing dependent processes and GLS losses. We believe this general result would be of widespread interest. Just like the iid result of Györfi et al. (2002) was used to establish consistency of a wide range of OLS estimators (including piecewise polynomials, univariate and multivariate splines, data-driven partitioning based estimators like RFs and trees), our Theorem 5.1 can potentially be used to establish consistency of the analogs of each of these methods that explicitly accounts for dependence (spatial/serial correlation) in the data by switching to a GLS procedure. 5. Consistency for Matérn GPs: We prove consistency of RF-GLS when the true dependent error process comes from the Matérn Gaussian family (the staple choice for geospatial analysis due to interpretability of the parameters in terms of spatial surface smoothness). The proof combines several results spread over the fields of time-series, stochastic differential equations, information theory, and spatial statistics. To our knowledge, this is the first consistency result for an RFtype algorithm under Matérn spatial correlation.
While RF-GLS is motivated by the spatial GP-based mixed model framework, the method and the theory is developed much more broadly and can be used for regression function estimation in many other dependent data settings. The method only relies on knowledge of the residual covariance matrix and thus can work with any dependent error process with a valid second moment. The theory is applicable for the large class of β-mixing processes. We briefly discuss how RF-GLS can be used for function estimation under serially correlated errors (time series). In particular, we show that for autoregressive errors, one of the mainstays of time-series analysis, RF-GLS would yield a consistent estimate of the regression function.
The rest of the article is organized as follows. In Section 2 we present our methodology. The theory of consistency is presented in Section 3, including theory for common spatial and time-series processes in Section 3.3. Results from a variety of simulation experiments validating our method is presented in Section 4. While the formal proofs of all the results are provided in the supplementary materials, an outline of the proof of the main consistency result is presented in Section 5 that highlights the new technical tools developed in the process. We discuss future extension of the presented work in Section 6.

Notations
|S| denotes the cardinality of any set S. I(·) is the indicator function. The null set is {}. For any matrix M, M + denotes its generalized Moore-Penrose inverse, and M p , for 1 ≤ p ≤ ∞, denotes its matrix L p norm. For any n × n symmetric matrix when the sequence |a n /b n | is bounded above (or goes to 0) as n → ∞. A sequence of random variables is called O b (1) if it is uniformly bounded almost surely, O p (1) if it is bounded in probability, and o p (1) if it goes to 0 in probability. X ∼ Y implies X follows the same distribution as Y. R, Z, and N denote the set of real numbers, integers, and natural numbers respectively. For M ∈ R + , T M is the truncation operator, that is, T M (u) = max(−M, min(u, M)).

Review of Spatial GP Regression
The standard geospatial data unit is the triplet (Y i , X i , i ) where Y i is the univariate response, X i is a D-dimensional covariate (feature) vector, and i is the spatial location (often, geographical co-ordinates). A spatial linear mixed-model for such data is specified as Y i = X i β + w( i ) + * i , where X i β is the linear mean (covariate effect), w( i ) models the spatial structure in the response not accounted for by the covariates, and * i iid ∼N(0, τ 2 ) is the random noise. The spatial effect w( i ) is often posit to be a smooth surface across space and is modeled as a centered GP w(·) ∼ GP(0, C(·, · | θ)) with covariance function C. This essentially means that for any finite collection of locations 1 , . . . , n , We can marginalize over the spatial random effects w( i ) and write Y = (Y 1 , . . . , Y n ) ∼ N(Xβ, 0 ) where 0 = C + τ 2 I. For known 0 , maximizing this marginalized likelihood of Y is equivalent to minimizing the quadratic form min β . This expression can be recognized as the GLS loss which replaces the squared error loss (1/n) n i=1 (Y i − X i β) 2 of OLS when the data are correlated.
We focus on estimating a general mean function in the spatial nonlinear mixed model (1) When m(x) is modeled in terms of a fixed basis expansion, the marginalized model becomes Y ∼ N(B(X)γ , 0 ) where B(X) are the bases and γ are the coefficients. Hence, the model remains linear in the unknown coefficients γ which can be once again estimated using GLS.
As discussed in the introduction, the scope of fixed basis function expansions are limited due to their inability to model regression functions with discontinuities, and curse of dimensionality in multi-dimensional covariate domains. RF are particularly suitable for such general regression function estimation due to their use of regression trees, and natural accommodation of higher covariate-dimensionality using random selection. However, unlike fixed basis expansion models which are linear in the parameters, RF estimation is a nonlinear greedy algorithm for which a GLS extension is not straightforward. In the next sections, we will develop a GLS-style RF algorithm for estimating m(X) under dependence.

Revisiting the RF Algorithm
We first review the original RF algorithm. Given data (Y i , X i ) ∈ R × R D , i = 1, . . . , n, the RF estimate of the mean function m(x) = E(Y | X = x) is the average of n tree regression tree estimates of m. In a regression tree, data are split recursively into nodes of a tree starting from a root node. To split a node, a set of M try ( D) features are chosen randomly and the best split point is determined with respect to each feature d by searching over all the "gaps" in that feature of the data as cutoff candidate c := c(d).
Here, "best" is determined as the feature-cutoff combination (d, c) maximizing the CART (classification and regression trees) split criterion (Breiman et al. 1984 where, Y P i , Y R i , and Y L i denote the responses in parent node, right child, and left child, respectively.Ȳ P ,Ȳ R , andȲ L denote the respective node means, n P = n R + n L , n R , and n L are the respective node cardinalities. The feature and cut-off value combination that minimizes the CART-split criterion (2) is chosen to create the child nodes. As each split is only based on one feature, all nodes are hyper-rectangles. Each newly created node is assigned a node representative-the mean of the responses of the node members.
The nodes are iteratively partitioned this way till a prespecified stopping rule is met-we arrive at leaf nodes having single element (fully grown tree) or the number of data points in each leaf node is less than a prespecified number or the total number of nodes reaches a prespecified threshold. The regression tree estimate for an input feature x ∈ R D is given by the representative value of the leaf node containing x. RF is the average of an ensemble of regression trees with each tree only using a resample or subsample of the data.

Dependency Adjusted Node-Splitting
The RF algorithm of Section 2.2 does not utilize any information on the locations i or the ensuing spatial correlation.
The fundamental unit of the algorithm is splitting of a given node, and it is inherently local in nature (in the covariate domain). The split criterion (2) and subsequent assignment of the node representatives are both based only on members within the parent node. For iid data, this local approach is reasonable as members of one node are independent of the others. However, geo-referenced data units data can be distant in the covariate-domain while being close in the spatial domain, and therefore strongly correlated.
The spatial nonlinear mixed model from Section 2.1 can be written more generally as i is not an iid process but a stochastic (Gaussian) Process capturing the spatial dependence. Hence, cov(Y i , Y j ) = cov( i , j ) = 0 for most or all (i, j) pairs implying that members of the other nodes can be highly correlated with those of a node-to-be-split and operating locally (intra-node) leaves out this information. We now propose a new global split-criterion using all the correlations in the data, as is desirable and in line with the common practice in spatial analysis.
To explore generalizations of RF for dependent settings, we first recall an equivalent global representation of the CARTsplit criterion (2). Consider a regression tree grown up to the set of leaf nodes {C 1 , C 2 , . . . , C K } forming a partition of the feature space. The node representatives are the corresponding means. To split the next node, say C K without loss of generality, the CART-split criterion (2) determines the best feature-cutoff combination (d * , c * ) and creates child nodes C K (simply the means of the left and right child nodes respectively). We can write the optimal splitdirection (d * , c * ) and node representativesβ (3) After the split, the new set of nodes is We note but suppress the dependence of Z on (d, c). Then the optimization in Equation (3) can be rewritten in the following way.
This connection of OLS regression with RF is well-known (see, e.g., Friedman et al. 2008 where RF rules are used in downstream Lasso fit). To see why Equations (3) and (4) are equivalent, note that, for a given (d, c) theβ optimizing (4) is simply β OLS = (Z Z) −1 Z Y, the minimizer of the OLS loss Y − Zβ 2 2 . Since the membership matrices have orthogonal columns consisting of only 1s and 0s, the components of β OLS are simply the node means. Hence, the last two components of β areβ (L) K and β (R) K . Since the first (K − 1) columns of Z are same as that of Z (0) , this implies that the first (K−1) components ofβ are the same as that ofβ (0) , that is, the means of the first (K − 1) nodes. So althougĥ β in (4) is a global optimizer of a linear regression re-estimating all the node representatives, in practice, only the representatives of the child nodes of C K are updated and this is equivalent to the local optimization (3).
The global formulation of node-splitting offers an avenue for GLS-style generalization of the split criterion for the spatial GP regression. Let 0 = cov(Y) = cov( ) denote the covariance matrix of the marginalized response, and Q = −1 0 . Then à la GLS we can simply replace the squared error loss Y−Zβ 2 2 with a quadratic loss We refer to Equation (5) as the dependency-adjusted regression tree (DART) split criterion. For a split (d * , c * ) maximizing (5), the new set of node representatives are given by the GLS estimateβ Both the loss used to choose the optimal split and the node representatives now depend on data from all nodes, weighted by the precision (inverse covariance) matrix Q, akin to the spatial linear mixed-model. Even though we preserve the greedy recursive partitioning strategy of RF, for each node split our loss function and node representation are now global in the sense that they consider all the data points, not just the ones inside the node to be split. To demonstrate why the DART loss function and the GLS node representatives are respectively more appropriate for dependent data than the CART loss and node means used in the original RF, we now present a theoretical result and an empirical example. To split a given parent node B into left and right child nodes B L and B R , respectively, based on the split direction (d, c), it is almost immediate that the CART split criterion (2), being the difference between the sample variance of the parent node with the those of the children nodes, is an estimator of its asymptotic limit (7) that is, the population difference between the variance in the parent node and the pooled variance in the potential children nodes. Indeed, if we were privileged to infinite amount of data, we should use Equation (7) to split a node as it minimizes the total intra-node variances in the children. Hence, for iid data, it is reasonable to use the CART criterion (2) which is the finite sample analogue of Equation (7). Under dependence, however, it is unclear if Equation (2) is an efficient estimator of Equation (7).
Our construction of the DART split criterion is guided by the same principles of GLS used in linear regression for dependent data. The resulting loss function (5) depends on data from not only the node B to be split, but on all data, and it is not immediately clear what its population limit is. Optimization of Equation (5) for the optimal split (d * , c * ) relies on the GLS estimate (6) which is also a function of all data. Hence, we first provide a result about its asymptotic limit. We use assumptions on the dependence structure of the error process (Assumption 1) and regularity of the precision matrix (Assumption 2) discussed in more details later in Section 3 on consistency of our method. The proofs of these results can be found in Section S2.1 (supplementary materials).
Assumption 2 (Regularity of the working precision matrix). The working precision matrix Q = −1 admits a regular and sparse lower-triangular Cholesky factor − 1 2 such that and L is a fixed lower-triangular q × q matrix .
Lemma 2.1 shows that the GLS-estimate for the node representatives, although using data from all the nodes, asymptote to within node population means -the same limit as that of the sample node means used in RF. This is a new result of independent importance showing that the GLS estimate for a fixed-partition regression-tree, is a consistent estimator of the node-specific conditional means. This in turn leads to the following result about the DART loss.
Theorem 2.1 (Theoretical DART-split criterion). Under Assumptions 1 and 2, for a tree built with a fixed (data-independent) set of splits C 1 , . . . , C K , as n → ∞ the empirical DART-split criterion (5) to split a node B = C l into left and right child nodes B L and B R respectively, converges almost surely to the following The aforementioned result shows that remarkably the limit of the the DART-split criterion converge to the same respective limit (7) of the CART-split criterion up to a constant α. This result is intriguing as although the empirical DART split criterion depends on the entire set of splits and data from all the nodes, its asymptotic limit is simply the population variance difference (7) between the parent node and the children nodes and does not depend on the other nodes. As discussed before, the quantity (7) would be the ideal one to use for splitting a node if one had knowledge of the population distribution, and it is reassuring that the DART criterion asymptotes to it.
These asymptotic results are in line with the conventional GLS wisdom. GLS estimators are known to have the same asymptotic limit as the OLS estimators but are more efficient under dependence. To demonstrate how these asymptotic results translate to finite sample performance, we conduct a simple experiment using data generated from Y i = m(X i ) + i , where i is an GP with exponential covariance function on the regularly spaced one-dimensional lattice, and m(x) is a function of a single covariate supported on [0, 1] such that m(x) = 1 for x ≤ 0.5, and m(x) = 1.5 for x > 0.5. If we use a two-node decision tree to estimate m, then it is obvious that a good loss criterion should be maximized near the cut-off value of 0.5 where there is a discontinuity in the true regression function. In Figure 1(a), we plot the average CART and DART split criterion as a function of the choice of cutoff, and the point-wise confidence bands. The DART split criterion was scaled by α to have the same asymptotic limit as the CART split criterion. For reference we also plot the true regression function on the secondary y-axis. We see that the average curves for both the CART and DART criterion are quite identical, both peaking near the true cutoff of 0.5. However, the CART criterion using the OLS loss has large uncertainty as reflected by the much wider confidence bands, than the DART criterion. This in turn affects the estimates of the cutoff and the node representatives as reflected in Figure 1(b). While both losses lead to similar mean estimates of the cutoff, and node representatives, the variability is substantially higher for the OLS loss. This large variability is especially evident for the choice of the cutoff where the CART loss can choose cutoff far away from 0.5 with high probability whereas the DART loss chooses a cutoff near 0.5 almost always. We will show in Section 4 over a wide range of simulation studies how this leads to poor finite sample estimation and prediction performance of RF for dependent data.

GLS-Style Regression Tree
The DART split criterion (5) and node representation (6) constitute the fundamental node-splitting operation of a GLSstyle regression tree algorithm that incorporates the spatial information via the correlation matrix. We now introduce additional notation to formally detail the algorithm of this GLS-style regression tree. Creation of a forest from the trees will be discussed in Section 2.5.
For ease of presentation, so far we have talked about the case of splitting the last node (Kth) at a given level of the tree. More generally, we can denote the complete set of nodes in level k − 1 by , we consider the following membership matrices: Z (0) , which corresponds to the nodes in parent level, with the column for the node-to-besplit pushed to the last column, and Z which corresponds to the membership of the potential child nodes based on a split (d, c). Note that Z above is a function of the previous set of nodes C (k−1) , the node to be split l 1 and the split (d, c) all of which is kept implicit. We denote the GLS-style split criterion (5) as Equipped with the notation, the GLS-style regression tree algorithm is presented below:  (5) 13: 17: Representatives β = (β 1 , . . . ,β g (k) ) ← Equation (6) with Note that, if the true covariance matrix 0 is unknown, as in practice, Q will be −1 where is some working covariance matrix (estimate and/or computational approximation of 0 ). We discuss estimation of 0 and choice of Q in Section 2.7. The algorithm works with any choice of the working precision matrix Q. For example, with Q = I, the algorithm is identical to the usual regression-tree used in RF.

RF-GLS
We now focus on growing a RF from n tree number of GLS-style trees. In RF, each tree is built using a resample or subsample of the data D n,t = (Y t , X t ) where Y t is the resampled (or subsampled) vector of the responses and X t is the design matrix using the corresponding rows. For our GLS-style approach, naive emulation of this is not recommended. To elucidate, if one resorts to resampling, under dependent settings, this would mean resampling correlated data units (Y i , X i ) thereby violating the principle of bootstrap. Also, it is unclear what the working covariance matrix would be between the resampled points. If subsampling is used, this is avoided, as one can use the submatrix t corresponding to the subsample. However, each tree will use a different subsample and hence a different t . Inverting covariance matrices of dimension O(n) require O(n 3 ) operations, and this approach would require inverting n tree such matrices, thereby substantially increasing the computation.
Interestingly, the GLS-loss itself offers a synergistic solution to resampling of dependent data. To motivate our approach, we once again revisit RF, and note that the CART-split criterion (4) for a re(sub)sample can be expressed using the squared error loss P t Y − P t Zβ 2 2 , where P t is the selection matrix for the resample. Now GLS loss with Y and Zβ coincides with an OLS loss withỸ = −1/2 Y,Z = −1/2 Z. Hence, the immediate extension for the resample (subsample) in our setup would be using the loss Thus, to use a GLS-loss in RF, we essentially resample the contrastsỸ instead of the outcomes Y. This principle of contrast resampling has been used in parametric bootstrapping of spatial data (Pardo-Igúzquiza and Olea 2012; Saha and Datta 2018a). In our algorithm the resampling amounts to simply replacing the Q in the DART-split criterion (5) and node representative calculation (6) with Q t = − /2 P t P t −1/2 . Computationally, our approach has the advantage of only requiring a one-time evaluation of the Cholesky factor −1/2 that is used in all trees. As in RF, subsequent to growing n tree trees corresponding to each different resample we take the average to get the forest estimate. This completes the specification of a novel GLS-style RF for dependent data. We refer to the algorithm as RF-GLS and summarize it in Algorithm 2. It is clear when Q = I, the node-split criterion, the node representatives, and the resampling step all become identical to RF. Hence, RF is simply a sub-case of RF-GLS with an identity working correlation matrix.

Algorithm 2 RF-GLS
Input: Data D n = (Y 1 , X 1 , . . . , Y n , X n ), working correlation matrix , number of trees n tree , stopping rules for the trees: t n (maximum number of nodes) and t c (minimum number of members per node), number of features considered for each split M try , randomness R for t = 1 : n tree do 5:

Kriging Using RF-GLS With GPs
Subsequent to estimating the regression function m(x) using RF-GLS in the spatial nonlinear mixed model (1), we can seamlessly perform traditional spatial tasks like predictions (kriging) and recovery of the latent spatial random surface w( ). This is because RF-GLS only estimates the mean part, and the covariance is still being modeled using a GP. This facilitates easy formulation of the predictive or latent distribution conditional on the data as standard conditional normal distributions. If L denotes the training data locations, Y = (Y 1 , . . . , Y n ) , and m = ( m(X 1 ), . . . ,m(X n )) where m is the estimate of m from RF-GLS, then prediction at a new location new with covariate x new will simply be given by the kriging estimate where v = cov( ( new ), (L)), = cov( (L), (L)). The prediction equation (11) possess the advantage of nonparametrically estimating the mean function of the covariates while retaining the spatial structure encoded in the GP covariance function which adheres to the philosophy of first law of geography, that is, proximal things are more correlated than distant ones. The prediction framework is completely agnostic to the choice of the covariance function and can work with nonstationary or multi-resolutional covariance functions if deemed appropriate. One can also obtain estimates of the latent surface using the conditional distributions w( ) | Y, X akin to the spatial linear model.
Prediction equations of the form of Equation (11) has been used in applications of RF to spatial settings and has been termed as random forest residual kriging (RF-RK) (Viscarra Rossel, Webster, and Kidd 2014; Fayad et al. 2016). The estimate m in these applications come from a naive application of RF without accounting for the dependence. Our empirical studies in Section 4 will demonstrate how residual kriging using RF-GLS improves over use of RF in dependent settings.
RF-GLS also has several advantages over the RFsp of Hengl et al. (2018), which does not use GP but include pairwise distances between a location and all other locations, as additional covariates. For n locations, this adds n covariates. RF-GLS avoids such unnecessary escalation of the problem to highdimensional settings. Direct use of the GP and the mixed-model framework helps model the spatial structure parsimoniously via the covariance function parameters. Our simulation studies in Section 4 demonstrates the improved prediction performance of RF-GLS over RFsp. Most importantly, RF-GLS separates out the contribution of the covariates and the spatial component, thereby allowing estimation of the regression function m. Estimate of m is not available from the RFsp of Hengl et al. (2018) which only offers a prediction given the covariates and the location.

Practical Considerations
The development of the RF-GLS method has been presented using a working covariance matrix . In practice, the true correlation 0 will not be known and needs to be estimated assuming a parametric form based on the covariance function used. In Section 3 we present the result that both RF and RF-GLS are consistent under dependent errors (akin to both OLS and GLS being consistent for linear models). Hence, for data analysis, we recommend running the first pass of RF on the data to get a preliminary estimate of m, use it to obtain the residuals from which the parameters of can be estimated using a maximum likelihood approach. This once again parallels the practice in linear models where the oracle GLS assuming knowledge of the true covariance matrix is replaced by a feasible GLS where the covariance matrix is estimated based on residuals from an initial OLS estimate.
The second consideration concerns computational scalability of the approach. RF-GLS requires computing the Cholesky factor −1/2 . It is clear from Algorithm 2, that this is only a one-time cost, unlike the possible alternate approach discussed in Section 2.5 where subsampling is conducted before decorrelation, which would lead to computing a different Cholesky factor for each tree. However, spatial covariance and precision matrices arising from GP are dense and for large n, evaluating −1 even once still incurs the computational cost of O(n 3 ) and storage cost of O(n 2 ) both of which are taxing on typical personal computing resources.
Over the last decade, the inventory of approximation techniques attacking the computational weakness of GP has grown and become increasingly sophisticated (see Heaton et al. 2019, for a review). NNGP (Datta et al. 2016a;Finley et al. 2019;Datta et al. 2016b,c) has emerged as one of the leading candidates. Centered on the principle that a few nearby locations are enough to capture the spatial dependence at a given location (Vecchia 1988), NNGP replaces the dense graph among spatial locations with the nearest neighbor graphical model. This was shown to directly yield a sparse Cholesky factor −1/2 that offers an excellent approximation to the original dense −1/2 (Datta et al. 2016a). Software packages implementing NNGP are also publicly available (Finley, Datta, and Banerjee 2021;Saha and Datta 2018b). Hence, for very large data, we recommend using this NNGP sparse Cholesky factor −1/2 instead of −1/2 . This will reduce both the computation and storage cost from cubic to linear in sample size. We will show in Proposition 3.1, that using NNGP for RF-GLS ensures a consistent estimate of m even when the true data generation is from a full Matérn GP.

Consistency
In this section, we present the main theoretical result on consistency of RF-GLS for a very general class of dependent error processes. The outline of the proof highlighting the new theoretical challenges addressed are presented in Section 5 along with some general results of independent importance. The formal proofs are provided in the supplementary materials.

Assumptions
We first discuss Assumptions 1 and 2 and make additional assumptions required for the proof of consistency. In Assumption 1, we focus on absolutely regular or β-mixing processes, since this class of stochastic processes is rich enough to accommodate many commonly used dependent error processes like ARMA (Mokkadem 1988), GARCH (Carrasco and Chen 2002), certain Markov processes (Doukhan 2012) and GPs with Matérn covariance family. At the same time, uniform law of large numbers (ULLN) from independent processes can be extended to this dependent process under moderate restriction on the class of functions under consideration. No additional assumption is required on the decay rate of the β-mixing coefficients (which are often hard to check). Assumption 2 requires the Cholesky factor of the precision matrix to be sparse and regular. Such structured Cholesky factors routinely appear in time series analysis for AR(p) process. For spatial data, exponential covariance family on a 1-dimensional grid satisfies this. Other covariances like the Matérn family (except the exponential covariance) do not generally satisfy this assumption. However, NNGP covariance matrices satisfy this and are now commonly used as an excellent approximation to the full GP covariance matrices (Datta et al. 2016a). Since this assumption is on the working covariance matrix and not on the true covariance of the process, we can always use an approximate working covariance matrix like ones arising from NNGP to satisfy this. We discuss these examples in Section 3.3. Under Assumption 2, for any two vectors x and y, defining x i = y i = 0 for i ≤ 0, we have where α = ρ 2 2 ,Ã 1 ,Ã 2 ⊂ {1, 2, . . . , n} with |Ã 1 |, |Ã 2 | ≤ 2q, γ i,i 's are fixed (independent of n) functions of L and ρ. The expression of the quadratic form in Equation (12) makes it evident that λ max (Q) is bounded as n → ∞. As the third term is a sum of fixed (at most 4q 2 ) number of terms, it is O(1) as long as x and y are bounded.

Assumption 3 (Diagonal dominance of the working precision
Diagonal dominance implies λ min (Q) is bounded away from zero as n → ∞ which is needed to ensure stability of the GLS estimate. We will discuss in Section 3.3 how working correlation matrices from popular time series and spatial processes with regular design satisfy this Assumption. Note that under Assumption 2, checking that the first (q + 1) rows of Q are diagonally dominant is enough to verify Assumption 3.
Assumption 4 (Tail behavior of the error distribution).
We show for Gaussian errors, ζ n needs to be O(log n) 2 which makes the scaling condition in Assumption 4(a) as t n (log n) 9 /n → 0. This is the same scaling used in Scornet et al. (2015) for Gaussian errors and using the entire sample. In general, the choice of ζ n will be dependent on the error distribution. Assumption 4(a), (b) and (c) will all be satisfied by sub-Gaussian errors. As demonstrated in Scornet et al. (2015), additive models provide a rich enough environment to address the asymptotic properties of nonparametric methods like RF sans the additional complexities in controlling asymptotic variation of m in leaf nodes. Since RF is invariant to monotone transformations of covariates (Friedman, Hastie, and Tibshirani 2001;Friedman 2006), without loss of generality, the covariates can be distribution function transformed to be Unif[0, 1] distributed. Hence we assume that the components (functions) m d are supported on [0, 1], implying m is uniformly bounded by some constant M 0 .

Main Result
For the tth tree, the predicted value from our method at a new point x 0 in covariate space is denoted by m n (x 0 ; t , , D n ) where D n = {(x i , y i ) | i = 1, . . . , n} denote the data. Note that the ith data unit corresponds to location i and that the covariance matrix is based on the covariance function evaluated at pairs of locations i and j . But unless otherwise needed we suppress the locations i and simply use the subscript i. used in each tree as well as the choice of random splitting variable for iterative splitting in the tree. For tractability, Scornet et al. (2015) considered subsampling instead of resampling for the theoretical study. In our theoretical study, for analytical tractability of the GLS weights, we consider trees that use the entire set of samples and the randomness t in each tree is only used to choose the candidate set of features for each split. The randomness for each tree are iid, that is, t iid ∼ , ⊥ D n , ∀t ∈ {1, · · · , n tree }. The finite RF-GLS estimate m n,n tree (x 0 ; 1 , · · · , n tree , , D n ) that will be used in practice is given by the sample average of the individual tree estimates. Conceptually, n tree can be arbitrarily large; hence following Scornet et al. (2015), we focus on "infinite" RF-GLS estimate given bym n (x 0 ; , D n ) = E m n (x 0 ; , , D n ) where the expectation w.r.t is conditional on D n . For notational convenience, we hide the dependence of m n , m n,n tree ,m n on and D n throughout the rest of this article. Our main result on L 2 -consistency is stated next, the proof is deferred to Section 5.1.

Theorem 3.1. Under Assumptions 1-5 and if for some
The uniformly bounded (2 + δ)th moment assumption in Theorem 3.1 is needed to generalize uniform laws of large number bounding the GLS estimation error from the iid setting to the dependent setting. We discuss this in details in Section 5.3. The following corollaries discuss three specific cases where this assumption is met. Their proofs are deferred to Section S2.3.
For bounded errors (part (a)), the (2 + δ)th momentbound of Theorem 3.1 is immediately satisfied, and hence consistency can be established without further assumptions. For unbounded errors, a stronger form of diagonal dominance condition is needed in Corollary 3.1 Part (b). This is used to control the (2 + δ)th moment of the data weights arising from the gram-matrix (Z QZ) −1 which in turn ensures the moment-bound. We discuss examples and specific parameter choices ensuring this in Section 3.3. Also note that, the assumption of diagonal dominance is not on the true correlation matrix of the error process and hence is not a restriction on the data-generation mechanism, but rather on the working correlation matrix which is chosen by the user. One can always use parameters in the working correlation matrix that satisfies this (although enforcing this is not needed in practice).
RF is RF-GLS with Q = I. Hence the assumption of Corollary 3.1 part (b) is trivially satisfied. This proves consistency of RF under β-mixing dependence.
To our knowledge, Corollary 3.2 is the first result on consistency of RF under a dependent (β-mixing) error process. Since RF is simply RF-GLS with the working correlation matrix = I, Assumptions 2 and 3 are automatically satisfied, and hence we only need the Assumptions of β-mixing process, tail bounds, and additive model. The consistency result is analogous to the ordinary least-square estimate being consistent even for correlated errors. Besides its own importance, Corollary 3.2 also heuristically justifies the first step used in practical implementation of RF-GLS. The parameters in the working correlation matrix is unknown, and as highlighted in Section 2.7, we use the RF to get a preliminary estimate of m, estimate the spatial parameters using the residuals, and use these estimated parameters in the working correlation matrix for RF-GLS. This is again, analogous to feasible GLS which estimates the working correlation matrix using residuals based on OLS. Corollary 3.2 guarantees that the initial estimator used to obtain the residuals is consistent.

Examples
In this section, we give examples of two popular dependent error processes under which a consistent estimate of m can be obtained using RF-GLS.

Spatial Matérn GPs
Our main example focuses on the spatial nonlinear mixed model using GPs as described in Section 2.6. While many candidates exist for the covariance function of GP, the class of Matérn covariances enjoy hegemonic popularity in the spatial literature owing to its remarkable property of characterizing the smoothness of the spatial surface ( ) (Stein 2012). The stationary (isotropic) Matérn covariance function is specified by where φ = (σ 2 , φ, ν) and K ν is the modified Bessel function of second kind. We consider a Matérn process sampled on one-dimensional regular lattice. This regular design is considered both for tractability of the Matérn GP likelihood but also for ensuring stationarity of the process in the sense required in Theorem 3.1 as for irregular spaced data cov( 1 , 2 ) = cov( 2 , 3 ) whenever 1 − 2 2 = 2 − 3 2 . Such assumptions on the dimensionality and/or regularity of design has been widely used for theoretical studies of spatial processes (Du et al. 2009;Stein et al. 2002). By keeping the gap in the lattice fixed, we are also essentially using increasing-domain asymptotics, as parameters are generally not identifiable in fixed domain asymptotics for Matérn GPs (Zhang 2004).
The error process arising from the marginalization of (1) is the sum of a Matérn process and a nugget (random error) process. We consider half-integer ν ∈ 1.5, 2.5, . . .. This class of processes are popularly studied and used owing to their convenient state-space representation (Hartikainen and Särkkä 2010) which in turn leads to efficient computation of these Matérn GP likelihoods. The state-space representation of halfinteger Matérn GP is equivalent to that of a stable AR(q 0 ) process on the continuous one-dimensional domain with q 0 = ν + 1/2. However, unlike an AR(q 0 ) time series, the Matérn GP when sampled on the discrete integer lattice is no longer an AR(q 0 ) process. Consequently, unlike covariance matrices from AR processes, covariance matrices generated from Matérn GP (expect for exponential GP), do not satisfy the sparsity and regularity of the working correlation matrix of Assumption 2.
Instead, we consider the working correlation to come from the nearest neighbor Gaussian process (NNGP) (Datta et al. 2016a) based on the Matérn covariance. As discussed in Section 2.7, NNGP covariance matrices are one of the most successful surrogates for full GP covariances for large spatial data, reducing likelihood computations from O(n 3 ) to O(n). What is important for the theoretical study is that an NNGP is constructed by sequentially specifying the conditional distributions as When the locations are the integer grid, N q (i) becomes {i − 1, . . . , i − q}, and the NNGP construction is akin to an AR(q) process. Consequently, the Cholesky factor −1/2 from NNGP on an integer lattice satisfies Assumption 2 with ). This ensures the following consistency result of RF-GLS fitted with NNGP for data generated using Matérn GP. The proof is in Section S2.6.
One observation is central to the proof. The half-integer Matérn GP, which is an AR(q 0 ) process in the continuous domain, when sampled on a discrete lattice becomes an ARMA process ((Ihara 1993) Theorem 2.7.1). This will establish absolutely regular mixing of these Matérn processes using the result of Mokkadem (1988) on ARMA processes, and subsequently consistency of RF-GLS by Theorem 3.1.

Autoregressive Time Series
Our main focus in this article, is estimation of nonlinear regression function in the spatial mixed model (1). However, the scope of our RF-GLS algorithm is much broader. It can be used for functional estimation in the general nonlinear regression model Y i = m(X i ) + i where i is a dependent stochastic process with valid second moment. The method only relies on knowledge of an estimate of the residual covariance matrix = cov( ). The general consistency result (Theorem 3.1) is also not specific to the spatial GP setting, and only relies on general assumptions on the nature of the dependence, tail bounds of the error, and structure of the working correlation matrix. In this section, we demonstrate that RF-GLS can be used for consistent function estimation for time series data, that is, where i models the serial (temporal) correlation. In particular, we discuss consistency of RF-GLS for Autoregressive (AR) error processes, one of the mainstays of time series studies. An AR(q) model can be written as follows: where η i is a realization of a white-noise process at time i. AR processes are β-mixing (Mokkadem 1988), they also produce a banded Cholesky factor of the precision matrix as required in Assumption 2. Hence, we have the following assertion. Its proof is in Section S2.6.
Proposition 3.2. Consider a time series Y i = m(X i ) + i where i is the time, m satisfies Assumption 5, i denote a sub-Gaussian stable AR(q 0 ) process. Let denote a working correlation matrix from a stationary AR(q) process. Then RF-GLS using produces an L 2 consistent estimate of m if 1. q = 1 and the working autocorrelation parameter ρ used in satisfies |ρ| < 1 (for bounded errors) or |ρ| < 1/(2 √ 2) (for unbounded errors). 2. q > 1 and the AR(q) working precision matrix Q = −1 satisfies Assumption 3 (for bounded error) or min i Q ii > √ 2 max i j =i |Q ij | (for unbounded error).
We separate the results for q = 1 and q ≥ 2 since unlike AR(1), for general AR(q) it is challenging to derive closed form expressions of the constraints on the parameter space needed to satisfy Assumption 3 or the stronger diagonal dominance condition in Proposition 3.2 part 2 (for unbounded errors). However, verifying these conditions for a given AR precision matrix Q is straightforward due to stationarity and banded structure. One only needs to check the first q + 1 rows of Q irrespective of the sample size n. The necessary condition for row q + 1 is where κ equals 1 for bounded errors and equals √ 2 for unbounded errors. Additional checks are only needed for the first q rows of Q.
In practical implementation of AR processes, the order of the autoregression is often chosen based on analysis of the autocorrelation function of the residuals and may not equal the true order of the autoregression. Proposition 3.2 accommodates this scenario by not restricting the working autoregressive covariance to be of the same order or have the same coefficients as the ones generating the data.

Illustrations
We conduct simulation experiments to demonstrate the advantages of the RF-GLS over competing methods for both estimation and prediction in finite samples. We simulate data from the spatial nonlinear mixed model of Equation (1). We consider the following choices for the true mean function m 1 (x) = 10 sin(π x) and m 2 (x) = (10 sin(π x 1 x 2 ) + 20(x 3 − 0.5) 2 + 10x 4 + 5x 5 )/6 (Friedman function, Friedman 1991). w( ) is an exponential GP on a two-dimensional spatial domain with 27 different combinations of covariance parameters spatial variance σ 2 , spatial decay φ and error variance τ 2 as a % of the spatial variance. For each setting, we perform simulations for 100 replicate datasets. To evaluate prediction performance, we keep 10% hold-out data. Details of the parameter choices and hold-out design are in Section S1.1.
For evaluating estimation performance, we consider RF (which does not use spatial information), RF-GLS (Oracle) using true covariance parameter values, and RF-GLS which obtains an estimate of the covariance parameters from RF residuals using the BRISC package (Saha and Datta 2018b) and uses them in RF-GLS. The estimation performance of the methods are evaluated based on a discrete Mean Integrated Square Errors (MISE) for m, evaluated over uniformly generated data points from the covariate space that provides a good coverage of the entire space. MISE for the estimated functionm is given as: where x 1 , . . . , x n 0 are a dense set of points in the covariate domain (See Section S1.1 for details). We observe that the RF-GLS performs at par with RF-GLS (Oracle) which assumes knowledge of the true spatial parameters (supplementary material, Section S1.2). As in reality, we won't have knowledge of the true parameters, for the rest of the article we will be comparing the competing methods only with RF-GLS, where model parameters are unknown and estimated from RF residuals. We present the estimation and prediction results for m = m 2 . The results for m = m 1 are similar, and are provided in the supplementary materials, Section S1.3.
The median MISE over 100 simulations for all the 27 setups are shown in Figure 2(a). RF-GLS outperforms RF across all the scenarios demonstrating that exploiting the spatial information substantially improves estimation performance over the vanilla RF. Additionally, we also notice that as the strength of the spatial variation increases (σ 2 goes from 1 to 10), the ratio of MISE of RF and RF-GLS increases indicating a greater gain in terms of MISE.
We consider five candidates for evaluating prediction performance: RF, RF-RK (Fox, Ver Hoef, and Olsen 2020), RF-GLS, RF-Loc (the 2-D spatial locations of the data are used as additional covariates in RF to account for the spatial structure), and RFsp (Hengl et al. 2018). Among these, RF is the only one that does not use spatial information and only accounts for the mean. For RFsp, we used unordered pairwise Euclidean distances as additional covariates in RF to account for the spatial structure in the data. The prediction performance are evaluated based on the relative mean squared error (MSE) for the test data. MSE measures the average squared difference between the estimated values and the actual value of the response. Relative MSE helps compare MSE across different simulation setups, while standardizing by the variance of the response. Figure 2(b) shows the median relative MSE over 100 simulations for all the 27 setups. Expectedly, all methods performed better than RF, as unlike the other methods, RF does not use any spatial information. RF-GLS performs best or at par with the best across all settings.
RF-Loc and RFsp belong to the class of approaches that add extra spatial covariates to RF. RF-Loc always performed worse than RFsp and RF-GLS. However, when the spatial variance (σ 2 ) is high, RFsp performs comparably to that of RF-GLS, both of which outperform the others. For lower σ 2 , RF-GLS significantly outperforms all the other methods. In this case, RFsp performs worse than RF-RK which is now the second best method.
We conducted an in-depth study in Section S1.4 to understand the performance of RFsp. We summarize the findings here. Mentch and Zhou (2020) showed that for iid errors, adding additional noise covariates can improve prediction performance of RF for low (covariate-)signal-to-(random-)noise ratio (SNR). For high SNR, the trend was reversed and using extra noise covariates (when uncorrelated with the true covariates) did not help. For our setting of spatially correlated errors, akin to how noise covariates can be added to RF to explain random variation, RFsp adds distance-based covariates to explain spatial variation. The appropriate quantity to understand the performance of RFsp would be the (covariate-)signal-to-(spatial-)noise ratio SNR = var(m(X))/var(w( )) = var(m(X))/σ 2 . This SNR is low when σ 2 is large and vice versa ( Figure S3(a)).
RFsp adds n pairwise distance-based covariates to RF to explain the spatial variation. If D denotes the number of true covariates, then RFsp uses a total of n + D covariates in RF and the probability to include a true covariate in the M try set of candidates for splitting a node becomes vanishingly small when n D. When σ 2 is small (high SNR), the true covariates predominantly dictates the variation in the outcome but is rarely selected in RFsp ( Figure S3(b)) resulting in its poor performance. In fact for high SNR, RFsp performs even worse than RF-RK which can estimate m reasonably well in this setting despite ignoring the spatial dependence as the covariate signal dominates. For low SNR (σ 2 = 10), as the spatial contribution increases, RFsp (being naturally equipped to capture the spatial correlation in the data) performs comparably to RF-GLS ( Figure 3). For low SNR, RF-RK, ignoring the dominant spatial variation when estimating m, performs worse than both RF-GLS and RFsp.
RF-RK and RFsp outperforms each other in two ends of the SNR spectrum. RF-GLS combines the strengths of both, using RF to estimate a nonlinear covariate effect while parsimoniously modeling the spatial effect using a GP specified by only 2 or 3 parameters thereby avoiding introduction of a large number of spatial covariates. Consequently, RF-GLS produce comparable or better results to both RF-RK and RFsp across the SNR spectrum (Figure 3).
Section S1 of the supplementary materials file contains a number of additional simulation studies. These include results for larger sample size (Section S1.5, supplementary material), mean function with more covariates (Section S1.6, supplementary material), misspecified GP smoothness (Section S1.7, supplementary material), and misspecified entire spatial effect (not generated from a GP, Section S1.8, supplementary material). For each study, and all choices of data-generation parameters, RF-GLS performed as the best or comparably with the best method across all scenarios for both estimation and prediction.

Proof of Consistency
For studying RF-GLS with dependent data, we adopt the framework of consistency analysis of nonparametric regression (introduced in Nobel et al. 1996, generalized in Györfi et al. 2002. In Scornet et al. (2015), the authors also adopted this framework to prove consistency of RF for iid errors. After presenting an informal outline of the consistency argument in Györfi et al. (2002), we provide a road map of how we extend different pieces of this argument for RF-GLS with dependent data, and highlight additional technical challenges that we resolved in this work. All formal proofs are provided in the supplementary materials. Györfi et al. (2002) considered a least-square estimator of the form where F n is a carefully chosen, data-dependent function class which is large enough to control approximation error (i.e., how well the true function is estimated by the function class), and small enough to control estimation error (i.e., how far the estimate m n is from the representative of F n closest to m) in the sense that a Uniform Law of Large Number (ULLN) holds on this class. m n is consistent if both errors vanish asymptotically. In order to use powerful exponential inequalities which hold for classes of bounded functions, the proof of L 2 -consistency in Györfi et al. (2002) uses a standard truncation argument in probability theory with a diverging sequence of truncation thresholds {ζ n }. They first show that if uniformly over the class T ζ n F n of truncated functions in F n , 1. Approximation Error of m by a data-driven class F n containing m n is small, that is, Estimation Error is small so that a ULLN holds over F n for squared error loss, that is, then the truncated estimator T ζ n m n is L 2 -consistent for f . Then they extended the consistency guarantee from truncated to the original estimators by showing that the truncation error vanishes under suitable tail decay assumptions on the error distribution.
In the analysis of RF-GLS, there are two main challenges. The error process i is no longer an iid process but a stochastic process capturing the dependence. Also, we work with a quadratic loss . . , f (X n ) − Y n . Addressing these require nontrivial generalizations of each of the above pieces.

Consistency of Quadratic Loss Optimizers in Data-Driven Function Classes Under Dependent Errors
Our main technical statement (Theorem 5.1) is a generalization of (Györfi et al. 2002, Theorem 10.2) to the setting of dependent error processes and quadratic loss functions. Let D n = (X 1 , Y 1 ), . . . , (X n , Y n ) be the data where Y i = m(X i ) + i . With randomness parameter , let F n = F n (D n , ) be a data-dependent class of functions. We will consider an optimal estimator m n ∈ F n with respect to quadratic loss: This simply states that m n is the GLS estimate with respect to the working precision matrix Q in the class F n . This is analogous to the OLS assumption used in Györfi et al. (2002). When F n is the class of piecewise constant functions on the partitions generated by a regression tree, m n is our GLS-style regressiontree estimate. Hence studying estimators of the form (16) more generally suffices to prove consistency of GLS-style regression tree and henceforth of RF-GLS. We now state a general technical result establishing L 2consistency of such GLS estimators m n under β-mixing (absolutely regular) error processes. This is a sufficiently large class of processes that includes spatial Matérn GP and autoregressive time series as discussed in Sections 3.3.1 and 3.3.2. The result is applicable beyond RF to more general nonparametric GLS estimators from dependent data using other suitable function classes.
Theorem 5.1. Let { i } be a stationary β-mixing process satisfying Assumption 1, and the matrix Q satisfies Assumptions 2 and 3. Let m n (., ) : R D → R denote a quadratic-loss optimizer (with respect to Q) of the form (16) in a data-dependent function class F n . If m n and F n satisfies the following conditions: (C.1) (Truncation error) ∃ {ζ n } such that lim n→∞ ζ n = ∞ and ζ 2 n /n → 0, such that we have, . . , n} be identically distributed as D n but independent of D n . Define f (Ẋ) andẎ similar to f (X) and Y. Then, we have for all arbitrary L > 0 Then we have lim n→∞ E E X (m n (X, ) − m(X)) 2 ) = 0, and lim n→∞ E X (m n (X) − m(X)) 2 = 0; wherem n (X) = E m n (X, ) and X is a new sample independent of the data.
The proof is deferred to Section S2.3. Theorem 5.1 is a result of independent importance as it is a general statement on L 2 consistency of a wide class of GLS estimates under β-mixing dependent errors. Besides data-driven-partitioning-based estimates like RF or RF-GLS, it can be used to study properties of histograms, kernel-density estimates, local polynomials, etc., under β-mixing error processes. The second part of the theorem states that the consistency also holds for an ensemble estimator m n that averages many such estimates m n each specified with random parameters . This will be used to show consistency of the RF-GLS forest estimate subsequent to showing consistency of each RF-GLS tree estimate.

Approximation Error
The condition C.2 of asymptotically vanishing approximation error ensures that as sample size increases, the growing class of approximating functions (e.g., piece-wise constant functions in the case of regression trees) is rich enough to approximate the target function. In earlier works on consistency of RF, approximation error was controlled under a stringent assumption of vanishing diameter of leaf nodes. Scornet et al. (2015) replaced this by a condition that the variance of Y (or equivalently, variation of m) within a leaf node of a regression-tree vanishes asymptotically, and verified this condition for RF. There are two steps to show this. (i) Establish a theoretical or population-level split-criterion-an asymptotic limit of the empirical split-criterion used in practice, such that variation of m in the leaf-nodes of a hypothetical regression tree generated using the theoretical criterion is small. (ii) Establish stochastic equicontinuity of the empirical splitcriterion, such that if two set of qualifying splits Z (1) and Z (2) are close, their corresponding empirical split-criterion values are close, irrespective of the location of the splits.
For our RF-GLS trees, the partitioning is driven by the DART criterion (5) and is different from the CART (2) criterion used in the RF trees. Since the GLS loss and estimator involves the matrix Q, they are not available in simple scalar expressions unlike the OLS loss (sum of squares) and estimator (mean response within a node). So we address a number of technical challenges for steps (i) and (ii) that do not appear in the analysis of RF.
For (i), Lemma 2.1 and Theorem 2.1, establishes that the DART split-criterion remarkably has the same limit of as that for the CART criterion. Hence, variation of m in trees generated by this theoretical criterion is controlled in the same way as for RF.
For (ii), we require an entirely new and involved proof of equicontinuity for the DART-split criterion of RF-GLS as the previous arguments of Scornet et al. (2015) do not immediately generalize for RF-GLS loss function (5). We discuss the new contributions in Section 5.2.1.

Equicontinuity of the Split Criterion
Equicontinuity of the CART-split criterion 1 was the centerpiece of the theory in Scornet et al. (2015), requiring involved but elegant arguments on the geometry of splits. Since the CART criterion only concerns the parent node to be split and its potential two child nodes, the equicontinuity essentially boiled down to showing closeness of the respective means and variances of these three nodes for the two sets of splits. These three scalar mean and variance differences are functions of the difference in volumes of the respective nodes which goes to zero uniformly as the splits come closer.
For RF-GLS, to update each node, the entire set of node representatives get updated via the GLS-estimate (6) which is analytically intractable due to the matrix inversion. Also, the DART-split criterion (5) is a quadratic form of the plugged-in GLS-estimate and thus a function of the representatives of all nodes and not just the 3 nodes as in RF. Our equicontinuity proof is built on viewing GLS predictions as oblique projections on the design matrices corresponding to the splits. We consider two scenarios: R1-where for at least one set of split, both potential child nodes have substantial volumes, and R2-where for each set of split, one potential child node have ignorable volume. Under R1, all nodes for at least one set of split have substantial representation ensuring that the gram matrix has norm bounded away from zero and equicontinuity can be established using perturbation bounds on orthogonal projection operators (Chen, Chen, and Li 2016). Under R2, this will not be the case as for both set of splits one child node will have small volume. Instead, we first show that difference of the DART-split criterion at the previous level of (parent) splits is small using the same perturbation argument as the parent node volumes are bounded away from zero. Subsequently we argue that the creation of the children nodes do not change the split criterion substantially as one of the child nodes is essentially empty. This new matrix-based proof for equicontinuity also circumvents the need to invoke mathematical induction as required in Scornet et al. (2015).
Proposition 5.1 (Equicontinuity of empirical DART-split criterion). Under Assumptions 2-4(b) and (c), the DART split criterion (5) is stochastically equicontinuous with respect to the set of splits.
The proof is involved, and is deferred to Section S2.1 where the Proposition is more technically phrased using additional notation on splits. Subsequent to proving equicontinuity, we can prove the following result of approximation error Proposition 5.2. Let F n = F n ( ) is the set of all functions f : [0, 1] D → R, piecewise constant on each cell of the partition obtained by an RF-GLS tree. Then, under Assumptions 1-5, the class T ζ n F n satisfies the approximation error condition (C.2).

Estimation Error
Theorem 5.1 shows that for GLS estimators like Equation (16), one needs to control the quadratic form estimation error (ULLN C.3). A common technique for proving ULLNs under dependence is (i) to prove an analogous ULLN for iid error processes, and then (ii) use mixing conditions to generalize the result for the dependent process of interest.
Both steps require addressing new challenges for RF-GLS which we discuss in the next two subsections.

Cross-Product Function Classes
For step (i), it is difficult to directly state an iid analogue of the ULLN C.3 as it is not possible to have an error process { * i } which is simultaneously iid, satisfies * i ∼ i and E(y Qy) = E(y * Qy * ). To see this, simply note that as cov( . Instead, we create separate iid analogues of ULLN for each term in the expansion of the quadratic form, that is, both the squared error and the crossproduct terms. These ULLN are stated as condition C.3.iid in the supplementary materials. Condition C.3.iid(a) for the squared (diagonal) terms is the standard squared error ULLN using the iid error processes, and has been proved in Györfi et al. (2002) Theorem 10.2 generally, and in Scornet et al. (2015) in particular for RF.
For the cross-product (off-diagonal) terms, the ULLN is stated in Condition C.3.iid(b) of the supplementary material, and is to our knowledge a novel strategy. The analysis of RF under iid settings in Scornet et al. (2015) only used a squared error loss which does not involve any cross-product terms and hence did not need to prove any ULLN for cross-product terms. We construct separate ULLN for the cross-product terms i q i,i−j i i−j for each lag j = 1, . . . , q. As mentioned earlier, use of the univariate copy { * i } will not allow to generalize to the corresponding term in C.3. This is because * i ⊥ * i−j but i and i−j are correlated. Instead, for each lag j = 1, . . . , q, we create bivariate iid sequences (˜ i ,¨ i−j ) that are identically distributed as the joint (bivariate) distribution of the pairs ( i , i−j ), but are independent over i. We thus exploit the banded nature of the working precision matrix Q to prove iid ULLNs for crossproduct terms at each of the q lags. Combining these with the ULLN for the squared terms gives us an ULLN for the entire quadratic form.
Formulation and establishing this cross-product ULLN using lag-specific bivariate iid copies is a new contribution and is of independent importance for establishing vanishing limits of any estimation error that involves interaction terms. We prove this ULLN in Proposition S2.1 of the supplementary material by showing that cross-product function classes has the same concentration rate as that for squared-error function classes with respect to the random L p norm entropy number.

ULLN for β-Mixing Processes
For step (ii), we need to go from Conditions (C.3.iid)(a) and (C.3.iid)(b) for iid processes to their analogs for dependent error processes, which would then immediately establish C.3. It has been shown that the mixing condition of the stochastic process determines the assumptions required on the class F n (Dehling and Philipp 2002). If we look at the "hierarchy" of dependence structures, strong-mixing or α-mixing (Bradley 2005), is one of the broadest family of dependent processes accommodating dependent structures "furthest" from independence. However, existing ULLN results for α-mixing processes require the class of functions in F n to be Lipschitz continuous (Dehling and Philipp 2002). As regression-tree estimates are inherently discontinuous due to nature of discrete partitioning, this will not be satisfied here.
Hence, we focus on absolutely regular or β-mixing process. This class of mixing processes is rich enough to include a number of commonly used spatial or time-series structures as discussed in the examples of Section 3.3. Our main challenge here is that no existing ULLN for β-mixing processes apply to the class of functions on partitions of the RF-GLS trees. ULLN for Glivenko-Cantelli classes under β-mixing was established in Nobel and Dembo (1993). Similar results have been established for a class ofφ-mixing processes in Peligrad (2001). Both results, do not need any convergence rate on the mixing coefficients, but require the class F n to have an envelope (dominator) F (free of n). For data-driven partitioning based estimates like RF or RF-GLS trees such uniform envelopes are not available (as the envelop is the truncation threshold ζ n → ∞). Instead, we propose an ULLN for dependent processes that uses a weaker assumption of a n-varying envelop with a moment-bound. The proof is deferred to Section S2.2 (supplementary material).
Proposition 5.3. (A general ULLN for β-mixing processes) Let {U i } be an R d -valued stationary β-mixing process. Let G n ({U i } n−1 i=0 ) be a class of functions R d → R with envelope G n ≥ sup g∈G n |g|, such that G n is "uniformly mean integrable", that is, Let {U * i } be such that U * i is identically distributed as U i , ∀ i and U * i ⊥ U * j ; ∀i = j. Then, sup g∈G n 1 n Proposition 5.3 ensures that ULLN for iid errors is enough to generalize to β-mixing error processes as long as the function classes are contained within a sequence of mean uniform integrable envelopes in the sense of Equation (17). Next, we show that the ULLN holds for RF-GLS trees under a (2+δ)th moment assumption that is sufficient for mean uniform integrability.
Proposition 5.4 (Estimation error for RF-GLS). Let F n = F n ( ) is the set of all functions f : [0, 1] D → R, piecewise constant on each cell of the partition obtained by an RF-GLS tree. If any subsetF n ⊆ F n satisfies the following condition: (C.4) (Moment bound:) ∃ an envelope F n sup f ∈F n |f |, such that lim n→∞ E 1 n i |F n (X i )| 2+δ < ∞ for some δ > 0, then T ζ nF n satisfies the ULLN (C3) for β-mixing error processes, and under Assumption 2.
The proof is in Section S2.2. The (2+δ)th moment condition C.4 is easier to verify for RF or RF-GLS as discussed in Corollaries 3.1 and 3.2.

Proof of Theorem 3.1
Equipped with Theorem 5.1, we can prove Theorem 3.1 by showing that RF-GLS meets the conditions C.1-C.3. Proposition S2.3 in the supplementary material) shows that the truncation error condition C.1 is met for any ζ n satisfying the scalings of Assumption 4(a) required for proving the approximation and estimation error conditions.
We have already shown conditions C.2 (Proposition 5.2) and C.3 (Proposition 5.4) for two separate choices of function classes. The last step of the proof is choosing the function class that satisfies both conditions. Let F n = F n ( ) be the set of all functions f : [0, 1] D → R piece-wise constant on each cell of the partition P n ( ) created by an RF-GLS tree with data D n and randomization . We have already shown in in Proposition 5.2 that F n satisfies C.2. To apply Proposition 5.4, F n also needs to satisfy the moment condition of C.4. As T ζ n F n is only bounded by ζ n which goes to ∞, clearly F n will not satisfy C.4 .
We carefully carve out a subclassF n ⊆ F n which is still wide enough to satisfy the approximation error condition (C.2), while satisfying the additional restriction (C.4). For a given partition P n ( ), we defineF n as follows: Since by construction of RF-GLS, m n is the optimizer over a much larger set F n ( ), trivially m n is also the optimizer inF n . The first step of the proof of Proposition 5.2 makes it evident why Condition C.2 will also hold for this smaller classF n . To apply Proposition 5.4 and show Condition C.3, the final piece is to show that the condition (C.4) is satisfied by F n . Since apart from m n , F n consists of functions that are bounded by M 0 , we can have the envelope to be F n = |m n | + M 0 . Hence, for Condition (C.4) to hold, it is enough to show lim n→∞ 1 n i E|m n (X i )| 2+δ < ∞ which is an assumption of the Theorem.

Discussion
We considered nonlinear regression function estimation in the spatial mixed model and developed a random forest method (RF-GLS) for estimating the nonlinear covariate effect, while still modeling the spatial effect using GPs, as is conventional. Retaining the GP framework facilitates parsimonious encoding of structured spatial dependence, and conducting all traditional spatial tasks like kriging (prediction at a new location), and recovery of the latent spatial random effect surface. We show in Section 4 how these advantages of RF-GLS manifest into superior estimation performance over naive RF that does not use any spatial information, and superior predictive performance over many existing RFsp methods.
Our method RF-GLS, more generally, can be used for functional estimation under many types of dependence. RF-GLS uses the same fundamental principle that generalizes OLS to GLS. We show how adapting the concept of GLS in RF synergistically mitigates all the issues encountered in naive application of RF to dependent settings. While simple in principle, RF-GLS algorithm differs inherently from RF, by optimizing globally (across all nodes) for each split, to account for dependence across all data points. We show RF is a special case of RF-GLS with an identity working correlation matrix, and is substantially outperformed by RF-GLS for both estimation and prediction under dependence.
We present a thorough theoretical study establishing consistency of RF-GLS under a general assumption of β-mixing dependence. In particular, we establish consistency of function estimation by RF-GLS for the spatial nonlinear mixed-model using the ubiquitous Matérn GP. We also establish consistency of RF-GLS for functional estimation under auto-regressive timeseries errors. Finally, as a byproduct of the theory, we also establish consistency of RF for dependent settings, which to our knowledge, is the first such result.
The theoretical results required involved proofs to address the challenges of accommodating dependent error processes and use of a quadratic form loss for node-splitting. In the process, we developed a number of tools of independent importance. The general result (Theorem 5.1) on L 2 consistency of data-driven partitioning-based GLS estimates under β-mixing dependence extends the analogous result in Györfi et al. (2002) Theorem 10.2 which was for iid settings and OLS estimates. This will be useful for studying other function estimators like local polynomials under dependence. Proposition S2.2 establishes random-norm entropy number concentration bounds for general cross-product function classes, which can find its use in establishing ULLN for any class of functions containing interaction terms. Finally, Proposition 5.3 proposes a general ULLN for β-mixing processes requiring less restrictive assumptions than existing results.
For future work, extension to multivariate outcomes is an important direction. The spatial community is increasing shifting toward multivariate analysis, as GIS systems are empowered to collect data on many variables. Multivariate extension of RF-GLS can leverage the rich literature of multivariate cross-covariance functions for GP (Genton and Kleiber see 2015, for a review). When working with a very large number of variables p, a potential computational roadblock for RF-GLS will be evaluation of the np × np Cholesky factor. Strategies like graphical models may need to be adopted to enforce sparsity and effectuate computational speedup.

Supplementary Materials
An online supplementary materials file contains additional data analyses, and proofs of all the theoretical results.