Local Gaussian Process Approximation for Large Computer Experiments

We provide a new approach to approximate emulation of large computer experiments. By focusing expressly on desirable properties of the predictive equations, we derive a family of local sequential design schemes that dynamically define the support of a Gaussian process predictor based on a local subset of the data. We further derive expressions for fast sequential updating of all needed quantities as the local designs are built up iteratively. Then we show how independent application of our local design strategy across the elements of a vast predictive grid facilitates a trivially parallel implementation. The end result is a global predictor able to take advantage of modern multicore architectures, providing a nonstationary modeling feature as a bonus. We demonstrate our method on two examples using designs with thousands of data points, and compare to the method of compactly supported covariances. Supplementary materials for this article are available online.


Introduction
The Gaussian process (GP) is a popular choice for emulating computer experiments (Santner et al., 2003).As priors for nonparametric regression, they are unparalleled; they are rarely beat in out-of-sample predictive tests, have appropriate coverage, and have an attractive ability to accomplish these feats while interpolating the response if so desired.That said, GPs do have disadvantages, and our aim is not to make an exhaustive or even representative list of them here.But perhaps the two most common criticisms are that (1) computation (for inference and prediction) scales poorly as data sets get large, and (2) that best results require an assumption of stationarity in the data-generating mechanism, which may not be appropriate.The second issue is usually coupled with the first.Emulators relaxing the stationarity assumption exist, but often require further computational expense.Examples include learning mappings from the original input space to one where stationarity reigns (e.g.Schmidt and O'Hagan, 2003, and references therein), and process convolution approaches that allow the kernels to vary smoothly in parameterization as an unknown function of their spatial location (e.g., Paciorek and Schervish, 2006, and therein).
Several works in recent literature have made inroads in reducing the computational burden for stationary models, usually by making some kind of approximation.Examples include searching for a reduced set of "pseudo-inputs" (Snelson and Ghahramani, 2006), building up estimators iteratively in batches (e.g., Haaland and Qian, 2011), sequentially updating via particle methods tailored to design applications (Gramacy and Polson, 2011), fixed rank kriging (Cressie and Johannesson, 2008), and the use of compactly supported covariance (CSC) matrices and fast sparse linear algebra packages (Kaufman et al., 2011;Sang and Huang, 2012).Our contribution has aspects in common with all of these approaches and several others by association (e.g., Nychka et al., 2002;Quiñonero-Candela and Rasmussen, 2005;Furrer et al., 2006).It is reminiscent of ad hoc methods based on local kriging neighborhoods (e.g., Cressie, 1991, pp. 131-134).What we propose is more modern, scalable, more easily automated, which makes it ideal for multicore computing, and tailored to computer experiments.It draws in part on recent findings for approximate likelihoods in spatial data (e.g., Stein et al., 2004), and active learning techniques (e.g., Cohn, 1996).
We start by focusing on the prediction problem, locally, at an input x.In computer model emulation this is the primary problem of interest.We recognize, as many authors have before us, that data points far from x have vanishingly small influence on the predictive distribution (assuming the usual choices of covariance function).Thus, even though a large covariance matrix may have been calculated and inverted at great computational expense, not many of its rows/columns contribute substantively to the linear predictor.We focus on finding those data points (relative to x), and corresponding rows/cols, without considering the full matrices.In particular, we illustrate how a sensible objective criterion reduces to a key component of an active learning heuristic that has found recent application in the design of computer experiments literature (see, e.g., Gramacy and Lee, 2009).
Our localized sub-designs differ from a local nearest neighbor (NN) approach, and yield more accurate predictions for a fixed local design size.This echoes a result by Stein et al. (2004) who observe that NNs may not be ideal for learning correlation parameters.We extend that to obtaining predictions, although we recognize that a modern take on NN can represent an attractive option, computationally, in many contexts.Our methods (in which we include an updated NN) are robust, in a certain sense, to the covariance parameterization.First, they are applicable with any differentiable correlation structure.Second, we deliberately choose a very naïve one and nonetheless compare favorably to others having more thoughtful choices (e.g., Kaufman et al., 2011).The explanation is that we leverage a globally nonstationary effect that compensates for an overly smooth/simplistic local structure.
There are strong parallels between our suggested approach and CSC, although there are two key distinctions.The first is that our method is purely local, whereas a substantial innovation in CSC is to take a local-global hybrid approach.CSC estimates a global mean structure via a polynomial basis in order to "soak up" large scale variabilities in the spatial field.A local correlation structure is estimated on the residuals from the global fit.Second, we are not applying any kind of tapering of influence in learning that local structure.We think of tapering as a continuous or smooth operation which would retain most of the features, e.g., stationary, of an un-tapered analog.However our approach is more discrete and therefore could not retain stationarity or smoothness.Our main advantage is that accuracy can be explicitly linked to a computational budget-we know exactly how "sparse" our matrices will be before running the code, and we can monitor convergence to the (intractable) full-data counterparts.Also, since we focus on particular locations x, independent of others, our method is highly parallelizable.Local inference for many x's can proceed without communication between nodes, meaning that OpenMP pragmas can yield nearly-linear speedups.A disadvantage is that a lack of global structure means that our approach may not be ideal for estimating stationary fields when computational resources are abundant.
The remainder of the paper is outlined as follows.In Section 2 we review GP predictive modeling, inference and prediction.In Section 3 we derive a new sequential design criteria for greedily building local designs by considering future mean-squared prediction error (MSPE).We also derive expressions for the efficient updating of all required quantities in order to keep computational costs down, and argue that a simpler heuristic related to one from the active learning literature may prove to be even more efficient.Then in Section 4 we consider the global prediction problem, for many x, and discuss global inference for the correlation structure.Finally, all of the elements are brought together in Section 5 for empirical illustrations and comparisons to CSC.We conclude with a discussion in Section 6.An implementation of our methods can be found in the laGP (Gramacy, 2013) for R.

Gaussian process predictive modeling
A GP prior for functions Y : R p → R, where any finite collection of outputs are jointly Gaussian (Stein, 1999), is defined by its mean µ(x) = E{Y (x)} and covariance A common simplifying assumption in the computer modeling literature is to take the mean to be zero, and this is the convention we will follow.Typically, one separates out the variance τ 2 in C(x, x ) and works with correlations K(x, x ) = τ −2 C(x, x ) based on Euclidean distance and a small number of unknown parameters.In this paper we focus on the isotropic Gaussian correlation K θ,η (x, x ) = exp{−||x−x || 2 /θ}+ηδ x,x where θ is called the lengthscale parameter, and η the nugget.
In what follows we hold η at a pre-determined small value in order to obtain a nearinterpolative surface which is appropriate when the computer model is deterministic.It is important to note that our main results do not heavily depend on of the choice of correlation function so long as derivatives (with respect to θ) are available.The isotropic, one-parameter (θ) family simplifies the exposition.However the derivations of our main results [Appendix A.2] assume a separable family.The laGP package facilitates estimating η if desired.
Although technically a prior over functions, the regression perspective recommends a likelihood interpretation.Using data D = (X, Y ), where X is a N × p design matrix, the N × 1 response vector Y has a multivariate normal (MVN) likelihood with mean zero and covariance Σ(X) = τ 2 K, where K is an N × N matrix with (ij) th entry K(x i , x j ).
Conditional on K, the MLE of τ 2 is available analytically.Profile likelihoods may then be used to infer other unknown parameters (e.g., θ).
Empirical Bayes (Berger et al., 2001) has gained in popularity recently.In this approach one specifies a reference prior, π(τ 2 ) ∝ τ −2 .Having an MVN likelihood, and prior algebraically equivalent to IG(0, 0), makes integrating over τ 2 analytic.We obtain Numerical methods for finding parameters like θ generally work well when leveraging derivative information for the log likelihood, quoted here for convenience in later developments; derivations are included in Appendix A.2: where K and K are shorthand for ∂K ∂θ and ∂ 2 K ∂ 2 θ respectively.These can be used as the basis of a Newton-Raphson scheme, but complications arise in the not uncommon situation where the likelihood is multimodal (e.g., Rasmussen and Williams, 2006, Chapter 5).
The marginalized predictive equations for GP regression are available in closed form.Specifically, the distribution of the response Y (x) conditioned on data D and covariance K(•, •), i.e., p(y(x)|D, K), is Student-t with degrees of freedom N , and scale where k (x) is the N -vector whose

Localized approximate emulation
If obtaining fast approximate prediction at x is the goal, then a first stab may be to use a small sub-design of X locations closest to x. I.e., choose the n-nearest neighbor (NN) subdesign X n (x) ⊆ X and apply the predictive equations (4-5) with D n (x) = (X n , Y n ) where n is chosen as large as computational constraints will allow.This is a sensible approach.It is clearly an approximation: as n → N , predictions of Y (x)|D n and Y (x)|D converge trivially.Moreover, since the predictive equations are valid for any design, interpreting the accuracy of the approximation involves the same quantities as those obtained from an analysis based the on full design.Clearly, when n N , the variance V (x)|D n can be much larger than V (x)|D.However, if N is so big as to preclude obtaining any predictor at all, within reasonable computational constraints, obtaining one with marginally inflated uncertainty is better than nothing.
Though simple, one naturally wonders if it is optimal for fixed n.It is not: Vecchia (1988) observed that the best sub-design for prediction depends on the correlation parameters (θ) and is usually not the NN design.Stein et al. (2004) go further by showing that NN is uniformly suboptimal under certain regularity conditions.Still, finding the best design for n > 1 is hard.It would involve a high-dimensional non-convex optimization-a computational nightmare in its own right.Joint inference with θ is more difficult still, compounding the issue.Although the simple yet sub-optimal NN idea seems attractive again, the main goal of this paper is to demonstrate that it is possible to do better without much extra effort.
Our main idea, which is akin to forward stepwise variable selection in regression, is as follows.For a particular predictive location x, we search for a sub-design X n (x), with accompanying responses Y n (x), together comprising data D n (x), by making a sequence of greedy decisions D j+1 (x) = D j (x) ∪ (x j+1 , y(x j+1 )), j = n 0 , . . ., n.Each choice of x j+1 depends on the previous choices saved in D j (x) by searching over a criterion.Evaluating the criterion, and updating the quantities required for the next iteration, must not exceed O(j 2 ) so that the total scheme remains O(n 3 ), like NN.We start with a very small D n 0 (x) comprised of n 0 NNs, which is easy to motivate in light of results shown in Section 3.4.

Greedy search to minimize mean-squared predictive error
Given x and D j (x), we search for x j+1 by considering its impact on the predictive variance of Y (x), taking into account uncertainty in θ, through an empirical Bayes mean-square prediction error (MSPE) criterion: J(x j+1 , x) = E{[Y (x) − µ j+1 (x; θj+1 )] 2 |D j (x)}, which should be minimized.The predictive mean µ j+1 (x; θj+1 ) follows (4), although we embellish notation here to explicitly indicate dependence on x j+1 and the future/unknown y j+1 .This mean represents the hypothetical prediction of Y (x) arising after choosing x j+1 , observing y j+1 , and calculating the MLE θj+1 based on the resulting D j+1 (x).Where convenient we drop the x argument in the local design, D j ≡ D j (x).Also note that θj ≡ θj (x) since it depends on D j (x).
In Appendix A.2 we derive the approximation The terms in (6) are explained below in turn.
• V j (x|x j+1 ; θ) is an estimate of the new variance that will result after adding x j+1 into D j , treating θ as known.
V j (x|x j+1 ; θ) = (j + 1)ψ j j(j − 1) v j+1 (x; θ), is capturing x j+1 's contribution to the reduction in variance by placing its correlations to X j in the j + 1 st row/col of K j+1 .The correlation between x and X j+1 (x) ≡ (X j (x), x j+1 ) comes in through the j + 1 st entry of k j+1 , the rest being k j .The other part of the expression, involving an estimate of τ 2 via ψ j /j from D j , deliberately avoids using y j+1 but it recognizes its effect as an additional design point in the denominator.
is the partial derivative of the predictive mean (4) at x, given D j , with respect to the lengthscale parameter.In the appendix this is derived as where kj (x) is the j-length column vector of derivatives of the correlation function K(x, x k ), k = 1, . . ., j, with respect to θ.
The approximation in (6) is three-fold: (a) the partial derivative of µ j+1 is approximated by that of µ j ; (b) the Student-t predictive equations are approximated by moment-matched Gaussian ones (9); and (c) the future θj+1 is replaced by the current estimate θj based on the data D j obtained so far.All are required since averaging over future expected y j+1 is analytically intractable.Even though a y j+1 is technically available in our context, we deliberately do not condition on this value, which would result in the undesirable avoidance of selecting an x j+1 whose y j+1 is at odds with D j .[More discussion in Appendix A.2.] Observe that when the Fisher information is large, the MSPE ( 6) is dominated by the reduced variance for x j+1 , the first term.We return to this special case in Section 3.3.For smaller j the extra term gives preference to choosing an x j+1 that is expected to improve estimates of θ nearby to x. Considering that a sensible non-local design strategy is to maximize (the determinant of) the Fisher information, the MSPE is thus making a local adjustment by (a) weighting the Fisher information by the rate of change of the predictive mean near x and (b) balancing that with the aim of reducing local predictive variance.The appropriate balance is automatically dictated by the definition of J(x j+1 , x), which considers the effect of uncertainty in θ on the predictive uncertainty in Y (x).Aspect (a) is suggestive of potential benefit in estimating a localized, i.e., nonstationary, spatial field.

Fast updates
MSPE is deployable because everything it requires can be obtained by fast updating schemes from one iteration to the next: j → j + 1. Below, we outline updates that require time in O(j 2 ) so that n applications are in O(n 3 ).This is possible since K −1 j+1 can be obtained efficiently from K −1 j , both conditional on the same parameter θ, via the partition inverse equations (Barnett, 1979, Appendix A.1).We use them in several applications.For example, where G j (x ) ≡ g j (x )g j (x ), Each product above requires most O(j 2 ) time.Adding a row and column k j (x j+1 ) to K j which, together with K(x j+1 , x j+1 ) = 1 + η, yields K j+1 in O(j) extra time and space.In addition to aiding in computation, Eq. ( 10) helps clarify how variance is reduced at x as a function of x j+1 .The K(x j+1 , x) 2 term is suggestive of a larger reduction the closer x is to x j+1 , however k j (x) suggests that distances from x to X j (x) are also a factor.The log marginal likelihood requires log |K j | which, by similar considerations, may be updated in O(j) time as log Fast updates are also available for key aspects of the predictive equations (4-5): and where h j (x j+1 ) = Y j g j (x j+1 ) Both updates take time in O(j).
For fast updates of the observed information F j+1 (θ), write the log likelihood as a sum of components from D j and from (x j+1 , y j+1 ), with the latter denoted by j (y j+1 ; θ).This suggests the simple recursion . In Appendix A.2 we show that For compactness we are suppressing arguments to the mean and variance functions, and shorthanding their derivatives.E.g., V j ≡ V j (x j+1 ; θ) and μj ≡ . Expressions for second derivatives of predictive quantities may be found in the appendix.Observe that the expectation (given D j ) of the above quantity augments F j (θ) to complete G j+1 (θ) in ( 9).
Updates of all other quantities required for the calculations in Section 3.1 follow trivially.Since they require fixed θ, our greedy scheme is only efficient if MLEs are not recalculated after each sequential design decision, suggesting further approximation is warranted to obtain a thrifty MSPE-based local design.We find that working with a reasonable fixed value of θ throughout the iterations works well.At the end of the n local design and update steps we can find the MLE θn (x) by Newton-like methods with analytic derivatives (2-3), which would incur O(n 3 ) cost once.More details on local inference are deferred to Section 4.2.

A special case
When the Fisher information is large, the MSPE (6) reduces to V j (x|x j+1 ; θ), which is the new variance at x when x j+1 is added into the design [Section 3.1].A similar statistic has been used in the sequential design of computer experiments before, but in a different context: choosing design locations for new computer simulations (e.g., Seo et al., 2000;Gramacy and Lee, 2009).Rather than focus on a single x, those works average reductions in variance over the whole input space X : ∆V j (x j+1 ; θ) = X V j (x; θ) − V j (x|x j+1 ; θ) dx.Part of the integrand does not depend on x j+1 , and so can be ignored.The other part is proportional to v j+1 (x; θ) in (10).In the special case where K(•, •) yields the identity (Anagnostopoulos and Gramacy, 2013), or when X is a finite and discrete set, the integral is also analytic.Otherwise numerical methods are needed, which may add significant computational burden.
Choosing x j+1 to maximize ∆V j (x j+1 ; θ) is a sensible heuristic since it augments the design with an input-output pair that has the most potential to reduce variance globally.In repeated application, j = 1, . . ., N , the resulting designs have been shown to approximate maximum information designs (Cohn, 1996).To credit its originator, subsequent authors have referred to this technique as active learning Cohn (ALC).However, the expensive numerical integration required means that alternatives heuristics are usually preferred.
A local design criteria based on the integrand alone-the simplest implementation of which involves maximizing (10)-may therefore be a sensible alternative to MSPE, not just further approximation.On the one hand, it leverages an information theoretic criteria.On the other it shortcuts substantial derivative calculations required by MSPE.In what follows we overload the terminology a bit and refer to local design based on maximizing (10) as ALC.However, it is important to note that its application in this context is novel.ALC in its original form has never been used to select a local sub-design and, since no integration is needed (numerical or otherwise) the computation is straightforward.In the supplementary material we show that this heuristic can be derived by studying partial correlations akin to those used to obtain sequential least squares estimators.

An illustration
Consider a surface first studied by Gramacy and Lee (2011b), defined by f ( x 1 , x 2 ) =  −w(x 1 )w(x 2 ), where w(x) = exp (−(x − 1) 2 ) + exp (−0.8(x + 1) 2 ) − 0.05 sin (8(x + 0.1) share the same initial design D n 0 comprising the closest six NNs to x.Although the two heuristics agree on some sites chosen, observe that they are not selected in the same order after iteration nine.The patterns diverge most for further out sites, but their qualitative flavor remains similar.In both cases, the bounding box [−2, 2] 2 impacts local design symmetry.
The full 46 local design iterations take less than a tenth of a second on a modern iMac, although in repeated trials (to remove OS noise) we found that MSPE takes 2-3 times longer.Inverting a 40K × 40K matrix cannot, to our knowledge, be done on a modern desktop primarily due to memory swapping issues.For a point of reference, inverting a 4000 × 4000 matrix took us about seven seconds using a multithreaded BLAS/Lapack.Memory issues aside we deduce that the alternative full-GP result would have taken days.
Figure 2 illustrates the accuracy of the approximate predictive distribution thus obtained, with comparison to the NN alternative.These results summarize the output of a Monte Carlo experiment repeated 1000 times, whose setup is similar to that above with the following The open circles and whiskers at 52 on the j-axis provide an approximate benchmark, based on a NN local design of size 1000, for the best possible results.
The left pane shows the bias, defined as the predictive mean µ n (x) minus the true y-value.Notice how, relative to the greedy/MSPE method, the NN approach is slow to converge to the true value and is consistently biased low after the initial design.Both start biasedlow, which we attribute to 0-mean reversion typical (for GPs) in areas of the input space sparsely covered by the design.The right pane shows the predicted standard deviation over design iterations.Notice that the NN standard deviations are consistently lower than those from the greedy/MSPE method.So the NN method is both biased-low and more confident than its comparator.At j = n = 50 both methods obtain a predictive mean close to the truth, with lower variability for the greedy/MSPE method.Moreover, both adequately acknowledge uncertainty inherent in the approximation obtained using a much smaller local design compared to one obtained with orders of magnitude larger designs.In the supplementary material we show that an added benefit of the greedy approach is that the condition numbers of K j are lower than those obtained from NN.

Global emulation on a dense design
The simplest way to extend the analysis to cover a dense design of predictive locations x ∈ X is to serialize: loop over each x collecting approximate predictive equations, each in O(n3 ) time.For T = |X | the total computational time is in O(T n 3 ).Obtaining each of the full GP sets of predictive equations, by contrast, would require computational time in O(T N2 + N 3 ), where the latter N 3 is attributable to obtaining K −1 . 2 One of the nice features of standard GP emulation is that once K −1 has been obtained the computations are fast O(N 2 ) operations for each location x.However, as long as n N our approximate method is even faster despite having to rebuild and re-decompose K j (x)'s for each x.

Parallelization
The approximation at x is built up sequentially, but completely independently of other predictive locations.Since a high degree of local spatial correlation is a key modeling assumption this may seem like an inefficient use of computational resources, and indeed it would be in serial computation for each x.However, independence allows trivial parallelization.Predicting at a dense design of 10K locations with 8 threads on an 4-core hyperthreaded 3 iMac, with a 40K design [as in Section 3.4] took about 30 seconds and required only token programmer effort: we augmented a single C "for" loop over x ∈ X with an OpenMP "pragma": #ifdef _OPENMP #pragma omp parallel for private(i) #endif for(i=0; i<npred; i++) { ...
That is actual code from our implementation, where npred = |X |.This led to a nearly linear speedup: runtimes for P processors scale roughly as 1/P .

Illustration continued
Figure 3 offers visualizations of the behavior of our greedy, local, approximations applied globally in the input space.The left column contains surfaces for both input dimensions, whereas the right one shows a 1d slice for x 2 = 1.The mean surfaces (top row) exhibit essentially no visual discrepancy to the truth.There are, however, small discrepancies (middle row) and the level of accuracy is not uniform across the input space.Pockets of higher (though still small) inaccuracy are visible.Finally, despite the uniform gridded design, the predicted standard deviation (bottom row) is non-uniform throughout the input space, appearing to smoothly vary albeit across a narrow range of values.This subtly nonstationary behavior is due to the localized estimation of τ 2 via ψ(x).It is not due to θ since that was fixed globally.Closer inspection reveals that areas of lower predictive variability (bottom row) correspond to greater inaccuracy (middle).Both results suggest that the localized greedy approximation struggles to cope with the assumed stationary covariance, and may benefit from a more deliberate localized approach to the estimation of spatial correlation.

Localized parameter inference and stagewise design
Wishing to leverage fast sequential updating, retain independence in calculations for each predictive location x so they can be parallelized, and allow for local inference of the correlation structure for nonstationary modeling, we propose the following iterative scheme.
1. Choose a sensible starting global lengthscale parameter θ x = θ 0 for all x.
2. Calculate local designs D n (x, θ x ) based on sequential application of the MSPE or ALC design heuristics, independently for each x.
4. Set θ x = θn (x) possibly after smoothing spatially over all x locations.
6. Predict at each x, independently, based on D n (x) and possibly smoothed θ x .
Usually only one repetition of steps 2-4 is required for joint convergence of local designs D n (x) and parameter estimates θ n (x), for all locations x.The initial θ 0 is worthy of some consideration, since certain pathological values-very small or very large on the scale of squared distances in the input space-can slow convergence.We find that a sensible default θ 0 setting can be chosen from the lower quantiles of squared distances in X.

MLE calculations in
Step 3 proceed via Newton-like methods as described in Section 3.2.We take θ x for initialization, which may be θ 0 in the first stage or a possibly smoothed θn (x) in later stages.Convergence is very fast in the latter case (1-4 iterations) as we illustrate below.The former can require about twice as many iterations depending on θ 0 , however this can be reduced if nearby θn (x ) can be used for initialization instead.As a share of the computational cost of the entire local design scheme, the burden of finding local MLEs is dwarfed by the MSPE calculations when N n.So reducing the number of Newton steps may not be a primary concern in practice.
The smoothing suggestion in Step 4 stems from both pragmatic and modeling considerations.GP likelihood surfaces can be multimodal so the Newton method is not guaranteed to find a global mode.Smoothing can offer some protection against rapid mode-switching in local estimators.On the modeling end, it is sensible to posit a priori that the correlation parameters be constrained somewhat to vary slowly across the input space.As we illustrate below, a simple neighborhood scheme leads to nice (smooth) visualizations of spatially varying lengthscale parameters.However, we find that this step is not absolutely necessary to obtain good prediction results, though it may aid in convergence of the two-stage scheme.

Illustration continued
Figure 4 shows the estimated log local lengthscale parameters on our illustrative example from Section 3 using ALC; the MSPE results are similar but require 50% more computational effort.The first pane shows a snapshot after the first stage; the second after the second stage.These have been smoothed using loess in R with a span of 0.01, giving non-negligible weight to only the twelve or so NNs in our particular predictive grid of x values.Notice the increase in fidelity from one stage to the next.The variation the lengthscale over the input space is unmistakable after either stage.Observe the similarity to the predictive error and variance results from Figure 3, with larger lengthscales corresponding to areas poorer emulation under the model with a fixed, global, lengthscale.The final pane in Figure 4 shows a histogram of the θn (x)'s obtained after the second stage.The Newton steps from the first stage required about 6.5 iterations on average, with 90% between 6 and 8 whereas the second stage took 4.2 (3 to 6) when primed with previous stage values.
5 Empirical demonstration and comparison

Illustrative example
We start by making an out-of-sample comparison of our local approximation methods on the prediction problem described above.Comparators include variations based on NN, and greedy selection via MPSE and ALC.All greedy methods are initialized with θ 0 = 0.7 which is in the lower quartile of squared distances in X.We also consider a second iteration of MSPE and ALC local design initialized with smoothed MLEs θn (x) from the first stage.NN with θn (x) is included too, which we regard as one of our novel (if small) contributions.Note that fully placing NN in the context of the scheme outlined in Section 4 has limits since later stage local NN designs would be identical to those from the first stage, as they do not depend on θ.We consider variations where MLEs are not calculated.In those cases we used θ 0 = 1 which is a reasonable but not optimal value, inside but towards the upper extreme of the 95% interval of θn (x)-values found, ex poste, by the other methods.The timings were obtained with 8 threads on 4-core hyperthreaded iMac, via OpenMP pragmas.All other particulars are exactly as described by the illustrations above: the local methods start with n 0 = 6 NNs, and grow iteratively until n = 50, i.e., using the laGP package defaults.We also consider a "big" NN benchmark, which uses n = 200.√ 1 − NSE and RMSE, smaller is better), and uncertainty (95% coverage and predicted standard deviation) on a single out-of-sample experiment.The timings for two-stage estimators ("mspe2" and "alc2") include those from the first stage ("mspe" and "alc").A "nomle" indication flags estimates based on θ 0 = 1.
Table 1 summarizes predictive accuracy with RMSE, which is self-explanatory, and √ 1 − NSE, included here for completeness as NSE was used as the main metric of comparison in an earlier study [see Section 5.2 for more details].The rows of the table are sorted by accuracy, with smaller being better.Predictive uncertainty is summarized with pointwise 95% coverage rates and p standard deviations.
We observe the following.First, all methods exhibit high precision on the scale of the response, and all over cover.Still, the best methods (by RMSE) are three times better than the worst.The ordering of the methods by RMSE is as expected: more computational effort reaps rewards.Methods which do not infer θn (x) fare worst, and NN is only competitive with the greedy methods when it uses an order of magnitude larger local designs.Notice that the "nnbig" comparator is competitive with the smaller greedy methods in terms of accuracy per unit time, even giving lower predicted standard deviation.However this comes at the expense of storing a 4x larger output object, which may limit portability.[See Section 6 for further discussion of time/memory in implementation.]

Comparison to compactly supported covariances
For a broader comparison, consider the borehole function (Worley, 1987) which is a common benchmark for computer experiments (e.g.Morris et al., 1993).The response y is given by The eight inputs are constrained to lie in a rectangular domain: Part of our reason for choosing this data is that it was used as an illustrative example for the CSC method described by Kaufman et al. (2011), which may be the most widely adopted modern approach to fast approximate emulation for computer experiments.The authors provide a script illustrating their R package sparseEM on this data. http://www.stat.berkeley.edu/~cgk/rcode/assets/SparseEmExample.R In the first part of our comparison below we follow that script verbatim, except augmenting their 99% sparse estimator with a 99.9% one for faster results.They generate a LHS of size 4500, using the first 4000 for training and the rest for testing, and use Nash-Sutcliffe efficiency (NSE, Nash and Sutcliffe, 1970) for evaluation.
Here, Ŷ (x) represents the predicted value of Y (x), which for all models under comparison will be the posterior predictive mean; Ȳ is the average value of Y (x) over the design space.
The second term in ( 13) is estimated mean-squared predictive error (i.e., realized RMSE 2 , that which we approximately optimize over when calculating a local MSPE design) over the unstandardized variance of Y (x).NSE is an attractive metric because it can be interpreted as analog of R 2 .However, it has the disadvantage of providing misleadingly large values in deterministic (noise-free) computer model emulation contexts.We therefore report √ 1 − NSE = sd(prediction error)/sd(response) instead, and in addition to other metrics like RMSE.NSEs can still be backed out for direct comparison to tables in the CSC paper.
Table 2 summarizes the results of two comparisons, both involving thirty Monte Carlo repetitions, each with new random LHS testing and training sets.One (top) for the original borehole comparison, and another (bottom) with double-sized training and testing sets.
Package defaults are used in all cases; second-stage greedy methods (e.g., "mspe2") used unsmoothed first stage θ x = θn (x) values (from "mspe").The timings, obtained on the same 4-core hyperthreaded iMac throughout, comprise of the sum of fitting and prediction stages.OpenMP pragmas, with 8 threads, were used to parallelize our local design methods.Although CSC is not explicitly multi-threaded, the sparse linear algebra libraries it uses (from the spam package) are threaded, and therefore do utilize multiple cores. 4We can see clearly from the table(s) that even without parallelization (∼ 8x slower) our local methods are orders of magnitude faster than the CSC ones.
Differences in the raw accuracy numbers, via average √ 1 − NSE or RMSE, are much greater than they were in our earlier illustrative example.Acknowledging the randomness in the Monte Carlo setup, the third column gives a p-value for a one-sided paired t-test under the alternative that that row's NSE came from a different population than those with the next-lowest averages (next row down).For example, at the 5% level, the best method ("alc2") is not statistically better than the second best ("mspe2"), and whereas both are convincingly better than the third best, and so on.At a high level, observe that the local methods without θ(x) fare worst.NN (n = 50) is dominated by greedy methods across the board.Due to lack of statistical significance in differences between MSPE and ALC, the former might not be worth the added computational expense.As before, a big NN (n = 200) is competitive in terms of accuracy per second, however only when local MLEs are calculated.Without local inference and parallelization (both our contributions), the "big" NN method is amongst the most expensive (∼ 70 secs) and the least accurate.
Focusing on the uncertainty estimates in the table, observe that the greedy methods over cover, whereas the CSCs have the correct nominal pointwise coverage.They also have, with the exception of "nnbig", lower predicted standard deviation.Interpreting these results is not straightforward, however.The true response smooth and deterministic, so the 5% which CSC mis-covers must comprise of a small number of contiguous regions of the input space where it was (possibly substantially) biased.We therefore draw comfort from the fact that the local methods are both more accurate and more conservative.The low standard deviation of "nnbig" would seem to be due solely to the larger n in the denominator of ψ n (x) in (1), not to a better estimate of the global variance structure.This can be verified by replacing 50 • ψ n (x)/200 in the coverage calculations for the other n = 50 estimators, which results in values nearly identical to the others.We conclude that our greedy methods are (perhaps overly) conservative.However next to "nn", which provides deceptively good coverage results, caution in terms of predictive uncertainty pays dividends in accuracy.
The double-sized experiment in the bottom table tells a similar story.Accuracy results increase slightly, but relative orders are similar.What is noteworthy is that the local methods require, as expected, about double the computational effort.By contrast the CSC methods take four to eight times longer, suggesting a quadratic scaling in N .Anecdotally we remark that the 99% sparse CSC method on (N, n) = (8000, 1000) approaches the limit of the size of problem possible on this machine due to memory constraints.We tried N = 10000.Frequent memory pages to disk lead to dramatic slowdowns; our estimates suggest the code would have taken days to finish.By contrast, the local/greedy methods required seconds more.

Discussion
We outlined a family of sequential design schemes to modernize an old idea of approximate kriging based on local neighborhoods [for a recent reference, see Emory (2009)].The result is a predictor that is fast, nonstationary, highly parallelizable, and whose "knobs" offer direct control on speed versus accuracy.Really the only tuning parameter is n.We argue in favor of calculating local designs which differ from a simpler (yet modernized) NN, and give two greedy algorithms which have the same local O(n 3 ) computational order as NN.But the order notation hides large constant, which means that the greedy methods demand compute cycles that could have been spent, e.g., by NN with a larger n budget.Indeed, our results show that this approach is competitive, but not without modern enhancements like local θn (x) and parallel computation.Nevertheless, instinct suggests that smaller n is better.Many of our ideas for extension, which we outline below, bear this out.
By saving a small amount of information-indices of the local designs at each x, and corresponding θn (x) values-designs can augmented, picking up where they left off if more compute cycles become available.Those resources do not need to be allocated uniformly.Larger local designs n(x) can be sought where, e.g., the potential to reduce variance or increase accuracy is larger.Such decisions can be made more judiciously (i.e., accurately, via ALC) and efficiently (smaller n calculations) under a greedy local design scheme, compared to NN say.This is suggestive of benefit from differential effort through the stages of global inference in Section 4.2.One could start with a small n and local inference based on NN; then iterate with ALC and/or MSPE, possibly increasing n; and finally allocating additional n(x), refining hard-to-predict areas until the computing budget is exhausted.
Having a stopping criterion might be desirable when computational limits are less constrained, which when applied independently to obtain n thresh (x) for each x could represent an alternative mechanism for allocating computational resources differentially amongst the predictive locations.One option is to track the ratio of predictive and reduced variances (V j (x)/V j+1 (x)) over the iterations, and stop beyond a certain minimum threshold.Both depend on ψ j , so cancellation would lead to a metric independent of the responses {y i } j i=1 (other than via a common estimate of θ).However, non-uniform full-data designs coupled with spatially varying choices for θ make the actual time to convergence-which is related cubically to n thresh (x)-highly unpredictable.Since lack of predictability would be compounded over thousands of independent local searches, practical considerations may necessitate a low global cap on n thresh (x), recommending against a simplistic NN approach.
We remark that there is a trade-off between the size of local designs and the nonstationary flexibility of the global predictor: as the designs get larger, the global surface can exhibit less spatial heterogeneity.With larger n, approximation will improve relative to the full-data GP, but it may exhibit poorer out-of-sample performance, suggesting it could be beneficial to stop early.Post-process pruning steps, perhaps via an out-of-sample validation scheme, could be deployed to find n * (x) ≤ n, a search which is more efficient the smaller n is to start with.Another option is to augment the GP with elaborate mean structure, following Kaufman et al. (2011).That would allow a sparser covariance structure, which for us is a sparser local design (again smaller n).That mean structure is one of the big strengths of the CSC method, which we anticipate would fare better than ours in extrapolation exercises, or when predicting in an area of of the input space with very sparse design coverage.
When the design and predictive grids are dense, and when the same θ parameters are used globally, the pattern of greedy local designs in the interior of the input space (away from the boundaries) are highly regular.For example, Figure 1 suggests a simple template-based rule where, for each x, n/2 NNs are chosen along with n/2 other points spaced out along orthogonal rays from x.This may prove to be a cheap alternative to expensive MSPE/ALC calculation and therefore may widen the accuracy-per-flop gap between more thoughtful small-n and much larger-n NN alternatives.However, it may also be that other approaches, like structured Gaussian process predictors (e.g.Gilboa et al., 2012), may be better able to leverage high regularity in dense designs.Exploring alternatives in this context represents future work.Yet regular designs rare for computer experiments where space-filling ones are preferred.That sort of looser, yet uniform structure, can be leveraged towards more efficient greedy searches, representing a hybrid between a template-oriented approach and an exhaustive search.We have found that substantial speedups can be obtained by searching first over the closest members of X −X j (x).One option is to stop early once some tolerance is reached on changes to ALC/MSPE.Even simpler is to restrict searches to a more modestly sized local neighborhood of candidates N N , say N = n m for m-dimensional input spaces.Actually, for all experiments in this paper, we restricted searches to the nearest N =1000 candidates.Results (except time) with N = 10000.
We envision further potential benefits from parallelization.Symmetric-multi-processor machines are commonplace in modern desktop computing.Numbers of cores are doubling every few years and our methods are well-positioned to take advantage.Graphical processing units (GPUs) are taking off for scientific computing, having thousands of stripped-down computing cores.Some authors have already jumped on that bandwagon in the context of computer experiments (e.g., Franey et al., 2012;Eidsvik et al., 2013).Although requiring nontrivial extra programming effort (by comparison with OpenMP, say), our beta GPU version of ALC-based local design searches have resulted in 20-70x efficiency gains (Gramacy et al., 2013).Such gains effectively obsolete large-n NN alternatives, by drastically narrowing the time gap between NN and greedy alternatives for fixed n.
Finally, we anticipate that a modern local approach to GP emulation will have applicability beyond emulation.For example, they may be ideal for computer model calibration where predictions are only required over part of the input space: at pairings of the calibration parameter(s) with a small number of field data sites.In that case it might be sensible to build "local" designs jointly for field data locations.Although the details for doing this by MSPE are still under development, an ALC version can be invoked by providing a matrix of x-values to the laGP function in the R package.Other contexts clearly pose challenges for local emulation.For example, when optimizing black box functions by expected improvement-where

A.2 Mean-squared predictive error
Here we derive the expressions behind the MSPE (6) development in Section 3. We allow any form of the correlation function where derivatives are analytic.Although our expressions in the main body of text assume a scalar parameter θ, the derivations here allow for a p-parameter family, where p ≥ 1. Therefore our partial derivatives are expressed componentwise and vectorized as appropriate.In most cases, the single-parameter result is obtained straightforwardly by removing subscripts and collapsing products of identical second derivatives into squares and/or factors of two.
We begin with the batch log likelihood derivatives which lead to the Fisher information.Define w j = K −1 j Y j , which is a component of the marginal (log) likelihood (1) for data D j via ψ j .Much of what follows repeatedly leverages that where ∂θ k is the matrix of partial derivatives corresponding to K j calculated via K θ,η (x, x ) for all (x, x ) pairs in D j .Recursive application yields Also useful is Note that the trace of a product of square matrices M (1) and M (2) may be efficiently computed as k,l M (1) lk M (2) kl .Using the above results, the first and second derivatives of the log likelihood j (θ) ≡ log p(Y j |θ, X j ) given in (1) are (for 1 ≤ k, l ≤ p) The Fisher information F j (θ) matrix is obtained by negating the (k, l) elements of the second derivative above.
Sequential updating of the Fisher information leverages a recursive expression of the log likelihood which follows trivially from a cascading conditional representation of the joint probability of the responses given the parameters j+1 (θ) = log p(Y j+1 |θ) = log p(Y j |θ) + log p(y j+1 |Y j , θ) ≡ j (θ) + j (y j+1 ; θ) where the final term represents the conditional loglikelihood for y j+1 given Y j and θ.Taking (negative) second derivatives yields the following updating equations Deriving the final term, here, requires expressions for y t+1 |Y t , θ, i.e., the predictive equations, and their derivatives.First, thereby defining some shorthand that will be useful going forward.The derivatives of the log likelihood follow immediately from applications of the chain rule Eqs. (16-17) require the derivatives of the predictive equations.In turn, those require w j and its derivatives (15), and a new quantity z j = K −1 j k j ≡ K −1 j k j (x j+1 ; θ) and its derivatives: and which follow from applications of ( 14).The derivatives of the mean are then revealed as For the variances it helps write V j = 1 j−2 ×ψ j ×v j where v j ≡ v j (x j+1 ; θ) = K η (x j+1 , x j+1 )− k j (x j+1 )K −1 j k j (x j+1 ), and ψ j ≡ ψ j (θ).We assume that when the arguments to K θ,η are identical, as in v j , the correlation function is no longer a function of θ.Derivatives then follow as This completes the expressions required for updating the Fisher information matrix.
The MSPE calculation (6) requires that the conditional likelihood be replaced with the expectation because we wish to select the next x j+1 in a manner that does not utilize the "future" observation y j+1 .Unfortunately, the Student-t predictive equations (4-5) preclude a tractable analytic expectation calculation.Therefore, we approximate by employing Gaussian surrogate equations with matched moments.I.e., The (exact) derivatives for the approximation (23) then follow, for 1 and Taking the (negative) expectation of (25) gives, Now, given a specified input site x at which predictions of Y (x) are desired, and given data D j containing previously selected sites, the objective is to choose the next site x j+1 recognizing that the data pairs will lead to improved predictions via (4-5) and an improved estimate of the correlation parameters θ.If the latter were of primary concern, one possible way to choose x j+1 would be to maximize the determinant of G j+1 (θ) for some choice of θ.While reasonable, this does not directly consider the prediction error variance of Y (x).Therefore it possess no mechanism for giving a local character to the estimate of θ.So instead we propose to choose x j+1 to minimize the predictive variance of Y (x) in a manner that considers the impact of estimation uncertainty in θ.Specifically, we choose x j+1 to minimize the Bayesian mean squared prediction error defined as where µ j+1 (x; θj+1 ) depends on (x j+1 , y j+1 ) and represents the hypothetical prediction of Y (x) that would result after choosing x j+1 , observing y j+1 , calculating MLE θj+1 based on the expanded data set D j+1 , and using the MLE in the prediction of Y (x).Eq. ( 27) becomes Regarding the first term in (28), we approximate it as E V j+1 (x; θ) Y j = E ψ j+1 (θ) (j − 1) v j+1 (x; θ) Y j ≈ (j + 1)ψ j j(j − 1) v j+1 (x; θj ) (29) which can be recognized as the new/reduced variance (10) implied after adding x j+1 into D j but which does not condition on y j+1 and, therefore, uses the MLE calculated from D j .
Regarding the last term in (28), we again approximate as follows.The last inequality in (30) involves two approximations.First, the partial derivative of µ j+1 (x; θ) at future θj+1 is approximated by the partial derivative at the current θj .The second approximation uses the partial derivative of µ j (x; θ) instead.Both are required for tractability because θj+1 and µ j+1 (x, θ) depend on the future y j+1 in a manner that would make the expectation in (30) difficult to evaluate analytically.Note that the difference between µ j+1 (x; θ) and µ j (x, θ), and their partial derivatives at a common value of θ, will generally be similar for larger j because the most influential points are included in initial iterations.For situations in which y t+1 is available in advance, one might consider using it for calculating the partial derivative of µ j+1 (x; θ), and thereby avoid an approximated expectation.However, even though this would lend tractability and accuracy in one sense, it may not be as attractive a criterion as using µ j (x, θ).A strategy that used the partial derivative of µ j+1 (x; θ) might avoid adding a site at which y j+1 was not consistent with the other {y i , i = 1, 2, . . ., j}, which would ignore unusual or unpredictable features of the response surface.

B Supplementary Materials B.1 Comparison of condition numbers for local designs
A knock-on effect of the spacing of the sub-design points found by the greedy method is that the condition numbers (ratio of the largest to smallest eigenvalue) 5 of the resulting covariance matrices, K j , are lower relative to those of the NN method as j becomes large.to the right panel Figure 1, using ALC; whereas the right plot uses a smaller nugget value, η.Lower condition numbers indicate greater numerical stability and thus more reliable matrix decompositions.This is relevant for GP surrogate modeling since it means that smaller nuggets, and thus truer interpolations, can be achieved with the greedy method if so desired.Although some recent papers have argued that the quest for truer interpolation in surrogate modeling may not be efficient from a statistical perspective (Gramacy and Lee, 2011a;Peng and Wu, 2012), seeking out the smallest possible nugget while retaining numerical stability remains an aspiration in the literature (Ranjan et al., 2011).

B.2 An interpretation of the reduction in variance criterion in terms of partial correlations
The ALC criterion (10) has a helpful interpretation in terms of the partial correlation between the prediction errors at x and at some x , a potential new site to be added into the design.is the reduction in predictive variance in Y (x) after including the response at x .Noting that Cov[e(x|Y j ), e(x |Y j )] = τ 2 [K(x, x ) − k j (x)K −1 j k j (x )], it is straightforward to show that (33) is equivalent to the criterion in (10) with ψ j /(j − 2), an estimate of τ 2 , replaced by τ 2 .Because Var[e(x|Y j )] does not depend on x , the expression (33) for the reduction in variance implies that the next site x j+1 should be chosen to maximize the correlation between the errors in predicting Y (x) and Y (x j+1 ) given Y j .

Figure 1 :
Figure 1: A comparison between ALC (black) and MSPE (green) heuristics at two predictive locations.The numbers plotted indicate the order in which the design sites were chosen.

Figure 2 :
Figure2: Comparing the NN predictor to the greedy selection approach in 1000 repeated repetitions an LHS of size 10001: 10K train/1 test.Both methods are initialized with an NN design n 0 = 6.The left pane shows the actual bias and precision and the right shows the predicted precision (via the square-root of the GP predictive variances).Each repetition is shown in light shades; the darker dashed lines are 90% quantiles, and the dark solid line is the average.The open-circles with whiskers at j = 52 are the the 90% quantile and average results from a size 1000 NN design.

Figure 3 :
Figure3: The left column shows 2d visualizations of the posterior mean surface µ(x|D n ), difference between the mean and the truth µ(x|D n ) − y(x) and the predicted standard deviation V (x|D n ).In the image plots, lighter colors (yellows/whites) are higher values; darker (reds) are lower.The right column shows the same for the slice x 2 = 0.5.

Figure 4 :
Figure 4: Smoothed log θn (x) values after the first and second stages (reds are lower values), followed by a histogram of the of those values from the second stage.
For the case when all parameters are known, denote the error in predicting Y (x), given data D j , by e(x|Y j ) = Y (x) − µ j (x; θ).Note that e(x|Y j ) and Y j are orthogonal under the standard correlation inner product, and consider the orthogonal decompositionY (x) = µ j (x; θ) + e(x|Y j ) = µ j (x; θ) + ρ Var[e(x|Y j )] Var[e(x |Y j )] e(x |Y j ) + e(x|Y j , x )(31)whereρ = Corr[e(x|Y j ), e(x |Y j )] = Cov[e(x|Y j ), e(x |Y j )] Var[e(x|Y j )] Var[e(x |Y j )]denotes the correlation coefficient between the errors in predicting Y (x) and Y (x ).Sometimes this is called the partial correlation coefficient between Y (x) and Y (x ) given Y j .Because of the orthogonality of the three terms in (31), the error variance in predictingY (x) given Y j is Var[e(x|Y j )] = ρ 2 Var[e(x|Y j )] + Var[e(x|Y j , x )].(32)The right-most term in (32) is defined as the error variance in predicting Y (x) after including the response at x .Hence, the term ρ 2 Var[e(x|Y j )] = Corr 2 [e(x|Y j ), e(x |Y j )]Var[e(x|Y j )] = Cov 2 [e(x|Y j ), e(x |Y j )] Var[e(x |Y j )](33)

Table 2 :
Average timings (in seconds) and accuracy (in √ 1 − NSE) values from two thirtyfold Monte Carlo experiments on the borehole data.The top table echoes the experiments for CSC; the bottom table is for a double-sized experiment.The final column in both tables contains p-values from one-sided t-tests of adjacent performers (better v. next-best).The rows of the table are ordered by the accuracy estimate(s).