Predictive Subdata Selection for Computer Models

Abstract An explosion in the availability of rich data from the technological advances is hindering efforts at statistical analysis due to constraints on time and memory storage, regardless of whether researchers employ simple methods (e.g., linear regression) or complex models (e.g., Gaussian processes). A recent approach to overcoming these limits involves information-based optimal subdata selection and Latin hypercube subagging. In the current study, we develop a novel subdata selection method for large-scale computer models based on expected improvement optimization. Numerical and empirical analysis using real-world data are used to select subdata by which to derive accurate predictions. During the optimization procedure, the proposed scheme employs the geometry of the input feature region as well as information related to output values. The data points associated with the largest improvement in prediction accuracy are combined in the construction of a subdataset that can be used to formulate predictions with affordable computing time. Supplementary materials for this article, including proofs of theorems and additional numerical results, are available online.


Introduction
Computer models (otherwise referred to as computer codes) are increasingly popular as surrogates for physical experiments. Typical computer models are deterministic, which means that the same input will produce the same output subject to numerical noise (e.g., inverting ill-conditioned matrices), and are implemented by complex mathematical equations that build the relationship between input and output variables.
Ideally, a computer model should cover all required input variables, including those in the physical experiments as well as all environmental and model variables. Owing to the "black box" nature of computer models, statisticians have devised various statistical models by which to unveil the input-output relationship. Gaussian process (GP) models have been a popular emulator. A comprehensive account about designing and analyzing computer models are given in Santner, Williams, and Notz (2018).
Recent technological advances have made it possible to collect enormous volumes of data, exceeding the capacity of advanced statistical methods and even most of the computer systems on which they depend. These constraints can result in numerical instability during computation, which increases variability with a corresponding decrease in prediction accuracy. Innovative approaches to statistical analysis are thus required to circumvent these difficulties. Haaland and Qian (2011) described the advent of large-scale computer models in aerospace engineering and information technology. Lopez and Hamann (2011)  and Williams (2006), of size 48,933, is based on an inverse dynamics problem for a seven degrees-of-freedom SARCOS anthropomorphic robot arm. Issues pertaining to big data can also be found in other disciplines. Wang, Yang, and Stufken (2019) mentioned an airline on-time dataset from the 2009 ASA Data Expo, which included flight arrival and departure information for all commercial flights in the United States between October 1987 and April 2008, with 123,534,969 observations from 29 variables.
Statistical methods such as Gaussian process (GP) modeling and linear regression have been rendered impractical in the face of exponential growth in the size of datasets. Researchers have proposed a number of remedies for such computational issues. To facilitate the use of linear regression, Xiong et al. (2016) proposed orthogonalizing expectation-maximization (EM) for large-scale penalized least-squares problems. Nie, Huling, and Qian (2017) extended this approach to a richer class of models, including generalized linear models and Cox's proportional hazard model. To facilitate GP modeling, Haaland and Qian (2011) proposed the use of a multi-step predictor to avoid illconditioned results in large-scale computer experiments. Peng and Wu (2014) examined the use of nugget in kriging to overcome the numerical instability inherent to large-scale computer models. Local GP approximation proposed by Gramacy and Apley (2015) used a subset of data around a testing input location to construct a GP predictor to handle large datasets. Sung, Gramacy, and Haaland (2018) extended that approach to achieve further improvements in computing speed.
Another approach to streamlining computation involves selecting a subset of data for analysis, rather than tackling all of the data simultaneously. The feasibility of this approach is supported by the argument that not all data points provide valuable information relevant to the input-output relationship. Furthermore, nearby data points often share more information than those located at a distance. Ma and Sun (2015) employed leveraging in the sampling of subdata. Ma, Mahoney, and Yu (2015) evaluated the statistical properties of algorithmic leveraging. Ma et al. (2020) went on to study the asymptotic distributions of leveraging-based sampling estimators. Härdle et al. (2018) also discussed statistical leveraging methods in detail. Zhao, Amemiya, and Hung (2018) developed a subdata selection method combining Latin hypercube sampling with subsample aggregation, with the input region divided into grid blocks. Zhao, Amemiya, and Hung (2018) repeatedly selected Latin hypercube samples based on blocks for use as subdata in formulating predictions using the bagging technique. Wang, Yang, and Stufken (2019) borrowed the concept of optimal experimental designs (Kiefer 1959) in their subdata sampling scheme, which they called information-based optimal subdata selection (IBOSS). IBOSS approximates D-optimality and selects data points pertaining to input variables with extreme values in formulating subdata, yielding impressive benefits in linear regression models. Ai et al. (2021) further studied the optimal subsampling plans for generalized linear models. Wang et al. (2021) proposed an orthogonal sampling method for linear models. Drovandi et al. (2017) encouraged the interaction between the optimization and experimental design community in big data analysis; followed by Deldossi and Tommasi (2022) who applied the theory of optimal design to incorporate the inferential goal in the subsampling strategy.
Intuitively, representing a smooth output/response surface requires fewer data sites, whereas reconstructing output surfaces with high variability calls for many more data points. Based on this concept, Joseph and Mak (2021) developed a supervised data compression scheme, which is robust to model misspecification. They demonstrated that their method produces good prediction results under a variety of models, including the 1-nearest-neighbor regression and 1-layer/2-layer neural networks. A review of statistical techniques for big data is given by Yao and Wang (2021).
Unlike one-stage experimental designs, sequential experimentation such as response surface methodology (Box and Wilson 1951) gradually updates the information of the output surface. Response surface methodology is a popular approach to elucidating the relationship between input factors and the output (Wu and Hamada 2021, chap. 10). In global optimization, Mockus, Tiesis, and Zilinskas (1978) proposed an optimal sequential algorithm by which to find the minimum for a computer model. Schonlau, Welch, and Jones (1998) and Jones, Schonlau, and Welch (1998) proposed and popularized the efficient global optimization (EGO) algorithm, which uses a sequential design strategy by initiating a small space-filling design and then adding one design point at a time. An expected improvement (EI) function was derived for sequential optimization. Ranjan, Bingham, and Michailidis (2008) adopted a similar approach to contour estimation. Lam and Notz (2008) used the EGO algorithm to search for designs capable of good overall fit. An overview of EI-based methods was completed by Notz (2015).
In this article, we propose an improvement function based on the idea of EGO to select subdata with good prediction performance for computer models. The EI function corresponding to a given statistical model (as an emulator) is first derived. If some data points outside the current subdataset contribute significantly to the EI function, then adding them to the subdata would result in a large improvement in prediction accuracy. A major difference between design construction and subdata selection is that no output value is available when constructing (one-stage) designs, while for subdata selection, one can exploit the information of the output surface contained in the full data. The method proposed in this article considers the geometry of the input region as well as the information related to the output surface.
The remainder of this article is organized as follows. Section 2 illustrates a motivating example and introduces GP modeling. Section 3 proposes an improvement function for the selection of subdata with a desirable prediction ability and provides corresponding theoretical properties. Section 4 presents several numerical examples, including two real-world datasets. Finally, we discuss potential applications and conclude our work in Section 5. Technical proofs and more numerical examples are summarized in the supplementary materials.

Motivating Example and GP Modeling
In the following, we present a motivating example, illustrating why the information related to the output variable must be incorporated in the selection of subdata. We then introduce GP modeling as an emulator for computer models wherein we assume that the output variable is quantitative (a common assumption for computer models).

Motivating Example
The performance of a subdata selection scheme largely depends on the integration of output information. Intuitively, more data points should be selected around the input region that produces high output variation, while fewer points are needed in the other area. Here we consider the Shubert function f s (x) and the (two-dimensional) Michaelwicz function f m (x) as an illustrative example, where with 0 ≤ x j ≤ 1. For convenience, the input region of the Shubert function is scaled to [0, 1] 2 . Figure 1 presents the contour plots of f s (x) and f m (x). Note that the two functions produce smooth output surface across most of the area, while only a small portion of the input region produces wiggly outputs. Suppose that we are given a dataset of size 1000, where the outputs are generated by f s (x) and f m (x) with input x generated from the two-dimensional uniform distribution [0, 1] 2 . Our task is to select a subdataset of size 200. Using the simple random sampling without replacement (SRS), we obtain the black dots in the contour plots as the subdatasets. The two subdatasets do not appear attractive because the input region that produces the wiggly outputs does not include as many data points as the region that produces the smooth output surface. This illustration makes it clear that SRS and other unsupervised subdata selection schemes do not accurately describe the correspondence between the input and output. We propose a novel supervised subdata selection method in Section 3 and revisit these two functions in Section 4.

GP Model
GP models are widely used to model the output from computer models (Santner, Williams, and Notz 2018). These models are also popular in the machine learning community, due to the theoretical connection with kernel ridge regression based on reproducing kernel Hilbert spaces (Hastie, Tibshirani, and Friedman 2009).
Without loss of generality, we denote the input space as the d-dimensional space [0, 1] d . We consider GPs over [0, 1] d → R based on an output vector y = (y(x 1 ), . . . , y(x n )) T at n input points x 1 , . . . , x n , where x i = (x i,1 , . . . , x i,d ) T is the ith input setting. For an input x, a general form of GP models can be written as follows where is a second-order stationary GP with zero mean and constant variance σ 2 , and (x) independently follows a zero-mean normal distribution over x ∈ [0, 1] d with variance σ 2 (x) and is uncorrelated with Z(x). If m(x) = m j=1 f j (x)β j with known functions f j s and unknown parameters β j s, then Model (1) is referred to as universal kriging.
Let ϕ(h) be the correlation function of Z(x), which is positive semidefinite with ϕ(h) = ϕ(−h) and ϕ(0) = 1. A popular choice is the power exponential family where positive θ j and 0 < p j ≤ 2, j = 1, . . . , d, are the parameters that control the smoothness of Z(x). When p 1 = · · · = p d = 2, it reduces to the Gaussian correlation function. An alternative approach is the Matérn correlation function: with Bessel function K ν (·) and positive ν and θ j , j = 1, . . . , d.
With data x 1 , . . . , x n and y in hand, it is important to build a predictor for unobserved input locations. We derive a result similar to that in Santner, Williams, and Notz (2018, p. 70) for Model (1) to provide the best linear unbiased predictor (BLUP) and its mean squared prediction error (MSPE). The proof is presented in the supplementary materials.
Theorem 1. Under Model (1), the BLUP at an input location x 0 is given by η( Computing η(x 0 ) and s 2 (x 0 ) requires the knowledge about m(x), σ 2 , , and the parameters in ϕ(h). In practice, one can plug in their estimates if they are unknown. It is common to estimate , σ 2 and the parameters in ϕ(h h h) using (residual) maximum likelihood estimation. To estimate m(x), one can use nonparametric regression techniques such as the local regression estimate and k-nearest-neighbor estimate. If m(x) = m j=1 f j (x)β j with known functions f j s and unknown parameters β j s, then generalized least squares estimation for β j s is a popular choice. Considering the corresponding estimators m(x), σ 2 , , and R, the resulting predictor is called an empirical BLUP. We use the notation η(x 0 ) and s 2 (x 0 ) throughout this article to represent the empirical BLUP and its MSPE.
Theorem 1 incorporate the case where m(x) is a general mean function, not necessarily of the form m j=1 f j (x)β j . Thus, we can implement this theorem under a standard nonparametric regression framework by setting σ 2 = 0 and = σ 2 I (i.e., constant variance) in Model (1). In this case, it follows from Theorem 1 that The use of this identity in our proof yields the result for the universal kriging reported in (Santner, Williams, and Notz 2018, p. 70). By respectively replacing m and m(x 0 ) with F and f 0 in Theorem 1 via a suitable matrix transpose operation, we obtain the BLUP and its MSPE under universal kriging as follows F. Universal kriging reduces to ordinary kriging if m = 1 and f 1 (x) = 1. Alternatively, if the GP term Z(x) is removed (i.e., set σ 2 = 0), then universal kriging reduces to a linear model with Readers can refer to Santner, Williams, and Notz (2018) for further details.
Several extensions and modifications to universal kriging have been proposed in the literature. Joseph (2006) modified ordinary kriging to make it more robust to model misspecifications. Joseph, Hung, and Sudjianto (2008) proposed blind kriging by applying a model selection algorithm to m j=1 f j (x)β j . Ba and Joseph (2012) designed a composite GP model for nonstationary response surfaces. More recently, researchers have advocated the use of additive GPs (Deng et al. 2017; Lin and Joseph 2020). Nevertheless, ordinary kriging is still attractive and supported by Welch et al. (1992), who argued that erroneously selecting a structure of m j=1 f j (x)β j can lead to high prediction variation. This phenomenon could be partially attributed to the complex effect aliasing between m j=1 f j (x)β j and Z(x) (Chang, Cheng, and Cheng 2019). For the numerical examples presented in this article, we model the data by ordinary kriging with the Matérn or Gaussian correlation function.

Methodology
In this section, we outline a method by which to select predictive subdata based on the EGO algorithm. Our focus here is on GP models; however, the proposed method could be applied to a variety of statistical models. Additionally, we define a quantity in our algorithm in order to facilitate the optimization procedure.

Improvement Function and its Properties
The EGO algorithm presented in Schonlau, Welch, and Jones (1998) and Jones, Schonlau, and Welch (1998) was originally developed for global minimization/maximization tasks, in which an improvement function was explicitly defined. The EGO algorithm computes the conditional expectation of the improvement function based on the current dataset. Then, the data point that reaches the maximal EI is added to the dataset. The EGO algorithm is an example of an EI algorithm. This algorithm, adding one point at a time, has been applied to different purposes, such as contour estimation and overall model fitting (Lam and Notz 2008;Ranjan, Bingham, and Michailidis 2008). Here we propose a new EI algorithm capable of selecting multiple predictive data points simultaneously.
We develop our method under the model presented in (1). Suppose that we have a large dataset D = {x i : i = 1, . . . , n} with the outputs y(x i )s. Let f (x) be the true value of the output at x. For a deterministic process, we obtain y( is a zero-mean random error. For a given k ≥ 1, a predictive model should be close to the true output f (x) at any k unobserved input locations x i 1 , . . . , x i k . We define our improvement function as where η η η 0 and t 0 are respectively the vectors of η(x) and f (x) For a given k, we anticipate that x i 1 , . . . , x i k contribute considerably to learning f (x) and are to be added to the current dataset as long as is a random quantity due to the uncertainty in η(x) and (x), the improvement is averaged over the function class of η(x). Let D s ⊆ D be a subdataset with the output vector y and let η η η = var(η η η 0 |y), which is the covariance matrix of η η η 0 conditioned on y. By taking 1≤j≤k . The EI function E(I(x i 1 , . . . , x i k )|y) contains unknown components such as m(x) and the parameters in ϕ(h). In practice, we replace E(η η η 0 |y) with the empirical BLUPs η η η 0 = [ η(x i j )] 1≤j≤k introduced in Theorem 1. All of the unknown parameters, including those in η η η and , are replaced with their estimates.
To select the k data points with the largest E( for each x ∈ D − D s and then select those with the largest k values. The EI function in (3) has a similar connotation to the q-EI criterion proposed by Schonlau (1997). Contrary to q-EI, our EI function is computationally feasible under massive data. The EI function in (3) is a sum of three terms that are interpretable. Here, we explain this under k = 1 for the sake of simplicity. A data point is added to the current subdata if it has the largest E(I(x)|y) among all the data points belonging to D − D s . The term var(η(x)|y) measures the uncertainty of the prediction at x, ( η(x) − y(x)) 2 presents the prediction error, and var( (x)) provides the noise variation. Thus, a data point in D − D s is expected to yield a large improvement in prediction accuracy if, given the current subdata D s , it has a large prediction variance as well as a large prediction error. Note that the prediction error ( η(x) − y(x)) 2 cannot be computed in the context of design construction due to the fact that y(x) is unknown, while it is available here because y(x) is known for For deterministic computer models, var( (x)) is zero and η(x) is an interpolator. Suppose that the full data D = {x i : i = 1, . . . , n} have extremely large size and are "dense" in the sense that for any x ∈ [0, 1] d , there exists an 1≤s≤n is the vector of the correlations between x and x s , and R i is the ith column of the correlation matrix [ϕ(x l − x j )] 1≤l,j≤n . Then under the ordinary kriging η(x) = β + Z(x) with a power exponential correlation function, the Theorem 1 in Ranjan, Bingham, and Michailidis (2008) as the subdata size M approaches n, where y S denotes the outputs of the subdata and η S (x) is the BLUP fitted by the subdata. Thus, E(I(x)|y S ) tends to toward zero as the amount of subdata increases. In this case, EI gradually declines and eventually no improvement can be achieved with an increase in the amount of subdata.
The proposed method requires an initial subdataset. SRS would no doubt work for this; however, we use the SPlit method proposed by Joseph and Vakayil (2022), which was developed to facilitate the selection of a representative portion of points from the entire dataset. The SPlit method involves selecting the subdataset with the minimal energy distance (Székely and Rizzo 2013), such that the empirical distribution of the resulting subdata is as similar as possible to that of the entire dataset. Joseph and Vakayil (2022) used a difference-of-convex programming technique and a sequential nearest neighbor assignment by kdtrees. When applied to subdata selection, the SPlit method can be regarded as an unsupervised selection method. After selecting an initial subdataset, we then need to decide the number of points k to be added in each update of the algorithm. Setting k = 1 constantly is consistent with the spirit of the traditional EGO algorithm; however, we allow k to vary in each iteration of subdata updating to ensure that the overall computational cost is acceptable.

Proposed Algorithm
Given a subdataset D s , we add and the procedure is repeated until a stopping condition is met. In this article, the stopping condition is defined as a prespecified subdata size based on the computational capacity of the user.
The aim of the proposed method is to yield accurate predictions; therefore, we define a quantity indicating the prediction ability of a dataset referred to as predictive R 2 , denoted by R 2 pred . Given a subdataset D s , we can build a predictor η(x) to evaluate the prediction performance of D s by computing the mean squared error then the predictions of η(x) on D − D s given D s are more accurate than those on D − D s given D s . Therefore, we define the following where D 0 denotes the initial subdata. As with the adjusted R squared in linear regression (Faraway 2014), the R 2 pred (D s ) becomes negative if D s is less predictive than D 0 . We propose to use an unsupervised method, such as SRS or SPlit, to select an initial subdataset. Because the proposed method is a supervised one, the number |D 0 | determines the balance between data points respectively selected using unsupervised and supervised methods. The proposed implementation can be summarized as follows: 1. Determine the size of the subdata, denoted by M. 2. Use an unsupervised method to select an initial subdataset D 0 of suitable size, which should exceed the number of parameters in the model. Empirically, we recommend |D 0 | = 0.3M , with · for the ceiling function, and the SPlit method for selecting D 0 (Joseph and Vakayil 2022). 3. Denote a s as the number of points to be added to D s . Set a 0 = 1 and D 1 = D 0 . Given D s with s ≥ 1, if R 2 pred (D s ) ≥ R 2 pred (D s−1 ), then set a s > a s−1 (e.g., a s = 2a s−1 ); otherwise set a s < a s−1 (e.g., a s = a s−1 /2 ). If |D s | + a s > M, then set a s = M − |D s |.

Repeat
Steps 3 and 4 until the size of the subdata reaches M.

In
Step 3, we allow that the number of points to be added, denoted by the a s s, vary across iterations in order to reduce computational cost. The idea is that if R 2 pred (D s ) increases, we have more confidence in the prediction ability of D s . Thus, we set a s+1 > a s in the next iteration to mitigate the computational burden of the proposed method. However, a decrease in R 2 pred (D s ) would indicate that the current subdata may not be capable of providing best predictions. We therefore set a s+1 < a s to minimize the effects from potentially poor predictions. In our experience, the update size is inversely proportional to prediction performance. In other words, there is a tradeoff between prediction performance and the computational speed.
It is possible to use R 2 pred as a stopping condition for the algorithm if the subdata size is not specified in advance. The proposed algorithm can be stopped at subdata D s if R 2 pred (D s ) is high and stable, as shown in Figure 7.
Model fitting is needed to compute E(I(x i 1 , . . . , x i k )|y) in Step 4. Unlike unsupervised selection methods, the proposed method introduces extra computational cost associated with model fitting. In Section 4, we demonstrate that this cost tends to be affordable because M is usually far smaller than full data size n. Furthermore, the inclusion of output-related information yields subdata that is well-suited to prediction performance.
We analyze the computational complexity of the proposed algorithm in selecting a subdataset of size M within the following setting: a s = c × a s−1 for s ≥ 1 and c > 1 in Step 3. The proposed algorithm performs model fitting and model prediction in each iteration. First, we consider the complexity for model fitting. Let h(B) be the computational complexity for model fitting given a dataset of size B. Then, by letting Q be the number of total iterations in the algorithm, we obtain Q = log c {1 + (c − 1)(M − 1)} , where · denotes the ceiling function. The complexity of obtaining a subdataset of size M is calculated as follows: It has been established that ordinary kriging yields computational complexity O(B 3 ) when fitting a dataset of size B, where O(·) denotes the big O notation. Thus, under ordinary kriging, it follows from h(B) = O(B 3 ) that the complexity of the proposed method is derived as follows: where the last equality follows from O (1+(c−1)(M−1)) 3 c 3 = O(M 3 ) because c is far smaller than M in practice (e.g., c = 2 and M = 1000). Second, we consider the complexity of model prediction. We let g(B) be the computational complexity for model prediction at one input location given a dataset of size B. Then, the complexity under Q iterations is derived as follows: Under ordinary kriging, we need only to compute R + σ 2 σ 2 I n −1 y − 1 β β β once, resulting in complexity O (B × B × 1) = O B 2 if the subdataset at this iteration is of size B. After that, the complexity in predicting L points is L × O (1 × B × 1) = L × O (B), yielding total complexity O B 2 + L × O (B) at this iteration. Thus, the complexity of the proposed method is derived as follows: The total computational complexity for the proposed algorithm is given by Since subdata size M is usually far smaller than full data size n, is of a lower order than O(n 3 ), which is the complexity of directly fitting ordinary kriging to the full data.

Effect of Adding Data Points
In the following, we discuss the means by which a GP predictor is affected when data points are added to the subdataset. The stochastic simple kriging (m(x) = 0 in Model (1)) is considered for illustration. Here, we denote the full dataset as D, the initial subdataset as D 0 , and the current subdataset as D s . By denoting U i the data points in D added to D i for i = 0, . . . , s−1, we have D i+1 = D i ∪ U i , i = 0, . . . , s−1. Given two datasets H 1 and H 2 , let η η η H 1 ,H 2 = [ η(x)] x∈H 1 be the predictions at the input locations of H 1 , where η(x) is the empirical BLUP fitted by H 2 under Model (1). Then we have the following result with the proof in the supplementary materials.

Theorem 2. Let η η η
] x∈T , and y D i and y U i , respectively, indicate the vectors of outputs of D i and U i . Given Model (1), we can obtain the following: is the estimated covariance matrix between η η η U i and η η η T given D i , and V i = var(η η η U i |y D i ) is the estimated covariance matrix of η η η U i given D i .
From Theorem 2, we observe that if subdataset D i contains enough information about U i for some i ∈ {0, . . . , s − 1} in the sense that η η η U i ,D i = y U i , then η η η T ,U i |D i = 0 and the predictions at T are unchanged regardless of whether D i or D i+1 = D i ∪ U i is used to build the model. Suppose that one data point is to be added to D s ; that is, U = {u}. Then, in accordance with Theorem 2, we can obtain the following: where η u,D s = η(u) is the empirical BLUP fitted by D s under Model (1). In this setting, c is the 1 × |T | vector of the cov(η(u), η η η T |y D s )s. It is known that for any vector norm · the following reverse triangle inequality holds Thus, the difference between the prediction accuracy of η η η T ,D s+1 and that of η η η T ,D s is bounded by the difference between η η η T ,D s+1 and η η η T ,D s . In accordance with the theory of matrix algebra, any two vector norms are equivalent in the sense that one is bounded by the other. Here we take · to be the max-norm, denoted by we can obtain the following result.
Theorem 3. Given a set of testing input locations T and a subdataset D s , we assume that var(η(u)|y D s ) ≥ max x∈T var(η(x) |y D s ), where u ∈ D − D s is a data point to be added to D s . Then we obtain the following The equality holds when var(η(u)|y D s ) = max x∈T var(η(x) |y D s ) and corr(η(u), η(x * )|y D s ) = 1 for the x * = arg max x∈T var(η(x)|y D s ).
Noting that the EI criterion in (3) consists of η u,D s − y(u) 2 + var(η(u)|y D s ); therefore, we argue that maximizing the EI criterion tends to result in the selection of data points that lead to a large gap in prediction accuracy. However, Theorem 3 does not indicate whether prediction accuracy increases or decreases after u is added to the subdata. The R 2 pred can be used to determine whether additional data points, in addition to u, need to be selected.

Numerical Results
We compare the proposed method with the supervised data compression method in Joseph and Mak (2021), denoted by Supercom and implemented by the R (R Core Team 2021) package supercompress with the recommended setting lam = 1/(1 + d), under various simulation settings and two real-world datasets. We also consider SRS, which has been widely used as a benchmark due to its low computational cost. Note that in the motivating examples, we also include the SPlit method (Joseph and Vakayil 2022). Since the IBOSS method (Wang, Yang, and Stufken 2019) was developed under the linear regression framework, we consider it only in a case of one real-world data analysis (WEC dataset), in which the linear regression is used for model fitting and prediction. All numerical results are implemented using R on a desktop computer equipped with a 2.6 GHz CPU and 32 GB of RAM, where the CPU computation time is expressed in seconds.
In the following, we present seven simulation examples (two in the supplementary materials) and analyses based on two realworld datasets. The functions that generate the outputs in the simulations are detailed in Surjanovic and Bingham (2013). In formulating predictions, we adopt ordinary kriging with the Matérn or Gaussian correlation function to fit the data, and the mleHomGP function in R for implementation (except for a case in the WEC data).

Two-Dimensional Functions
For the examples in this section, the input locations of the full data and testing data are generated from a uniform distribution on [0, 1] 2 at 1000 and 100,000 points, respectively. We repeatedly generate full datasets and testing datasets 100 times, and examine the prediction performance based on the 100 prediction errors. The first example revisits the motivating example and the second example presents a highly nonlinear function referred to as the Drop-Wave function. We employ ordinary kriging with the Matérn correlation function with p = 2 for model fitting and prediction.

Motivating Examples Revisited
We revisit the motivating example from Section 2.1. Here, we use M = 80 and M = 200, respectively for the Michalewicz and Shubert functions as the subdata size for all subdata selection methods. When the proposed method is applied, the initial data are obtained by randomly selecting 0.3M data points from the full dataset. The update size in our algorithm is set at one in each iteration; that is, a s = 1 for all s. Figure 2 and the upper two panels of Figure 5, respectively present the contour plots of the Michalewicz function and Shubert function. In each figure, the black dots in the contour plots indicate the subdata selected by the proposed method and Supercom from the 100th full dataset. The subdata selected by the proposed method for the Michalewicz function and for the Shubert function, respectively, yield RMSPE 0.0751 and 18.3306, and have R 2 pred = 0.95 and R 2 pred = 0.91, where Figure 7 presents their values during the iterations. Our subdata yields high and stable R 2 pred under both functions. As shown in the contour plots, both of the proposed method and Supercom  select a larger number of points around the region that produces high output variation with fewer points in other regions. The results obtained using the two methods are similar; however, the proposed method selects a larger number of points on the bluecolor region, which produces the lowest output surface, than Supercom. Figure 3 and the bottom two panels of Figure 5 provide the contour plots fitted by the ordinary kriging in conjunction with the subdatasets selected by the proposed method and Supercom, respectively. The proposed method yields contour plots closer to the Michalewicz and Shubert functions. Figures 4 and 6 present the violin-box plots of the 100 root mean squared prediction errors (RMSPEs) at the testing input locations for all methods. Unsurprisingly, using the full dataset yields the lowest RMSPE. We can see that the proposed method, with RMSPE 0.0757 on average, performs well for the Michalewicz function. For the Shubert function, all the methods perform similarly, while the proposed method sometimes yields a far lower RMSPE than the other methods, including the use of the full data. This result could be attributed to the ill-conditioned problem of ordinary kriging in which the full dataset includes many input locations that are too close together (Haaland and Qian 2011). Figure 8 gives the subdata selected by the proposed method from the 100th full dataset for the Michalewicz function at different iterations in the algorithm. Each number indicates the order in which a data point was added. The upper-left panel shows the subdata with eight data points, where the first three form the initial subdata; the upper-right panel presents the subdata with 32 points; the lower-left presents the subdata with 64 points; and the lower-right panel presents the subdata at the final iteration, with 80 points. We can see that as the size of the subdata increases to 32, a large portion of points are located in the region that produces wiggly output surface. When the subdata size reaches 64, most of the data points selected by the proposed method are located in the flat region. Thus, it appears that these 64 data points are sufficient to reconstruct the Michalewicz function via ordinary kriging.
The computing time associated with the selection of subdata and prediction at testing locations is an important issue in data reduction. Unsurprisingly, the proposed method, which is a supervised selection method, is slower than unsupervised selection methods. For the Michalewicz function, the fastest methods are SPlit (an average of 1.15 sec) and Supercom (an average of 1.29 sec). The proposed method takes an average of 3.04 sec, whereas using the full dataset takes an average of 73.77 sec. For the Shubert function, the fastest methods are SPlit (an average of 3.74 sec) and Supercom (an average of 4.34 sec). The proposed method takes an average of 16.82 sec, whereas using the full dataset takes an average of 74.06 sec. We note that the proposed method can be sped up by setting a larger a s in the algorithm. For the Michalewicz function, if we set a s+1 = 3a s provided that R 2 pred (D s ) ≥ R 2 pred (D s−1 ) and a s+1 = a 3 /3 otherwise, then the proposed method would take roughly 1.80 sec with RMSPE 0.1247, which is lower than the average RMSPEs using the other methods.
The functions considered in the motivating example have a large amount of input area that produces nearly constant outputs. In the next example, we consider a two-dimensional function with a highly complex and nonlinear input-output relationship.

Drop-Wave Function
The Drop-Wave function is multimodal and highly complex, with expanding ripples, similar to those generated when an object is dropped onto a liquid surface. It is defined as follows where −2 ≤ x i ≤ 2, i = 1, 2. Here the input region is scaled to [0, 1] 2 . Using the full data takes an average of 147.31 sec. When using the proposed method, we set the size of the subdata to M = 200 and the initial subdataset to 0.3M = 60, where the initial subdata are selected by the SRS method. We fix a s = 1 in each iteration. The computing time for the proposed method to select a size-200 subdata is 25.81 sec on average, while the Supercom method takes an average of 4.27 sec. The upper two panels of Figure 9 present the contour plot of the Drop-Wave function, in which the unfilled and black circles denote the 100th full and subdata, respectively. The proposed method selects more data points around the center, as there are more peaks and a larger variation in f (x). The two bottom panels in Figure 9 present the contour plots fitted by ordinary kriging under the subdatasets selected by the proposed method and Supercom, respectively.  The proposed method yields a contour plot closer to the Drop-Wave function. For the comparison between the prediction ability of different methods, Figure 10 shows the violin-box plots of the 100 RMSPEs for the full dataset and the subdataset selected by the proposed method, Supercom, and SRS. The proposed method performs well in all instances except the full dataset.

High-Dimensional Functions
In the following, we consider three high-dimensional functions: the Borehole function (d = 8), Welch function (d = 20), and Robot Arm function (d = 20 with 8 active inputs and 12 inert inputs). The Borehole function models the water flow through a borehole. The Robot Arm function models the position of a robot arm that has four segments. More details can be found in Surjanovic and Bingham (2013).
For each of the three functions, we assume that the d input variables are equally-correlated with the correlation coefficient ρ. We generate the full dataset at 100,000 points, respectively, from: (a) the d-dimensional uniform distribution with ρ = 0 and (b) the d-dimensional normal distribution with ρ = 0.5. To evaluate prediction performance, we generate 1000 testing input locations from the same distribution as the full dataset. We repeat the procedure described above 100 times and then compute the resulting 100 RMSPEs. Considering this full data size, the ordinary kriging is not feasible due to memory limitations, thereby demonstrating the importance of subdata selection. Although linear regression is still workable, as shown later, it does not produce desirable predictions.
The subdata size is set to M = 1000. In the proposed method, we set the size of initial subdataset to M × 0.3 = 300, selected by the SPlit method. We also set a s+1 = 2a s if R 2 pred (D s+1 ) ≥ R 2 pred (D s ) and otherwise a s+1 = a s /2 . The methods included in this comparison are the proposed method, Supercom, and SRS, in which ordinary kriging with the Matérn correlation function with p = 2 is used for model fitting and prediction. For the sake of comparison, we also include linear regression under the full dataset.

Borehole Function
The Borehole function has the advantage of simplicity and quick evaluation, making it widely used when testing new methods. The Borehole function models the water flow through a borehole. It is given by where x 1 stands for the radius of the borehole, x 2 the radius of influence, x 3 the transmissivity of upper aquifer, x 4 the potentiometric head of upper aquifer, x 5 the transmissivity of lower aquifer, x 6 the potentiometric head of lower aquifer, x 7 the length of borehole, and x 8 the hydraulic conductivity of borehole. The input region is scaled to [0, 1] 8 . Applying linear regression to the full dataset takes an average of 0.61 sec. For each of the settings of (a) and (b), the computing time for the proposed method to select a size-1000 subdataset is 367.11 sec on average, while the Supercom takes an average of 433.83 sec. Figure 11 presents the violin-box plots of the 100 RMSPEs at the testing locations, where the upper and bottom panels, respectively, correspond to the settings (a) and (b). Overall, the proposed method yields the lowest RMSPEs among the competitive methods.

Welch Function
We consider a higher dimension function than the Borehole function in this example. The Welch function (Welch et al. 1992) with d = 20 inputs is defined by:

Robot Arm Function: Noise Outputs with Inert Inputs
Computer models may generate data with numerical noise. In this example, we consider an output that includes random errors, in conjunction with several inert input variables. The Robot Arm function, used commonly in the literature of neural networks, models the position of a robot arm that has four segments. The output is the distance from the end of the robot arm to the origin. This function is defined as follows: with 0 ≤ x i ≤ 1 and 0 ≤ w i ≤ 2π . Here the input region is scaled to [0, 1] 8 . In addition to the eight inputs, we add twelve inert input variables with values randomly selected from [0, 1] 12 . The input region then becomes [0, 1] 20 . We include random errors in this example. The output of an input location (x, w) is obtained by f (x, w) + , where is a standard normal random variable.
Here, we consider the setting (a) only. The Supercom takes an average of 2269.74 sec to select a size-1000 subdataset, whereas the proposed method consumes 1403.59 sec on average to reach the same size subdataset. Figure 13 presents the violinbox plots of the 100 RMSPEs at the testing locations. The proposed method performs well. Owing to the existence of the

Applications to Real-World Data
We evaluate the proposed method by applying it to two realworld datasets, each of them divided into a training dataset and a testing dataset. We assess the prediction accuracy of the models trained using the selected subdata by computing the RMSPE for the testing dataset. We also include the following methods for comparison: (i) SRS and (ii) linear regression under the full data. When using the first dataset (SARCOS), we consider the Supercom method. For the second dataset (WEC), we involve IBOSS when the model is fitted by linear regression. Note that ordinary kriging is not feasible for the two full datasets due to their large sizes.
The initial subdata size in the proposed algorithm is set to 0.3M for selecting a size-M subdataset. For the proposed algorithm, when the fitted model is ordinary kriging, then we set a s+1 = 2a s if R 2 pred (D s+1 ) ≥ R 2 pred (D s ) and otherwise a s+1 = a s /2 . When the fitted model is linear regression, then we set a s+1 = 2a s if R 2 pred (D s+1 ) ≥ R 2 pred (D s ) and otherwise a s+1 = 1. Except for IBOSS and linear regression under the full data, which are deterministic, each method is repeated 100 times.

SARCOS Data
In the following, we consider the SARCOS dataset mentioned in Rasmussen and Williams (2006), which relates to an inverse dynamics problem for a seven degrees-of-freedom SARCOS anthropomorphic robot arm. The task is to map from a 21dimensional input space (seven joint positions, seven joint velocities, seven joint accelerations) to the corresponding seven joint torques. Rasmussen and Williams (2006) presented the results for just one of the seven mappings, from the 21 input variables to the first of the seven torques. The inverse dynamics model of the robot is highly nonlinear due to the robot dynamics. As described in Rasmussen and Williams (2006), this dataset has been split into a training dataset of size 44,484 and a testing dataset of size 4,449.
We set the subdata size M = 1000 and use ordinary kriging with the Matérn correlation function with p = 1 to fit the data. In the proposed method, the initial subdataset is selected by the SPlit method. The proposed method takes an average of 280.13 sec, while Supercom takes 392.89 sec on average. Figure 14 presents the violin-box plot of the 100 RMSPEs pertaining to the testing data. Overall, the predictions generated using the proposed method exceed those of the other methods.

Wave Energy Converters Data
This dataset, provided by "UCI Machine Learning Repository" (Dua and Graff 2019) and considered in Wang et al. (2021), consists of positions and absorbed power outputs of wave energy converters in four real wave scenarios from the southern coast of Australia. The full dataset is of size 288,000 with 48 input variables, including 32 location variables and 16 absorbed power variables. We employ the SPlit method to divide the full data into a training dataset of size 201,600 and a testing dataset of size 86,400. As mentioned in Section 2, ordinary kriging and linear regression can be reduced from universal kriging. In the following, we fit the training dataset, respectively, using ordinary kriging with the Gaussian correlation function and using linear regression. For the proposed method, the initial subdataset is selected using SRS.
For ordinary kriging modeling, we set the subdata size M = 1000. The proposed method takes 2332.33 sec on average to obtain the subdataset. Figure 15 presents the violin-box plots for the 100 RMSPEs at the testing data. The proposed method outperforms SRS in terms of prediction accuracy. We note that the overall RMSPEs are high, which may indicate that ordinary kriging is ill-suited to modeling the outputs given this size-1000 dataset. Increasing the subdata size could enhance prediction accuracy.
For linear regression modeling, we set the subdata size at M = 50, 000. Here, an average of 30.24 sec is required for the  proposed method to select a subdataset. As shown in Figure 16, the proposed method outperforms IBOSS and a quarter of the SRS subdata in terms of prediction accuracy. While the average RMSPE for the proposed method is higher than that of SRS, the prediction performance of the proposed method tends to be more stable. Given that the size of the subdataset far exceeds that of ordinary kriging, we are able to obtain far lower RMSPEs.

Discussion
Many statistical methods are inapplicable to massive datasets. In this article, we address this problem from the perspective of subsampling. We propose a method by which to select sub-data capable of providing accurate predictions based on the EI algorithm. A key feature of our approach is the fact that we develop a supervised method capable of exploiting information related to input variables as well as outputs. In the numerical simulations, the proposed method performs well relative to other approaches with reasonable computing time. The proposed method also selects a larger number of data points corresponding to larger variations in output, thereby making it possible to obtain subdata that is highly representative of the entire dataset. The proposed scheme achieves outcomes similar to minimum energy design (Joseph et al. 2015), support points (Mak and Joseph 2018), and supervised data compression (Joseph and Mak 2021). In Sections S3 and S4 of the supplementary  materials, we address two numerical studies suggested by the reviewers, which investigate the performance of parameter estimation and the impact due to input dimension under subdatasets.
We must emphasize that despite the proposed method outperforms the other methods in the numerical examples in Section 4, this does not mean we can claim the proposed method is universally better. In previous studies, IBOSS demonstrated extraordinary performance when applied to data from a linear model. Moreover, Supercom is able to accommodate a diversity of statistical models, whereas the proposed method is designed specifically for deterministic computer codes, in which ordinary or universal kriging is commonly used in emulation. As such, the proposed method appears to be a sensible approach to the selection of subdata for computer models.
Most statistical models run into numerical difficulties when using big data. Distinct models have different capacities of data size. Although we use GP models for illustration purposes, the proposed method could be modified to suit other models, such as the multiresolution functional ANOVA model (Sung et al. 2020) and neural nets (Goodfellow, Bengio, and Courville 2016;Chollet 2021). Based on the theoretical connection with kernel ridge regression, the proposed method can be readily applied to the selection of basis functions for smoothing splines (Meng et al. 2020). As the EI function in (3) is crucial to the implementation of the proposed method, the corresponding EI functions of other models would also have to be derived.
Bootstrap aggregating (Bagging) is a general technique for improving unstable predictive machine learning algorithms (Breiman 1996). The prediction accuracy of the proposed method can be improved via bagging. One can randomly select initial subdatasets multiple times and then use the proposed method to obtain multiple subdatasets. The prediction at a new input location is obtained by averaging the predictions from these multiple subdatasets. While bagging enhances prediction accuracy, it also introduces additional computational costs, which might make it impractical for real applications. In addition, the subdatasets obtained via bagging are not always the same, which would make it difficult to identify representative subdata.
Several research directions arise. One involves setting an optimal update size for the proposed algorithm, which is user specified in the current study. Another one would involve developing a model-robust or nonparametric version of the proposed method. While GP models are highly relevant to kernel ridge regression, which is widely treated as nonparametric, there are real-world scenarios in which the input-output relationship cannot be well-approximated by GP models. The use of (2) provides a solution, but still needs a thorough study. Another interesting issue, raised by a reviewer, involves the volume of data. Given extremely large datasets, there is not always enough computing memory for the implementation of the proposed method, which requires loading of the entire database. A possible remedy is random division of the full dataset into several subsets small enough to be loaded into the computer. The proposed method could then be applied to each subset whereupon the predictions from all subsets could be aggregated. Thorough investigations of these problems represent nontrivial tasks, which remains for future study.

Supplementary Materials
The supplementary materials include proofs of theorems and additional numerical results.
Appendix: Section S1: the proofs of (3), Theorems 1 and 2. Section S2: the numerical results for Piston function (d = 7) and Wing Weight function (d = 10). Sections S3 and S4: the numerical studies for Bias-Prediction and Larger-d investigations. R code: R programs which can be used to replicate the numerical results in this article.