Pairwise Meta-Modeling of Multivariate Output Computer Models Using Nonseparable Covariance Function

Gaussian process (GP) is a popular method for emulating deterministic computer simulation models. Its natural extension to computer models with multivariate outputs employs a multivariate Gaussian process (MGP) framework. Nevertheless, with significant increase in the number of design points and the number of model parameters, building an MGP model is a very challenging task. Under a general MGP model framework with nonseparable covariance functions, we propose an efficient meta-modeling approach featuring a pairwise model building scheme. The proposed method has excellent scalability even for a large number of output levels. Some properties of the proposed method have been investigated and its performance has been demonstrated through several numerical examples. Supplementary materials for this article are available online.


INTRODUCTION
Many areas in science and engineering are confronted with the need for computer simulations to study complex real world systems or solve challenging design problems. Due to their increasing complexity, computer simulations for many real world problems may take exceedingly long time. Gaussian processbased emulators (Santner, Williams, and Notz 2003;Fang, Li, and Sudjianto 2005) have become a popular remedy to this issue.
The standard statistical framework in emulating (deterministic) computer simulations deals only with univariate output models (Sacks, Schiller, and Welch 1989;Currin et al. 1991). Much less effort has been spent on computer models with multivariate outputs, despite of their prevalence in many applications. In addition, computer models with both qualitative and quantitative factors, and ensemble analysis of multiple computer models (e.g., climate models based on various physical models, Tebaldi and Knutti 2007), can also be equivalently dealt with using the framework of multivariate output models. A simple approach to model the multivariate output is by separately constructing an individual GP (kriging) for each output. This method is sufficient when there is no correlation across different output levels. However, different output levels from computer simulation models in many real world applications can be correlated. Ignoring the cross-correlations among different output levels may cause substantial loss of efficiency, leading to poor parameter estimation and prediction accuracy.
Some attempts have been made on building meta-models with multiple outputs. Kennedy and O'Hagan (2000) used an autoregressive structure to jointly model several computer codes with different levels of accuracy, in the hope of improving prediction accuracy of the high-fidelity code with information from lower fidelity codes. Han et al. (2009) proposed a hierarchi-cal Bayesian model to address the computer experiments that have both quantitative and qualitative input variables, called HQQV. In their model, different output levels are modeled by independent univariate GPs with similar parameters governed by a hierarchical Bayesian structure. By combining data from all levels, better parameter estimates can be obtained. Another work was by Qian, Wu, and Wu (2008) and further refined in Zhou, Qian, and Zhou (2011), where a general framework for building multivariate Gaussian process (MGP) models was proposed. The covariance functions in their framework are separable ones that share a common marginal covariance function at all output levels. This assumption simplifies the covariance structure and significantly reduces the number of model parameters. However, the separability assumption is not appropriate in some applications. Li et al. (2008) developed a test for separability of covariance functions in multivariate random fields, which sheds some light on what type of covariance functions may be chosen.
In this work, we deal with the situation where a nonseparable covariance structure is necessary to adequately model the outputs. Nonseparable covariance functions allow different covariance parameters for different output levels, which is much more flexible than separable ones. Some related works on nonseparable covariance functions have recently been done by MelKumyan and Ramos (2011) and Fricker, Oakley, and Urban (2013). Despite of the advantages in using nonseparable covariance functions, the model building cost can be prohibitive even with a moderate number of output levels, due to the dramatic increase in the number of model parameters.
Under the general framework of multivariate emulators from Fricker, Oakley, and Urban (2013), we propose an efficient pairwise modeling approach, motivated by the idea of Fieuws and Verbeke (2006) in pairwise modeling of multivariate longitudinal profiles. Our proposed pairwise modeling approach is made possible by a reparameterization technique called hypersphere decomposition, which allows sequential estimation of MGP model parameters without losing the validity of its overall covariance structure. The proposed method has excellent scalability, even with a large number of output levels when regular modeling approaches would normally cease to work. In this article, we restrict our discussions to the deterministic computer simulation models.
The remainder of this article is organized as follows. Section 2 introduces the MGP model with nonseparable covariance functions. Our pairwise modeling approach is developed and discussed in Section 3. In Section 4, several numerical examples are conducted to illustrate the effectiveness of the proposed method. Finally, Section 5 concludes the article with a short discussion.

THE MULTIVARIATE GAUSSIAN PROCESS MODEL
MGP models are natural extensions of univariate GP models (also called Kriging, refer to Santner, Williams, and Notz 2003 for more details) to deal with multivariate output problems. Assume for each input point x ∈ R p , the computer model produces a k-variate output, denoted as y(x) = [y 1 (x), y 2 (x), . . . , y k (x)] T . At the ith output level, the user-specified regression functions are ). The MGP model is then defined as . . . where The k-variate error term (x) is assumed to be a stationary MGP with zero means and covariance function is the correlation between x of the ith output level and x of the jth output level, σ 2 i and σ 2 j are, respectively, the variances of the ith and jth levels. Define σ = [σ 1 , σ 2 , . . . , σ k ] T . In Model (1), each marginal process, i (x), is a stationary univariate GP with zero mean and covariance function C i,i (x, x ) = σ 2 i R i,i (x, x ). The formulation of an MGP as in (1) is quite general as it is directly extended from a standard univariate GP model framework, which it reduces to if k = 1. The same model is also mentioned in Santner, Williams, and Notz (2003) as a general framework for modeling multivariate outputs. In practice, if every single output level of a computer code can be suitably modeled by a standard GP, then the collection of all its output levels may be handled by Model (1).
Nonseparable covariance functions are much more flexible than separable ones as they do not assume the same marginal covariance function for all outputs. There are mainly two ways to build valid nonseparable cross-covariance functions: coregionalization and kernel convolution.
The linear model of coregionalization (LMC) was proposed by Goulard and Voltz (1992). The key idea is to model the output processes (·) as linear combinations of a set of independent GPs δ (·) : where commonly δ(·) is a vector of k independent GPs with zero mean and unit variance, and A is a k × k full rank matrix, though theoretically δ(·) may consist of any number of independent GPs. The between-outputs covariance of (·) is hence = AA T . There are several decomposition methods for A. Gelfand et al. (2004) employed the Cholesky decomposition for computational convenience. This method assumes a hierarchical dependence of the outputs. Higdon et al. (2008) used spectral decomposition to solve the computer experiments with high-dimensional output where the matrix A was obtained via singular value decomposition (SVD) of the standardized simulation output matrix.
The nonseparable covariance functions by kernel convolution have been discussed in Ver Hoef et al. (1998), Higdon (2002), Majumdar and Gelfand (2007), and MelKumyan and Ramos (2011). In the similar vein, a more flexible cross-correlation structure was proposed by Fricker, Oakley, and Urban (2013). In this article, we adopt their proposed correlation function for more flexibility, which is given by where i = diag(φ i,1 , φ i,2 , . . . , φ i,p ), and T i,j ∈ [−1, 1] is the (ith, jth) element of a k × k matrix T that is positive definite with unit diagonal elements (PDUDE). T i,j denotes the cross-correlation factor between the ith and jth levels, with T i,i = 1. A nice feature of (2) is that the ith marginal process i (x) has the most popular Gaussian correlation func- Hence (2) can be viewed as a generalization of the Gaussian correlation function to an MGP framework. Parameters in the MGP model can be estimated by maximum likelihood method in a similar fashion as that in the standard univariate GP model, with the additional constraint that the T must be a PDUDE. In the rest of this article, we focus our discussion on the MGP model with the correlation function defined in (2).

THE PAIRWISE MODELING APPROACH
While the aforementioned MGP model is flexible and theoretically appealing, model building can be painfully prohibitive in practice. In modern scientific and engineering applications, TECHNOMETRICS, NOVEMBER 2016, VOL. 58, NO. 4 it is not rare for computer models to have a large number of output levels (e.g., Higdon et al. 2008;McFarland et al. 2008). For computer models with qualitative factors using the MGP framework, the number of output levels skyrockets easily since it is determined by the total level combinations of all the qualitative factors. For a moderately large problem, the number of parameters in the covariance function can easily reach hundreds or even thousands. In addition, more design points are usually required for multivariate outputs, resulting in a large-scale covariance matrix C. In building the meta-model, parameter estimation is conducted in a high-dimensional space, with inversion of a large-scale covariance matrix at each step. Hence, standard modeling approaches are plagued by exceedingly high computational load, numerical issues, and getting trapped into local maxima in the high-dimensional parameter space, rendering the modeling approach inapplicable.
To solve this issue, we propose an alternative modeling approach with excellent scalability, especially for highdimensional outputs. The basic idea is to reparametrize the cross-correlation matrix T by a technique called hypersphere decomposition (HD, Rebonato and Jackel 1999). The introduction of HD has two major advantages. First, it guarantees T to be a PDUDE, an idea used in Zhou, Qian, and Zhou (2011). More importantly here, it permits an efficient model building scheme by sequential construction of T from marginal bivariate GP models instead of simultaneously from the full MGP model. This method is motivated by the idea from Fieuws and Verbeke (2006) in jointly modeling multivariate longitudinal profiles, but without the drawback of producing potentially invalid covariance structure.

The Sequential Construction of T Based on HD
We first introduce some properties of the HD, which are essential for our method. The HD provides a flexible way to build a PDUDE matrix. First, Cholesky decomposition is applied to T as where A is a k × k lower triangular matrix. The elements in matrix A are then parameterized by a spherical coordinate system: where ω i,j ∈ (0, π). Further, let = {ω i,j } i>j be a k × k matrix whose elements are only defined in the lower triangular part without the diagonal line. With being defined in the space (0, π) k(k−1)/2 , the resulting T is guaranteed to be a PDUDE. This HD establishes a one-to-one correspondence between and T, retaining the number of free parameters, which is k(k − 1)/2. An important property of the HD is that it allows T to be constructed in a sequential manner based on elements in . First, we introduce some more notations. Define From the con-struction of T, it can be observed that T i,j depends only on all the parameters contained in u i,j , 1 ≤ j < i ≤ k. For convenience, we denote this dependent relationship by , which can be derived straightforwardly based on (3) and (4).

The Pairwise Modeling
Approach. Assume the design X for the MGP model has n training points with n i points at the ith level denoted by The full likelihood function of the MGP model is then given by blocks C i,j , the covariance matrix between ith and jth output levels. Denote the correlation matrix by R, whose (ith, jth) block is The idea of composite likelihood, or more generally pseudolikelihood (Besag 1975 by multiplying the marginal or conditional likelihoods of some subsets. Based on the idea, the composite likelihood version of (5) with marginal bivariate GP models is where It is clear that both (5) and (6) contain the same set of model parameters. To reduce the dimensionality, which is the major cause of difficulties in estimating the MGP meta-model, k(k − 1)/2 pairwise submodels based on marginal bivariate GPs are individually built and estimated. The HD is introduced to reparameterize the T matrix to guarantee the validity of the overall covariance structure. Based on the sequential construction procedure of HD described in Section 3.1, the pairwise submodels need to be built according to the specified sequence.
The building of pairwise submodels is carried out in two steps. First, the (k−1) pairwise submodels based on the 1st level and the ith levels are built separately and their parameters are estimated by maximizing the likelihood where In the second step, subsequent submodels in the specified sequence can be built one by one conditional on those already obtained. Assume now we are at the step of building the pairwise submodel for the ith and jth output levels, the cross-correlation parameter is At this step, all the ω i,j 's contained in v i,j should have already been obtained in previous submodels, denote them byv i,j . Therefore, the only free parameter in determining T i,j is ω i,j , which can be estimated by maximizing the pairwise likelihood L i,j condi- For any pairwise marginal likelihood defined in (7) or (8), the maximum likelihood estimation can be carried out following the procedure given in Appendix A of the online supplementary material. The above procedure is repeated until all pairwise submodels are built. Denote the model parameters estimated from the (ith, jth) pair by {σ Parameters with multiple estimates are averaged at the end to produce a single estimate. The procedure of the proposed approach is summarized as below.
2. Assume all submodels involving the first j -1 (2 ≤ j ≤ k -1) output levels have been built. Then, submodels between the ith level and the jth level can be built one by one, 3. Repeat the preceding step, increasing j by 1 each time until j = k − 1.
Take the following averages for multiple parameter estimates for i = 1, 2, . . ., k: Finally, calculateT byT i,j = f i,j (û i,j ). Note thatT i,j and the averaged estimates,σ i ,σ j ,β i ,β j ,ˆ i ,ˆ j , do not form an optimal parameter set for the likelihood function L i,j anymore. To alleviate this, we can reestimate each T i,j conditional on the values ofσ i ,σ j ,β i ,β j ,ˆ i ,ˆ j , following the same sequential pairwise estimation procedure specified above. For example, we reestimate ω i,1 (hence T i,1 ) by maximizing the conditional likelihood L i,1 ω i,1 |σ i ,σ 1 ,β i ,β 1 ,ˆ i ,ˆ 1 ; y i , y 1 as in (7). Our experience with extensive numerical simulations has indicated that adding the reestimation step for T can produce consistently better results than that without this step.
In Fieuws and Verbeke (2006), a similar pairwise modeling approach in a different context is implemented without the HD reparameterization and the reestimation step. We shall refer to their method as the FV method in the remaining part of this article. Since all the pairwise submodels are built independently in FV, the estimated matrix T, denoted byT F V = {T F V i,j }, may not be positive definite, leading to an invalid covariance function. As can be seen later in the numerical examples, the probability of producing invalid covariance function can be quite high in some cases.
In our proposed approach, some submodels are built conditional on some upstream submodels in the sequence specified. This leads to the concern of potential error propagation, that is, inaccurately estimated upstream parameters may adversely affect the estimate of some downstream parameters. This mostly affects the parameters in the cross-correlation matrix T. We provide some insights into this issue by introducing two propositions in Appendix B of the online supplementary material, each followed by a short discussion. This issue will also be examined in the numerical examples.
In most GP-based models, a variance estimator of the model prediction at x 0 can often be provided with plug-in parameter estimates. In our case, the formula has been complicated by the pairwise estimation approach, particularly the estimation of β. Hence, we suggest using the bootstrap method for estimating the variance following the method proposed by Den Hertog et al. (2006). Though originally proposed for univariate GP model, the algorithm can be easily extended to our case.

NUMERICAL EXAMPLES
This section gives some numerical examples to demonstrate the performance of the proposed method. The proposed method in Section 3.2.1 is denoted as PW. Four other reference methods are introduced for comparison: (1) the individual Kriging method (IK), where a standard univariate GP model is applied to each output; (2) the full multivariate approach (MV) based on maximizing the full likelihood in (5), with T being parameterized by the HD to ensure its validity; (3) The HQQV method by Han et al. (2009) introduced in Section 1; (4) The LMC method with spectral decomposition for the between-outputs covariance in Higdon et al. (2008) (with maximum likelihood estimates). Note the autoregressive model in Kennedy and O'Hagan (2000) is a special case of the LMC model.
In all models, a constant term for the regression function is employed, which is the standard practice in GP models (Kleijnen 2009)  its running time heavily depends on the chain size. In general, there is no standard rule for choosing the chain length and it varies with applications. To achieve a fair comparison on the running time of HQQV, a chain length of 20,000 is used. The reason is that, in the examples, a chain length of 10,000 yields discernible worse prediction results, while that of 30,000 produces no substantial improvements but with a much longer running time.

Example 1: A Simulated MGP
In this example, data are obtained from repeatedly simulating an MGP with specified parameters. The advantage of simulating from an MGP is that the underlying true parameters are known, allowing us to assess not only the accuracy of the model prediction but also the accuracy of parameter estimation.
Example 1.1. The MGP is defined in one-dimensional space [0, 8] and has four levels of outputs. The true values of σ , β, , and T are listed in second column of Table 1. A typical random realization from this MGP is illustrated in Figure 1.
The training points are generated by the sliced Latin hypercube design (SLHD) in Qian (2012) Table 1, and model prediction accuracy is given in Figure 2(a). It can be seen that parameters can be properly estimated by our method, and its prediction performance is close to that of the full multivariate approach.
Example 1.2. Here, we rerun Example 1.1 using smaller crosscorrelation parameters to investigate the method's performance when outputs are less correlated. All other setting remains the same as in the Example 1.1. The true values of parameters and simulation results are also given in Table 1, and the predictive accuracy is shown in Figure 2(b). Similar conclusions as those in Example 1.1 can be made, except that the prediction accuracy of MV and PW is closer to IK in this case, which is expected as an integrated modeling approach will not gain much advantage if cross-correlations are not strong enough.
It is also noted that the proposed method depends on the ordering of output levels, which is often arbitrarily set. To give some numerical insight for the sensitivity to the ordering, the above example has been repeated for 24 times, each with a distinct ordering. The prediction results based on all orderings are given in Figure 3. Clearly, the ordering of output levels does not play an important role in this example.      Table 2. Comparing to that in the Example 1.1, the modified T matrix is closer to singularity, which leads to 34.33% of T F V obtained from the FV method being nonpositive definite. Due to the Proposition 1 in the supplementary material, potential error propagation may exist in the PW method when T F V is nonpositive definite. The results show that this phenomenon has negligible impact in this example. Comparing Table 2 with Table 1, it is also interesting to see that the RMSEs of T i,j 's are significantly smaller for the modified T matrix. This agrees with the results from Proposition 2 in the supplementary material.
Example 1.4. Now we compare the methods using a further modified MGP defined in the two-dimensional input space X ∈ [0, 3] 2 with k = 15 levels of outputs. This MGP contains 135 parameters in its correlation function, in addition to process variances and regression parameters. We set β i = 0, σ i = 2 + U i,3 i = diag(2 + U i,1 , 2 + U i,2 ), for i = 1, 2, . . ., k, and T i,j = 0.85 + 0.05U T i,j for 1 ≤ j < i ≤ k, where U i,1 , U i,2 , U i,3 , and U T i,j 's are obtained by independently sampling from the uniform distribution on [0, 1] in each iteration. It is possible that T is negative definite due to the random generation mechanism in some iterations; it is simply regenerated until a valid one is found. The training data points are generated by SLHD with 100 points at each level, and the testing data contain 400 grid points as [0, 3/19, 6/19,. . ., 3] 2 . Due to the long model building time of the MV method, only 100 iterations have been conducted.
The boxplots of predictive RMSEs as well as model building time are shown in Figure 4. It is clear that the PW method reaches a similar level of prediction accuracy as the MV method, while significantly reducing the model building time. Note that it takes about 10 hours to build each MV model for this example, which has only two-dimensional input and a moderate number of output levels. This shows the excellent scalability of the PW method, as well as its capability in delivering similar prediction performance as the full multivariate approach.

Example 2: Deterministic Quadratic Curves
In this section, we adopt the quadratic function example from Han et al. (2009) with some modifications. The multivariate output model is simulated by a set of similar quadratic functions. Following Han et al. (2009), denote a set of k quadratic curves by y j (x) = H 1,j x 2 + H 2,j x + H 3,j , for j = 1, 2, . . ., k, where x ∈ [0, 1]. The coefficient H i,j is the (ith, jth) element of a 3 × k matrix H k . Three levels of output are used in the original example, but here we extend them to 20 and 50 levels as a high-dimensional example. Corresponding to the jth output level, the jth column of H k (k = 20 or 50) is defined with H 1, j = a, H 2, j = −2ab, and H 3, j = c, where a = −5 − 3U j,1 , b = (2 + U j,2 )/5, c = −1 + 2U j,3 , and U j,1 , U j,2 , U j,3 are independently generated from the uniform random distribution on [0,1]. The training points are from an SLHD with five points at each level, and the testing points are 15 evenly spaced points at each level defined as [0, 1/14, 2/14, . . ., 1]. The two experiments with respectively 20 and 50 output levels are both repeated for 100 iterations, each time a new SLHD and a new H k matrix are randomly generated.
The boxplots of predictive RMSE and model fitting time for both cases are shown in Figure 5. The example with 50 output levels is very computationally demanding. A single model building process for the LMC method costs more than 1 day on average, followed by the MV and HQQV methods using about 10 hr. But only about half an hour is required by the PW method. The time saving is not so dramatic in the 20-level example, but the PW method still uses the least amount of modeling time.
In terms of prediction accuracy, the PW and MV methods are the best, while HQQV and LMC are similar and relatively inferior. But all four integrated modeling methods outperform IK significantly, once again demonstrating the benefit of jointly modeling correlated outputs. It is also observed that the PW method has produced slightly better prediction results than the MV method in the 50-level case. We will see this phenomenon again in the borehole example in Section 4.3. This should not be too surprising. Searching for optimal model parameters in a very high-dimensional space easily leads to suboptimal results. Maximizing the complex full likelihood function in a search space of 1275 dimensions is certainly a daunting task for any search algorithm. The process is also plagued by numerical issues in inverting the 250 × 250 covariance matrix at each step of searching.
It is also observed that the appealing features of the PW method may gradually disappear as the number of output levels decreases. The MV method may be well-qualified to handle problems at smaller scales. Therefore, we only recommend the PW method for moderate to large-scale problems which cannot be handled well by other methods.

Example 3: The Borehole Example
The borehole function is a popular testing example in the computer experiments literature. It models the flow rate of water through a borehole. Here, we use the following slightly modified version of the borehole function: The detailed descriptions and ranges for each variable can be found in Morris, Mitchell, and Ylvisaker (1993). To produce multiple output levels, three variables are chosen as qualitative Figure 9. Prediction intervals and true values ( * ), centered by prediction means. factors each with three levels (as shown in the Table 3) and the remaining variables are the quantitative factors. The computer code hence has totally 27 level combinations of qualitative factors, each corresponding to an output level of a multivariate output system. We consider three cases with different number of training points: 8, 15, and 20 sample points at each level. The training points are generated by SLHD. Fifty testing points at each level are generated from a Latin hypercube design (LHD). The test is repeated for 100 iterations, each time with newly generated sets of training and testing points.
The results are given in Figure 6. It can be seen that the proposed pairwise method gives consistently better and more stable prediction results than other methods, but using the least amount of time. The reason is that the full MGP models fail to explore the likelihood properly in high-dimensional parameter space and hence give unsatisfactory parameter estimation. It is also seen that as the number of training points reduces to eight, the PW method does not gain much advantage over the MV in terms of both prediction accuracy and running time. This suggests that the PW may not be preferable if the sample size is small, when very limited data points are available in building each bivariate model.
The HQQV method is much more time-consuming than other methods as it uses MCMC to draw the posterior samples. Even so, it yields unsatisfactory results in this example. In fact, the underlying mechanism of borrowing information across levels in HQQV is different from that of others, which are through the cross-covariance of an MGP. It assumes that the univariate GPs are independent of each other, but with similar parameters. By capturing the output similarity at the parameter level of GPs, the HQQV is able to handle data that look seemingly different at all levels but with similar smoothness and dynamic behaviors. However, methods based on an MGP framework such as ours, are less likely to provide substantial improvements over IK if the univariate GPs are independent of each other. It is hard to conclude which mechanism is better in general as it will depend on the actual application, but the HQQV framework apparently is not suitable for this example.

A Heat Sink Computer Experiment
In this section, the proposed method is applied to a computer experiment for a heat sink design. IC chips are widely used in computers, cell phones, and other modern electronic products. Thermal management of the IC chips is essential to ensuring their normal functioning without overheating. In some applica-tions, heat sinks are required for cooling down the chips, for example, CPUs and graphical processors in desktop computers. In this example, a simulation model based on finite element analysis is used to demonstrate the design of a disk-stack heat sink. This model is built and simulated in COMSOL Multiphysics. Details of this model can be found in COMSOL (2015). The goal of this experiment is to investigate the cooling effects of various designs of the heat sink. Figure 7 shows the schematic layout of the circuit board. There are several peripheral chips surrounding a main chip on the board. A disk-stack heat sink is mounted to the main chip. The heat sink has several disks supported by a central hollow column, mounted on a metallic base attached to the surface of the main chip. The power of the main chip is 20 W while each of the small chips is 1 W, and the two elongated chips are 2 W each. A convective heat transfer coefficient of 20 W/m 2 ·K is assumed, corresponding to an airflow of about 1 m/s. The air temperature is set as 25 • C. Without the heat sink, simulation shows that the main chip can reach a temperature of about 89 • C, too high for most commercial IC chips to function properly (desktop computer CPUs are typically working in the range of 40 • C-70 • C).
For demonstration, seven design factors are considered in the experiments, three of which are qualitative and four continuous. The qualitative factors are (1) four different types of materials of the heat sink, which are Aluminum, Copper, Aluminum Alloys 6061, and Aluminum Alloys 6063; (2) three levels of number of disks, which are 5, 10, and 15; and (3) whether attaching a copper bottom to the PCB board. The continuous factors, which control the size and shape of the heat sink, include (1) the inner radius of the disks (x 1 : 5 mm-20 mm); (2) the outer radius of the disks (x 2 : 25 mm-40 mm); (3) the space between two adjacent disks (x 3 : 1 mm-4 mm); and (4) the thickness of the metallic base (x 4 : 1 mm-4 mm). Therefore, this simulation model has 24 level combinations of the qualitative factors that can be considered as a 24-level multivariate output (Zhou, Qian, Zhou 2011), defined in a four-dimensional input space. The sampling points are generated by an SLHD with 30 points at each level. The response is the working temperature at the center of the main chip. Parameters are estimated based on the entire experiment data using methods IK, MV, and PW. The model fitting times for the MV and the PW methods are 458 and 21 minutes, respectively.
Cross-validation based on repeated random subsampling is used to evaluate the RMSE of the predictive accuracy. In each iteration, one tenth of the dataset is randomly selected at all out-TECHNOMETRICS, NOVEMBER 2016, VOL. 58, NO. 4 put levels for validation, while the others are used for building the meta-models. The above procedure is repeated for 100 iterations and results are given in Figure 8. It should be noted that, since fitting an MV meta-model is very time-consuming, the model parameters in the covariance function are only estimated once based on the entire data points. Then, these parameters are treated as known in the 100 iterations. It is evident from Figure 8 that the PW method performs similarly as the MV method, with a small fraction of the modeling time. Both multivariate approaches are much better than the IK method.
A leave-one-out cross-validation procedure is also carried out for this example. Since parameter estimates are essentially the same based on either all 720 sample points or the 719 training points in each cross-validation fold, we estimate the model parameters based on all sample points and do not reestimate them in each fold. The RMSE obtained is 0.08, based on deviations between predicted values and corresponding true values at these 720 points. Comparing with the range of the response variable (40-70), the prediction accuracy is very good. To further investigate the uncertainty of the predicted value in each fold, we employ the bootstrap variance estimation method in Den Hertog et al. (2006) as previously discussed. A bootstrap sample size of 6000 is used to estimate the variation and provide a 95% confidence interval for each predicted value in one fold. Of all these 720 intervals centered at the predicted values, 94.2% of them contain the true values. Since it is not possible to show all the 720 intervals here, we randomly select a point at each of the 24 level combinations and provide the prediction intervals against their respective true values in Figure 9. The meta-model can then be used to find the optimal design of the heat sink. The lowest possible temperatures reached by adding the heat sink are listed in Table 4. It can be seen that the best and worst two designs have a difference of about 8 • C, and the impact of adding a copper bottom layer is negligible to the temperature of the main chip. The results suggest that the optimal design of the heat sink should use copper as its material with 15 disks, and the lowest temperature is achieved with (x 1 , x 2 , x 3 , x 4 ) = (18.09, 39.58, 1.06, 3.31)mm. A verification simulation is conducted with these factor values and the result is 42.66 • C, very close to the meta-model predicted value in Table 4.

SUMMARY AND DISCUSSION
MGP models with nonseparable covariance functions offer a general solution to building meta-models for computer models with multivariate outputs, computer models with both quantitative and qualitative input factors, and ensemble of computer models. However, conventional modeling approaches face significant challenges in building the model due to the highdimensional parameter space and large-scale covariance matrix. In this article, we have proposed a pairwise modeling approach to solve this problem. The proposed method takes advantage of the sequential construction property of the HD. The highdimensional MGP model is broken down into a series of bivariate GP models. Each bivariate GP model is built with a much smaller number of parameters and a small-scale covariance matrix. The method achieves excellent scalability, particularly for high-dimensional output. Several numerical examples have demonstrated that the proposed method can deliver similar prediction performance as the full multivariate approach, with only a small fraction of its computational need. In cases where the output dimension is high, the proposed method may even outperform the full multivariate approach.
A reasonable variation of our method is to use the model parameters estimated from the pairwise approach to be initial values of the full likelihood approach, and further improve them by maximizing the full likelihood function. In our numerical examples, we have tried this technique but found no clear differences, and the modeling time of MV is not substantially reduced. However, improvements may be obtained in some applications by carefully tuning the optimization algorithm in MV. Users can certainly consider this approach if the additional model building time can be tolerated. Another concern is the ordering of output levels, which is often arbitrarily chosen. Although results from the proposed method indeed depend on the ordering, our extensive experience with the numerical examples suggests the model performance in general is not very sensitive to this issue, much like what is shown in Figure 3.
Finally, it should be pointed out that we do not advocate the use of our method in all scenarios and the users should implement the method at their discretion. In cases when the sample size is small and/or when the number of output levels is small, the PW method can be inferior to the MV method due to loss of efficiency. The good news is that when the sample size and the number of output levels are small, the problem can generally be well handled by the MV already. The PW method is most useful when the scale of the problem is beyond the capability of a full likelihood approach.

SUPPLEMENTARY MATERIALS
Online supplementary materials of this article include appendices, technical proofs, and Matlab code used for the examples.