A preposterior analysis to predict identifiability in the experimental calibration of computer models

ABSTRACT When using physical experimental data to adjust, or calibrate, computer simulation models, two general sources of uncertainty that must be accounted for are calibration parameter uncertainty and model discrepancy. This is complicated by the well-known fact that systems to be calibrated are often subject to identifiability problems, in the sense that it is difficult to precisely estimate the parameters and to distinguish between the effects of parameter uncertainty and model discrepancy. We develop a form of preposterior analysis that can be used, prior to conducting physical experiments but after conducting the computer simulations, to predict the degree of identifiability that will result after conducting the physical experiments for a given experimental design. Specifically, we calculate the preposterior covariance matrix of the calibration parameters and demonstrate that, in the examples that we consider, it provides a reasonable prediction of the actual posterior covariance that is calculated after the experimental data are collected. Consequently, the preposterior covariance can be used as a criterion for designing physical experiments to help achieve better identifiability in calibration problems.


Introduction
In many engineering and scientific applications, computer simulations have become an important tool for design, optimization, or simply to better understand physical phenomena (Hoffman et al., 2003;Higdon et al., 2004;Malhotra et al., 2012). However, computer models never perfectly represent reality. Two general sources of uncertainty that account for differences between the computer model and physical reality are parameter uncertainty and model discrepancy (Kennedy and O'Hagan, 2001). The former results from imperfect knowledge of underlying physical parameters, called calibration parameters (e.g., material properties, friction coefficients, etc.), whereas the latter results from approximations, missing physics, and other inaccuracies of the computer simulation and is represented by a discrepancy function (see Equation (1), introduced in Section 2). Socalled calibration methods for learning these uncertainties via combining (usually quite abundant) simulation data with (usually more limited) physical experimental data have been developed (Kennedy and O'Hagan, 2001;Higdon et al., 2004;Reese et al., 2004;Bayarri et al., 2007;Higdon et al., 2008). It should be noted that if an uncertain parameter is a distributional parameter of a random input variable, one might be able to reduce its uncertainty by directly observing a sample of the input variable realizations, as in Song and Nelson (2015). We treat the very different situation (calibration) in which the uncertain parameters are fixed, deterministic variables that cannot be observed directly, and all inference on them must be made indirectly CONTACT Daniel W. Apley apley@northwestern.edu Color versions of one or more of the figures in this article can be found online at www.tandfonline.com/uiie.
Supplemental data for this article can be accessed on the publisher's website at www.tandfonline.com/uiie via observations of the simulated and physical experimental response variable. When there is no discrepancy function, calibration is typically a straightforward, statistically identifiable problem (Kumar, 2008;Drignei, 2009;Akkaram et al., 2010;Huan and Marzouk, 2013). However, it is well known (Loeppky et al., 2006;Han et al., 2009;Arendt et al., 2012aArendt et al., , 2012b) that when a discrepancy function is present the estimation problem is often poorly identifiable, in the sense that it is difficult to distinguish the effects of calibration parameter uncertainty from the discrepancy function or to estimate these quantities individually. Much of the prior calibration work that has considered a discrepancy function, including work on experimental design for calibrating computer models (Kennedy and O'Hagan, 2001;Higdon et al., 2004;Bayarri et al., 2007;Ranjan et al., 2011;Williams et al., 2011), focused on the more easily attainable objective of good response prediction with the calibrated computer model, even if one is unable to accurately identify the calibration parameters and distinguish their effects from the discrepancy function. Higdon et al. (2004), Loeppky et al. (2006), and Han et al. (2009) distinguished between two different types of parameters-tuning parameters and calibration parametersthat can serve as input variables in a computer model. Tuning parameters have no meaning in the physical experiment, whereas calibration parameters do have concrete physical meaning but are unknown in reality. Examples of tuning parameters are mesh density in a finite element simulation or some constant in an empirically postulated material flow law (e.g., Maheshwari Copyright ©  "IIE" et al. (2010)). Han et al. (2009) and the references therein provide a number of examples of calibration parameters that have concrete physical meaning, such as the friction between bone and prosthesis in a prosthestic knee simulation. They discuss the importance of identifying the true values of the calibration parameters and, in light of the identifiability challenges, the need for further research on improving identifiability.
In this article we consider the situation in which there are unknown calibration parameters that have physical meaning, and it is desired to estimate their true values and distinguish their effects from the effects of the discrepancy function. We take the viewpoint of Han et al. (2009) and Arendt et al. (2012b) that good calibration identifiability may be important for a number of reasons. First, learning the calibration parameters may itself be the primary goal of the calibration; e.g., for scientific discovery purposes when the parameters cannot be directly measured. Second, learning the model discrepancy function may provide insight into the deficiencies of the computer model for improving future generations of the simulation code. Third, if the calibration parameters and discrepancy function cannot be individually identified, they may still allow reasonably accurate correction of the computer model in regions of the input space near where physical experimental runs have been conducted; however, better knowledge of the calibration parameters would likely allow a less myopic correction to the computer model that provides better prediction over a broader range of input settings.
As in much of the prior work that assessed calibration identifiability, we quantify identifiability via the posterior covariance matrix of the set of calibration parameters (posterior to observing the physical experimental data, in addition to the simulation data). One might also consider using the posterior covariance function of the discrepancy function in the measure of identifiability. However, this is infinite dimensional and cumbersome. Moreover, Arendt et al. (2012b) demonstrated that the posterior covariance of the discrepancy function is closely related to the posterior covariance of the calibration parameters, in the sense that precisely estimated calibration parameters generally imply a precisely estimated discrepancy function (which is somewhat obvious from Equation (1) in Section 2). Consequently, we work with the more manageable posterior covariance matrix of the calibration parameters. Here, we are using the term "identifiability" rather informally to refer to whether one can expect to achieve reasonable and useful estimation of the calibration parameters with typical finite sample sizes (which is sometimes possible), as opposed to whether estimators are consistent and guaranteed to asymptotically converge to the true values when sample sizes approach infinity in some sense (which is perhaps never possible under realistic assumptions). In Section 5.4, we discuss why the latter may not be particularly meaningful for most computer model calibration problems.
It should be noted that one view within the calibration community is that identifiability of the calibration parameters in the Kennedy and O'Hagan model is impossible, or nearly impossible. We argue in Section 5.4 that good identifiability, although sometimes impossible and usually difficult, is sometimes achievable. Moreover, after collecting the simulation and experimental data and conducting the Kennedy and O'Hagan analysis, if identifiability is poor, the posterior standard deviations produced in the analysis will indicate that this is the case. The main issue that we address in this article is how to assess the level of identifiability prior to conducting the physical experiment.
Others have also demonstrated that identifiability is sometimes possible and depends on the specifics of the example (e.g., Higdon et al. (2004) and Han et al. (2009)). Arendt et al. (2012b) showed that the degree of identifiability strongly depends on the nature of the computer model response as a function of the input variables and the calibration parameters, as well as on the prior assumptions regarding the discrepancy function (e.g., the smoothness of the discrepancy function). Computer simulations are inherently multi-response, as there are many intermediate and final response variables at many different spatial and temporal locations that are automatically calculated during the course of the simulation. Arendt et al. (2012a) showed that for systems with poor identifiability based on a single measured physical experimental response, identifiability may be improved by measuring multiple physical responses. In all situations, identifiability obviously depends on the design of the physical experiment; i.e., the number of experimental runs and the input settings for each run.
The same can be said when estimating the parameters of any parametric model based on physical experimental data, and standard criteria for designing such physical experiments are often based on measures related to parameter identifiability. However, calibration of computer simulation models involves a fundamental distinction; prior to designing the physical experiment, one can learn via simulation a great deal about the nature of the functional dependence of the response(s) on the input variables and on the calibration parameters; this knowledge can be exploited when designing the physical experiment to provide better identifiability.
In order to accomplish this, one needs to predict or approximate the degree of identifiability prior to conducting the physical experiments but after conducting the computer simulations. The primary objective of this article is to propose and investigate the use of the preposterior covariance matrix (Berger, 1985;Carlin and Louis, 2000) of the calibration parameters for this purpose and to demonstrate that it provides a reasonable prediction of the actual posterior covariance, at least for the examples that we consider. Consequently, the preposterior covariance matrix can serve as a reasonable criterion for designing physical experiments in order to achieve good identifiability when calibrating computer simulation models.
We note that a preposterior analysis cannot improve one's knowledge of the calibration parameters, because it is conducted prior to observing any physical experimental data. However, it can provide a reasonable quantification of the uncertainty of the calibration parameters that one expects will result after observing the experimental data. This is analogous to what occurs in standard optimal design of experiments (Wu and Hamada, 2000;Montgomery, 2005) for fitting a response surface via linear regression. Given the design matrix and the standard deviation of the random observation error, one can calculate the covariance matrix of the estimated parameters prior to conducting the experiment and use this as the experimental design criterion. Of course, one cannot estimate the parameters until the experimental data are collected.
The format of the remainder of the article is as follows. In Section 2 we provide a brief background on the standard formulation of the computer model calibration problem and how computer simulation and physical experimental data are combined to implement the calibration. In Section 3 we present our approach to calculate the preposterior covariance matrix of the calibration parameters. In Section 4 we discuss a modification of the preposterior analysis that provides insight into the behavior of the preposterior covariance as a means of predicting the posterior covariance. In Section 5, we use a number of examples to illustrate the effectiveness of the preposterior analysis in predicting identifiability prior to conducting physical experiments and discuss the issue of identifiability of the calibration parameters. Section 6 concludes the paper.

Background on combining computer simulation and physical experimental data for computer model calibration
We use the standard computer model calibration formulation introduced in Kennedy and O'Hagan (2001) and used in many subsequent works (Higdon et al., 2004;Loeppky et al., 2006;Williams et al., 2006;Bayarri et al., 2007;Han et al., 2009). For ease of exposition, we consider only a single response variable. The standard formulation is (Kennedy and O'Hagan, 2001) where y e (x) denotes the physical experimental response, y m (x,θ) denotes the computer model response, x denotes a d × 1 vector of controllable (or design) variables that can be controlled in the physical experiments, and θ denotes an r × 1 vector of calibration parameters. The calibration parameters are unknown constants in reality, but in the simulations their values can be specified as inputs in the same way the values of x are specified. Consequently, if we let θ * denote the vector of true calibration parameters, then Equation (1) expresses y e (x) as the summation of y m (x,θ * ), a discrepancy function δ(x), and a random experimental observation/replication error ε. One can take the definition of δ(x) to be the difference between y e (x) (with no random error) and y m (x,θ * ), which we write as a function of only x. The random error ε is assumed to be zero-mean, Gaussian, and independent for different experimental observations. To estimate θ * and δ(x) and quantify their uncertainties, we use the modular Bayesian approach of Kennedy and O'Hagan (2001) and Bayarri et al. (2007), which is comprised of the four modules illustrated in the left panel of Fig. 1 the notation in which will be explained shortly. Next we briefly describe each module. Additional discussion can be found in Kennedy and O'Hagan (2001) and Arendt et al. (2012b).
In the modular Bayesian approach, both y m (x,θ) and δ(x) are modeled as Gaussian Processes (GPs) (O'Hagan, 1978;Cressie, 1993;Handcock and Stein, 1993). In Module 1, the hyperparameters (i.e., the prior correlation and variance parameters) of the GP model for y m (x,θ) are estimated using Maximum Likelihood Estimation (MLE), based on only the observed computer simulation data. The physical experimental data are not used in Module 1 for computational reasons and because the physical experimental data contain little further information about the hyperparameters of the GP model for y m (x,θ). Next, in Module 2, the hyperparameters for the GP model for δ(x) are estimated using MLE based on all of the observed data, treating the MLEs from Module 1 as fixed parameters and marginalizing the likelihood over the assumed prior distribution for the calibration parameters. Subsequently, in order to quantify the calibration parameter uncertainty, Module 3 calculates the posterior distribution of the calibration parameters based on all of the observed data and on the prior for the calibration parameters, with the MLEs from Modules 1 and 2 treated as fixed values.
Lastly, Module 4 calculates a posterior distribution of the experimental response and the discrepancy function using the results of Modules 1 to 3. Thus, the modular Bayesian approach provides posterior distributions that quantify the uncertainty of the calibration parameters, the discrepancy function, and the experimental response. In the next section, we will use the first three modules of the modular Bayesian approach when calculating the preposterior covariance matrix of the calibration parameters prior to observing the physical experimental data, which will provide a prediction of the posterior covariance matrix that will result after observing the physical experimental data.

Calculating the preposterior covariance matrix of the calibration parameters
We next provide an overview of our algorithm for calculating the preposterior covariance matrix, illustrated in Fig. 1. Following this, we give detailed descriptions of each step in the algorithm. The algorithm begins by fitting the GP model for y m (x,θ) based on the computer simulation data (Step 0a). The user then specifies the physical experimental input settings and assigns priors for the GP model of δ(x), for the calibration parameters, and for ε (Steps 0b and 0c). Because the preposterior analysis is conducted prior to observing actual experimental data, the main body of the algorithm is a Monte Carlo (MC) simulation (Steps 1 to 7) in which, on each MC replicate, we generate (Steps 1 to 5) a hypothetical set of physical experimental response observations at the specified input settings based on the knowledge of y m (x,θ) from the computer simulation data and the relevant prior distributions for δ(x), θ, and ε. Then we calculate (Steps 6 and 7) a posterior covariance matrix for θ for the experimental response observations generated in that MC replicate. The preposterior covariance matrix for θ is taken to be (Step 8) the average of the posterior covariance matrices over all MC replicates. Within Steps 1 to 7, Modules 2 and 3 of the standard modular Bayesian approach are used. Next we describe each step of the preposterior analysis in greater detail.
Step 0: Preliminaries. In this preliminary step, the GP model for y m (x,θ) is fitted, and the user specifies several prior distributional quantities related to the physical experimental responses. By "fitting" the GP model for y m (x,θ), we mean finding the MLEs (Schabenberger and Gotway, 2005) T used in the computer simulations, from which we obtain expressions for the posterior distribution of y m (x,θ). This takes place in Step 0a using Module 1 of the modular Bayesian approach. In the preceding, x m i denotes the d × 1 vector of inputs, and θ m i denotes the r × 1 vector of calibration parameters, used on the ith run of the computer simulation. In the GP model, y m has mean vec- , σ m 2 denotes the prior variance of the GP model, and ω m denotes a (d + r) × 1 vector of prior correlation (aka, scale or roughness) parameters. For all examples in this article, we used a constant prior mean function of unknown magnitude (i.e., h m (x) = 1 and β m a scalar hyperparameter to be estimated) and a Gaussian correlation function of the form , where the summation is over the elements of the vectors z and z . For numerical stability of the MLE algorithm, X m and m are transformed to the range [0, 1], and y m is standardized to have a sample mean of zero and a sample standard deviation of one (Rasmussen, 1996). Letφ m denote the MLE of φ m . Later, within the MC Steps 1 to 7, we generate a realization of y m (x,θ) at various sets of {x, θ} values using a multivariate normal random number generator with mean and covariance taken to be the posterior mean and covariance of the GP model of y m (x,θ) from Step 0a. The posterior mean and covariance are calculated using standard results from the literature (O'Hagan, 1978;Rasmussen and Williams, 2006). Here we use plug-in MLEs for σ m 2 and ω m with a non-informative prior for β m . Over the remainder of the preposterior analysis, the GP model of y m (x,θ) is not updated from that of Step 0a. In Step 0b the user specifies the N e experimental input locations X e = [x e 1 , . . . , x e N e ] T at which one intends to conduct the N e physical experimental runs (and at which the hypothetical experimental response observations will be generated within the MC loop). As described in Section 2 in the context of Module 2 of the modular Bayesian approach, we represent the discrepancy function as a GP model with hyperparameters φ δ = {β δ , σ δ 2 , ω δ } (analogous to the hyperparameters φ m for the GP model of y m (x,θ)), to which the user assigns prior distributions in Step 0c. This prior for φ δ is used only in Step 3 of the MC loop, in which values of φ δ are sampled from the specified prior before generating a hypothetical discrepancy function. Because the modular Bayesian approach calculates MLEs of φ δ in Step 6, we do not incorporate the prior for φ δ via maximum a posteriori estimators, although one could modify the approach to do this if desired.
Additionally in Step 0c, the user specifies the prior distribution for λ, the variance of the experimental observational error (which is assumed to be independent and identically distributed normal with mean 0). We treat λ in the same manner as φ δ . Its prior is used in Step 4 when generating a random value for ε, but this prior is not used when calculating the MLE of λ in Step 6.
Lastly, in Step 0c, the user assigns a prior distribution p(θ) to the calibration parameters. Since we are directly interested in the uncertainty in the calibration parameters, Module 3 of the modular Bayesian approach in Step 7 calculates a full posterior distribution for θ based on the specified prior, in contrast with the MLE-only estimation for φ m and φ δ in Modules 1 and 2. Based on the posterior of θ, Module 3 also calculates the posterior covariance of θ for each MC replicate. MC loop (for i = 1, …, N mc ). Steps 1 to 7 are the steps within the MC loop for generating N mc hypothetical sets of physical experimental response observations and calculating a hypothetical posterior covariance matrix for θ for each set. We add a superscript (i) to denote that the quantity was generated in the ith MC replicate.
Step 1: Generate θ (i) from the prior for θ specified in step 0c-iii. The generated θ (i) will be treated as a "true" value for θ when generating the hypothetical physical experimental response observations on the ith MC replicate.
Step 2: Generate y m(i) at {X e , θ (i) }. The N e × 1 vector y m(i) denotes a random realization of values for the computer model's response at the experimental input settings X e (from Step 0b) and the calibration parameter values θ (i) (from Step 1). The vector y m(i) is generated from the multivariate normal posterior distribution of y m (x,θ) computed in Step 0a, and it will be used in Step 5 to calculate a realization of the experimental data y e(i) .
Step 3: Generate δ (i) at X e . The N e × 1 vector δ (i) denotes a random realization of values for the discrepancy function at input settings X e . We first generate a random realization of φ δ , denoted by φ δ(i) , from their priors specified in Step 0c. Then δ (i) is generated from a multivariate normal distribution with mean and covariance based on the prior GP model for δ(x) from Step 0c using parameters φ δ(i) . Specifically, the mean is matrix with kth-row, jth-column entry equal to the prior correlation between δ(x k e ) and δ(x j e ) using parameters ω δ(i) .
Step 4: Generate ε (i) . The N e × 1 vector ε (i) denotes a random realization of values for the experimental random error, with each element generated as a normal random variable with mean 0 and variance λ (i) . Here, λ (i) is first generated from its prior distribution specified in Step 0c-i.
Step 5: Calculate y e (i) . Based on the model of Equation (1), the N e × 1 vector of experimental response observations at X e for the ith MC replicate is calculated via y e(i) = y m(i) + δ (i) + ε (i) .
Step 6: Based on y e(i) , estimate φ δ and λ. For each MC replicate, we calculate the MLEs of the hyperparameters φ δ of the GP model for δ(x) and the MLE of λ using Module 2 of the modular Bayesian approach with the hypothetical experimental data y e(i) from Step 5. We denote the MLEs for the ith MC replicate bŷ φ δ(i) andλ (i) .
Step 7: Calculate the posterior covariance of θ for replicate i. We calculate the posterior covariance Cov (i) [θ| y e(i) , y m ] using Module 3 of the modular Bayesian approach, which treats φ m , φ δ , and λ as fixed at their MLEs from Steps 0 and 6. Cov (i) [θ| y e(i) , y m ] can be viewed as the actual posterior covariance (as calculated via Module 3) that would result if the actual physical experimental response observations y e were equal to y e(i) . Details on calculating the posterior of the calibration parameter are discussed in Kennedy and O'Hagan (2001) and Arendt et al. (2012b). Since r = 1 or r = 2 in the examples in this article, we used numerical integration (with Gauss-Legendre quadrature) to calculate Cov (i) [θ| y e(i) , y m ]. For higher-dimensional problems, if one uses Markov Chain Monte Carlo (MCMC) (Robert and Casella, 1999) methods to calculate the posterior distribution of θ, Cov (i) [θ| y e(i) , y m ] can be directly calculated as the sample covariance matrix of the MCMC samples.
Step 8: Calculate the preposterior covariance of θ. After replicating Steps 1 to 7 in the MC loop a total of N mc times, we calculate the preposterior covariance of the calibration parameters as the average of the N mc posterior covariance matrices. That is, if we denote the preposterior covariance by θ,pp = E[Cov(θ|Y e ,y m )|y m ], where Y e denotes the (as yet) unobserved actual experimental data, then the estimate iŝ (2)

A modified algorithm for investigating the behavior of the preposterior analysis
Consider the following modification to the algorithm of Fig. 1. In Step 1, instead of randomly generating a different θ (i) from its prior distribution on each MC replicate, we use the same fixed value (denoted by θ t , a value specified by the user outside the MC loop) for all replicates. We refer to this algorithm as the fixedθ preposterior analysis and denote the resulting estimate of the preposterior covariance matrix byˆ θ , pp (θ t ). Recall that θ (i) in the algorithm in Fig. 1 is only used when generating the hypothetical experimental observations y e(i) , for which θ (i) is treated as the true values of the calibration parameters. In reality there is only a single true value θ * that the laws of nature will dictate when generating the actual experimental data y e , based on which the actual posterior covariance of θ will be calculated. In light of this, one might wonder how well the preposterior analysis of Fig. 1 (which has no knowledge of the true θ * when generating y e(i) and must, therefore, average the results over values of θ (i) drawn from the prior p(θ)) can predict the actual posterior covariance matrix. In the ideal situation thatˆ θ,pp (θ t ) does not depend on θ t , we might expect the prediction to be good. In Section 5 we use the fixed-θ preposterior analysis to investigate the extent to whichˆ θ,pp (θ t ) depends on θ t in the examples considered.
Although it may seem counterintuitive thatˆ θ,pp (θ t ) might not depend (strongly) on θ t , this is precisely the situation in the standard linear regression formulation y = Xθ + error, where X is the design matrix, and y is the observation vector. In the linear regression formulation, under mild assumptions the covariance matrix of the regression estimate of θ is [X T X] −1 multiplied by the error variance, which is independent of the true θ. This characteristic allows one to design experiments, prior to observing y, which are "optimal" regardless of the true θ. The situation is more complex for the calibration problem, in part because of the black-box, potentially nonlinear dependence of y m (x,θ) on θ. However, in the next section we demonstrate that the relative (when comparing different experimental designs) dependence ofˆ θ,pp (θ t ) on θ t in the examples is mild enough that the preposterior covariance still allows one to distinguish good experimental designs from poor ones.
Notice that the fixed-θ preposterior covarianceˆ θ,pp (θ t ) is related to the preposterior covarianceˆ θ,pp viâ where p(•) is the prior distribution of θ specified in Step 0c-i. In other words, the preposterior covariance is the expected value of the fixed-θ preposterior covariance with respect to the prior of θ.

Examples and discussion
In this section, we consider various examples that illustrate how the preposterior analysis can be used, after conducting the computer simulations but before conducting the physical experiments, to predict the identifiability that will result from a proposed experimental design. The system in the first example (Section 5.1) is inherently difficult to identify even with a large amount of observed physical experiments, whereas the system in the second example (Section 5.2) is more identifiable with an appropriate experimental design.

... Computer model, physical experiments, and preliminaries
We investigate the simply supported beam example considered in Arendt et al. (2012b). Briefly, the simply supported beam is composed of a 2 m long beam with a rectangular cross-section (height of 52.5 mm and width of 20 mm). One end of the beam is rigidly fixed to a support, whereas the other end is supported by a roller. The design variable x is the magnitude of a static force, in Newtons, applied to the midpoint of the beam, and the response y is the strain, in meters, measured at the midpoint of the beam (at the top of the cross-section). The computer simulation is a finite element analysis (FEA) model, implemented in Abaqus 6.9, with a simplified material law that depends on the Young's modulus (the calibration parameter θ ). For Step 0a, the simulations were observed on a 4×4 evenly spaced grid (N m = 16) over the input space (1300 ࣘ x ࣘ 2300 N and 150 ࣘ θ ࣘ 300 GPa). The simulation response observations y m and the posterior mean of y m (x,θ) from Step 0a are shown in Fig. 2. Because of the smooth nature of y m (x,θ), there was very small posterior uncertainty in y m (x,θ) over the entire input space. Regarding Step 0b, we will conduct different analyses for different values of N e . For each N e , X e is evenly spaced over the input range 1300 ࣘ x ࣘ 2300. For purposes of comparing the preposterior covariance to the posterior covariance, we generated actual "physical" experimental data via the same FEA model but using a more elaborate material law that results in Figure . Plot of the fixed-θ preposterior STD and the posterior STD (both normalized by the prior STD of θ ) versus θ t for the beam example. "θ PP STD" is the fixed-θ preposterior STD and "P STD" is the posterior STD.
a discrepancy between the computer model and the experimental data. Observation error with variance λ = 6.63e-12 m 2 was added to the experimental data, and the true value of the Young's modulus calibration parameter was taken to be θ * = 206.8 GPa. These physical experimental data are not used in the preposterior algorithm of Fig. 1 and are used only to verify the results of the preposterior algorithm. For Step 0c, we assigned to φ δ a point mass prior with mass at {β δ , σ δ 2 , ω δ } = {0.00, 1.11×10 −7 , 0.002}. The value 1.11×10 −7 m 2 for σ δ 2 was chosen so that the bounds of the 99.7% prediction interval of the discrepancy function were approximately [−1, 1] ×10 −3 m. The correlation parameters ω δ were chosen to represent a relatively smooth GP model for the discrepancy function. We also assigned to λ a point mass prior with mass at 6.63e-12 m 2 . Recall that the prior distributions for φ δ and λ are used only in Steps 1 to 5 to generate random realizations of y e(i) . In general, the MLEs of φ δ and λ are calculated in Step 6. For the examples in this article, however, we treated λ as a known value and did not estimate it in Step 6. This is reasonable in systems for which the random error variance can be estimated externally, simply by taking replicate experimental measurements at the same input settings. Lastly, we chose p(θ ) to be a non-informative uniform distribution over the full range of θ .

... Comparing the preposterior and posterior covariances
We begin by comparing the fixed-θ preposterior covariancê θ,pp (θ t ) for various θ t to the actual posterior covariance based on the experimental data from the more elaborate FEA model. As described in Section 4, for each θ t ,ˆ θ,pp (θ t ) was calculated using the algorithm of Fig. 1 with θ (i) from Step 1 replaced by the same value θ t for all N mc replicates. Fig. 3. plots the fixedθ preposterior standard deviation (STD) (i.e., the square root ofˆ θ,pp (θ t )) versus θ t for two different experimental designs with N e = 3 and N e = 8. The individual points plotted at θ = 206.8 GPa and represented by the shaded circle (for the design with N e = 3) and shaded square (for the design with N e = 8) are the actuals posterior STDs of θ that result by applying Modules 1 to 3 of the standard modular Bayesian approach to a set of experimental data generated using the true value θ * = 206.8 GPa. These individual points are included in Fig. 3 for purposes of comparing to the preposterior STDs that our algorithm calculates, which are represented by the series of connected points. In Fig. 3 both the fixed-θ preposterior STD and the posterior STD have been normalized (divided) by the STD of the prior p(θ ).
Although the fixed-θ preposterior STDs differ from the posterior STD, their relative change as N e increases is comparable. In particular, the relative difference in the fixed-θ preposterior covariance for the design with N e = 3 and the design with N e = 8 is reasonably independent of the fixed θ t , and this relative difference is quite consistent with the relative difference in the actual posterior covariance for the true θ * = 206.8 GPa. Consequently, the relative difference in the preposterior STD (which we calculated as the integral of the fixed-θ preposterior STD in Fig. 3 with respect to p(θ )) accurately reflects the relative difference in the actual posterior STD in this example. Specifically, when N e is increased from three to eight in Fig. 3 the posterior STD decreases by 10.3% (from 0.8231 to 0.7383), and the preposterior STD decreases by 12.5% (from 0.6676 to 0.5840). This correctly indicates that the experimental design with N e = 8 will result in somewhat better identifiability than the experimental design with N e = 3, as measured by the posterior STD that would result after the experiment is conducted.
We then calculated the preposterior STD from the fixedθ preposterior STDs in Fig. 3 as described in Section 4. We repeated the analyses for various N e , and Fig. 4 shows a scatter plot of the preposterior STD versus the posterior STD (each normalized by prior STD θ ) for N e = 3, 4, …, 15. The scatter plot conveys a number of interesting characteristics of the beam example. Although the preposterior STD slightly underestimates the posterior STD for the different designs, the trend is the same; when N e increases from three to four, the preposterior STD correctly predicts that the posterior STD will decrease. Then, as N e increases from four to 15, the preposterior STD again correctly predicts that there will be negligible further decrease in the posterior STD. The overall conclusion that a user might draw from this preposterior analysis is that the system inherently suffers from a lack of identifiability, because an increase in N e beyond a value of four results in almost no further improvement in the preposterior STD. Arendt et al. (2012aArendt et al. ( , 2012b discussed the reasons behind the lack of identifiability in detail and also demonstrated that identifiability is greatly improved when multiple responses are measured experimentally. In the next section we will explore a second example in which the identifiability of the system continues to improve as experimental data are added.

Sinusoidal example
In this section, we explore an example involving a mathematical test function to investigate various aspects of the preposterior analysis. We compare the fixed-θ preposterior covariance to the posterior covariance for two different experimental data input settings and show results analogous to those shown in Fig. 3 and Fig. 4. Further analyses are presented in the online supplement.

... Computer model, physical experiments, and
preliminaries Suppose the computer model is the function: The simulation was run on an 8×8 evenly spaced grid (N m = 64) over the input space (1 ࣘ x ࣘ π and 1 ࣘ θ ࣘ 4). The simulation response observations y m and the posterior mean of y m (x,θ) from Step 0a are shown in Fig. 1. Although the model shown in Equation (4) was used to generate the computer model data used in Step 0a, no knowledge of this parametric model was used anywhere in the algorithm to calculate the preposterior standard deviations. In the GP modeling in Step 0a, these computer model data were treated as having come from a black-box, and in Step 2 the computer model responses were generated from the posterior distribution of y m (x,θ) obtained in Step 0a without using knowledge of the underlying sinusoidal relationship.

Figure .
Plot of the fixed-θ preposterior STD versus θ t and the posterior STD versus θ * (the vertical axis is normalized by the prior STD of θ ) for the sinusoidal example. "θ PP STD" is the fixed-θ preposterior STD and "P STD" is the posterior STD.
For the experimental data, we consider different values of N e , and for each, X e is evenly spaced over the input range 1 ࣘ x ࣘ π . Within Steps 1 to 5 of the algorithm of Fig. 1 the experimental data are generated according to the GP models. However, to generate the actual physical experimental data (used only for the purpose of calculating actual posterior STDs to compare to the preposterior STDs calculated by our algorithm), we use the model where we have added a superscript k on [δ k (x)] because we consider a number of different discrepancy functions in the online supplement. Here we consider the discrepancy function: which we use only to generate the actual experimental data (the discrepancy function is still generated from a GP model within the algorithm for calculating the preposterior STD). Observation error with variance λ = 1×10 −14 was added to the experimental data. Regarding Step 0c, we assigned to φ δ a point mass prior with mass at {β δ , σ δ 2 , ω δ } = {0.00, 0.028, 3.09}. The value 0.028 for σ δ 2 was chosen so that the bounds of the 99.7% interval for the discrepancy function were approximately [−0.5, 0.5] (other values for σ δ 2 are considered in the online supplement), and ω δ was chosen to represent a relatively smooth GP model. We also assigned to λ a point mass prior with mass at 1×10 −14 . As in the beam example, we treat λ as known but estimate φ δ in Step 6. We chose p(θ ) to be a non-informative uniform distribution over the full range of θ .

... Comparing the preposterior and posterior covariances
We start by comparing the fixed-θ preposterior STD for various θ t to the actual posterior STD based on the experimental data generated from Equation (6) for two different experimental designs with N e = 3 and N e = 8. When generating the experimental data, we considered a range of values for θ * . Figure 6  plots the fixed-θ preposterior STD versus θ t and the actual posterior STD versus θ * . The posterior and preposterior STDs are again normalized by the prior STD of p(θ ). In Fig. 6, the fixedθ preposterior STD and the posterior STD are similar when θ t = θ * , and the preposterior STD reflects the relative differences (for different N e ) in the actual posterior STD reasonably well for most θ * . For example, if θ * = 2.25 the posterior STD decreases from 0.6302 to 0.2861 when N e increases from three to eight, and the preposterior STD (the integral of the fixed-θ preposterior STD in Fig. 6, with respect to p(θ )) decreases from 0.8304 to 0.2650. Although these are different, the preposterior STD correctly indicates that the experimental design with N e = 8 will result in substantially better identifiability compared with the identifiability achieved using the experimental design with N e = 3.
We next compare the preposterior and posterior STDs for various N e . Instead of specifying a single value for θ * as in Fig. 4, we consider the mean posterior STD (the mean of the posterior STD with respect to a uniform distribution for θ * over the range 1 ࣘ θ * ࣘ 4). And as a measure of the extent to which the posterior STD depends on θ * , we also calculate the standard deviation of the posterior STD (also with respect to a uniform distribution for θ * over the range 1 ࣘ θ * ࣘ 4). Figure 7 shows a scatter plot of the preposterior STD versus the mean posterior STD (the error bands are ±1.0 standard deviations of the posterior STD) for N e = 3, 4, …, 10. There is a very tight relationship between the preposterior STD and the mean posterior STD. Notice that the wide error bands for N e = 3 and the much narrower error bands for N e = 8 are consistent with Fig. 6 in the following sense. For N e = 8 the posterior STD in Fig. 6 is much less dependent on θ * than for N e = 3. There is actually a stronger trend in Fig. 7 than the wide error bands would seem to imply. If one fixed θ * and plotted the posterior STD for the fixed θ * versus the preposterior STD (as was done in Fig. 4), there would be a very tight linear trend in the graph for each θ * . Consequently, for this example we conclude that the preposterior STD provides an accurate indication of the relative improvements in identifiability that would be achieved by increasing the size of the physical experiment. See the online supplement for further analyses.

Using the preposterior analysis to select N e
One way in which the preposterior analysis can be used is to help the user choose the number N e of experimental runs. Figure 8, which plots the normalized preposterior STD versus N e for the beam and sinusoidal examples, illustrates how this might be accomplished. For the beam example, one might conclude that increasing N e beyond four will provide little further knowledge of θ * and that the beam system is inherently difficult to identify with only the single experimentally measured response (which was strain at the beam midpoint). See Arendt et al. (2012a) for discussions on how experimentally measuring multiple responses (e.g., angle of beam deflection, internal energy, etc.) that share a mutual dependence on the calibration parameter may substantially improve identifiability. For the sinusoidal example (for which results with two different choices of σ δ 2 are shown in Fig. 8), one might conclude that the system is more identifiable than the beam system but that there is little further improvement to be gained by using N e > 8. See the online supplement for an additional example with two inputs.

Is calibration identifiability possible with the Kennedy and O'Hagan model?
The focus of this work is on assessing the level of identifiability in the Kennedy and O'Hagan calibration model, prior to conducting the physical experiment. This is premised on the assumption that reasonable identifiability is sometimes possible for some applications. In the calibration literature, the nearly universal consensus is that identifiability is extremely difficult for many systems, because of the difficulty in distinguishing between the discrepancy function and the effects of the calibration parameters. However, there appears to be some debate regarding whether one can ever claim good identifiability with any level of confidence. We believe that good identifiability is indeed possible in some cases, and in this section we provide an example to support this and demonstrate how it strongly depends on the experimental design. We also argue that the Kennedy and O'Hagan approach does a good job of letting the user know whether or not reasonable identifiability was achieved for a particular situation via the posterior standard deviations for θ. We also discuss implications of the form of identifiablity studied in Loeppky et al. (2006) and Tuo and Wu (2013), which is asymptotic in nature.
To illustrate situations in which good identifiability is possible, depending on the experimental design, suppose that there are two inputs, two calibration parameters, and the com- 10,20], and θ 2 ∈ [0, 2π ]. Further suppose, that the computer simulation is cheap, so that infinitely many observations of y m are available, and the physical experimental data are generated from Equation (1) with true values {θ * 1 = 15, θ * 2 = π/2} (in radians, with x unitless), ε ∼ N(0, 0.1 2 ), and discrepancy function [δ(x) = 0.5 + x 2 1 ]. As always, we treat the parametric computer model and discrepancy function as a black-box and use no knowledge of their parametric forms in the analyses, other than as a mechanism to generate the data. We used point mass priors for φ δ and λ with masses at {β δ , σ δ 2 , ω δ } = {0.00, 1.0, 1.0} and λ = 0.1 2 and a noninformative prior for θ over its full range.
Each panel in the top row of Fig. 9 shows y e (x) and y m (x, θ) (the latter for three different values of θ) as a function of x 1 for a particular physical experimental design. Each panel in the bottom row shows the three differentδ(x) corresponding to the three different potential values of θ for the data in the top row panel immediately above. In panels (a) and (b), the physical experimental design is N e = 60 points evenly spaced over the full range x 1 ∈ [−1, 1] with x 2 = 1 fixed. For the design in panel (a), one can observe a very close match (aside from a vertical offset due to the quadratic discrepancy) between the observed experimental data y e (x) and the computer model response y m (x, θ) only at the true values θ = θ * = {15, π/2}. For the other two values of θ ({10, π/2} and {15, 0}), the resulting y m (x, θ) is a far poorer match with y e (x).
Would the data in panel (a) constitute reasonably compelling evidence that the true θ * is close to the correct value {15, π/2}? This is debatable, but our opinion is that it would. The response y m (x, θ) in panel (a) exhibits very distinct behavior as a function of x 1 , and this behavior is highly dependent on θ. Although one cannot rule it out with absolute certainty, it seems highly unlikely that the close match in this distinct behavior of y m (x, θ) and y e (x) only at θ ≈ θ * is just coincidence and the true θ is really some other value for which the match between y m (x, θ) and y e (x) is far inferior.
As an analogy, much of the inferential statistical objective when fitting linear regression models to limited-size samples of data is distinguishing between what is likely a meaningful relationship versus what is just coincidental random chance. There are established methods of quantifying the statistical significance and uncertainty (P-values and confidence intervals), and even though one may not be able to conclude with absolute certainty that an observed relationship in the sample data is more than just coincidence, one may be able to conclude this with reasonable certainty.
Fortunately, the Kennedy and O'Hagan approach has an inherent mechanism for quantifying the statistical uncertainty and assessing (what it thinks is) the level of identifiability; namely, the posterior standard deviations of θ. For the data in Figs. 9(a) and 9(b), the posterior standard deviations of θ 1 and θ 2 were 0.081 and 0.050, respectively, which are quite small and indicate that there is reasonably good identifiability for this example and this design (point estimates wereθ 1 = 14.978 and θ 2 = 1.610, which are close to the true values). We discuss the preposterior standard deviation shortly.
Contrast this with the data for the design shown in Figs. 9 (c) and 9(d), for which there were still 60 points evenly spaced over the full range x 1 ∈ [−1, 1], but now with x 2 = 0.1 fixed at a smaller value than for Figs. 9(a) and 9(b). With smaller x 2 , the amplitude of oscillation in the response is much smaller. The match between y e (x) and y m (x, θ) is slightly better at the true values θ = θ * = {15, π/2} than at the two other values of θ, but the difference is not nearly as pronounced as in Figs. 9(a) and 9(b). Visually, the data for the Figs. 9(c) and 9(d) design present far less compelling evidence that θ * ≈ {15, π/2} than the corresponding data for Figs. 9(a) and 9(b) design. Quantitatively, the Kennedy and O'Hagan approach correctly assessed the poorer level of identifiability, producing posterior standard deviations of 0.834 and 0.521 for θ 1 and θ 2 , respectively, which are a full order of magnitude larger than for Figs. 9(a) and 9(b) (point estimates wereθ 1 = 14.88 andθ 2 = 1.232).
The design shown in Figs. 9(e) and 9(f), for which the 60 design points were evenly spaced over a much smaller range x 1 ∈ [−0.1, 0.1] with x 2 = 1.0, resulted in even poorer identifiability for θ 1 . Intuitively, the reason for the poorer θ 1 identifiability is that the distinct oscillatory behavior of y e (x) and y m (x, θ) is not present over the smaller range x 1 ∈ [−0.1, 0.1]. The Kennedy and O'Hagan approach was again able to assess the poorer level of identifiability, producing posterior standard deviations of 1.273 and 0.214 for θ 1 and θ 2 , respectively (point estimates wereθ 1 = 14.77 andθ 2 = 1.501). Figure 10 is a scatter plot of the posterior standard deviation versus preposterior standard deviation for θ 1 and θ 2 for the three different experimental designs corresponding to Figs. 9(a), 9(c), and 9(e). There is quite strong correlation between the posterior and preposterior standard deviations, which indicates that the preposterior analysis has done a good job of predicting the level of identifiability that will result after conducting the physical experiment. Figure 10 also highlights the importance of the experimental design and the extent to which it can affect identifiability. In this case, the preposterior analysis could have been used to choose the design in Fig. 9(a), on the basis of providing far better identifiability than the designs in Figs. 9(c) and 9(e).
It is worth noting that Tuo and Wu (2013) investigated the identifiability issue via an asymptotic analysis. They ignored the random error ε in Equation (1) and considered the likelihood-based estimator of θ, which (for cheap simulation

code) minimizes
where y e and y m (θ) are the vectors of experimental and simulation response observations at the same input sites X e , and R δ denotes the covariance matrix of the discrepancy function at X e . They showed that as N e → ∞ and the observations more and more densely cover the input domain (denoted by ) for x, the minimizer of the discrete-sample likelihood criterion (7) converges to the minimizer θ ≡ argmin θ ∫ ∫ [y e (x) − y m (x, θ)]R −1 δ (x, z)[y e (z) − y m (z, θ)]dxdz of its continuous-domain counterpart. The integral here is the Reproducing Kernel Hilbert Space (RKHS) norm of the difference y e (x) − y m (x, θ) between the experimental and simulation response functions, where the kernel R δ (x, z) is the discrepancy covariance function, and its inverse R −1 δ (x, z) can be informally viewed as the Karhunen-Loève expansion of R δ (x, z) with the eigenvalues inverted (y e (x) − y m (x, θ) must be an element of the same RKHS, for which its RKHS basis representation coefficients converge at a faster rate than the eigenvalues of R δ (x, z), which was assumed in Tuo and Wu (2013)). They also defined θ 0 ≡ argmin θ ∫ [y e (x) − y m (x, θ)] 2 dx (i.e., the minimizer of the L 2 -norm of y e (x) − y m (x, θ), which is the RKHS norm with a Dirac delta kernel) and termed the likelihood-based estimator as inconsistent because it approaches θ , as opposed to θ 0 , as the observations cover more and more densely. Loeppky et al. (2006) derived the similar, but less general, result that the likelihood-based estimator approaches θ 0 in the special case that y m (x, θ 0 ) = y e (x) for all x ∈ (i.e., when there exists a value of θ that makes the difference y e (x) − y m (x, θ) identically zero). In light of this, in the context of tuning (for which there is no true value θ * of the parameters), Tuo and Wu (2013) recommended a modified estimator of θ that essentially minimizes [y e − y m (θ)] T [y e − y m (θ)] instead of Equation (7). Their modified estimator converges to θ 0 as the observations cover more and more densely.
In the context of calibration (for which one would like to estimate the true value θ * , as opposed to θ 0 ), one may be tempted to view the inconsistency result of Tuo and Wu (2013) as casting the prospect of calibration in a dismal light. To quote one of the anonymous referees for this article, the (likelihood version of the) estimator in the Kennedy and O'Hagan model will "converge to the wrong value. " However, we believe that this does not preclude reasonable calibration identifiability for a number of reasons. Although the Tuo and Wu (2013) result is asymptotic in sample size, the input domain remains fixed. And for most, if not all, real simulation and experimental systems, the size of is limited. Because of the highly spatially correlated nature of response surface data, when sample size increases beyond some value, very little new information is gained by more and more densely covering a limited-size input domain with additional observations. In this sense, one can view the Tuo and Wu (2013) result as a finite-sample result, and for finite samples, virtually all estimators will give the wrong value with probability one. If the size of were allowed to grow to infinity in a truly asymptotic scenario, it is likely that one could devise conditions on y m (x, θ) and y e (x) under which the estimator of θ is guaranteed to converge to the true θ * . However, considering the inherently limited size of in real applications, it is doubtful that any such asymptotic analysis would be meaningful.
For the calibration problem, we believe that the more relevant question is not whether the estimator is wrong for a given set of infinitely dense experimental and simulation data over a finite, limited input domain but, rather, whether reasonably good identifiability can be achieved for the finite sample of data that will be used. We also believe that the examples in this article and elsewhere (e.g., Higdon et al. (2004), Loeppky et al. (2006), Han et al. (2009), and Arendt et al. (2012a, 2012b) demonstrate that reasonably good identifiability is achievable in some situations and, in order to assess the level of identifiability, the posterior standard deviation (after conducting the experiment) and the preposterior standard deviation (prior to conducting the experiment) can be used.

Conclusions
Due to identifiability issues, it is often difficult to distinguish parameter uncertainty from uncertainty in the discrepancy function when using the GP-based method of Kennedy and O'Hagan (2001) for calibrating computer models. In this article, we have shown that the preposterior covariance of θ calculated via the algorithm of Fig. 1 constitutes a reasonable prediction of the posterior covariance that will result when the physical experimental data are collected, at least for the examples considered. As such, the preposterior covariance can provide insight into the identifiability of a system and serve as a criterion for helping to design an effective physical experiment, based on the results of a previously conducted set of computer simulations.
Regarding the conditions on δ(x) required for reasonable identifiability, clearly there must be some, and δ(x) is not allowed to be any function. Our assumption that δ(x) can be represented by a GP model is not a particularly strong one, because for various choices of correlation function a GP prior can produce an extremely wide variety of stochastic realizations of δ(x). This flexibility of the GP model is one of the reasons for its popularity in the calibration literature. As discussed in Arendt et al. (2012b), good identifiability is more difficult to achieve if δ(x) is not relatively smooth, which is reflected by the correlation parameters ω δ (smaller ω δ corresponds to a smoother discrepancy function). After collecting the physical experimental data, one can always estimate ω δ to gauge the smoothness of δ(x). However, perhaps the best and most direct measure of whether δ(x) is sufficiently smooth to allow identifiability is to simply look at the posterior standard deviation of θ, which directly reflects the impact of the smoothness of δ(x) on identifiability. However, prior to collecting the physical experimental data, it may be difficult to accurately gauge the smoothness of δ(x). In this case, if one specifies a specific ω δ (or a range of ω δ with some prior distribution assigned to it) to use in the preposterior algorithm, then the estimated preposterior standard deviation of θ should be interpreted as what would result if the discrepancy function truly had the assumed degree of smoothness. One should keep in mind that this complication (not knowing the actual smoothness of δ(x) until after the experimental data are collected and, hence, not knowing how good the design will be until after the experiment is conducted) is certainly not unique to the calibration problem. In any experimental design problem, one never knows with certainty whether the design is good until after the experiment is conducted. For example, an experiment designed for estimating a linear model will end up being a very poorly designed experiment if, after conducting the experiment, it is discovered that the true response surface is really quadratic. The prevailing viewpoint in experimental design is to accept the fact that one may need to conduct a follow-up experiment based on what was observed in the initial experiment, and this same viewpoint applies to the calibration problem. If one designs a calibration experiment based on an assumed ω δ , and after conducting the experiment it is discovered that the actual ω δ is much different, then one could design and conduct a better follow-up experiment based on the improved knowledge of δ(x) learned from the initial experiment.
As noted by a referee, for the examples summarized in Fig. 8, there was little further improvement in the preposterior standard deviation when N e increased beyond the number of settings for x used in the computer experiment, and for the grid designs in those examples the x settings for the computer and physical experiments coincided. This suggests that some form of nested designs such as those proposed in Qian (2009) and Qian et al. (2009) may be useful designs for calibration purposes, a point that was also noted by Han et al. (2009). There are certainly other issues that would influence which design would provide the lowest posterior standard deviation for a particular calibration problem. For example, Arendt et al. (2012aArendt et al. ( , 2012b have shown that the posterior standard deviation strongly depends on which response variables are measured experimentally when there are multiple responses, and they have also shown that regions of the x space over which the simulation response y m (x,θ) depends more strongly on θ are prime candidates for input settings at which to run physical experiments. A proper investigation of the relative merits of nested designs versus other designs in general should use existing methods (e.g., the Bayesian analyses of Kennedy and O'Hagan (2001), Higdon et al. (2004), Loeppky et al. (2006), Williams et al. (2006), Bayarri et al. (2007), and Han et al. (2009)) to evaluate the actual posterior standard deviation and/or Bayesian point estimates of θ that result after conducting experiments with various designs. The preposterior analysis of this article is not intended to serve as a measure of performance after the design is conducted or as a method of comparing the general performance of different classes of experimental designs over a broad range of examples. It is only intended to serve as a rough proxy for performance that can be calculated prior to conducting the physical experiment, to aid in selecting a reasonable design for a particular problem. The examples of Section 5 illustrate how the preposterior standard deviation can be used for these purposes. For example, Fig. 8 indicates that the designs with N e = 4 and 8 are reasonable designs for the beam and sinusoidal examples, respectively, in the sense that the preposterior standard deviation does not further decrease if N e is further increased. The actual posterior standard deviations shown in Figs 4, 7, 9, and 10 confirm that the relative performance differences between the different designs were accurately predicted by the preposterior standard deviations.
In the examples, we have used the preposterior standard deviation to compare and choose from among a small set of experiments that were designed using other methods. The examples considered evenly spaced grid designs and Latin hypercube designs of various sizes and focused on selecting the appropriate size. More generally, the algorithm in Fig. 1 could be used to directly compare any set of given designs (one would specify the input settings for each design in Step 0b). The approach could, in principle, also be used as a formal design optimization criterion to customize the exact values for the experimental x settings, choose which of the simulation response variables to measure experimentally, etc. However, in its current form the computational expense of the preposterior algorithm is prohibitive for these purposes. For the beam (Section 5.1) and sinusoidal (Section 5.2) examples, each MC replicate (i.e., each iteration of Steps 1 to 7 in Fig. 1) took roughly 0.018 minutes on average on a single-processor, 3 GHz, i7 machine. The computational expense is largely independent of the size N e of the physical experiment if N e is less than N m , as will often be the case. Because Steps 1 to 7 account for most of the computational expense of the entire algorithm, the overall expense is roughly proportional to N mc . For example, using 1700 MC replicates for the beam example, the entire algorithm took roughly 31 minutes to calculate the preposterior standard deviation for each experimental design. Using 1700 MC replicates was more than sufficient for this example, and we found no appreciable difference in the results when we increased to 8500 MC replicates. The computational expense will also increase with the number of variables. For the example depicted in Figures S4 and S5 of the online supplement, each MC replicate took roughly 0.028 minutes on average, and this was also largely independent of N e . In order to use the approach in a formal design optimization algorithm, one would have to calculate the preposterior standard deviation for a great many different candidate designs as the exact locations of the x settings are varied. Hence, further research on reducing computational expense is needed to adapt the algorithm for these purposes. We are currently investigating methods of improving the computational expense; for example, taking into account the fact that the computer simulation design is the same for each physical experimental design being evaluated.

Funding
The grants that supported this work from the National Science Foundation (CMMI-1233403, CMMI-0928320 and CMMI-0758557) and also the U.S. Army Tank-Automotive Research Development & Engineering Center (TARDEC) (contract number W911NF11D0001-0037) are gratefully acknowledged.