Optimal Simulator Selection

Abstract Computer simulators are widely used for the study of complex systems. In many applications, there are multiple simulators available with different scientific interpretations of the underlying mechanism, and the goal is to identify an optimal simulator based on the observed physical experiments. To achieve the goal, we propose a selection criterion based on leave-one-out cross-validation. This criterion consists of a goodness-of-fit measure and a generalized degrees of freedom penalizing the simulator sensitivity to perturbations in the physical observations. Asymptotic properties of the selected optimal simulator are discussed. It is shown that the proposed procedure includes a conventional calibration method as a special case. The finite sample performance of the proposed procedure is demonstrated through numerical examples. In the application of cell biology, an optimal simulator is selected, which can shed light on the T cell recognition mechanism in the human immune system. Supplementary materials for this article are available online.


Introduction
There are generally two types of experiments for the studies of complex systems: physical and computer experiments. Physical experiments refer to actual experiments performed in a laboratory or observed in the field. They are often expensive and/or infeasible to conduct. Therefore, computer experiments, which refer to simulations using complex mathematical models and numerical tools, are commonly served as alternatives.
Based on the observed physical experiments, there is a growing demand in many applications for the identification of an optimal computer simulator among multiple ones with different scientific interpretations of the underlying mechanisms. For example, among different queuing models which represent different types of patient flow, it is important to identify the best simulator for a particular medical service in a hospital (Lakshmi and Iyer 2013). Geologists want to know that, among different global weather models governed by different fluid dynamics and thermodynamics equations, which one can be best used for predicting the weather of a local region (Richardson 2007). Biologists need to select some differential equations to represent the growth (or decline) of a biological population (Brauer and Castillo-Chavez 2012). However, most of the existing developments in the computer experiment literature are based on the assumption that there is only one simulator available (Fang, Li, and Sudjianto 2006;Santner et al. 2018). To the best of our knowledge, there is no systematic procedure developed for the selection of an optimal simulator.
The goal in optimal simulator selection is different from the variable selection problems in computer experiments (Bastos and O'Hagan 2009;Overstall and Woods 2016), where the focus is to identify significant variables by using one computer simulator. It is also different from studies of multi-fidelity simulations where multiple simulations are developed based on the same physical law but with different approximation accuracy, and the objective is to incorporate information efficiently from all the computer simulators (Kennedy and O'Hagan 2000;Tuo, Wu, and Yu 2014).
To identify an optimal simulator, we propose a new criterion based on leave-one-out cross-validation (LOOCV). LOOCV is commonly used for model selection, but how to implement this idea to select simulators is not trivial because of two reasons. First, unlike typical model selection problems, the current setting involves two types of data, one from physical experiments and another from computer simulations. Second, in addition to a set of regular parameters shared across all the simulations, each simulator is associated with a unique set of calibration parameters. These two types of parameters play different roles, and it is crucial to distinguish their impacts in the optimal simulator selection. The proposed LOOCV criterion addresses these issues by incorporating a goodness-of-fit measure for each simulator and a generalized degrees of freedom penalizing the sensitivity of the simulator due to calibration. Different from typical AIC and BIC types of methods (Burnham and Anderson 2011;Wood, Pya and Säfken 2016), where the penalty is defined directly by the total number of parameters, the proposed criterion takes into account the unique feature of calibration and offers a data-driven penalty for the simulator sensitivity.
This article is organized as follows. In Section 2, we propose a criterion for selecting the optimal simulator. In Section 3, the proposed criterion is shown to be decomposed into a measure of goodness of fit for physical experiments and a generalized degrees of freedom which properly penalizes the sensitivity of the simulator due to the calibration parameters. It is also shown that the proposed criterion includes the conventional L 2 -norm calibration criterion (Tuo and Wu 2015) as a special case when there is only one simulator available. The asymptotic properties of the selected optimal simulator are also discussed in Section 3. The simulation studies are provided in Section 4, and an application of the proposed method in selecting the optimal antigen recognition mechanism for T-cell signaling is provided in Section 5.

Cross-Validation for Optimal Simulator Selection
Assume that there are n observations available from physical experiments denoted by For notational simplicity, we first assume the outputs y i 's are continuous, and where y(x i ) = y i , ξ(x i ) is known as the true process in computer experiment literature, and i are identically distributed random variables with zero mean and finite variance (Tuo and Wu 2015). The true process ξ can be estimated by nonparametric regression methods, such as kernel ridge regression (Shawe-Taylor and Cristianini 2004) and Gaussian process (Fang, Li, and Sudjianto 2006;Santner et al. 2018), and the estimator is denoted byξ(·). Apart from physical experiments, there are K candidate computer simulators which refer to K different mathematical models representing different underlying mechanisms. These simulators are often computationally intensive to perform, especially for the studies of complex systems; therefore, they are approximated by surrogate models for further analysis and inference. These surrogate models are known as emulators denoted by f k (x, θ k , β k ), where k = 1, . . . , K, θ k is a set of unknown parameters called calibration parameters (Santner et al. 2018), and β k is the rest of the unknown parameters for constructing the kth emulator. The parameters, θ k and β k , can be different over k. Various surrogate models, such as Gaussian process models or spline-based models (Wahba 1990), are applicable here to construct emulators, and the parameters β k can be estimated accordingly. Given the computer experiment outputs D s k collected from the kth simulator, we assume that one emulator, f k (x; θ k ,β k ), is constructed and serves as the surrogate for the simulator to perform prediction, inference, and uncertainty quantification. By incorporating the information from the physical experiments D, the calibration parameterθ k (D) can be estimated by calibration methods, including the L 2 calibration which minimizes the discrepancy between the simulator and the true process ξ(·) (Tuo and Wu 2015) and the least-square approach which minimizes the least-square distance between the simulator and the true process.
Given the outputs from the K simulators and the observations from physical experiments, the goal is to identify an optimal emulator f 0 (x; θ 0 , β 0 ) satisfying where θ k and β k are the true parameter settings associated with the surrogate model f k and a prespecified calibration procedure, and || · || L 2 is the L 2 norm. We call the corresponding simulator of the optimal emulator the optimal simulator. To estimate (2), we propose a leave-one-out cross-validation (LOOCV) method as follows. Define a LOOCV score by where is the estimated calibration parameters by using dataset D (−i) , andξ(·) is the estimated true process. To guarantee the theoretical properties in Section 3,θ k (·) is required to be √ n-consistent estimators which are obtainable by several calibration methods including those discussed in Tuo and Wu (2015), Wong, Storlie, and Lee (2017), and Sung et al. (2020a) and the necessary condition forξ(·) can be achieved by commonly used nonparametric methods, such as kernel ridge regression methods and Gaussian processes (Stone 1982;Tsybakov 2008). In this article, we consider three types of loss functions for Q(·, ·) including the squared loss, the zero-one loss, and the deviance loss (Efron 1986;Gneiting and Raftery 2007). Becauseβ k is estimated from D s k , we haveβ k (D (−1) ) = · · · =β k (D (−n) ). Therefore, with a slight abuse of notation,β k (D (−i) ) is omitted in the cross-validation iterations and Based on Equation (3), we obtain the estimated optimal emulator whereθ T (D) is the estimated calibration parameter from D. The procedure is summarized in Algorithm 1. This procedure can also be generalized to non-Gaussian outputs. Take the binary output as an example, the same procedure follows by replacing the true process by ξ(x) = P(y(x) = 1). A demonstration of Algorithm 1 in the application to binary output is given in Section 5. The proposed procedure assumes that each simulator is represented by one emulator. In practice, there are numbers of surrogate models that can be used to construct emulators; therefore, it is crucial to carefully select one of them to represent the simulator. To do so, one approach is to modify the proposed LOOCV procedure by replacing f k in Algorithm 1 with different surrogate models and find the one that minimizes the Err k . Furthermore, an accurate estimation of the true process,ξ(·), is crucial as shown in the next section. To further enhance the asymptotic performance,ξ(·) can be incorporated into the proposed LOOCV procedure by estimating ξ(·), denoted bŷ ξ(x i ; D (−i) ), after line 5 of Algorithm 1 instead of line 2.

Theoretical Properties
In the following lemma, it is shown that the LOOCV score in Equation (3) can be decomposed into the goodness-of-fit Algorithm 1 The algorithm for simulator selection Estimate the true processξ 3: for each k in 1, 2, . . . , K do 4: Construct emulator f k (x; θ k ,β k ) using D s k .

5:
for each i in 1, 2, . . . , n do 6: Obtain the estimated calibration parameter 10: end forreturn The selected optimal simulator f T . 11: end procedure of the emulator and a quantity penalizing the flexibility of the emulator. Therefore, minimizing (3) implies a minimization of not only the discrepancy between physical and computer experiments but also the sensitivity of the emulator. The detailed proofs can be found in the supplemental material Section 2.
The decomposition of Equation (3) requires an expression of the loss function which is introduced by Efron (1986). That is, where q(·) is a concave function andq(·) is its first-order derivative. For example, as shown in the supplementary material Section 6, a squared loss function Q(·, ·) can be expressed by Equation (5) with Lemma 1. whereē and The quantity in Equation (7) determines how well the emulator fits the physical observations and the quantity GD k in (8) is the generalized degrees of freedom for the kth emulator which is an analogy to "optimism" in Efron (1986) and the generalized degrees of freedom for linear model in Ye (1998). The quantity GD k can be estimated by GD k = n( Err k −ē rr k ) and can be interpreted as the sum of sensitivity of the kth estimated emulator to perturbations in the corresponding physical observation y i −ξ(x i ). If the emulator is highly flexible/sensitive, then the values in f k tend to have a higher correlation with y i −ξ(x i ), which leads to a larger penalty. It also appears that the sensitivity is mainly associated with calibration because there is no calibration parameters involved in the kth emulator. As compared to a naive application, where LOOCV is evaluated based on the physical observations y i directly instead ofξ in (3), the value of GD k is always zero which indicates no penalty for the sensitivity. This result demonstrates the novelty of the proposed LOOCV where a data-driven penalty function is incorporated and the penalty function automatically distinguish the impacts of calibration from regular parameter estimation. It is shown in the following special case that the generalized degrees of freedom is equivalent to the number of calibration parameters. The detailed proof is given in supplemental material Section 3.
. . , n and k = 1, . . . , K. If θ k is estimated by least-square method, the generalized degree of freedom GD k in (8) is equal to the dimension of θ k .
The proposed selection procedure can also be applied to the conventional calibration problem with K = 1. The following result shows that the estimated calibration parameters and the resulting discrepancy based on the proposed leave-oneout procedure asymptotically converge in probability to those obtained by the conventional L 2 calibration (Tuo and Wu 2015;Sung et al. 2020a). The proof is given in supplemental Material Section 4.
For the estimated optimal emulator, its estimated prediction error and the sensitivity are denoted by Err T and GD T = n( Err T −ē rr T ). For the optimal emulator f 0 (x, θ 0 , β 0 ) defined in (2), we denote its prediction error by Err 0 = 1 n n i=1 Q(ξ(x i ), f 0 (x i ; θ 0 , β 0 )) and the corresponding sensitivity by GD In the following theorem, it is shown that the estimated prediction error Err T and the estimated GD T are asymptotically equivalent to those calculated for the optimal emulator f 0 (x, θ 0 , β 0 ). The proof is given in supplemental material Section 5.

Numerical Studies
In this section, the true process ξ(x) is estimated by the kernel ridge regression, that is, to minimize the following loss function where λ > 0 is a penalized parameter, || · || N is the norm of the reproducing kernel Hilbert space N generated by a kernel function (·). We consider a Gaussian kernel (h) = exp(−h 2 /(2τ 2 )), where τ is the length scale parameter. The penalized parameter λ in Equation (10) and the length scale parameter τ are chosen by 10-fold cross-validation. The emulators are then constructed by the Gaussian process (GP) models where ((x , θ k ), (x, θ k )) a Matèrn kernel with roughness coefficient 2.5 for kth emulator, i.e., is the gamma function, K ν (·) is the Bessel function, and ρ k is the range parameter. The model parameter β k = (μ k , σ 2 k , ρ k ) in (11) includes the unknown mean, variance, and the range parameter for kth emulator, and is estimated by empirical maximum likelihood method (Santner et al. 2018, sec. 3.3). The calibration parameters are estimated by the L 2 -calibration method (Tuo and Wu 2015).

Example 1: The Branin Function
Two simulators are constructed by the Branin function with two different sets of calibration parameters: where simulator m 1 contains the calibration parameters θ 1 = (b, r), simulator m 2 contains the calibration parameters θ 2 = (b, c), b ∈ [0, 2], r ∈ [5, 7], and c ∈ [4, 6]. For both simulators, computer experiments are conducted by using a 60-run maximum projection design (Joseph, Gul, and Ba 2015). Simulator m 2 is used as the true process to generate physical experiments by y(x 1 , x 2 ) = m 2 (x 1 , x 2 ; θ 2 ) + , where the inputs x 1 and x 2 are 30 Sobol points, the calibration parameters are set to be θ 2 = (1.275, 5), and ∼ N(0, 4). The true process is estimated by minimizing the loss function (10) with the length scale parameter 2.635, selected by a 10fold cross-validation. Based on Equation (3) and Lemma 1, the leave-one-out cross-validation scores for the two simulators are reported in Table 1 with the estimated generalized degrees of freedom. By using the proposed criterion, the selected optimal simulator is T = 2, which agrees with the numerical settings. Furthermore, the estimated generalized degrees of freedom for the two simulators are similar, which implies a similar sensitivity due to the calibration parameters for the two simulators. This observation also agrees with the numerical settings in which equal number of calibration parameters are associated with the simulators.

Example 2: Multi-Fidelity Simulators
The proposed procedure is demonstrated by using two simulators introduced by Goh et al. (2013) for the study of multifidelity simulations. Define the low-fidelity and high-fidelity simulators, m 1 and m 2 , by where the calibration parameters θ 1 = (t s , t ) ∈ [0, 1] 2 and θ 2 = (t s , t h ) ∈ [0, 1] 2 . When the high-fidelity simulator (13) is used, the calibration parameter t in θ 1 is set to be 0.1. The physical experiments are generated by y(x 1 , x 2 ) = m 2 (x 1 , x 2 ; θ 2 ) + 10x 2 1 + 4x 2 2 50x 1 x 2 + 10 + with the calibration parameters set to be θ 2 = (0.2, 0.3) and ∼  N(0, 0.25). These functions are shown in Figure 1. The goal is to identify the optimal simulator based on the observed physical data. This is different from the conventional goal in the study of multi-fidelity simulations.
A 40-run maximum projection design is used for the two simulators, and the physical experiments are performed based on a 30-run Sobol' points. The true process is estimated by minimizing (10) with the length scale parameter 0.584, selected by 10-fold cross-validation. For the two simulators, the estimated LOOCV scores and the estimated generalized degrees of freedom are reported in Table 2. Because Err 1 > Err 2 , the high-fidelity simulator m 2 is chosen as the optimal simulator according to Equation (4). Based on Table 2, the high-fidelity simulator has a larger GD k as compared to the low-fidelity simulator which indicates a slightly higher flexibility and therefore a larger penalty for the sensitivity.

Example 3: The Study of Simulator Sensitivity
To demonstrate the performance of the generalized degrees of freedom with respect to the different sensitivity in simulators, we consider two simulators with different numbers of calibration parameters: m 1 (x; θ 1 ) = δ 1 x +δ 2 x 2 +δ 3 x 3 and m 2 (x; θ 2 ) = θ 2 x, where θ 1 = (δ 1 , δ 2 , δ 3 ), and θ 2 are the calibration parameters. Physical experiments are generated from y(x) = x + 2x 2 + 3x 3 + 0.1 sin(20x) + , where ∼ N(0, 0.25). These functions are illustrated in Figure 2(a). A 100-run maximum projection design is used to generate computer experiments based on the two simulators, and a 61-run maximin design is implemented for physical experiments.
Based on 100 replicates, the average of GD 1 is 3.143 with standard deviation 0.342, and the average of GD 2 is 1.294 with standard deviation 0.154. These results are summarized in the boxplots in Figure 2(b). The estimated generalized degrees of freedom for the first simulator are around three times more than that of the second simulator, which reflects the sensitivity associated with the first simulator due to a larger number of calibration parameters and a higher-order polynomial.

Optimal Simulator for T-cell Signaling
It has long been known that the adaptive immune system defends the organism against diseases by recognition of pathogens by the T cell. T-cell receptor (TCR) is the primary molecule on the T cell in detecting foreign antigens which are present in major histocompatibility complex (pMHC) molecule expressed by infected cells. However, much is still unknown regarding the underlying antigen recognition mechanism.
To understand the recognition mechanism through the TCR-pMHC interactions, biologists (Rittase 2018) have developed micropipette adhesion frequency assays which are physical experiments performed in a laboratory. Although micropipette assays allow accurate measurements, they are time-consuming and often involve complicated experimental manipulation. Furthermore, some variables of interest cannot be studied in the lab due to technical complexity in carrying out the experiments. A more cost-effective approach is to illuminate the unknown recognition mechanism through computer simulations. Based on the idea of the kinetic proofreading model, two simulators are developed under two different recognition mechanisms: one is the conformation-change mechanism (denoted by CC in Figure 3(a)), and the other is the receptor-pulling mechanism (denoted by RP in Figure 3(b)). The two mechanisms associate with two different ways of TCR-pMHC interactions, either the molecules have conformational change due to the binding   or involve force due to the pulling of the TCR-pMHC bond (Rittase 2018). Biologists are interested in understanding which mechanism is behind the recognition process, but it cannot be directly detected by physical experiments. Therefore, the goal of this study is to identify the optimal mechanism based on the observed experimental data from the laboratory. Two control variables, contact time x ct and waiting time x wt , are involved both in the lab experiments and in the simulators. Denote x = (x wt , x ct ). Four calibration parameters, denoted by x Kf , x Kr , x Kr,p , and x Kc , are involved in the CC mechanism, while only the first three of them are involved in the RP mechanism. The descriptions for the variables are given in Table 3, and further details can be found in Sung et al. (2020a). The two mechanisms are simulated by the Gillespie (1976) algorithm, which is a stochastic simulation algorithm. The experimental outputs are binary, indicating a TCR-pHMC binding or not. A 60-run OA-based Latin hypercube design (Tang 1993) is implemented for the two simulators, and each design consists of 10 replicates to capture the cell-cell variability. Therefore, the sample size of the computer experiment is 600 for each mechanism. For the physical experiments, the sample size is Given the binary binding outcomes y(x) observed in the laboratory, the true process is defined as the binding probability, ξ(x) = P(y(x) = 1), and estimated by a kernel logistic regression where logit{·} is the logistic link function, {α i } n i=0 are the estimated coefficients, and (x , x) is the Matérn kernel with roughness parameter ν 0 = 2.5. Define p(x; θ ) = P(y s (x; θ ) = 1), where y s (x; θ ) is the simulated binary outcomes, and its emulators are constructed by the generalized Gaussian process models (Sung et al. 2020b) where ((x , θ k ), (x, θ k )) is the Matérn kernel with roughness parameter ν = 1.5, β k = (μ k , σ 2 k , ρ k ) includes the mean, variance, and range parameters for CC simulator (k = 1) or RP simulator (k = 2). β k is estimated by empirical maximum likelihood method as in Section 4. The calibration parameters are estimated by minimizing the L 2 discrepancy proposed by Sung et al. (2020a).
The leave-one-out cross-validation errors for the two simulators are summarized in Table 4 along with the estimated generalized degrees of freedom. The optimal simulator is the CC mechanism because its LOOCV is smaller, while its sensitivity is slightly higher than that for the RP mechanism. From a biological perspective, the selection of the CC mechanism indicates that the molecules have conformational changes due to the TCR-pMHC binding. Analyzing the CC mechanism using all the data, we haveβ 1 = (μ 1 ,σ 2 1 ,ρ 1 ) = (−0.322, 1.732, 3.089), and θ = (1.560, 8.563 × 10 −7 , 1.425, 1.589). By plugging in the estimated calibration parameters, the simulated binding probability according to the CC mechanism (red dashed lines) in

Summary and Concluding Remarks
In many applications, identifying an optimal simulator for the observed physical experiments can provide scientific insights that are not available from lab experiments. There is, however, no systematic statistical method to tackle this problem. We propose a new criterion based on the idea of leave-one-out cross-validation. Theoretical properties of the selection method based on the criterion and the estimated optimal simulator are discussed. It is also shown that asymptotically the proposed approach includes the L 2 calibration method as a special case. Simulation studies are conducted to demonstrate the performance of the proposed method. By applying the proposed method, the selected optimal T-cell signaling simulator suggests that the true binding mechanism is through conformational changes in molecules, which may shed new light on the antigen recognition mechanism in human immune system.
As pointed out by one of the reviewers, an important and interesting research is to understand the convergence properties of the estimated optimal emulator f T , such as the convergence rate to the optimal emulator f 0 . A promising direction is to extend recent results in Wang, Tuo, and Wu (2020), which is for deterministic functions, to stochastic functions. This work will be considered in a future research.

Supplementary Materials
The online supplemental material contains more technical details of this paper, including the assumptions used in Theorems 3.3 and 3.4, and the detailed proofs of the lemma, proposition, and theorems.