Quickly Assessing Contributions to Input Uncertainty

“Input uncertainty” refers to the (often unmeasured) variability in simulation-based performance estimators that is a consequence of driving the simulation with input models (e.g., fully specified univariate distributions of i.i.d. inputs) that are based on real-world data. In 2012 Ankenman and Nelson presented a quick-and-easy diagnostic experiment to assess the overall effect of input uncertainty on simulation output. When their method reveals that input uncertainty is substantial, then the natural next questions are which input distributions contribute the most to input uncertainty, and from which input distributions would it be most beneficial to collect more data? They proposed a possibly lengthy sequence of additional diagnostic experiments to answer these questions. In this paper we provide a method that obtains an estimator of the overall variance due to input uncertainty, the relative contribution to this variance of each input distribution, and a measure of the sensitivity of overall uncertainty to increasing the real-world sample-size used to fit each distribution, all from a single diagnostic experiment. Our approach exploits a metamodel that relates the means and variances of the input distributions to the mean response of the simulation output, and bootstrapping of the real-world data to represent input-model uncertainty. Further, we investigate whether and how the simulation outputs from the nominal and diagnostic experiments may be combined to obtain a better performance estimator. For the case when the analyst obtains additional real-world data, refines the input models, and runs a follow-up experiment, we analyze whether and how the simulation outputs from all three experiments should be combined. Numerical illustrations are provided.


Introduction
There is increasing recognition of the need to quantify all sources of error in mathematical and computer models, including stochastic simulations. Every simulation language measures the statistical error due to sampling from the input models, typically via Confidence Intervals (CIs) on the performance measures. However, these CIs do not account for the possible (in fact, likely) misspecification of the input models when they are estimated from real-world data. For instance, later in this article we consider the simulation of a remote order-taking system for customers using a drive-in service at a chain of fast-food restaurants; this simulation was created to estimate a measure of customer delay. Real-world data on customer arrivals, the time it takes an agent to obtain a customer's order, and the time needed for a car to move beyond the order board are used to fit input models that drive the simulation. Because we only have a finite quantity of real-world data, these input models are imperfect representations of the actual processes. As shown in many papers (e.g., Barton (2012), Barton et al. (2014), Holland (1998, 2004), Chick (2001), and Wilson (2003, 2004) the error * Corresponding author due to "input uncertainty" can overwhelm the simulation sampling error. These papers provide overall measures of input uncertainty, such as adjusted CIs or Bayesian credible intervals, where as we focus on assessing the contribution of each input model to input uncertainty as a guide to collecting more real-world data.
A predecessor of this article, Ankenman and Nelson (2012) presented a quick-and-easy diagnostic experiment to assess the overall effect of input uncertainty relative to simulation sampling variability, and a follow-up method for estimating contributions. Unfortunately, their method for identifying the input models that contribute the most to input uncertainty requires a sequence of additional diagnostic experiments; in the worst case it requires as many experiments as there are input models, and each of these experiments can be substantial. Furthermore, the variance model that underlies their diagnostic experiments has no rigorous justification.
In this article, we provide a new analysis that requires only one diagnostic experiment to assess the overall effect of input uncertainty, the relative contribution of each input distribution, and a measure of sample size sensitivity of each distribution. Using these results, the analyst can decide whether it is worth the time and expense to collect additional data and on which input processes to do so. If the analyst decides to collect additional real-world input data to refine the input models, then yet another simulation experiment would be conducted. Thus, there are potentially three sets of simulation output data: nominal experiment, diagnostic experiment, and follow-up experiment. We study when and how these data may be combined to produce better performance estimates.
We obtain our measures of overall input uncertainty, contributions, and sample size sensitivities using the following approach: 1. Following the nominal experiment, we take repeated bootstrap samples from the real-world data and use these data to create alternative sets of input distributions representing what could have occurred with different real-world samples. 2. Using these alternative input distributions we fit a regression metamodel that relates the mean of the simulation output to the means and variances of the input distributions. 3. From the metamodel, we derive expressions for the overall variance in the simulation output due to input uncertainty, the contribution to this variance of each input distribution, and the reduction in overall variance that would result from one additional real-world sample of data to fit each input distribution.
The measures computed in step 3 can be used to guide additional real-world data collection and to heuristically adjust measures of error, such as CIs, to account for both input and simulation variability. Ours is not the first attempt to decide from which input processes to collect more real-world data to reduce input uncertainty. Freimer and Schruben (2002) considered uncertainty in the estimated parameters of parametric input distributions (e.g., exponential with parameter λ, gamma with parameters α and β). Similar to the present article, they used bootstrapping of real-world input data to mimic the effect of different possible real-world samples and the corresponding input-parameter estimates they would yield. The basic premise of Freimer and Schruben (2002) was that sufficient real-world data on the parameters have been collected when their sampling distributions, as represented by bootstrap values, have no statistically detectable effect on the simulation output.
After what we call the nominal experiment, Chick (2001, 2006) attempted to optimally allocate a finite amount of additional effort-additional real-world input-data collection and additional replications of the simulation-to reduce overall uncertainty about the simulated system performance. Similar to our approach, they employed a regression model to relate the inputs to the outputs. Their goal was to collect additional real-world input observations and additional simulation replications to minimize the posterior variance of the simulation point estimator subject to a budget constraint. Freimer and Schruben (2002) collected additional input data until they could establish that input uncertainty was negligible, whereas Chick (2001, 2006) optimally allocated the data-collection and simulation-replication budget to minimize overall uncertainty. Both assumed that it was possible to collect input data from any of the input distributions in whatever quantity was desired or affordable. We take the perspective that additional real-world data are often unattainable, at least for some of the input processes, and the quantity that can be obtained is more likely to be constrained by time than cost per observation; therefore, if we can get more data, we will get as much as possible. The insight we deliver starts with an overall assessment of input uncertainty, which is useful for understanding risk even if there is no follow-up experiment. When additional input data are to be collected, then our relative contributions and sensitivities provide guidance about the best targets.
This article is organized as follows. Section 2 defines the input uncertainty problem and sets up our model of it. In Section 3 we describe the sequence of experimentsnominal, diagnostic, and follow-up-focusing on the diagnostic experiment for assessing input uncertainty and the contribution of each input distribution. When and how to combine output data from these experiments is addressed in Section 4. Section 5 provides guidelines for the design of the diagnostic experiment. Section 6 summarizes results from an empirical study and an illustrative example, followed by conclusions in Section 7.

Problem formulation
In this section we present a definition of "input uncertainty" and introduce our model of it.

Definition of input uncertainty
In this article, we consider mutually independent input processes that are each independent and identically distributed (i.i.d.) random variables whose marginal distributions are unknown. We use estimated or "fitted" distributions based on real-world data as stand-ins for the unknown, true distributions. We do not consider multivariate or time-dependent input processes here.
Suppose that there are L mutually independent input processes characterized by a collection of true real-world marginal distributions F c = {F c 1 , F c 2 , . . . , F c L }, where c denotes "correct." Since these distributions are unknown, we use a corresponding collection of estimated distributions F = { F 1 , F 2 , . . . , F L } to drive the simulation. The th estimated marginal distribution, F , can be either a parametric or an empirical distribution, but in either case it is inferred from observed real-world data X 1 , X 2 , . . . , X m ∼ i.i.d. F c , where m indicates the number of observations for the th input model. We only consider m > 0 and thus do not address subjectively specified distributions for which we have no data.
Given a collection of input distributions F, the simulation generates performance output Y j ( F) on i.i.d. replication j = 1, 2, . . . , n. Ankenman and Nelson (2012) represented where ε 1 , ε 2 , . . . , ε n are i.i.d. with mean zero and variance σ 2 representing the natural stochastic variability from replication to replication in the simulation. They assumed that ε j does not depend on the input model F, which is clearly an approximation and one which we also adopt. It is important to notice that E[Y( F)| F] is a random variable since it is a functional of F, which is estimated from real-world data.
In other words, depending on the real-world observations that are used to infer F c , we could have different values of The goal of our simulation is to estimate E[Y(F c )], which we typically do with the sample meanȲ( F) = n j =1 Y j ( F)/n for given F. This article focuses on how to assess the relative impact of each F on the variability of the simulation estimatorȲ( F). As the number of simulation replications n increases,Ȳ( F) converges to E[Y( F)| F], which is not necessarily the same as E[Y(F c )]. In fact, there is typically a bias in the estimatorȲ( F) coming from the fact that F is inferred from a finite number of observations and the simulation is a nonlinear transformation. The traditional confidence interval captures only Var[Ȳ( F)| F] = σ 2 /n for given F. Therefore, we need a different approach to account for the variability of the simulation estimator depending on the input models, which we refer to as the input uncertainty.
Formally, the input uncertainty σ 2 I of the simulation es-timatorȲ( F) is defined as where the Var[·] is with respect to the sampling distribution of F. Therefore, Var[Ȳ( F)] can be decomposed as where the first expression is general, and the second follows from the definition of σ 2 I and the homoscedasticity assumption in Model (1). Ankenman and Nelson (2012) introduced the ratio γ = σ I /(σ/ √ n) as a measure of the relative significance of input uncertainty to the simulation-based estimator variability. Suppose that the analyst chooses n large enough so that the estimator variance σ 2 /n is reasonably small. Then a very small γ implies that σ 2 I σ 2 /n; i.e., the input uncertainty is insignificant. On the other hand, if γ is large, then σ 2 I σ 2 /n; i.e., there is significant input uncertainty in the simulation estimator. In the latter case, a natural question is: Which models contribute the most to the input uncertainty?
Our proposed definition of the contribution of F to input uncertainty is In other words, V (m ) is the variability in the simulation's expected value when all of the true input distributions except F c are known and F c is estimated by F . Notice that V is a function of the sample size m to make it clear that the contribution depends on the number of observations; the larger m is, the smaller V (m ) becomes as F approaches F c . The relative contribution of the th input model is .
We also define the (sample size) sensitivity of the variance ofȲ( F) with respect to th input model by approximating m as real valued and taking which is the same as ∂σ 2 I ∂m m =m if we assume homoscedastic simulation variance. The sensitivity can be interpreted as a measure of how much input uncertainty can be reduced by observing one more real-world sample from the th input process given that we already have m observations.
The input uncertainty problem is related to global sensitivity analysis. Here we briefly describe similarities and differences. Suppose we have a response y = g(x) that is a function of some parameters x = (x 1 , x 2 , . . . , x L ). The response y might be the objective-function value in an optimization problem or the key output from a deterministic numerical simulation, for instance. However, the parameters x are not actually known with certainty and therefore could be modeled as a random vector X with known distribution F c . Loosely speaking, the goal of global sensitivity analysis is to assign to each parameter X 1 , X 2 , . . . , X L a measure of impact on the random variable Y = g(X); the focus is often on decomposing Var [Y]. Many of these measures are computationally expensive to compute.
For instance, Wagner (1995) defined two global sensitivity measures for the objective function of an optimization problem with uncertain parameters. One measure of sensitivity for the th parameter was based on the variance of the conditional expectation of g(X) given all parameters except X , whereas the other was the variance of the conditional expectation of g(X) given only X . Homma and Saltelli (1996) proposed a related variance-based sensitivity measure: the ratio of the variance of the total effect (main effect and all the interaction effects of the parameter of interest) to the variance of Y. Oakley and O'Hagan (2004) introduced the idea of replacing evaluations of g(x) with evaluations of a Gaussian process metamodel g(x) that is fit to a chosen set of parameters and outputs (x i , g(x i )), i = 1, 2, . . . , k. Because generation of X ∼ F c and evaluation of g(X) are fast, this facilitates computing any of the global sensitivity measures described above as well as others; furthermore, the Gaussian process metamodel supports incorporating uncertainty about the true function g(·) into the analysis. Marrel et al. (2012) extended this idea to stochastic simulations in which we can only observe g with noise: g(x) + (x). They used joint metamodels for g (·) and Var[ (·)] to estimate a variance-based sensitivity measure using a functional analysis of variance (ANOVA) decomposition of g. Their setting is the closest to ours in that they do sensitivity analysis in the presence of stochastic simulation output.
Returning to deterministic outputs, Plischke et al. (2013) proposed density-based sensitivity measures as an alternative to variance-based measures. They evaluated the expected difference between the unconditional probability density of Y and its conditional density given X = x . The larger the expected difference is, the more sensitive the output is to this parameter.
From our perspective, global sensitivity tries to assess the effect of each distribution F c 1 , F c 2 , . . . , F c L on Y(F c ); that is, it decomposes the (simulation) variability represented by ε. Input uncertainty arises when F c is estimated by F, and an assessment tries to measure the impact of variability in F (and in this paper, each F ) in the presence of simulation variability. Input uncertainty due to F can depend on how sensitive the output is to the th distribution but also on how well that distribution is estimated.

The mean-variance effects model
As noted earlier, E[Y( F)| F] is a random variable depending on F; therefore, we can think of it as a functional of F; i.e., g( F) = E[Y( F)| F]. We suggest (and justify below) the following mean-variance effects model for g( F): where μ( F ) and σ 2 ( F ) represent the mean and the variance of a random variable with distribution F , respectively, and β and ν are constant coefficients. The philosophy behind this model is that sensitivity of the mean simulation output to the particular realization of F is largely captured by the realized center (mean) and spread (variance) of the distribution. This model could be extended to include higher moments such as skewness and kurtosis, or the 25th and the 75th percentiles could be chosen instead of mean and variance. However, the essence of the model is to represent the relationship between each F and E[Y( F)| F] through some characteristic properties of F . As we show later, there are advantages to using the mean and variance when we want to estimate the contribution of each input model. This model does not include interaction terms; we expect that in many cases the main effects are more significant than the interaction effects and can capture a large part of the impact of F on the output. The contribution of F can be derived by plugging this model into the definition in Equation (4 The third equality holds because μ(F c i ) and σ 2 (F c i ) are constants. The overall input uncertainty σ 2 I can be derived by plugging Model (6) into the definition in Equation (2): The second equality follows from independent sampling from the input models and the last equality follows directly from Equation (7). This result shows that under Model (6) the overall input uncertainty is the summation of individual contributions; i.e., σ 2 Thus, under our model the overall input uncertainty can be decomposed into the individual contributions, and the individual contributions are independent of each other. Also, under this model the sensitivity of F becomes which makes the sensitivity simply the derivative of the contribution with respect to the sample size, evaluated at the current number of samples m . The variance decomposition in Equation (8) coincides with a result in Cheng and Holland (1998) that was obtained by a different argument. They approximated the input uncertainty variance σ 2 I as a function of the variances of parameter estimators of parametric input distributions. If we let F c (·|θ ) be the true parametric family of input distributions, θ c be the collection of true parameters, and θ be an estimator of θ c , then a Taylor series expansion of g( θ) around θ c gives where Var[ θ ] is the variance-covariance matrix of θ and ∇(g(θ c )) is the gradient of g(·) at θ c . By further assuming θ is a Maximum Likelihood Estimator (MLE) for θ c , they argued that is the asymptotic variance-covariance matrix of θ under some regularity conditions. To connect this to our formulation, assume that each input distribution F c can be parameterized by μ(F c ) and σ 2 (F c ). Then we can represent g( F) as a function of the parameters g( θ ), where In fact, in our case the approximation in Equation (9) is an equality because Model (6) is linear in θ and therefore the first-order Taylor approximation is the model itself. A feature of our approach is that we can directly compute Var[ θ ] without making further assumptions. Since we have independence among input processes, Var[ θ] is a block diagonal matrix: Plugging Var[ θ] into Equation (9) gives the same expression as in Equation (8). Cheng and Holland (1998) provided an approximate analysis of an exact model that requires parametric input distributions and MLEs; we provide an exact analysis of an approximate model using any form of input distribution but assuming that most of the sensitivity of the simulation response to the input distributions is captured by their means and variances.
To evaluate the contribution of each input model, we will use least-squares regression to estimate β 0 , β 1 , . . . , β L , ν 1 , ν 2 , . . . , ν L . In the next section we introduce a sequence of experiments to estimate these coefficients and to evaluate the contribution and sensitivity of each input model.

The sequence of experiments
In this section we describe a sequence of experiments that an analyst might conduct: nominal, diagnostic, and follow-up. Current simulation practice is to run only the nominal experiment.
The nominal experiment involves the analyst collecting input data, choosing input models F to use, building the simulation model, and running n replications to obtain the point estimatorȲ( F From this experiment, the analyst obtains the traditional CI for E[Y( F)| F], which is typically not the same as a CI for E[Y(F c )], as discussed earlier. The number of replications n is either chosen arbitrarily, to achieve a certain level of simulation error, or because it can be completed in the available time.
We are suggesting that this be followed by a diagnostic experiment to evaluate the contribution and sensitivity of each input model as well as the overall input uncertainty. The contributions can be calculated from Equation (7) given the coefficients β 1 , β 2 , . . . , β L , ν 1 , ν 2 , . . . , ν L and the variance and covariance of μ( F ) and σ 2 ( F ). We describe a method for estimating them in the next section.
Upon completion of the diagnostic experiment, the analyst is either satisfied that input uncertainty is not substantial or is concerned that it is substantial and has a better understanding of how significant it is. In either case, the analyst has simulation results from the diagnostic experiment that could perhaps be used to improve the estimate of E[Y(F c )]; we study whether and how to do this as well.
When input uncertainty is substantial, the analyst may also undertake a follow-up experiment, which involves collecting additional real-world input data (with our sensitivities providing the most valuable targets), refining the estimated input models, and conducting another simulation with the refined input models. Conclusions could be based on this final experiment only, but we investigate whether simulation outputs from the nominal and diagnostic experiments should also be incorporated. Of course, there could be additional cycles of diagnostic and follow-up experiments as desired.

Diagnostic experiment
The diagnostic experiment is conducted to fit the meanvariance effects model (6) and derive our measures of input uncertainty. To estimate the unknown coefficients β 0 , β 1 , . . . , β L , ν 1 , ν 2 , . . . , ν L by least squares, we need at least 2L + 2 design points, which means we need 2L + 2 different Fs. However, this is typically impossible since we do not have more than one data set to fit F; even if we did have multiple data sets they would usually be pooled to enhance the precision of F. Instead, Song and Nelson (2013) suggested a bootstrap approach by treating F as the true real-world distribution F c and sampling multiple times from F instead of gathering multiple real-world samples. In our context, a "bootstrap" is an i.i.d. sample X 1 , X 2 , . . . , X m from F . We use the notation X j to denote a sample from the empirical cumulative distribution (ecdf) or fitted parametric distribution F , as opposed to X j , which is a sample from the true real-world distribution F c . More generally, a denotes a quantity defined by a bootstrap sample.
In our method a bootstrap sample X 1 , X 2 , . . . , X m from F is treated as a real-world sample X 1 , X 2 , . . . , X m from F c . From the bootstrap sample we can fit an ecdf F , which plays the role of F , and calculate μ( F ) and σ 2 ( F ). Repeating this for = 1, 2, . . . , L, a collection of ecdfs F = { F 1 , F 2 , . . . , F L } is obtained and we can run replications of the simulation with F to obtain the esti-matorȲ( F ). If we repeat this process B times, we will get F (1) , F (2) , . . . , F (B) and the corresponding means and variances of input models, as well as the simulation esti-matorsȲ( F (1) ),Ȳ( F (2) ), . . . ,Ȳ( F (B) ) by which we can fit Model (6).
Why use bootstrap samples to create design points for fitting the mean-variance effects model (6) instead of a classic designed experiment? First, this is not a simple design space: it is the space of possible fitted input distributions that could result from sampling from the true distribution F c . Even if parametric distributions are used (which we do not require) so that the design space becomes the space of parameter values, some sets of parameters are far more likely than others, and our mean-variance model will be most effective if we get a good fit near F c rather than a global fit across the space. By using bootstrap samples from F we create design points that are representative of what is likely, providing a good fit where it matters most. Furthermore, simulating at bootstrap random samples, rather than chosen design points, allows us to combine simulation results from the nominal and diagnostic experiments without introducing lack-of-fit bias; see Section 4. There are additional advantages, which we describe below.
The analogy between real-world sampling and bootstrap sampling is equivalent to assuming that the input uncertainty σ 2 Under Model (6), σ 2 I is the sum of the contributions of all input models as in Equation (8). Therefore, in view of the approximation (10): which will be true if As the real-world sample size m increases, this approximation is asymptotically justified under some conditions on F , as discussed in Section 3.3. A valuable advantage of the approximation (11) is that it provides expressions for the variance and covariance components in Equation (7) that we need to calculate the contributions. Since we use an empirical distribution F , μ( F ) and σ 2 ( F ) are simply the sample mean and the second sample central moment of X 1 , X 2 , . . . , X m , which is an i.i.d. sample from the known distribution F . Therefore, as shown in Appendix A of the online supplement, we can derive expressions for If F is a parametric distribution, then we can calculate the central moments from the known representation; for instance, if F is a gamma distribution with estimated shape parameter α and rate parameter β, then the second, third, and fourth moments are α/ β 2 , 2 α/ β 3 , and 3 α 2 / β 4 + 6 α/ β 4 , respectively.
One of the major advantages of our approach is that these variance/covariance expressions are not approximations, which improves the estimation of the contributions. Also notice that this is a very general method that is applicable to any empirical distribution and many parametric distributions; the only time we face difficulty is when F is a parametric distribution whose parameters are in the range for which not all moments up to the fourth exist (e.g., a log-logistic distribution with shape parameter less than four). Even then, we can use our method provided that the offending distribution can be represented as a transformation of another distribution whose first four moments always exist. In this case we fit the mean-variance model to the moments of the underlying distribution. For instance, the log-logistic distribution is a transformation of a logistic distribution whose first four moments are finite, so we fit to the moments of the underlying logistic distribution. Since every distribution can be viewed as a transformation of the uniform (0, 1) distribution, our method is (at least in theory) completely general.
Inserting the moment expressions into Equation (7), the contribution of F with m samples can be written as Quickly assessing input uncertainty 899 implying that the sample size sensitivity of F is Notice that the sensitivity is always negative since input uncertainty is reduced with additional real-world data. Notice also that the rank order of contributions and sensitivities of distributions do not always coincide; even if F has the largest contribution, if m is also large then input uncertainty may be less sensitive to F than some other input models. In other words, an additional sample from F c may not make much difference to the input uncertainty variance since we already have a large sample.
The algorithm for the diagnostic experiment is as follows:

Diagnostic Experiment
1. Given the estimated input models F, do the following: m and calculate the mean μ( F (b) ) and variance σ 2 ( F (b) ).
We estimate σ 2 in step 4 using the sample variance from the nominal experiment, rather than using the residual meansquared error (MSE) of the fitted model in step 3; this avoids bias due to lack of fit. Of course, n in step 4 is the number of replications used in the nominal experiment and need not be the same as R.
As discussed in Section 2.1, the ratio γ expresses input uncertainty in units of the simulation estimation error. If γ = 0.5, for instance, it implies that the input uncertainty is only half of the simulation estimation error, which may be acceptable depending on the type of decision that the simulation is expected to support. If γ is large-e.g., γ = 20then we can conclude that the simulation estimation error as measured, say, by the width of a CI, should be inflated by a factor of roughly 1 + γ 2 . Whether or not this is acceptable depends on the situation: If the simulation estimation error is very small-as would occur if the number of replications n is very large-then a CI that is roughly 20 times longer might have little effect on the decision that the simulation model is designed to support. We believe that it will often be the case that such an inflation is unacceptable, so the analyst may choose to collect more real-world data for the input models that have greater (more negative) sensitivities. This decision also depends on the feasibility and the cost of additional data collection.
The key insight is that γ must be interpreted in light of the remaining simulation estimation error, not as an absolute number, and will be the most meaningful when n was explicitly chosen chosen to achieve a specified level of error. For instance if n was chosen (perhaps sequentially) to attain a CI whose width is no more than ± , then γ can be interpreted as large or small relative to 1 + γ 2 × .

Follow-up experiment
If the analyst collects more real-world data from some or all of the input processes, then they will have an updated collection of input models with m > m for at least one ∈ {1, 2, . . . , L}. Using updated input models that are fit to all of the accumulated data, the analyst can run a follow-up experiment to obtain an estimator of E[Y(F c )] with reduced input uncertainty. We assume that the follow-up experiment employs at least as many simulation replications as the nominal experiment, so n ≥ n. If needed, this sequence of experiments can be repeated by regarding the results from the follow-up experiment in the previous sequence as a new nominal experiment.
The primary question with respect the follow-up experiment is whether we should use simulation outputs from the nominal or diagnostic experiments in the overall estimator. We address this question in a later section.

Validity of the bootstrap approximation
Here we consider the validity of the approximation, introduced above, where we suggested that

Given our model, this is equivalent to
This assumption can be asymptotically justified under certain conditions as the sample size m gets large; we describe some cases below. Recall that in our method the bootstrap distribution F is an ecdf. Let Assuming that F is also an ecdf, and the relevant moments exist, the asymptotic variance and covariance of μ( F ) and σ 2 ( F ) given F are, with probability 1: Furthermore, when F is an ecdf it is easy to show that See Zhang (2007), Cho and Cho (2008), and the online supplement. However, if F is a parametric distribution whose parameters are estimated from the observed real-world data then neither Equation (15) nor Equation (16) is guaranteed to hold. For instance, if we fit the wrong parametric family to the data then differences can occur. A sufficient condition for both Equation (15) and (16) to hold when F is a parametric distribution is that F is flexible enough to match any first four moments of the data, and the distribution is fit using the Method of Moments (MMs) up to at least the fourth moment.
Even when we have the correct parametric family, there could still be differences depending on how we estimate the parameters. As mentioned above, if we use the MMs then the moments of the fitted distribution are the sample moments. And in many cases, the MLEs and MMs are asymptotically equivalent (e.g., normal). This is not always the case, however.
Suppose that F c is a uniform (0, 1) distribution and we have m i.i.d. real-world observations X 1 , X 2 , . . . , X m . The MLEs are α = X (1) and β = X (m ) , where X (i ) is the i th order statistic. Then Except for the covariance term, the asymptotic variances are smaller than those of the MMs estimators because the MLEs are asymptotically more efficient. Nevertheless, even in this case our bootstrap approximation provides a valid representation of the variability of the real-world data that could have been obtained, although not a perfect representation of the variability of the estimated parameters. As a practical matter, we would apply our method for any sample sizes m , = 1, 2, . . . L that the analyst is comfortable using to fit distributions if the alternative is to ignore input uncertainty. Based on the bootstrapping literature (e.g., Hall (1992, Appendix I)), the performance of sample moment estimators, and our own experience in empirical studies, we are comfortable with m ≥ 100.

Combining results from the nominal, diagnostic, and follow-up experiments
In some situations it is not feasible to gather additional real-world input data even if there is substantial input uncertainty; in others the input uncertainty is so small that there is little value in reducing it further. In either of these situations the analyst terminates the experiment at the diagnostic phase that generatedȲ( F (b) ), b = 1, 2, . . . , B to fit Model (6). The first question we address is whether these results can be utilized to improve the estimatorȲ( F) from the nominal experiment by using a weighted estimator: whereȲ( F) = B b=1Ȳ ( F (b) )/B and α ∈ [0, 1]. This estimator only makes sense because the bootstrap distributions F (b) are sampled directly from F and therefore indirectly from F c . Deterministically chosen design points would introduce an unknown and likely significant bias.
We answer this question by seeking α that minimizes MSE[Ỹ| F], rather than minimizing MSE[Ỹ]. Minimizing MSE[Ỹ] would make sense if we were actually able to obtain multiple real-world data sets, whereas minimizing MSE[Ỹ| F] acknowledges that we only have one real-world data set and therefore cannot improve our estimate of F c beyond F.
From the derivation in Appendix B of the online supplement, and under the assumption that Model (6) holds, we have Quickly assessing input uncertainty which is the bias from bootstrapping. Therefore, the optimal α is Notice that α is strictly less than one; hence, it is always better to poolȲ( F) andȲ( F). The key term is σ 2 /n that represents the simulation variance: the larger it is, the more weight is given to the diagnostic results, which are biased but reduce simulation variance. An unbiased estimator of b isȲ( F) −Ȳ( F), so every term in α is either known or estimable from the nominal and diagnostic experiments.
Previously we suggested that the diagnostic experiment could be used to provide a heuristic adjustment to the CI or standard error of the mean of the nominal experiment by multiplying them by 1 + γ 2 . In Appendix C of the online supplement, we justify using σ 2 I + MSE as a plug-in approximation for the MSE of the combined estimatorỸ, where MSE is Equation (17) with optimal weight α from Equation (18) but inserting estimates from the diagnostic experiment for all of the unknown quantities. As a rough CI we could multiply this by an appropriate normal quantile.
We next consider the case when all three experiments have been conducted. Now it makes sense to try to minimize the MSE of the final point estimator with respect to both input and simulation uncertainty. For this reason we discard the simulation results from the diagnostic experiment since they introduce additional bias due to bootstrapping.
For a clear distinction, let F m denote the collection of input models used in the nominal experiment that were estimated from m = {m 1 , m 2 , . . . , m L } real-world observations, and letȲ n ( F m ) denote the estimator from n replications using F m . Similarly, F m denotes the collection of input models used in the follow-up experiment that were estimated from m = {m 1 , m 2 , . . . , m L } real-world observations, where m ≥ m for all , andȲ n ( F m ) is the corresponding estimator from n ≥ n replications. Notice that it is very likely thatȲ n ( F m ) andȲ n ( F m ) are positively correlated since the follow-up m real-world observations include the nominal m observations.
The weighted estimator Y is defined as Due to the correlation betweenȲ n ( F m ) andȲ n ( F m ), finding α to minimize MSE( Y) is more complicated than the previous case. As shown in Appendix D of the online supplement, if we let b m and b m denote the bias ofȲ n ( F m ) and Y n ( F m ), respectively, as estimators of E[Y(F c )], and let σ 2 I and (σ I ) 2 denote the input uncertainty of the nominal and the follow-up experiments, respectively, then α becomes If α < 0 we use α = 0. In general it is not easy to find a useful expression for Cov[Ȳ n ( F m ),Ȳ n ( F m )]. However, if we assume all input distributions are ecdfs, then under Model (6) we can show that Under this condition we can also get an expression for the bias term as where σ 2 (F c ) represents the variance of the true distribution F c . These expressions are derived in Appendix D of the online supplement. Therefore, for this special case α becomes From this result, we can immediately observe that as n gets bigger, α gets smaller, which implies that the more replications we make from the follow-up experiment, the less we value the replications from the nominal experiment, which makes sense.
For the simplest case, suppose that the analyst did not collect additional real-world input data and ran the followup experiment with the same input models F m . Then α = n/(n + n ) because m = m for all . This is clearly the weight we would use to pool two estimators from n and n replications generated by the same input models.
More generally, when m > m for at least one there is a trade-off because pooling leads to more bias asȲ n ( F m ) tends to have a larger bias thanȲ n ( F m ). IfȲ n ( F m ) is significantly more biased thanȲ n ( F m ) then it is less attractive to pool the two estimators and we suspect this is often the case. To illustrate this point, suppose that overall input uncertainty is substantial (e.g., γ = 20) and F 1 has the largest contribution among all input models while others have negligible contributions. In this case, the analyst might collect a large additional sample (m 1 m 1 ) from F c 1 to update F 1 , while keeping all other input models the same as in the nominal experiment. When this happens α becomes α = (σ 2 /n ) + ((ν 1 σ 2 (F c 1 )) 2 /(m 1 − 1))((1/(m 1 − 1)) − (1/(m 1 − 1))) (σ 2 /n) + (σ 2 /n ) + (V 1 (m 1 ) − V 1 (m 1 )) + {ν 1 σ 2 (F c 1 )((1/(m 1 − 1)) − (1/(m 1 − 1)))} 2 . (20) The denominator in Equation (20) is strictly positive as V 1 (m 1 ) > V 1 (m 1 ). However, the numerator can be negative if in which case α = 0. Even if the numerator is greater than zero, we suspect α will be near zero for the following reasons.
1. The analyst may run more replications for the follow-up experiment than the nominal experiment (n ≥ n) as it provides a less biased (more accurate) estimator, which makes σ 2 /n < σ 2 /n. 2. (σ 2 /n ) + ((ν 1 σ 2 (F c 1 )) 2 /(m 1 − 1))((1/(m 1 − 1)) − (1/(m 1 − 1))) < σ 2 /n ; 3. since m 1 m 1 , we expect V 1 (m 1 ) − V 1 (m 1 ) ≈ V 1 (m 1 ); and 4. since γ is large and F 1 has the biggest contribution, From 1 to 4; and, therefore, α becomes small. Hence, the more we realworld data we collect for the follow-up experiment, the less benefit there is in variance reduction from pooling the two estimators, whereas the relative disadvantage from introducing additional bias increases. As mentioned earlier, the result in Equation (19) holds under Model (6) when we assume that F m and F m are collections of ecdfs. We believe that α is typically near zero in this case, which implies that it is better to use the estimator from the follow-up experiment without pooling. We also suspect a similar conclusion holds when any of the input distributions are parametric.

Design of the diagnostic experiment
The diagnostic experiment is an essential component of our method. There are three key experiment design questions.
1. How should we select design points? As discussed in Section 3.1, to fit Equation (14) we have chosen bootstrap generated empirical distributions F 1 , F 2 , . . . , F B as design points. This concentrates the design where we need to fit well, and it facilitates combining results from the nominal and diagnostic experiment.

Should we use Common Random Numbers (CRNs) across the diagnostic simulations?
The bootstrap design points F (1) , F (2) , . . . , F (B) must be sampled independently from F, but we can choose to use the same random numbers for the simulations conducted with each F (b) . Kleijnen (1988) and others have shown that CRN tend to reduce the variance of the slope-parameter estimators in least-squares regression, which are β , ν , = 1, 2, . . . , L in Model (6). Since the variance of V is an increasing function of the variances of β and ν , it seems clear that using CRNs is desirable.

Given a budget of N simulation replications, how should it be divided between design points (B) and simulation replications per design point (R) so that RB = N?
Ankenman and Nelson (2012) showed that with their method for assessing overall input uncertainty, if N is not too small then B ≈ 10 is optimal in terms of minimizing the expected width of the CI for γ . However, our objective is different: we focus on providing estimates of the contribution of each input model. Below we argue that B = N (R = 1) is the best choice in terms of statistical efficiency, but B = 2L + 3 is best for computation. Therefore, we provide a recommendation that balances these two objectives.
First, B = N is the optimal design to minimize the MSE of the combined estimator from the nominal experiment (Ȳ( F)) and the diagnostic experiment (Ȳ( F)). This is because the conditional variance ofȲ( F) can be approximated as Var[Ȳ( F)| F] ≈ σ 2 I /B + σ 2 /N under the bootstrap approximation in Equation (10); see Appendix B of the online supplement.
Second, as described in Section 3, we estimate the parameters β 0 , β 1 , . . . , β L and ν 1 , ν 2 , . . . , ν L in Model (6) by regressing B simulation output estimators on the means and variances of bootstrapped ecdfs. The design matrix for the regression is ⎡ As we have 2L + 1 parameters, we need at least 2L + 2 unique rows in the design matrix, which implies that it is necessary to have B ≥ 2L + 2. However, B ≥ 2L + 2 is not always sufficient to obtain the required number of unique rows. If our input models include at least one continuous parametric distribution among F 1 , F 2 , . . . , F L , then with prob-ability one all rows in the design matrix will be unique because the probability of two bootstrap samples with the same sample moments is zero for a continuous distribution. On the other hand, if all input models in F 1 , F 2 , . . . , F L are ecdfs or discrete parametric distributions, then we need to be more cautious. Bernoulli distributions, particularly when the probability of success is extreme, are the most challenging cases since ties in the sample moments are quite likely when m is small. Thus, a larger m gives more opportunities for unique rows. Also worth noting is that the likelihood of identical rows diminishes as the number of input models L increases, so the problem is less pronounced in simulations with many input models.
Finally, the contribution estimator V in Equation (12) is a function of β and ν , so a good stand-in for the properties of V are the properties of β and ν . To illustrate why B = N is statistically best, we temporarily simplify Model (6) to only include the "mean effects" and no CRN: Then V (m ) = β 2 Var[ μ( F )]. If we assume that both μ( F ) and ε j are normally distributed-which is plausible since μ( F ) is asymptotically normally distributed with large sample size m -then standard results show that when we force RB = N. Notice also that Clearly, B = N is best for minimizing variance and bias, but the marginal impact of increasing B diminishes when BR = N is fixed. If we extend this analysis in the natural way to include the "variance effects" as in Model (6), then the −L terms in the denominators of Equation (22) and (23) become −2L.
On the other hand, from a computational efficiency point of view using B < N (R > 1) has advantages over B = N (R = 1). There is typically a computational set-up cost for simulating a new design point (which is really a new simulation model) but very little setup required for each additional replication at a design point. If B < N then there are fewer setups. Furthermore, the number of rows in the design matrix is B, implying the need to manipulate a B × (2L + 1) matrix to fit the regression model. This argues for smaller B.
We must have B ≥ 2L + 2. When larger N is feasible, we recommend making B large enough so that the incremental decrease in the variance and bias is small, say < δ. Therefore, we select the smallest B such that which implies selecting the smallest B such that B > 2L + 2 + 2L + 2 δ and R = N/B . For instance, if L = 5 and δ = 0.01-a 1% marginal decrease-then B ≈ 12 + 35 = 47.

Empirical results
This section summarizes an empirical study of the proposed method applied to two simple examples and also an illustration on a realistic problem.
In the two simple examples we apply our method and provide intuitive explanations for the results, as well as compare them to true input-model contributions as defined in Equation (4) that were estimated precisely from side experiments. These side experiments exploit the fact that the true, correct real-world distributions are known, which is obviously not the case in practice. In both examples we define F c for each input and then sample m observations that we treat as the real-world data.
We first evaluate the method using a well-known Stochastic Activity Network (SAN); see Fig. 1. The goal is to estimate the mean time to complete the network. We defined a set of real-world input distributions for the activity times F c = {F c 1 , F c 2 , . . . , F c 5 }. Experiments under different settings of sample size and mean and variance of the activity-time distributions were conducted and the variance contribution and sensitivity of each input distribution was estimated.
The second example is an M/M/1/k queueing system simulation that has two input distributions: interarrival time and service time. The goal of the simulation is to estimate the mean steady-state queue length, which is known to be a highly nonlinear function of the means of the two  input distributions. We use this example to show the performance of our method when the mean response is a nonlinear function of the means and variances of the input distributions. Different settings of the means of the interarrival times and service times as well as the capacity of the queue were tested to evaluate the performance of our method.
We conclude with a more realistic example of simulating a remote order taking system to illustrate the practical application of our method.

SAN
For each activity time X we have a correct real-world distribution F c , which we pretend is unknown. Using m samples generated from F c , we fit ecdf F for each X . The total time to finish the project is Y = max{X 1 + X 4 , X 1 + X 3 + X 5 , X 2 + X 5 } and the purpose of the simulation is to estimate the expected value of Y. In the remainder of the section we describe empirical results for contribution and sensitivity of the input distributions under two experimental settings: (i) different real-world sample sizes and (ii) different means and variances of the activities.
To evaluate how well our method measures the contribution of each input model, a series of side experiments was conducted for each experimental setting. The side experiment estimates the contribution directly from the definition in Equation (4) by exploiting the fact that we know the true distributions in this example.

Side Experiment
1. Given the true distributions F c 1 , F c 2 . . . , F c 5 for activity times X 1 , X 2 , . . . , X 5 , do the following: 2. For b = 1 to B a. Generate m 1 samples from F c 1 and fit F 3. Calculate the sample variance ofȲ( F

Conduct steps 2 and 3 for each distribution
A key point is that we make R large enough that This implies thatȲ( F and we can treat the sample variance (24) as an estimate of the variance due to different real-world samples; i.e., contribution V 1 only. We used (B = 100, R = 5000) that gave relative errors of less than 2%. Notice that B and R in these side experiments are unrelated to the choices we make for the diagnostic experiment.

SAN with different activity-time sample sizes
In this experiment, all activity times have exponential real-world distributions with mean one. Thus, the path X 1 + X 3 + X 5 is likely to be the longest. To see the effect of different real-world sample sizes on V , we conducted simulations for two cases: (i) m = 100, = 1, 2, . . . , 5; and (ii) m = 100, = 1, 2, 4, 5, and m 3 = 50. We used (B = 50, R = 200) for the diagnostic experiment. Figure 2 displays the simulation results when averaged over 1000 macro replications to provide a relative error of less than 3%. Plotted are the relative contributions V (m )/ L i =1 V i (m i ), and the sensitivities scaled so that the greatest sensitivity has value −1. In case (i), X 1 and X 5 have the largest contributions and are more sensitive to additional real-world data since these two activity times are involved in two-out-of-three paths on the network. On the other hand, X 2 and X 4 have the smallest contributions and sensitivities since they are only involved in one path that is likely to be dominated by X 1 + X 3 + X 5 . The higher contribution/sensitivity of X 3 compared with X 2 and X 4 can be explained by the same reasoning. This trend changes slightly when m 3 decreases to 50 in case (ii): since there are fewer real-world samples for X 3 in this case, X 3 shows the largest contribution and sensitivity. This is a good illustration that input uncertainty is a combination of how much the distribution itself matters in the simulation and how well it has been estimated. Figure 3 compares the estimated contributions from the diagnostic experiments and the side experiments for cases (i) and (ii). Notice that the contributions are not scaled in this graph. Clearly, the estimated contributions from the two approaches are close, which implies our diagnostic experiment successfully estimated the true contributions.

SAN with different activity-time distributions
In this section, gamma distributions are used as the true distributions for the activity times. Two sets of experiments were conducted to investigate the effect of different mean and variance values on the contribution of X 3 . In the first set of experiments, we fixed the variance of all activity times to 0.5 and set X 1 , X 2 , X 4 , X 5 to have mean one and X 3 to have mean 10. In the second set, the means of all activities are fixed to 10, but X 1 , X 2 , X 4 , X 5 have variance one, whereas X 3 has variance five. In all cases, m = 100 real-world samples were collected for each activity time to fit ecdf F . For the diagnostic experiment we used (B = 50, R = 200) and the results were averaged over 1000 macroreplications. Figure 4 shows the experimental results. Observe that in the first set the trend in contribution/sensitivity is not much different from case (i) in Section 6.1.1. This might not seem intuitive since one might think that the large mean value of X 3 would increase its contribution. However, a large mean for X 3 only makes the path X 1 + X 3 + X 5 more likely to be dominant and, therefore, the input uncertainty due to X 1 and X 5 still has a significant impact on output variability since they are also included in other paths. In the second set, X 3 has the largest contribution and is more sensitive to additional data collection, which can be explained by its relatively large variance. Figure 5 presents the estimated input contributions from the side experiments and compares them to the results from our method. As in Section 6.1.1, the two estimated values are close in both cases (i) and (ii).

M/M/1/ k queueing system
In this section, we apply our method to a queueing system with finite capacity k where the interarrival times and services times are i.i.d. exponential random variables with mean θ 1 and θ 2 , respectively. Our measure of interest is the mean of the number of customers in the system at steadystate, Y. Given θ 1 and θ 2 , it is known that the steady-state number of customers in system n follows a truncated geometric distribution with probability P n = (θ 2 /θ 1 ) n 1 − θ 2 /θ 1 1 − (θ 2 /θ 1 ) k+1 , for n = 0, 1, . . . , k, (25)  provided θ 1 = θ 2 . Therefore, the expected number of customers in the system is which is clearly a nonlinear function of θ 1 and θ 2 . Assume for the moment that the expression in Equation (26) is unknown and we have to estimate it by simulation, where the true parameters θ c 1 and θ c 2 are estimated from the observed interarrival times and the service times, respectively. Then we can use our method to analyze the input uncertainty in the simulation estimator and evaluate the contributions of the two input models. However, our method fits Model (6) to the simulation response. In other words, it assumes E[Y| θ 1 , θ 2 ] to be a linear function of θ 1 , θ 2 1 , θ 2 , and θ 2 2 . Therefore, the estimation quality of contributions depends on how well Model (6) approximates E[Y| θ 1 , θ 2 ] near θ c 1 and θ c 2 . To investigate the performance of our method, the estimated contributions from diagnostic experiments at different settings of θ c 1 , θ c 2 , and k are compared to the estimated contributions from the side experiments. As in the SAN example in Section 6.1, we exploit the fact that θ c 1 and θ c 2 are known. In all cases the sample sizes for the real-world interarrival times and services times are m 1 = 200 and m 2 = 100, respectively. Table 1  In all settings, the estimated contributions are of the same order of magnitude (by rounding to the first decimal point, if necessary) with the results from the side experiments. Also, the order of importance is preserved; i.e., the contribution of the service time is higher than that of the interarrival time in all cases in both diagnostic and side experiments. The first two cases show the results when the queue is congested. Therefore, the system is likely to be filled up to its limit size k. In fact, in both k = 1 and k = 100 cases (26) is quite flat near (θ c 1 = 0.2, θ c 2 = 1). Therefore, Model (6) captures the shape of Equation (26) well, which results in good estimation quality. In Cases 3 and 4, the queue is lightly loaded and therefore the expected queue length in this case is less than k. Especially when k = 100, the system effectively has no capacity limit and behaves similar to a corresponding M/M/1 queueing system. When k = 1, Equation (26) shows smooth curvature near (θ c 1 = 1, θ c 2 = 0.5), which can be easily captured by Model (6). When k = 100, Equation (26) is flat near (θ c 1 = 1, θ c 2 = 0.5). Therefore, in both cases the diagnostic experiments approximate the contributions well. When θ c 1 = 1 and θ c 2 = 0.9, our method performs better in Case 5 than in Case 6. This difference is because the traffic intensity θ c 2 /θ c 1 is close to one in this case. In Case 5, the size of the queue is limited by k = 1; therefore, even if the estimated traffic intensity θ 2 / θ 1 is greater than one, the queue length is still less than one. In Case 6, however, if θ 2 / θ 1 > 1, the simulated average queue length becomes close to k = 100, whereas it is much less than 100 when θ 2 / θ 1 < 1. Therefore, as small change in θ 1 and θ 2 can cause the mean response to change dramatically and Model (6) approximates the surface relatively poorly in Case 6. However, our method still estimates the contributions of the input distributions to the right order of magnitude and the order of importance is also correct in this case compared to the results from the side experiment.
From this example, we can confirm that even if the mean response is a highly nonlinear function of input distributions' means and variances, our method works reasonably well. Notice that our goal is not to approximate the mean response surface globally but rather to fit Model (6) near the true means and variances of the input distributions. Therefore, even when the mean response is nonlinear globally, if it is linear in terms of means and variances in the neighborhood of interest, our method works well.

Illustration: remote order-taking system
In this section we apply our method to a more realistic simulation to illustrate the sequence of experiments proposed in Section 3. All experiments presented in this section were conducted by using Simio (www.simio.com) and functionality that has been added to Simio that performs the diag-  (2013), is to evaluate replacing the current drive-through order windows for a chain of fast-food restaurants with the equivalent of a call center. The current design for each store has two fully staffed windows, one for order taking and another for food delivery. The proposal is to replace the first window with a remote order-taking service in which agents communicate with customers through the electronic order board and then relay the order to the store. The fast-food chain has high standards for customer service and requires that the average waiting time for a customer to be greeted by an agent once they reach the order board be less than 1.5 seconds. The proof-of-concept simulation uses data from seven stores to investigate whether the call center can meet this standard with substantially fewer agents than the number of stores served.
There are nine sets of real-world data (actually created by us): customer interarrival times from the seven stores; order-taking times; and the time for a car to pull away from the order board and the next one in line (if there is one) to pull up. Interarrival times were collected during the busiest 3-hour period over 10 days. The numbers of observations of order-taking time and car-moving time available were 150 and 70, respectively. With these data, parametric distributions were fitted for the initial experiment (the fitted distributions were exponentials, Weibulls, and gammas, and we did not exploit knowing the true distribution families). To evaluate busy-period performance, steady-state simulations were conducted using a replication-deletion experiment design employing n = 1000 replications. With four agents at the call center, the 95% confidence interval for mean customer waiting time in the nominal experiment was 0.99 ± 0.04 seconds, which clearly excludes 1.5 seconds.
To analyze the impact of input uncertainty, we conducted a diagnostic experiment with (B = 80, R = 125). The design was chosen using the guidelines presented in Section 5. Below we report results from the simulation with four agents, which we concluded to be adequate based on the nominal experiment.
The diagnostic experiment gave γ = 17.48, which implies that input uncertainty is significantly larger than simulation uncertainty. If we adjust the CI to account for input uncertainty, the 95% CI length should be more like 0.04 × √ 1 + 17.48 2 ≈ 0.63. This is a rough adjustment, and we do not claim that it is a CI in the formal sense. Clearly, the adjusted interval 0.99 ± 0.63 includes the critical value 1.5. Therefore, to reduce the input uncertainty we are interested in which distributions are the largest contributors. Table 2 displays the estimated scaled contribution and sensitivity of each input model, as well as its real-world sample size and its sample mean (in seconds). The input model for order-taking time makes the largest contribution to input uncertainty, and it is the most sensitive to collecting additional data. This makes sense: every customer requires an order-taking time, and the real-world sample size for this input is relatively small. If it is possible to collect additional input data, then this is the distribution for which we would achieve the most benefit.
Notice that the rank order of contribution and sensitivity do not always coincide. For instance, the interarrival times of stores 6 and 7 have the same contribution; however, the sensitivity of the former is greater because it has a smaller sample size, so that additional real-world samples for store 6 can reduce the variability in the simulation result by a larger amount.  Since we concluded that the order-taking time has the largest contribution, we collected more real-world data to perform a follow-up experiment (the "real-world" ordertaking time in this example is an exponential distribution with mean 90 seconds). Figure 6 shows the trend in γ , indicated by •, and the roughly adjusted CI for expected waiting time, as a function of the total amount of real-world data on order taking. These are CIs from the nominal experiments inflated by 1 + γ 2 , not CIs using combined nominal and diagnostic experiment results (we discuss one such CI below). The number of simulation replications remains n = 1000.
We observe that γ tends to decrease as we collect more order-taking data. However, even with 50 000 observations of the order-taking time, γ does not become zero because there is still residual input uncertainty from the other distributions. The contribution and the sensitivity of the ordertaking time also decrease as the sample size grows. At m = 50 000, the relative contribution of the order taking time is 4.1%, which is smaller than any of the other distributions except the moving time. We also observe that the average customer waiting time increases as we collect more data. This implies that the first 150 real-world observations did not provide a particularly good approximation of the true distribution. With more real-world data, the adjusted CI length decreases and it becomes more clear that the critical value (1.5 seconds) has not been achieved. Therefore, we need to consider increasing the number of agents from four to five, which we would not have considered based on the nominal experiment alone.
In this illustration we chose to collect more real-world data on the order-taking time to mitigate input uncertainty. Had that not been possible, then we could have improved our point estimate from the nominal experiment by using the combined estimatorỸ = α Ȳ ( F) + (1 − α )Ȳ( F), where α = (b ) 2 + σ 2 I /B + σ 2 /BR (b ) 2 + σ 2 I /B + σ 2 /BR + σ 2 /n .
To estimate α we used σ 2 I from the diagnostic experiment, σ 2 from the nominal experiment, and b =Ȳ( F) −Ȳ( F) to estimate the bias. This gave α = 0.88, putting most of the weight on the nominal experiment. In this case the adjustment was so small thatỸ is the same asȲ( F) to two decimal places. It is important to note that the estimated bias from bootstrapping matters: had we ignored it (set b = 0) then α = 0.80 putting somewhat more weight on the results from the diagnostic experiment. The heuristically adjusted CI, which uses data from the nominal and diagnostic experiment, has halfwidth ±1.96 σ 2 I + MSE = ±0.45 seconds, which is narrower than if we had used the nominal data alone.

Conclusions
In this article we presented a method to quantify the overall impact of input-model uncertainty on simulation-based performance estimates, to identify which input models make the largest contribution to this uncertainty, and to identify the input data sources from which additional realworld observations would lead to the greatest reduction in input uncertainty. Our approach builds on Ankenman and Nelson (2012) but obtains the contribution and sensitivity results from the same experiment that they used just to measure the overall input uncertainty.
Our philosophy is to try to obtain a lot of useful information without substantial additional effort or sophisticated computations. In fact, all we require is the capability to sample from the ecdf or fitted parametric distribution of the real-world input data-which is necessary for doing simulation-and least-squares regression to fit the meanvariance model. As a proof of concept, our diagnostic experiment and uncertainty measures have been implemented as a standard feature in Simio.
To achieve all of this we modeled the relationship between the input distributions and the simulation's output through the means and variances of these distributions, without interactions. We know that higher moments can matter, and there certainly could be interactions. However, our goal is to rank the input distributions as to their contributions to input uncertainty, rather than to precisely estimate each contribution. For this goal, a first-order model that characterizes a distribution by its mean and variance will often suffice, as illustrated by our experiments.
Open questions remain about the total budget N that should be expended on the diagnostic experiment to obtain reliable results. We suspect that no blanket recommendation is possible, but instead N would have to be discovered adaptively. Important extensions include assessing the contributions of multivariate (e.g., age, health status, and income of customers) and time-dependent (e.g., arrivals to a call center) input processes.