General-purpose ranking and selection for computer simulation

ABSTRACT Many indifference-zone Ranking-and-Selection (R&S) procedures have been invented for choosing the best simulated system. To obtain the desired Probability of Correct Selection (PCS), existing procedures exploit knowledge about the particular combination of system performance measure (e.g., mean, probability, variance, quantile) and assumed output distribution (e.g., normal, exponential, Poisson). In this article, we take a step toward general-purpose R&S procedures that work for many types of performance measures and output distributions, including situations where different simulated alternatives have entirely different output distribution families. There are only two versions of our procedure: with and without the use of common random numbers. To obtain the required PCS we exploit intense computation via bootstrapping, and to mitigate the computational effort we create an adaptive sample-allocation scheme that guides the procedure to quickly reach the necessary sample size. We establish the asymptotic PCS of these procedures under very mild conditions and provide a finite-sample empirical evaluation of them as well.


Introduction
Although invented in the 1950s and 1960s for application to biostatistics problems, the statistical methods of Ranking and Selection (R&S) have been embraced by the stochastic simulation community as a standard tool for selecting the best of a finite (and relatively small) number of alternative system designs. In fact, R&S procedures are featured in several commercial simulation products. A common characteristic of selectionof-the-best procedures that have been extensively used in simulation is that "best" is defined to be the smallest or largest mean performance. Furthermore, whether Bayesian or frequentist in philosophy, these procedures typically assume that the simulation output data are normally distributed; even procedures that are shown to be asymptotically valid under more general assumptions are derived based on normality. Later we provide a realistic illustration where the output data are highly non-normal.
There is also a literature on R&S problems that differ from the normal-mean case. For instance, there are procedures that define "best" to be the largest or smallest probability, variance, or qth quantile. Also, there are procedures designed for output data that are known to be non-normal, including Poisson, Bernoulli, and exponential. The procedures for these situations are customized for the particular performance measure or type of data, exploiting the mathematical-statistical properties of the relevant estimator or distribution.
Suppose, however, that the alternatives under consideration involve distinct technologies: manual versus automated; in-house versus outsourced; or synthetic versus biological. The simulation output data may not only be non-normal but also CONTACT Soonhui Lee shlee@hufs.ac.kr; soonhui.lee@gmail.com Color versions of one or more of the figures in this article can be found online at www.tandfonline.com/uiie.
Supplemental data for this article can be accessed on the publisher's website at http://www.tandfonline.com/uiie.
there is no a priori reason to believe that the alternatives even generate output data from the same distribution family. The R&S literature provides no direct approach for such settings, other than batching the output data to achieve normality. In this article, we take a big step toward general-purpose R&S, by which we mean procedures that work for many types of performance measures (e.g., means, probabilities, variances, or quantiles) and types of data (discrete-or continuous-valued and almost arbitrary distributions), and not all systems need to have the same output distribution family. We exploit intense computation, via bootstrapping, instead of clever mathematical analysis. To do this we employ a connection between fixed-width Confidence Intervals (CIs) and the desired Probability of Correct Selection (PCS). Our approach is sequential and frequentist in philosophy and incorporates an indifference zone. Finally, our procedures are fixed precision (as measured by the PCS) not fixed budget. Therefore, our first concern is to be able to deliver the desired PCS on virtually any problem, and a secondary (but still important) concern is being able to do so in an efficient manner. This means that we simulate until a specified PCS is achieved and stop, rather than trying to smartly allocate a fixed budget to achieve the best PCS we can attain with it.
As we substitute computation for analysis, our generic procedure will not be competitive when simulation output data are so computationally cheap that we can simulate each alternative system until its point estimator has effectively a zero variance. We will also not beat procedures that directly exploit (correct) distributional information; for instance, if we know that our output data are really Poisson, then we expect that a procedure based on that knowledge should be more efficient than ours, although in Copyright ©  "IIE" our empirical studies we are surprisingly close (Lee and Nelson, 2014). On the other hand, we make only very mild assumptions about the output data, and there are only two versions of our procedure: with or without Common Random Numbers (CRNs). When CRNs are employed we require no conservative inequalities to establish the validity of our procedure. The PCS guarantees are proven asymptotically, but because they bootstrap the actual simulation output data our procedures work well in finite samples across a variety of situations.
Preliminary versions of the procedures described in this article and a small empirical evaluation of them were presented in Lee and Nelson (2014). Here we prove their asymptotic validity and supplement the empirical results in Lee and Nelson (2014) with a more carefully chosen set of experiments. Also, we introduce a significant computational speed-up that allows our methods to be applied very rapidly to small problems (say 10 or fewer alternatives) and to be computationally feasible for larger problems.
The article is organized as follows. In Section 2 we describe the related literature that supports our work. Sections 3 and 4 present our R&S procedures and prove their asymptotic validity. A method for significantly reducing the number of sequential steps required to reach a correct selection is introduced in Section 5. Sections 6 and 7 contain experiment results. Conclusions are offered in the final section.

Background
Comprehensive treatments of R&S outside of computer simulation can be found in Gupta and Panchapakesan (1979) and Bechhofer et al. (1995). Simulation-focused surveys are provided by Kim and Nelson (2006b) and Kim (2013). Although this background is relevant, our approach to creating an R&S procedure is different in that we exploit a connection between indifference-zone-δ selection of the best and fixed-width-δ CIs.
Let X i j represent the jth observed output of system i, for i = 1, 2, . . . , k, so that X j = (X 1 j , X 2 j , . . . , X k j ) is a k × 1 vector representing the jth observed output across all systems. Throughout the article we assume that X i1 , X i2 , . . ., are independent and identically distributed (i.i.d.) with marginal distribution F i (x) = Pr{X i j ≤ x}. When we employ CRNs it will be useful to think of X 1 , X 2 , . . ., as i.i.d. with common joint distribution function F (x) = Pr{X 1 j x 1 , . . . , X k j x k }, x = (x 1 , . . . , x k ) ∈ R k . We neither assume nor fit any specific distribution to the simulation output. We use boldface type as in X j to indicate a vector of observations across all k systems and underlining as in X in to indicate a vector of n observations from system i. A denotes an estimator, and we append a * to quantities defined by bootstrapping. Let = (θ 1 , θ 2 , . . . , θ k ) be a vector whose ith element is a statistical property of the marginal distribution F i , such as its mean, variance, a quantile, or a probability. We are interested in finding the sample size that allows us to select the system with the largest value of θ i with a specified PCS by choosing the one with the largest empirical estimate θ i of it. Notice that the objective is finding an (ideally small) sample size that satisfies the PCS constraint.
For k ≥ 2 systems, suppose that we can build fixed-width-δ CIs for all θ i − θ j , i = j with simultaneous coverage 1 − α. As shown in Hsu (1996), if we have then with probability greater than or equal to 1 − α for i = 1, 2, . . . , k. If M is the index of the system with the largest performance estimate-i.e., θ M θ i for all i = M-then it follows from Equation (1) that with probability at least 1 − α, This result implies that if we select the system with the largest performance estimate θ M as the best system, the selected system will be the best system or a system within δ of the best system with a probability of at least 1 − α. Moreover, if the difference between the largest and the second-largest parameter value is strictly greater than δ, then the selected system is the best system with a probability of at least 1 − α. This is exactly the desired inference for an indifferencezone R&S procedure. In R&S, δ is a user-specified parameter that corresponds to the smallest practically significant difference worth detecting. Thus, if we have a procedure to create fixedwidth-δ CIs for θ i − θ j with overall coverage 1 − α, then we also have a selection-of-the-best procedure. Swanepoel et al. (1983) described a sequential bootstrapping procedure for generating a single fixed-width CI with a specified coverage probability 1 − α when θ is either the mean or median. Given an i.i.d. sample of size n, denoted X n = {X 1 , X 2 , . . . , X n }, from a population with marginal distribution F having a distribution property θ , let F n denote the empirical cumulative distribution function (e.c.d.f.) of X n defined as Let θ n be the corresponding distributional property of F n . Furthermore, let X * n = {X * 1 , X * 2 , . . . , X * n } denote a random sample of size n from F n , F * n the implied e.c.d.f. and θ (X * n ) (also denoted by θ * n ) the corresponding distributional property of F * n . The bootstrap estimator of the probability that θ is contained within the interval [ θ n − δ, θ n + δ] is Exact computation of P * n is often difficult, but Equation (2) can be estimated given B random samples of size n from F n , say, X * nb = {X * 1b , X * 2b , . . . , X * nb }, b = 1, 2, . . . , B, by using where θ * nb , b = 1, 2, . . . , B, is the estimate of the distributional property of interest from the bth bootstrap sample.
In their procedure, Swanepoel et al. (1983) sequentially increased the number of observations of X until the stopping time N * = inf{n n 0 : P * n 1 − α}, when the desired bootstrap coverage probability is 1 − α. The asymptotic properties of N * were shown when θ is the mean or median of X, as stated in the following theorem.
The limit c depends on the distributional property of interest: c = E(X − θ ) 2 z 2 1−α/2 when θ is the mean and c = z 2 1−α/2 /(4 f (θ ) 2 ) when θ is the median, where f is the density function of X and z 1−α/2 is the 1 − α/2 quantile of the standard normal distribution. Building from this foundation, we establish similar, but significantly more difficult, asymptotic results for fixed-width-δ simultaneous CIs for k(k − 1)/2 pairs of differences θ i − θ j . This, in turn, provides an indifference-zone R&S procedure. Furthermore, we propose a method to more quickly jump to the stopping time N * , avoiding many costly evaluations of P * nB . There are, of course, parametric (in particular, normaltheory) methods to construct fixed-width CIs. See, for instance, Hochberg and Tamhane (1987). More closely related to this research, Aerts and Gijbels (1993) and Hlávka (2003) proposed three-stage procedures that use bootstrapping to estimate the critical constant that determines sample size (e.g., z-value in the normal-theory case): the first stage provides a crude estimate; the second stage refines it; and the third stage jumps to the stopping time N * . These methods do not directly estimate the PCS but, more important, seem difficult to generalize beyond a single CI.

Procedures
In this section we describe algorithms for performing R&S for k 2 systems using the bootstrap-based fixed-width CI approach. We present two versions of the algorithm, one that exploits CRNs and one without CRNs. An algorithm that does not use CRNs was presented in Bekki et al. (2010) when the sample size was incremented one at a time. We generalize the algorithm here to allow n ≥ 1 additional observations on each iteration; this has the effect of speeding up the algorithm at the possible cost of taking more observations than necessary to guarantee a correct selection. Later we present a method for adaptively choosing n that provides a substantial computational speed-up without noticeable overshoot of PCS.
First, we describe the procedure without using CRNs. Let X in = {X i1 , X i2 , . . . , X in } be a sample of size n from a system with output distribution F i having distribution property θ i and F in the e.c.d.f. based on X in for system i = 1, 2, . . . , k. Let θ (X in ) be an estimate of θ i based on X in for i = 1, 2, . . . , k and θ i j (X n ) = θ (X in ) − θ (X jn ) for all i = j. We want to build simultaneous fixed-width-δ confidence intervals for all pairs of differences θ i − θ j for i = j by finding n such that where θ i j = θ i − θ j . The value of n will be the smallest one for which the estimated coverage probability using bootstrapping is at least 1 − α. Specifically, given B random samples of size bootstrap coverage probability is estimated by The procedure without CRNs described below starts with a sample of size N = n 0 from each system i = 1, 2, . . . , k, a desired PCS 1 − α, a half width (indifference-zone parameter) δ for the CIs, and a samplesize increment n.
..,k θ (X iN ) as the best system. Else Obtain X i n a sample of size n from the distribution F i for i = 1, 2, . . . , k. Set X iN = X iN ∪ X i n for i = 1, 2, . . . , k and N = N + n. Go to Step 3. End If We next present the bootstrap R&S procedure that exploits CRNs. The goal of CRNs is to induce a positive covariance between θ (X iN ) and θ (X jN ), which reduces the variance of the difference θ i j (X N ). This, in turn, tends to reduce the sample size N required to achieve a given PCS. Therefore, the sample size required to attain the desired PCS when employing CRNs is expected to be reduced relative to independent sampling, as shown later in Section 6. We recommend using CRNs wherever these can be effectively employed.
In the algorithm with CRNs, a sample will be taken from each of the k systems using CRNs across systems to induce a joint distribution on {F 1 , F 2 , . . . , F k }; we denote that distribution by F. Correspondingly, we draw bootstrap samples from the empirical joint e.c.d.f. F N , rather than from each marginal e.c.d.f. F iN . Below we list only the steps that change from the Bootstrap R&S procedure without CRN.
Remark 1: Although our algorithms state that we form the e.c.d.f.s of the output data, we do not need to actually create the e.c.d.f.s since bootstrapping simply requires random sampling from the data with replacement. However, the e.c.d.f.s are needed for the proofs presented in Section 4.

Asymptotic analysis
This section provides theoretical support for the R&S procedures introduced in the previous section. Recall that our goal is to achieve the desired PCS on virtually any problem, taking only as many observations as needed. For an indifferencezone R&S procedure, the most difficult case is when the true differences (which are unknown in practice) are small, and we demand to be able to detect small differences. Recall that δ is a fixed quantity that represents the smallest difference that is practically important to the user. Therefore, we employ an asymptotic regime in which we let the indifference-zone parameter δ go to 0. This drives the required sample size N * (obtained from the procedures) to infinity and, hence, we show that the procedures achieve the desired PCS asymptotically.
The theorems stated below and proved in the Appendix extend Swanepoel et al. (1983) from a CI for a univariate mean or median to simultaneous CIs for the means or any quantile from k ≥ 2 systems. We then show in the corollaries that these results justify our use of bootstrap R&S for either mean or quantile performance measures by providing simultaneous CIs for all pairwise differences.
We first review the key notation. Let X n = {X 1 , X 2 , . . . , X n } be a random sample of size n from distribution F (in R k ) with a k × 1 vector of marginal distribution properties , where X j = (X 1 j , X 2 j , . . . , X k j ) , j = 1, 2, . . . , n. Furthermore, let F n (x) be the e.c.d.f. based on X n defined in two different ways for use in the procedure without CRNs, as in Equation (6), and with CRNs, as in Equation (7): The results below are valid in either case. Let X * n = {X * 1 , X * 2 , . . . , X * n } denote a random sample of size n from F n . Let Pr and Pr * denote probabilities under F and F n and E and E * denote expectations under F and F n , respectively. Probabilities and expectations under * are equivalent to estimating them using B → ∞ bootstrap samples.
The bootstrap stopping variable N * is given by where 1 k is the k × 1 column vector of ones. The inequality between vectors used in Equation (8) is defined as x y if . , x k ) and y = (y 1 , y 2 , . . . , y k ) . When = E(X), then (X n ) and (X * n ) are the sample mean vectors based on X n and X * n , respectively. That is (X n ) =X n = n j=1 X j /n and (X * n ) =X * n = n j=1 X * j /n. Notice that the "mean" case includes probabilities as they are expected values of indicator outputs. We assume that n 0 in Equation (8) grows as δ ↓ 0, as in Swanepoel et al. (1983).
Before we state our theorems, we introduce the following definitions: (a) For any c > 0 and positive-definite covariance matrix , let : From here on, whenever we raise a vector or matrix to a power, as in X 2 , we mean element-by-element exponentiation. Matrix multiplications are written explicitly, as in XX .
Consider N * as defined in Equation (8).
(a) As δ ↓ 0, we have Notice that the assumptions are very weak when = E[X] and the statistic is a sample average of i.i.d. outputs. Suppose next that is the set of variances of the k marginal distributions, where the ith element is θ i = E(X i − μ i ) 2 , i = 1, 2, . . . , k. Unless μ i is known, θ i is not the expected value of the average of i.i.d. observations. Instead, (X n ) and (X * n ) are the sample variances based on X n and X * n , respectively, where the ith element of (X n ) (denoted by S 2 in ) is the sample variance of X i1 , X i2 , . . . , X in and the ith element of (X * n ) (denoted by S * 2 in ) is the sample variance of X * i1 , X * i2 , . . . , X * in ; that is, Nevertheless, we can establish analogous asymptotic results for the sample variance in a manner very similar to Theorem 1, so we omit the proof.
and let N * be as defined in Equation (8).
(a) As δ ↓ 0, we have Next let be a set of specific quantiles of the k marginal distributions where the ith element is Then (X n ) and (X * n ) are the sample qth quantiles based on X n and X * n , respectively, where the ith element of (X n ) is the sample qth quantile of X i1 , X i2 , . . . , X in and the ith element of (X * n ) is the sample qth quantile of X * i1 , X * i2 , . . . , X * in . Stronger assumptions on the distributions F i are required to obtain similar asymptotic results for quantiles. Theorem 2. Let F i be twice continuously differentiable in a neighborhood of θ i and ξ i = f i (θ i ) > 0, for i = 1, 2, . . . , k, where f i is the density associated with F i . Furthermore, let F i j be (i, j)th bivariate marginal distribution function. Consider N * as defined in Equation (8). Suppose that the covariance matrix is positive definite. (a) As δ ↓ 0, we have Proof. See Appendix A3.
The results above justify simultaneous fixed-width-δ CIs for k means or quantiles based on bootstrapping. They establish the asymptotic validity of our generic R&S procedure when we extend them to all pairs of difference estimates using the linear transformation A defined as where The linear transformation A transforms vectors in R k into vectors that correspond to all pairs of differences in R . We are now prepared to state the asymptotic validity of our generic R&S procedures in Corollaries 2 and 3. The stopping time used in our procedures is Corollary 2. Under the same assumptions as in Theorem 1, consider N * A as defined in Equation (10). (a) As δ ↓ 0, we have where a 1−α = −1 A A (1 − α) and is defined as in Theorem 1.

Proof. See Appendix A4
Next let be the sample quantiles defined in Theorem 2.
Corollary 3. Under the same assumptions as in Theorem 2, consider N * A as defined in Equation (10). (a) As δ ↓ 0, we have where a 1−α = −1 A A (1 − α) and is defined as in Theorem 2.
The assumptions that support our theorems are comparatively weaker than those that support other δ → 0 analysis in R&S. For instance, Robbins et al. (1968) assumed that output data from each system are independent and have common, finite variance. Damerdji et al. (1996) also assumed that output data from each system are independent and further that their variances are known, along with other conditions on the distribution mean, variance, and absolute centered moments. Similarly, Kim and Nelson (2006a) assumed that the output data from each system are mutually independent and that the standardized partial sum of each system's output converges to a Brownian motion process.
Remark 2: Notice that means (which includes probabilities), variances, and quantiles cover much of what is used in practice. One possible extension of our asymptotic results is when the performance measure θ is a function, like a cost, associated with (say) the mean performance measure θ = g(μ), and we can estimate μ. Then our procedure can still be shown to be asymptotically valid for continuously differentiable g using a delta-method argument.

Jump-ahead procedure
In the generic procedures described in Section 3, the number of observations N obtained from each simulated system is increased until the estimated PCS is at least 1 − α. The samplesize increment on each iteration, n, can be made larger than one to speed-up the procedure, and this is important because estimating the PCS is a non-trivial calculation for large k. In Lee and Nelson (2014) we suggested n = 10, but this could be too aggressive in some problems and unnecessarily conservative (small) when the stopping time N * is in the hundreds or thousands. Recall that our primary goal is to provide a PCS guarantee, but we also want to attain it efficiently in the sense of keeping both the required sample size N * and the computational overhead from bootstrapping small. Bootstrapping occurs after we collect each increment n of additional output data. When variances are unknown, the minimal number of stages or "jumps" that an indifference-zone R&S procedure can have is two, and this is only possible if we know the distributions of the outputs. Since we assume no distributional information, the best we can hope for is an approximation that keeps the number of jumps small.
Here we outline an adaptive method for selecting n that is aggressive when P * NB is far from 1 − α but cautious as we approach it. The idea is straightforward: From any intermediate observed (N, P * NB < 1 − α) pair, fit a simplified normal-theory approximation for PCS as a function of N and use it to project what the stopping time N * will be; call this projection N * . Then set n = c( N * − N), where c is a fraction such as 0.8, to avoid overshoot. Now simulate the additional observations, estimate the new PCS and new projection, and repeat until P * NB ≥ 1 − α. To simplify notation, let θ i = θ (X iN ) be the point estimator of θ i , i = 1, 2, . . . , k, where the sample size N will be clear from the context. Our approximation will assume that ( θ 1 , θ 2 , . . . , θ k ) are jointly normally distributed with means (θ 1 , θ 2 , . . . , θ k ), common variance σ 2 /N, and common correlation ρ (which is zero unless CRNs are employed). We would like to approximate the sample size N * for which by fitting this simplified model to the observed PCS.
Let Z i j = ( θ i − θ j − (θ i − θ j )). Then Z = {Z i j , ∀i = j}, appropriately organized and under our simplified model, has a multivariate normal distribution with mean vector 0 and variance-covariance matrix where A is as defined in Equation (9) and I is the k × k identity matrix. Notice that only the term σ 2 (1 − ρ) is unknown, since A and I depend only on k, and N is given. For 0 < β < 1, let h(β ) = −1 AIA (β ); h(β ) is the β quantile of the maximum of the absolute values of the components of a mean-zero multivariate normal distribution with covariance matrix AIA . Given β, h(β ) is also only a function of k. Then for any such β, Suppose we have run our procedure for N observations and obtained estimated PCS P * NB < 1 − α. Recall that P * NB is also the estimated coverage probability; that is, for the current value of N, We can calibrate our approximation to this observed probability by selecting σ 2 (1 − ρ) so that Here we are treating the entire term σ 2 (1 − ρ) as a single parameter to match the observed PCS and not worrying about separating σ 2 and ρ. Given this parameter, the target sample size to obtain PCS of at least 1 − α satisfies and the σ 2 (1 − ρ) terms cancel. The intuition is that the critical value h(1 − α) 2 is the scale-free sample-size multiplier to attain PCS 1 − α. Thus, the ratio (h(P * NB )/h(1 − α)) 2 is the fraction of the curve we have climbed if the current sample size N corresponds to a PCS of only P * NB . Unfortunately, calculating h(β ) = −1 AIA (β ) is a difficult root-finding problem for large k, but we have a straightforward empirical solution: Let M = max i = j |Z i j |, where the {Z i j , ∀i = j} have a mean-zero multivariate normal distribution with variance AIA . Then h(β ) is the β quantile of M, whose distribution depends only on k. Therefore, from an i.i.d. Monte Carlo sample M 1 , M 2 , . . . , M t , h(β ) can be estimated by the order statistic M ( βt ) , and only one sample is needed for any fixed k, as M ( βt ) is a function of β once the data are sorted. The jump-ahead approximation moves our procedure from theoretically interesting to practically useful because, as we illustrate in Section 6, it allows us to find N * in a small number of jumps, thereby avoiding the computational overhead of repeatedly evaluating P * NB , while not overshooting and being inefficient.

Empirical evaluation
In this section we present an empirical evaluation of the algorithms described in Section 3 combined with the jump-ahead procedure of the previous section. We consider k = 2, 4, 10, and 20 systems. For one set of results, the output data have a mix of exponential and normal marginal distributions, and the performance measures include the mean and 0.8-quantile. For a second set of experiments the output data have negative binomial distributions and the performance measure is the mean.
In preliminary work, favorable empirical results for a single distribution family (normal or Poisson), θ being the mean or various quantiles and k up to 10, were presented in Bekki et al. (2010) and Lee and Nelson (2014). Lee and Nelson (2014) also contains results when varying the initial sample size n 0 , the indifference-zone parameter δ, and the number of bootstrap resamples B; they also compared the efficiency of the algorithm with Poisson outputs to procedures designed specifically for Poisson data. In all cases the algorithms performed very well. Therefore, we focus here on the mixed distribution case, on a second discrete case (both highly skewed and less-skewed instances), larger numbers of systems k, and the use of the jumpahead procedure. We show below and in Appendix A6 that the algorithms achieve the desired PCS in five to ten jumps provided n 0 is large enough. They can be more or less efficient than the normal-theory procedure of Rinott (1978), depending on the initial sample size, but they retain their PCS guarantee for nonnormal data, whereas Rinott's procedure may not.

Selecting the best mean
All results presented here are averaged over 100 macroreplications of the entire experiment. Tables 1 to 3 contain results for selecting the system with the largest mean when using CRNs with an induced correlation ρ = 0.9. We varied the initial sample size n 0 , the number of systems k, and the configuration of the means. The number of bootstrap samples was B = 200, the desired confidence level was 1 − α = 0.95 and the indifferencezone parameter was δ = 0.1 for all experiments. We present only CRN results, as we recommend using CRNs whenever possible. Corresponding results without CRNs are presented in Tables A1 to A3, respectively, of Appendix A6, and we compare them to the results with CRN at the end of this section.
In Tables 1 and 2, the simulation outputs have a mix of distributions; specifically, half of the systems have a normally distributed output and the other half have an exponentially distributed output with the means either in a slippage configuration or a monotone-decreasing means configuration. The true mean vector in Table 1 is the slippage configuration = (5, 5, . . . , 5, 5 + δ) , so system k is the best. The true mean vector used in Table 2 has θ i = 5 + (i − 1)δ for i = 1, 2, . . . , k, so system k is again the best system. In the monotone-decreasing means configuration, we alternated exponentially distributed systems and normally distributed systems; for example, when k = 4, systems 1 and 3 have normally distributed outputs, whereas systems 2 and 4 have exponentially distributed outputs; we then reverse the assignments. The normally distributed outputs have variance 1, whereas the exponentially distributed outputs have variance θ 2 i . In the tables, "n" or "e" indicate that the best system has normally or exponentially distributed outputs, respectively.
In Table 3 negative binomial output data are used; there are two parameters, the target number of successes r and the probability of success p; these imply that the mean is θ = r(1 − p)/p. We used p = 0.5 (therefore, θ = r) for all experiments. The true mean vector used in Table 3 is the slippage configuration (θ, θ, . . . , θ, θ + δ) with θ = 1 or 10 implying that system k is the best system. The negative binomial distribution is highly skewed when θ = 1, whereas it is less skewed when θ = 10.
To induce correlation representing the effect of CRNs for the mixed-distribution case, we used the NORTA method described in Nelson (2013), inducing a common correlation of 0.9 between all pairs of systems. To represent CRNs for negative binomial (NB) distributions, we set X i = Z + W i where the W i s are independent NB(β i , p) for i = 1, 2, . . . , k, and Z is NB(β 0 , p). Then  the correlation between X i and X j for i = j is With p = 0.5, when θ i = θ j = 1, we have Corr(X i , X j ) = β 0 ; when θ i = θ j = 10, we have Corr(X i , X j ) = β 0 /10. In Table 3, β 0 = 0.9 is used to induce a common correlation 0.9 when θ = 1; similarly, we set β 0 = 9 when θ = 10. The PCS is the fraction of the 100 macroreplications where a system whose mean is within δ of the true best mean was selected. The estimated coverage probability is P * NB from step 6 of the algorithm, and the true coverage probability is the fraction of the macroreplications where the ( k 2 ) CIs simultaneously cover all pairwise differences θ i − θ j for all i = j. The average number of jumps that the jump-ahead procedure generates until P * NB ≥ 1 − α is denoted by "Ave jumps. " We quickly discovered that P * NB is often 0 when N is small and k is large, such as when N = n 0 and k ≥ 10, which provides no information to the jump-ahead procedure. Therefore, we adjusted the procedure to use max{1/k, P * NB }, since pure guessing provides PCS 1/k. Figure 1 shows the sample sizes generated by the jump-ahead procedure with this adjustment for a case with k = 20 systems in the slippage configuration and having a mix of distribution types; the left-hand figure is the result of one macroreplication, whereas the right-hand figure is the result of 10 macroreplications. Notice that the observed PCS values are zero initially, but after a couple of jumps the procedure is able to make rapid progress without significant overshoot. For the left-hand plot in Fig. 1, each dot denotes a pair of sample size N (starting from N = n 0 ) and its corresponding estimated PCS P * NB at each jump; for this example, the procedure stops at the seventh jump as P * NB ≥ 0.95, which results in N * = 50 824. Recall that in the best case with unknown variance R&S requires two jumps and that fully sequential procedure uses n = 1 (i.e., N * jumps). We observe that five to ten jumps were made for c = 0.8 in all experiments; however, if the user wants to make the procedure less conservative (fewer jumps), the constant c can be set closer to one.
The results from all tables show that the number of jumps and the required terminal sample size N * increase as the number of systems increases, as expected. The desired PCS is achieved in all but one case, and the desired CI coverage is attained except in a few cases when n 0 = 50. It seems clear that using CRNs implies that we should start with a larger initial sample size n 0 . This makes sense, as when we employ CRNs we bootstrap entire vectors, rather than each marginal e.c.d.f. individually, and 50 vectors is a small number to represent a joint distribution, particularly when k is large. As we have observed in previous papers, n 0 should be larger for bootstrap R&S than for distributionspecific procedures, as we need the bootstrap distribution, even at n 0 , to be a good representation of the true distribution.
Notice that when the best system has exponentially distributed outputs the required sample size N * tends to be larger than when the best system has normally distributed outputs, due to the variances of the exponentially distributed outputs being much higher than the normal case. The required sample size N * for the monotone-decreasing means configuration is larger than the slippage configurations for the same reason.
For both output distribution cases, the use of CRNs greatly reduces the required sample size, with savings of as much as 90%. As a specific instance, in the case of half-normal and half-exponential distributions when k = 20 and the best system is exponentially distributed, the required sample size obtained from the algorithm with CRNs is 11 437 (Table 1), whereas the required sample size without CRN is 50 257 (Table A1 in Appendix A6); when k = 2, the required sample size obtained from the algorithm with CRN is 7466 (Table 1), whereas the required sample size without CRNs is 11 061 (Table A1 in Appendix A6). For the case of negative binomial distributions, the sample size savings obtained from the use of CRNs are all close to 90% (compare Tables 3 and A3).

Selecting the best 0.8-quantile
Next we present results for selecting the system with the largest 0.8-quantile, with or without CRNs when we have a mix of output distributions. In all cases we used the slippage configuration.
The best 0.8-quantile was set as the 0.8-quantile of a normal distribution with mean 5 + δ and variance 1; that is, θ k = 5 + δ + −1 (0.8). The inferior systems had 0.8-quantile θ i = 5 + −1 (0.8). When the distribution was exponential, we solved for the distribution's mean so that it had the desired value θ k or θ i for its 0.8-quantile. To incorporate CRNs, we used the NORTA method cited earlier. The results with CRNs (with induced correlation ρ = 0.9) are listed in Table 4, and are without CRNs in Table A4 of Appendix A6. The PCS values are all greater than or equal to 0.95, and the CI simultaneous coverage probabilities are mostly greater than or equal to 0.95. As expected, CRNs greatly reduce the required sample size.

Comparison with Rinott's procedure
In this section we consider four different output distributions, normal, Poisson, negative binomial, and a mix of half-normal and half-exponential distributions to compare the PCS and efficiency of our bootstrap R&S algorithm without CRNs to Rinott's procedure. Rinott's procedure is a two-stage, normal-theory procedure for which the sample size from each system is proportional to its sample variance; it does not exploit CRNs but does guarantee the desired PCS when the output data are normally distributed.
The results in Table 5 show the required sample sizes denoted by N B and N R obtained from our bootstrap R&S algorithm and Rinott's procedure, respectively, and their corresponding estimated PCS values denoted by PCS B and PCS R for each output distribution. For this table, we consider k = 10 systems with the slippage configuration of the means (5, 5, . . . , 5, 5.1) for all distributions. The results show that the attained PCS values (0.33 and 0.27) from Rinott's procedure are far from the desired PCS 0.95 for half-normal and half-exponential distributions; the sample sizes 12 619 and 12 354 from Rinott's procedure are much smaller than the sample sizes 37 383 and 37 745 obtained from our procedure, showing that Rinott's procedure under-samples in this case.
For the case of negative binomial distributions, the attained PCS from Rinott's procedure is 0.83 when n 0 = 10, which is significantly under the desired PCS of 0.95, whereas our bootstrap R&S algorithms achieved the desired PCS for all cases.  Rinott's procedure is more sensitive to the initial sample size than our bootstrap R&S procedure; for example, when n 0 = 10 and 50, N R < N B for all distributions, whereas N R > N B when n 0 = 5, even when the output data are normal as assumed by Rinott's procedure. For instance, the sample size obtained from applying our procedures for normal distributions when n 0 = 5 is 2030, which is much smaller than the sample size 3494 from Rinott's procedure. Moreover, our procedure achieves the desired PCS in this case. This is because Rinott's procedure is a two-stage procedure and hence with a small initial sample size in the first stage, the procedure becomes conservative in the second stage to guarantee the desired PCS. However, our jump-ahead procedure takes several stages to estimate the PCS and stops as soon as it is achieved.

Illustration
This section illustrates using our procedure in a difficult R&S problem that Goldsman et al. (1991) employed to demonstrate methods for selecting the best system. The goal is to find the airline-reservation system with the largest expected time-tofailure (TTF). In the experimental setup, each system works if either of the two computers works. Assuming that the two computers are identical, E[TTF] will be dependent on two parameters, a failure rate λ and a repair rate μ. In our experiment, k = 4 airline-reservation systems were considered, λ = 1 for all systems, and we varied the repair rates μ as shown in Table 6. Clearly, the system with the fastest repair rate will be the best, but we applied the procedures as if we do not know this point.
Note that computer failure is rare, repair times are fast, and hence the resulting E[TTF] is large. The replication outputs are highly variable and non-normal (closer to exponential), so much so that Goldsman et al. (1991) batched the data before applying the two-stage R&S procedure created for normally distributed output data from Rinott (1978). We applied our procedure directly without any pre-processing of the data. Table 6 shows the results without and with CRNs; the CRN cases are indicated by * . Notice that the desired PCS of 0.95 is attained, provided n 0 is not too small, and the required sample sizes are significantly reduced by using CRNs.

Conclusions
In this article we have provided essentially generic R&S procedures for computer simulation. Under very mild conditions on the simulation output data they provide asymptotically correct selection for all of the typical performance measures: means, probabilities, variances, and quantiles. The distribution families of the simulation outputs need not be known, or even the same across systems, and there are only two versions of the procedure: without or with CRNs.
We achieved this generality by using bootstrapping to estimate the PCS at the current sample size, and we sequentially increase the sample size until the bootstrap PCS is ≥ 1 − α. Thus, the generality comes at the cost of data storage (we need to retain the output data to bootstrap it) and calculation of the estimated PCS. To mitigate the computational expense, particularly as the number of systems k increases, we provided a procedure that quickly jumps to the required sample size N * .
To derive and prove the correctness of our procedures, we exploited a connection between PCS and the coverage of simultaneous CIs for all pairs to differences. Since coverage of all pairwise differences is a more demanding objective than correct selection of the best (see, for instance, Hsu (1996)), there is clearly room to make the procedures more efficient. Also, whereas our procedure is sequential, it is not an eliminating procedure, meaning that all k systems receive N * replications. This provides another opportunity for savings, as reducing k reduces the required effort for both the simulation and bootstrap computations.