Use of the bias-corrected parametric bootstrap in sensitivity testing/analysis to construct confidence bounds with accurate levels of coverage

Abstract Sensitivity testing often involves sequential design strategies in small-sample settings that provide binary data which are then used to develop generalized linear models. Model parameters are usually estimated via maximum likelihood methods. Often, confidence bounds relating to model parameters and quantiles are based on the likelihood ratio. In this paper, it is demonstrated how the bias-corrected parametric bootstrap used in conjunction with approximate pivotal quantities can be used to provide an alternative means for constructing bounds when using a location-scale model. In small-sample settings, the coverage of bounds based on the likelihood ratio is often anticonservative due to bias in estimating the scale parameter. In contrast, bounds produced by the bias-corrected parametric bootstrap can provide accurate levels of coverage in such settings when both the sequential strategy and method for parameter estimation effectively adapt (are approximately equivariant) to the location and scale. A series of simulations illustrate this contrasting behavior in a small-sample setting when assuming a normal/probit model in conjunction with a popular sequential design strategy. In addition, it is shown how a high-fidelity assessment of performance can be attained with reduced computational effort by using the nonparametric bootstrap to resample pivotal quantities obtained from a small-scale set of parametric bootstrap simulations.


Introduction
Sensitivity tests have been used for a long time in a variety of fields, including testing of explosives (e.g., Dixon and Mood 1948;Chao and Fuh 2001;Lease et al. 2020), testing the ballistics resistance of armor (Johnson et al. 2014;Cline et al. 2020), and in biological/pharmaceutical applications (Finney 1952;Pace and Stylianou 2007). The tests are performed in a binary response setting in which each experimental unit (object) has a critical stimulus level (threshold) that cannot be observed directly, such that the object will respond only at stress levels greater than or equal to the threshold. In the case of testing explosives, the stimulus might be drop height and the binary response could be either a fire or a no-fire (e.g., Marrs et al. 2021;Lease et al. 2020). In the case of a drug tolerance test, the stimulus might be the dose of the drug administered and the binary response relates to whether or not a reaction to the drug was observed (e.g., Pace and Stylianou 2007). Each object is exposed to a single stimulus level that is determined based on a fixed experimental design or a sequential experimental strategy. The purpose of the testing is to characterize the distribution of thresholds (e.g., via fitted model parameters) across a population of interest (e.g., lot of explosives). Often, there is also an interest in extreme quantiles of the threshold distribution. Such interest, commonly combined with a very small sample size, further necessitates the reliance on an assumed parametric model for analysis. In many cases, the parametric form of the threshold distribution is also assumed for purposes of selecting the stimulus levels that are used. In the context of explosives, there is often interest in the "all-fire" and "no-fire" levels (e.g., Young and Easterling 1994;Collins, Weaver, and Hamada 2017). Relating to reliability, the "all-fire' level refers to the stimulus level necessary to fire nearly all (e.g., .999 quantile) of the explosives with a high-level of confidence (e.g., 90%). Relating to safety, the "no-fire' level refers to a stimulus level that will result in a "no-fire" for nearly all of the explosives (e.g., .001 quantile) with a high-level of confidence.
Thus, there is interest in a one-sided tolerance bound for an extreme quantile (Collins, Weaver, and Hamada 2017). The objective of this paper is to propose and illustrate the use of the bias-corrected parametric bootstrap to produce confidence bounds for the model parameters (and extreme quantiles) with accurate levels of coverage. Note that this paper does not comprehensively review and/or compare sequential strategies per se. See Wang, Tian, and Wu (2020) for a recent such review and comparison.
The remainder of this paper is organized as follows. Experimental strategies, models and analysis methods commonly used to acquire and analyze sensitivity data are described in Sec. 2. Experimental strategies described include the popular sequential D-optimal method described by Neyer (1994), which is used for illustration throughout. Section 3 examines the performance of Neyer's method using simulations with maximum likelihood (ML) estimation in the context of small sample sizes and an underlying probit/normal model. Section 4 describes how a bias-corrected parametric bootstrap procedure, based on approximate pivotal quantities, can be used to produce confidence bounds. A small-scale (or pilot) simulation study demonstrates the use and efficacy of this procedure to construct tolerance bounds in the case of an extreme quantile that are superior in coverage accuracy than those produced by using the likelihood ratio. The dependence of coverage accuracy on correct model specification is also illustrated. In Sec. 5, a nonparametric bootstrap procedure that resamples the large set of simulated pivotal quantities produced by the pilot study is used to generate a higher fidelity assessment of confidence bound accuracies. Section 6 contains some concluding remarks.

Some experimental strategies, models, and methods of analysis for sensitivity data
The experimental results associated with sensitivity tests can be summarized by X 1, Y 1 ð Þ, X 2, Y 2 ð Þ, :::, X n, Y n ð Þ, where X i is the stimulus level associated with the ith object tested and Y i represents the binary response of the ith object (0 ¼ no response, 1 ¼ response). It is assumed that objects can be tested only once. For example, if an explosive detonator is tested and fires, it is consumed. If it does not fire, the condition of the detonator is altered so that any future test would not represent the performance of the unit in its natural unused condition. In some situations, testing is conducted at a predetermined sequence of stimulus levels (Finney 1952). However, here the focus is on adaptive/sequential strategies for setting the stimulus levels for one object at a time.
Adaptive strategies can be grouped into the following three generic categories.

Up-and-down strategy
The up-and-down (or Bruceton) method (Dixon and Mood 1948) is an early example of the first category and is still frequently used due to its simplicity (Chao and Fuh 2001;Pace and Stylianou 2007;Xiang, Tsung, and Pu 2017). Although the up-anddown method does not assume a probability model for the threshold distribution, a probit/normal model is often used for analyzing the data generated. The up-and-down method involves a set of possible stimulus levels that are usually evenly spaced by a step size, D. Assuming that an increase in the stimulus results in an increased probability of a response, One must choose the initial stimulus level, and step size prior to testing. The up-and-down method can provide useful estimates of the location and scale of the threshold distribution providing an appropriate step size is chosen (about one standard deviation). Although simple and heavily used, the up-and-down method is viewed to be inferior when compared to more modern adaptive methods (see e.g., Dror and Steinberg 2008).

Langlie strategy
An example of the second category is the Langlie One-Shot method (Langlie 1965). As in the case of the up-and-down method, the Langlie method does not assume a probability distribution for the thresholds. It is particularly effective for estimating the median of the threshold distribution. Although it is more complicated than the up-and-down method, it is easy to implement. The Langlie method requires the selection of two stimulus levels, L 1 and L 2 : The lower level (L 1 ) is such that no experimental units would be expected to respond (e.g., "fire") when subjected to that level. Conversely, the upper level (L 2 ) is such that all experimental units would be expected to respond when subjected to that level. The logic for the test sequence is as follows. The first unit is tested at : If the unit fails to respond (i.e. Y 1 ¼ 0), the stimulus for the next unit is stepped up to X 2 ¼ M þ L 2 ð Þ=2: Otherwise, the next unit is tested at a stimulus level of X 2 ¼ M þ L 1 ð Þ=2: The general rule for obtaining the stimulus level of X mþ1 , where m > 2, is to work backward in the test sequence, starting at the mth trial, until a previous trial (say the pth trial) is found such that there are as many successes as failures in the pth through mth trials. If such a trial is found, The Langlie method tends to seek out and concentrate stimulus levels in the vicinity of the median of the threshold distribution. By concentrating testing where mixed results are likely, the Langlie method tends to achieve "overlap" relatively quickly. Overlap occurs when the largest stimulus associated with a "no response" exceeds the smallest stimulus associated with a "response." The presence of overlap in the experimental results guarantees the existence of a ML estimate of the parameters of an assumed location-scale model (Silvapulle 1981). Because of the concentration of data around the median, the Langlie method can provide relatively good estimates of the location of the distribution of interest. Conversely, for that same reason, the Langlie method does not provide precise estimates of the scale parameter. Note that the possible stimulus values allowed prior to overlap are limited to the interval L 1 , L 2 ½ : Thus, it is important that this interval spans a sufficiently large part of the region where mixed results are likely. Otherwise, the number of trials required to achieve overlap may increase dramatically.

Parametric models and confidence bounds
While neither the up-and-down nor Langlie strategies assume a distribution, parametric models are usually assumed for making inferences regarding the threshold distribution based on sensitivity test data. Here, the focus is on cases where the form of the assumed threshold distribution is from the location-scale family, with special cases including the normal and logistic distributions. Likelihood-based methods are generally used to make inferences concerning the model parameters due to straightforward computation and their desirable asymptotic properties (Neyer 1994). For a location-scale distribution with cumulative distribution function F z ð Þ and location and scale parameters m and r, the likelihood function based on the experimental data, is given by In addition to providing a foundation for estimating model parameters, the likelihood function provides a generally useful mechanism for finding approximate confidence regions for model parameters and by extension, given the invariance property of ML estimation under re-parameterization, functions of model parameters (e.g., see Meeker and Escobar 1998). Functions of model parameters that might be of interest include quantiles and the probability of responding at a specific stimulus level. The basis for constructing approximate confidence bounds is the likelihood ratio where is used to assess the plausibility of possible values for the model parameters (l, r) given the experimental data and using Ll,r ð Þ as a benchmark. Consider the region, F , in the l Â r plane where R l, r ð Þ v 2 df ¼1, 1Àa : Then, an approximate 100 Á 1 À a ð Þ % con- Similarly, an approximate 100 Á 1 À a ð Þ % confidence interval for r, denoted by In the case of a quantile, given by x p ¼ l þ U À1 ðpÞ Á r, where U À1 ðÁÞ is the inverse c.d.f. of a standardized location-scale distribution, an approximate 100 Á 1 À a=2 ð Þ % upper tolerance bound for x p is given The determination of such intervals and bounds is easily accomplished by using constrained optimization methods. For example, refer to the function fmincon in MATLAB (2019). Another method for finding approximate confidence bounds for model parameters is the Wald (normal theory) approximation which is based on the asymptotic normality of maximum likelihood estimates (Meeker and Escobar 1995;Meeker, Hahn, and Escobar 2017). The Wald approximation can provide useful bounds in cases with moderate to large sample sizes. However, in cases with small sample sizes (or moderate sample sizes exacerbated with censoring), the coverage of bounds on r and extreme percentiles produced by the Wald approximation can be significantly anticonservative (Collins, Weaver, and Hamada 2017;Young and Easterling 1994). The likelihood-ratio based approach, although more computationally intensive, generally produces bounds with more accurate levels of coverage than the Wald approximation (Meeker, Hahn, and Escobar 2017).
Finally, it is also noted that the validity of approximate bounds produced by these and similar methods can be adversely affected by the use of sequentially constructed designs. This is due to the fact that the bounds are computed as if the observations are independent, which is not the case for sequential designs (Ford, Titterington, and Kitsos 1989). For more in this regard, also see Silvey (1980), Ford, Titterington, andWu (1985), and Morgan (1992).

Neyer's D-optimal strategy
An example of the third category of adaptive strategies, which is linked to an assumed parametric distribution, is a three-part strategy developed by Neyer (1994). The Neyer strategy requires guesses for three inputs (lower and upper bounds for l, given by l min and l max and a guess for r, given by r guess ). The first part of Neyer's method closes in on the mean of the distribution, adjusting for possible poor guesses regarding l and r: The second part is designed to achieve overlap. Once overlap is achieved, the third part of Neyer's method uses a Doptimization procedure to pick X iþ1 based on X 1, X 2, :::, X i ;l i ,r i , F f g , wherel i andr i are the current estimates of l and r: ML estimation is used to obtain the current estimates of l and r (l i andr i ) unless in the event that the maximum estimates are deemed to be "wild." In such cases (which might rarely occur soon after achieving overlap), the estimates are adjusted (see Neyer 1994 for details). The next stimulus (X iþ1 ) is selected to maximize the determinant of the Fisher information matrix of l, r ð Þ : That is, where By maximizing the determinant of the Fisher information matrix, the intent is to select stimulus levels that, when associated with the assumed model, will tend to minimize the variances of the parameter estimates. Comparative simulations performed by Neyer (1994) show that his method is much more efficient than either the up-and-down or Langlie methods for estimating model parameters. The simulations also show the use of all methods considered (Neyer, up-anddown, and Langlie) resulted in biased estimates of the scale parameter in the context of small samples sizes. This is not surprising given that all of the methods are often used in conjunction with ML estimation (e.g., Cordeiro and McCullagh 1991). The effect of this bias on estimates of extreme quantiles can be significant. Based on independent comparative simulations, Young and Easterling (1994) recommend Neyer's method for estimating extreme quantiles if the assumed model is symmetric. One deficiency of Neyer's method is the additional requirement of a guess for r: An inaccurate guess for r can delay the onset of overlap, hence reducing the efficiency of Neyer's method. Also, Neyer's method involves parameter estimation and optimization at each sequential step once overlap is achieved; hence, making it considerably more computationally intensive than either the up-and-down or Langlie methods. A number of sequential strategies utilize the Doptimization idea from Neyer's method. For example, the 3pod method (Wu and Tian 2014) incorporates the third part of Neyer's method as the second phase of a three-phase strategy. In the third phase, 3pod uses a nonparametric approach to target a specific percentile, rather than the model parameters. Dror and Steinberg (2008) describe a strategy that is Bayesian in nature and also involves D-optimality. Their strategy targets model parameters, and although relatively complex, has the advantage of being applicable to multifactor settings. Another strategy that borrows from Neyer's method, termed LND, was proposed by Wang et al. (2013). The LND strategy replaces the first two parts of Neyer's method with Langlie's method. Thus, LND uses the notion of Doptimality from Neyer after achieving overlap with Langlie's method and thus requires only two inputs (L 1 and L 2 ). Even with these other choices, Neyer's method remains a popular choice for practitioners (e.g., see Lease et al. 2020;Kay 2021;Marrs et al. 2021). Thus, this paper uses the Neyer method (as implemented in Neyer's commercial software, SenTest TM ) for illustration. As an aside, note that there has been some misinterpretation concerning the details of the Neyer method as described in Neyer (1994; Ray, Roediger, and Neyer 2014; Neyer Software LLC; Wang, Tian, and Wu 2020).

Small-sample performance of Neyer method
Sensitivity tests are often expensive to conduct as they often consume costly units and could involve expensive experimental setups, as for example during the testing of explosives (Cooper 1996). Collins, Weaver, and Hamada (2017) refer to the analysis of data derived from such tests as "panning for gold." That is, one is trying to extract all available information from small quantities of data. Therefore, the focus here is on relatively small samples where the intent is to characterize the threshold distribution of interest, including extreme quantiles. It is recognized that such ambitions are highly dependent on parametric assumptions. Nevertheless, this is common practice (e.g., see Young and Easterling 1994).
For example, Kay (2021) performed sensitivity testing of high explosives with a sample size of 25 using Neyer's method. In this study, each experimental unit involved a small amount of high explosives that was placed between a ceramic pin and plate. A frictional force was applied across the high explosive. If the force was sufficient, the high explosive responded with a "pop" or "snap." The experimenters used generalized linear modeling (with the probit link function) in conjunction with maximum likelihood and the test data to estimate the parameters of the assumed normal model of thresholds. In this case,l ¼ 61:8 Newtons (N) andr ¼ 11:4 (N).
Simulation experiments were performed to illustrate and assess the performance of Neyer's D-optimal method combined with ML estimation in this smallsample context. See Neyer (1994), Young and Easterling (1994) and Wang, Tian, and Wu (2020) for other evaluations of Neyer's method. Here, and throughout the remainder of this paper, thresholds were simulated to be normally distributed with a mean and a standard deviation of l ¼ 60 and r ¼ 10 (no units), respectively. The specification of these particular parameter values, similar to those estimated by Kay (2021), is made for illustration and does not invalidate the generality of the conclusions obtained from the simulations. It is of interest to understand how the performance varies due to sample size and the required inputs (l min , l max , and r guess ) relative to the true model parameters. Neyer (1994) concluded that the performance of his method is relatively insensitive to l min and l max , while potentially being more sensitive to r guess : Forty different conditions were explored by considering all combinations of the levels associated with three 2-level factors and one 5-level factor. The first factor is the sample size, n 2 25, 40 f g: The second factor (in illustrative and general/relative terms) is l min þ l max ð Þ =2 2 60, 80 f g¼ l, l þ 2r f g which provides an initial guess for l: The two levels of this factor represent well-centered and poorly-centered guesses for l, respectively. The third factor, representing the size of the search range, is defined by l max À l min ð Þ 2 40, 80 f g¼ 4r, 8r f g: Together, the second and third factors provide an initial search range for l: Unlike the Langlie method, possible stimulus values (prior to overlap) are not limited to this initial search range. The fourth factor (with 5 levels) is the initial guess for r, r guess 2 2:5, 5, f 10, 20, 40g¼ 0:25r, 0:5r, r, 2r, 4r f g : For each experimental condition, 2,000 sets of simulated binary data were generated via the Neyer method. ML estimation was used to analyze the binary data. Table 1 summarizes the performance in terms of estimating l and r across the experimental conditions considered. Overlap was achieved across all virtually all simulations. Notable exceptions occurred in cases when n ¼ 25 and r guess ¼ 40 ¼ 4r, regardless of the initial search range. In those cases about 1% of the simulations failed to produce an overlap. When r guess ¼ 40, the average number of trials required to produce an overlap was approximately twice that observed at most other conditions. It is easy to see that sample size had the greatest impact on performance as the benefit of additional samples in terms of reducing the variance of bothl andr is clear. The effects of the other factors are most apparent at n ¼ 25, where r guess clearly had a dominant effect. The impact of varying the search range for l had only minor effects on performance. Not surprisingly, the estimates of l appear to be effectively unbiased across all cases considered. The variance ofl is affected not only by the sample size, but to a lesser extent by r guess , where Varl ð Þ is largest when r guess ¼ 40: Consistent with the known small-sample behavior of ML estimation (e.g., see Griffiths, Hill, and Pope 1987), it is also not surprising that the estimates of the scale parameter (r) are biased (low) across all cases considered. Note that the observed level of relative bias observed here (about À15% for n ¼ 25 and about À10% for n ¼ 40), is consistent with what Neyer (1994) observed. In cases where n ¼ 25, the Neyer strategy performs relatively poorly for estimating r when r guess =r is small (here when r guess ¼2.5), resulting in increased bias and variance.
It is also interesting to compare the efficiencies of estimating the model parameters across the various conditions relative to the case where one would be able to observe the thresholds directly (complete data). In the case of normally distributed complete data, Varl ð Þ ¼ r 2 =n and Varr ð Þ ¼ r 2 =2n: The relative inefficiency in estimating r is striking. With a sample size of 25, one is effectively trying to estimate r with fewer than 5 complete observations. For example, when n ¼ 25, l min , l max ½ ¼ 20, 100 ½ , and r guess ¼ 10,n effectiver ð Þ ¼ r 2 =2Varr ð Þ ¼ 4:2: Even with the assumption of a correctly-specified parametric model, estimates of extreme quantiles are clearly going to be associated with substantial uncertainty.
An accurate assessment of the associated uncertainty is needed for users of such estimates.

Parametric bootstrap
While the likelihood-ratio based approach is commonly used for making inference in the case of sensitivity testing, it is known to produce inaccurate levels of coverage in situations involving small samples (see e.g., Jeng and Meeker 2000). Alternatively, the parametric bootstrap (e.g., see Efron andTibshirani 1993 andDavison, Hinkley, andYoung 2003) can be used as a method for producing confidence bounds in the context of the simulations presented in Sec. 3. In a pilot simulation study, tolerance bounds (for a particular percentile and level of confidence) produced by both the percentile parametric bootstrap and bias-cor- rected parametric bootstrap are compared with asymptotic bounds that are produced by using the likelihood ratio.
Bootstrap samples should be generated in a way that accurately simulates the methods from which the actual data are generated and analyzed. Here, with simulated data in contrast to actual experimental data, the stimulus levels are set and known perfectly (within computer precision). In the case of actual experimental data, the uncertainty associated with the stimulus levels also needs to be considered. A fitted normal/probit model is assumed with parameter estimates given by (l,r). B sets of simulated data are generated based on the Neyer adaptive method/fitted model and analyzed using ML estimation resulting in B pairs of parameter estimates given byl bs : In the pilot study, the primary interest concerns developing an approach for constructing an upper 90% tolerance bound for the 0.99 quantile (x 0:99 ) of the threshold distribution. A simple approach is to use the percentile parametric bootstrap where one can compute a set of B estimates for x 0:99 based on the associated estimates of l and r given bŷ where U À1 is the inverse standardized normal cumulative distribution function. The 90th percentile of x 0:99 k ð Þ È É k¼1:B constitutes the percentile bootstrap approximation of the upper 90% tolerance bound for x 0:99 : Examples of using the percentile parametric bootstrap to provide approximate confidence intervals in a quantal response setting are given in Chao and Fuh (2001) and Kerns (2021). Note that the method introduced by Chao and Fuh (2001) for implementing the parametric percentile bootstrap (and bootstrap-t) is specific to the Bruceton method.

Bias-corrected parametric bootstrap using approximate pivotal quantities
The preferred approach considered here for constructing tolerance bound uses a bias-corrected parametric bootstrap. Bias-corrected bootstrap methods have been described and analyzed elsewhere in the literature (see e.g., Efron 1987;Hall 1988;Chao and Fuh 2001). Here, the use of the bias-corrected bootstrap is motivated by the omnipresent small-sample bias associated with ML estimation of the scale parameter and the availability of approximate pivotal quantities relating to the use of a location-scale model. Such bias is observed in Table 1. Bias in the estimated scale parameter causes bias inx 0:99 k ð Þ È É k¼1:B and leads to an underestimate of the upper 90% tolerance bound when using the percentile bootstrap. An estimate of this bias is obtained through the bootstrap simulations and is used to correct the tolerance bounds.
In the case of using complete data (or, Type 2 censored data) with a location-scale model wherel and r are the ML estimates of the location and scale parameters, Q ¼ l Àl ð Þ=r and R ¼ r=r are exact pivotal quantities (Appendix E of Lawless 2003). In such cases, an exact pivotal quantity for a quantile (x p ) is given by T p ð Þ ¼ Meeker, Hahn, and Escobar 2017). By definition, the distributions of exact pivotal quantities do not depend on the true unknown values of l, r f g : The presence of exact pivotal quantities can provide a straightforward basis for making exact statistical inference regarding a parameter or quantile (e.g., see Lawless and Fredette 2005;Wang, Hannig, and Iyer 2012).
However, the sequential strategies considered here generate a complex mixture of right and left censored data defined by multiple censoring limits. Based on such data, the quantities Q, R, and T are not exact pivotal quantities. In these cases involving small sample sizes, the generated stimulus levels (censoring limits) are not equivariant with respect to changes in terms of location and scale. The sequential strategies discussed in Sec. 2 vary in how well they adjust to changes in terms of location and scale. For example, because of its predetermined set of possible stimulus levels with fixed step size (D), the up-and-down strategy is limited in how it can adjust to the location and scale of the threshold distribution. In contrast, the Neyer strategy and related methods adjust well to the location and scale of the threshold distribution. Based on the consistency of ML estimates, it follows that the limiting stimulus levels associated with the Neyer method are l À a Á r and l þ a Á r, where a¼1.138 in the case of a probit model (e.g., see Banerjee 1980). Given that the two limiting stimulus levels are equivariant with respect to changes in terms of location and scale and assuming they occur with equal frequency, it follows that Q, R, and T are asymptotic pivotal quantities when using the Neyer method. In the context considered here, Q, R, and T are regarded as approximate pivotal quantities. While approximate pivotal quantities have been introduced elsewhere to solve inference problems (see e.g., Lawless and Fredette 2005;Wang, Hannig, and Iyer 2012), the intent here is to demonstrate their utility in the context of a small sample setting involving binary data generated via a complex sequential design strategy.
Although the distributions of Q, R, and T are unknown, the parametric bootstrap provides a means to obtain empirical approximations to these distributions via simulation of the quantities Q k ¼l Àl bs k =r bs k , R k ¼r=r bs k , and The quantities Q k , R k , and T k simulate Q, R, and T: As such, the distributions of Q k , R k , and T k approximate the distributions of Q, R, and T. The computed values of Q k , R k , and T k are plugged into the expressions for Q, R, and T which are then inverted in terms of l, r, and x p : The resulting values are bias-corrected estimates of l, r, and x p : Due to the approximate pivotal quantity assumption, the distributions of these estimates are reflective of the uncertainties in l, r, and x p : For example, a set of B bias-corrected estimates for x 0:99 is given bŷ where The first part of the RHS of Eq.
[6],l þ U À1 0:99 ð ÞÁ r, is a simple estimate of x 0:99 that is based solely onl andr and includes no information from the bootstrap. The remainder of the RHS of Eq. [6] corrects for the bias observed at each iteration within the bootstrap. In this context, the bias correction applies mainly to the systemic under-estimate of r: Based on the bias-corrected bootstrap, an approximate upper 90% tolerance bound for x 0:99 is given by the 90th percentile of : The above strategy relies on the credibility of the approximate pivotal quantity assumption in order to produce accurate levels of coverage. One can also use the quantities, Q k and R k to construct bias-corrected bounds for l and r: An approximate 100 À a ð Þ% confidence interval for l is given bŷ where Q k p ð Þ is the p quantile of Q k f g k¼1:B : An approximate 100 À a ð Þ % confidence interval for r is given by R k a=2 ð ÞÁr, R k 1 À a=2 ð ÞÁr ½ , where R k p ð Þ is the p quantile of R k f g k¼1:B :

Assessment of bootstrap performance, pilot study
Bootstrap performance is measured here by coverage accuracy. The performance assessment, limited by the computational cost of simulating Neyer's method, is associated with 16 of the 40 experimental conditions in Table 1 with relatively few simulations (100) at each condition. The selected conditions span the range of performance observed in Table 1 in terms of trials to achieve overlap and ability to estimate the model parameters. Eight of the 16 experimental conditions comprise a full factorial in terms of the three 2level factors (defining sample size and search range) with the fourth factor set at its nominal level, r guess ¼ 10: The other 8 conditions (along with two of the first 8 conditions) comprise a full factorial in terms of the two levels of sample size and the five levels of r guess with the search range fixed at ½l min , l max ¼ ½40, 80: Based on the results summarized in Table 1, there is no evidence to suggest strong interactions between the search range and r guess in terms of effects on performance.
For each of the experimental conditions selected, simulations involved two nested stages. The purpose of the two stages is to enable the assessment of the accuracy of the bootstrap coverage at the simulation target (l ¼ 60, r ¼ 10). At the first (outer) stage, iterated N ¼ 100 times under a fixed experimental condition, the Neyer strategy combined with ML estimation was used to obtain estimates of l and r given that thresholds are simulated from a Normal distribution at the simulation target. The number of iterations performed at the outer stage was limited due to the computational cost of simulating Neyer's method. Each set of these estimates, given bŷ l j ,r j È É j¼1:N , is viewed as a plausible outcome assuming such a distribution. These estimates were then used as the basis for B ¼ 1,000 bootstrap simulations at the second (inner) stage. For each bootstrap simulation in the jth iteration of the outer loop, thresholds are simulated from a Normal (l j ,r j ) distribution. At the same experimental condition, the Neyer method was used in combination with ML estimation to estimate l and r: The result is a large set of simulated parameter estimates given byl Ã : The number of trials needed to achieve overlap in the inner stage tends to increase asr j decreases. In cases wherer j ( r guess , overlap was not always achieved in the inner stage. In these cases (limited to when n ¼ 25),l Ã jk andr Ã jk were set tol j andr j , respectively. With the exception of instances where r guess ¼40, the "non-overlap" cases occurred less than 1% of the time. When using the Neyer strategy with r guess ¼40 and n ¼ 25, the "non-overlap" cases occurred about 5% of the time, hence portending possible problems regarding the coverage accuracy of bootstrap bounds involving small samples when r guess greatly exceeds r: Following Eq.
[5], for the jth iteration of the outer loop, one can compute a set of B estimates for x 0:99 viax  Specific to each of the N iterations of the outer loop and based on the bias-corrected bootstrap, an approximate upper 90% tolerance bound for x 0:99 is given by the 90th percentile ofx ÃÃ 0:99 j, k ð Þ È É k¼1:B : It is useful to examine if there is any obvious dependence of the distribution of the computed pivotal quantities (Q Ã jk and R Ã jk ) on the parameter estimates (l j andr j ). Such dependence would be a contraindication for using the approximate pivotal quantities as useful surrogates for pivotal quantities. Figures 1-3 display the empirical cumulative distribution functions (ECDF's) of subgroups of approximate pivotal quantities (Q Ã jk and R Ã jk ) obtained via several experimental conditions that span the performance of the Neyer method as summarized in Table 1. The subgroups are defined in terms of their associated deciles ofl j andr j : For example, the distributions in decile 1 reflect the 10,000 values of Q Ã jk (or R Ã jk ) associated with the 10 (out of N ¼ 100) smallest values ofl j (orr j ). Each display contains 10 ECDF's, one per decile. The spread of ECDF's associated with Q Ã (or R Ã ) within a given experimental condition gives a good indication of the credibility of the pivotal quantity approximation. Figures 1-3 indicate good credibility for all conditions studied when n ¼ 40. Figure 1, which is representative of the cases considered where r guess ¼ 10, indicates good credibility when n ¼ 25 for cases where r guess is close to r: On the other hand, there is noticeable spread of the ECDF's within an experimental condition (particularly relating to R Ã ) in cases when n ¼ 25 and r guess is substantially off-target (see Figures 2 and 3). When r guess ¼ 40, the discontinuities observed in Figure 2 (at Q Ã ¼ 0 and R Ã ¼ 1) are associated with the correction made (discussed above) tol Ã jk andr Ã jk when overlap was not obtained. These corrections, which tend to artificially shift the values of R Ã to the left (disproportionally so when associated with smallr j ), could decrease coverage by a slight amount. When r guess ¼ 2:5, the dependence of R Ã onr j is noticeable and is due to the relatively poor estimates of r j that are obtained at that condition (see Table 1). In this case, there is a systematic shift of the distribution of R Ã to the right for large values ofr j due to a significant disparity betweenr j and r guess : This shift is likely associated with some degree of coverage inaccuracy. In summary, the accuracy of the bootstrap-derived confidence bounds is affected at some level by the following two factors: sample size and the discrepancy betweenr (the actual estimate of scale based on the experimental data) and r guess : In practice, a comparison ofr with r guess (which are both used by the bootstrap simulations) could provide users with some ability to detect potential credibility issues when n is relatively small (e.g., n ¼ 25). Alternatively, if the consequences of a poor guess for r are of concern, the LND method (with inputs L 1 and L 2 set sufficiently wide) may offer a useful alternative to the Neyer method when using BCB to construct bounds.
The simulations from the pilot study were used to assess and compare the coverage accuracy of upper 90% confidence bounds for x 0:99 produced by the percentile bootstrap (PB), bias-corrected bootstrap (BCB), and the likelihood ratio (LR) from Eq. [2]. Table 2 summarizes the levels of coverage that were observed for the above-mentioned methods at each of the experimental conditions selected. The assessed coverage levels produced by BCB are close to nominal across the range of conditions considered, particularly at n ¼ 40. On the other hand, it is clear that the PB and LR approaches are anti-conservative and fail to cover x 0:99 at the 90% level, sometimes by a large amount at n ¼ 25. The percentile bootstrap method is clearly less accurate than the likelihood-ratio based approach. It is also seen that the levels of coverage associated with LR and PB improve and the level of coverage associated with BCB becomes more stable across experimental conditions as n increases.
Figures 4 and 5 provide graphical insight for several experimental conditions by allowing for a comparison of the ECDF's of upper tolerance bounds produced by the methods considered. Here, the interest relates to how often the tolerance bounds surpass x 0:99 ¼ 83:26, which is indicated by the heavy vertical lines in each sub-figure. The heavy horizontal lines intersect with the tenth percentile of the tolerance bound distributions generated by each of strategies. The ECDF's displayed in Figures 4 and 5 are relatively rough due to the limitations of the small size of the pilot study (N ¼ 100). One can also see that the distributions of bounds for x 0:99 produced by BCB are much broader than those produced by LR and PB, particularly for n ¼ 25.
It is informative to visually compare bounds produced by the BCB and LR methods. As an example, consider the experimental condition where n ¼ 40, [l min , l max ¼ ½40, 80, and r guess ¼ 10: Figure 6 shows the close association between the upper 90% Figure 1. Distributions of Q Ã jk and R Ã jk , by sample size(n), and deciles ofl j (r j ) given inputs: l min ¼ 40, l max ¼ 80, and r guess ¼10.
confidence bounds for x 0:99 and upper 95% confidence bounds for r produced by each of the two methods in the pilot study. The shift in both axes from LR to BCB is clear and reflects the effect of the bias correction for creating upper bounds for both x 0:99 and r: Figures 7 and 8 compare the upper and lower 95% confidence bounds produced for l and r at each simulation iteration. In terms of l, while LR produces tighter intervals than BCB, the upper bounds are similar across methods (as are the lower bounds). Figure 8 shows that while the lower bounds for r are reasonably similar across methods, the upper bounds for r produced by BCB often greatly exceed those produced by LR.

Other location-scale models and model misspecification
The form of the assumed model influences the sequential strategy of the Neyer method (through F Á ð Þ Figure 2. Distributions of Q Ã jk and R Ã jk , by sample size(n), and deciles ofl j (r j ) given inputs: l min ¼ 40, l max ¼ 80, and r guess ¼40. and f Á ð Þ in Eq.
[4]), estimation, and the construction of confidence bounds. In addition to the normal distribution, practitioners assume various other locationscale models. Wang, Tian, and Wu (2020) consider the heavier-tailed logistic and asymmetric skewedlogistic distributions along with the normal distribution when comparing the performance of sequential design strategies. Following that approach, the supplementary material accompanying this paper summarizes the performance of the bias-corrected parametric bootstrap when using the Neyer method in Figure 3. Distributions of Q Ã jk and R Ã jk , by sample size(n), and deciles ofl j (r j ) given inputs: l min ¼ 40, l max ¼ 80, and r guess ¼2.5. conjunction with these alternative distributions. In the case of the logistic model, it is assumed that F X ð Þ¼ ð1 þ exp À XÀl ð Þ=r Þ À1 with l ¼ 60 and r ¼ 5:513: In the case of the skewed-logistic model, it is assumed that F X ð Þ¼ ð1 þ exp À XÀl ð Þ=r Þ À2 with l ¼ 53:5 and r ¼ 6:6: In both cases, the location and scale parameter values were chosen to match the mean and variance of the Normal (60, 10) threshold model. The results summarized in the supplementary material show that the coverage performance of the bias-corrected parametric bootstrap when used with the alternative models is comparable to when it is used with the normal model. When using the skewedlogistic model, ML produces estimates of l and r that are both biased and negatively correlated. However, the bivariate bias leads to somewhat greater dependence of the distributions of both computed pivotal quantities (Q Ã jk and R Ã jk ) on bothl j andr j : This could negatively impact the credibility of the pivotal quantity assumption when n is relatively small (e.g., n ¼ 25).
It is also interesting to understand how model misspecification influences the accuracy of the upper 90% confidence bounds for x 0:99 : Several situations, defined by the true underlying model and assumed model form, are investigated via simulations with n ¼ 40. Table 3 provides the inputs used by the Neyer method for each situation as well as a summary of the simulation results. Note that the accuracy of BCB and LR bounds are both affected by model misspecification. For example, when the true model form is logistic and the assumed form is normal, the confidence bounds become anticonservative due to heavier tails than assumed. Conversely, when the true model form is normal and the assumed form is logistic, the bounds become conservative due to tails that are  lighter than assumed. Note also that in the case when both the true and assumed model forms are the same, BCB appears to provide accurate coverage while LR appears to be anticonservative. While model specification is clearly an important issue, it is hard to distinguish between distributions without a large sample.

Large-scale assessment of coverage accuracy
The Neyer strategy is straightforward to code, but when used in a bootstrap framework (with estimation and D-optimization within nested loops) it can use nontrivial computing resources. Motivated by the limitations of the small-scale pilot study (N ¼ 100) and the computations required to perform larger-scale simulations, a straightforward and computationally efficient method was developed to assess the coverage accuracy of confidence bounds with potentially a higher level of fidelity. This method utilizes the large quantity of approximate pivotal quantities generated by the pilot parametric bootstrap study. For each experimental condition, N Á B ¼ 100, 000 combinations of pivotal quantities were generated. The sets of pivotal quantities for a particular experimental condition are denoted by It is useful to vectorize these sets of pivotal quantities resulting in q Ã ¼ vec Q Ã ð Þ and r Ã ¼ vec R Ã ð Þ: Because of the nature of the sequential strategies, the pivotal quantities were computationally expensive to obtain. However, once obtained, re-using them involves little cost. A useful way to exploit the large set of pivotal quantities is through resampling as described below.
Simulations, again specific to an experimental condition and consisting of two nested stages, were used to assess bound accuracy. The first stage is expanded 10-fold (N ¼ 1,000) in order to generate a larger number of plausible pairs of parameter estimates, l j ,r j È É j¼1:N : As before, the Neyer method combined with ML estimation was used to obtain these estimates of l and r: The second (inner) stage is unlike that described previously. Rather than using the computationally expensive parametric bootstrap where sequential strategies are applied to simulated thresholds, the second stage simulates the inner loop of the pilot study by  using a nonparametric bootstrap process with resampled pivotal quantities. That is, B ¼ 5,000 uniformly distributed pseudorandom integers are selected with replacement from 1, 2, :::, 100, 000 f g , resulting in U j ¼ U j1 , U j2 , :::, U jB f g : The sampled versions of q Ã and r Ã are q Ã j ¼ q Ã U j ð Þ and r Ã j ¼ r Ã U j ð Þ, respectively. The same set of integers, U j , is used to define q Ã j and r Ã j to preserve any existing dependence between elements of q Ã and r Ã : Let q Ã j k ð Þ and r Ã j k ð Þ be the kth element of q Ã j and r Ã j , respectively. . Similarly, resampling pivotal quantities can also be used to produce confidence bounds/intervals for model parameters l and r: Specific to the jth iteration of the outer loop, an approximate ð100 À aÞ% confidence interval for l is given bŷ Similarly, an approximate ð100 À aÞ% confidence interval for r is given bŷ The endpoints of these intervals can be used to define approximate one-sided ð100 À a=2Þ% upper or lower confidence bounds for the model parameters. Figures 9 and 10 display the distributions of upper confidence bounds for x 0:99 obtained from large-scale simulations (described above) at selected conditions. Table 4 summarizes the levels of coverage obtained for x 0:99 when using this approach. Due to the 10-fold expansion of plausible pairs of parameter estimates, the large-scale simulations can provide a more precise perspective on coverage accuracy when compared to the pilot study. However, use of this approach for the BCB method assumes that the aggregate cumulative distributions of pivotal quantities from the pilot study are applicable across the enlarged number of plausible pairs of parameter estimates. If this assumption is flawed, the assessment of coverage for BCB could be affected, especially in the case of extreme confidence levels. For example, when r guess ¼ 2:5 and n ¼ 25 the distribution of R Ã depends onr j (see Figure 3). In that case, the overall distribution of R Ã is shifted to the right through incorporation of the subpopulation of R Ã associated with large values ofr j : When applied to smaller values of r j , that shift could slightly increase (bias) the assessed coverage. Consider the experimental condition defined by n ¼ 25, l min , l max ½ ¼ 40, 80 ½ , and r guess ¼ 2:5: While it is possible that some of the increase in BCB's coverage level in Table 4 versus Table 2 at that condition is due to that shift, neither of the observed coverage fractions (906/1,000) and (85/100) is inconsistent with 90% coverage. Note that a computationally less expensive assessment of bound accuracy could be obtained by also simulating the first (outer) stage by resampling from q Ã and r Ã : However, that would preclude a direct comparison with the LR method and would increase the consequences of deviations from the pivotal quantity assumption.
The large-scale simulations were also used to compare confidence intervals obtained for l and r by using the BCB with those obtained by using LR. Tables 5 and 6 summarize the results. In terms of l, both BCB and LR generate confidence intervals that are typically centered around l ¼ 60: BCB tends to generate wider intervals than LR, but unlike LR, generates accurate levels of coverage over all of the conditions studied. The disparity in interval widths diminishes as n increases from 25 to 40. The coverage accuracy of LR clearly deteriorates in cases with the smaller sample size (n ¼ 25). With respect to r, BCB again generates accurate levels of coverage over all of  the conditions studied. The confidence intervals tend to be centered above r ¼ 10: LR is anti-conservative with respect to coverage at all conditions (particularly at n ¼ 25) while generating narrower intervals that tend to be centered slightly below r ¼ 10: As in the case of l, the disparity in interval widths between BCB and LR diminishes as n increases from 25 to 40. When comparing LR's coverage for l versus r at comparable conditions, it is clear that LR's coverage inaccuracy is more significant in the case of r: This is not surprising given the small-sample bias associated with ML estimation of r:

Concluding remarks
Sensitivity tests based on modern sequential strategies are often used to make statistical inferences in small-sample settings. Attempting to make inference about any characteristic of a population with a small sample, let alone an extreme quantile, is a difficult task. This task is made more difficult given the incomplete nature of the observations. By necessity, there is a strong reliance on assumed parametric models which leads to the use of likelihood-based methods for estimation and constructing confidence bounds. Sequentially constructed designs pose additional problems for constructing accurate confidence bounds. As a consequence of this combined with the bias associated with estimating r, the coverage associated with confidence bounds generated by using the likelihood ratio is generally anti-conservative in small-sample settings.
Assuming the model form is accurate, which admittedly is difficult to support (or refute) with limited incomplete data, this paper has demonstrated the efficacy of using a bias-corrected parametric bootstrap method involving approximate pivotal quantities for constructing confidence bounds. Specifically, it was demonstrated that the bounds can provide accurate levels of coverage when using a probit model in conjunction with a popular sequential method and method of analysis that effectively adapt to the model parameters. The supplementary material shows that this approach is applicable in the case of other location-scale models. Coverage frequency: Number of simulations (out of N ¼ 1,000) where interval covers r ¼ 10: Ã Median confidence interval width ðlogarithmÞ : median log 10 UB j ð ÞÀ log 10 LB j ð Þ À Á : Ã Median confidence interval midpoint ðlogarithmÞ : median 10 0:5 log 10 UBj ð Þþlog 10 LBj ð Þ ð Þ À Á : Also, while the BCB is sensitive to model misspecification, it appears to be no more so than other methods which have a parametric basis. A direct assessment of the performance of the proposed method is computationally burdensome due to the nature of the sequential strategies considered. Assuming that the approximate pivotal quantity assumption is credible, this paper proposes and demonstrates a way to lessen the computational burden of assessment by using a nonparametric bootstrap procedure to re-sample approximate pivotal quantities obtained from a small-scale parametric bootstrap simulation. It is expected that this approach could be useful in many other contexts where bootstrap simulations are computationally expensive.

About the author
Edward Thomas is a statistician in Albuquerque, New Mexico. His email address is edvthomas@msn.com.