Comparison of Test Sizing Approaches for Initial and Follow-On Evaluation of Strategic Weapon Systems

ABSTRACT Testing programs exist to establish initial estimates of reliability for strategic weapon systems and to determine significant changes in reliability. The methods to determine the appropriate sample sizes are based on one- and two-sample estimates of proportions. There are guidelines that exist that provide specific methods to determine sample sizes for strategic weapons systems. The methods used are considered to be quite conservative, and this article investigates alternative approaches for both initial and follow-on test sizing for weapon system reliability estimates. The methods are compared through Monte Carlo simulation.


INTRODUCTION
Testing programs exist to establish initial estimates of reliability for a device and to determine whether a significant decrease in reliability has occurred in relation to an initial value. The type of device being evaluated may be a complex system with many components, assemblies, and subsystems. At the Johns Hopkins University, Applied Physics Laboratory, missile systems are typically analyzed to establish an initial estimate for the reliability of a system and to detect a significant decrease in reliability over time. Though other performance metrics such as accuracy are also evaluated, the focus is on system reliability. Determining the number of tests needed for estimation with confidence and detecting changes in reliability is important when destructive testing of expensive items is involved. The necessary sample size determines the number of items to be tested and helps quantify the cost of the testing program(s).
There are some methodologies appropriate for establishing initial estimates of reliability for weapon systems, whereas other methodologies are more appropriate for detecting changes in reliability. There are various assumptions and approximations possible for both test sizing situations. Guidelines are often established to define the number of full system tests required. These guidelines are based on exact distributional assumptions that typically require more testing than necessary to establish estimates with a specified confidence. Therefore, this article investigates several methods for initial and follow-on test sizing and provides a comparison of the results using simulations.
The remainder of the article is organized as follows. The first section explains the current initial assessment methodology based on specified guidelines and describes several methods in the literature used for confidence interval estimation of a single proportion. The second section provides details of the simulation study to compare the coverage probabilities and confidence interval widths of the considered methods. The third section outlines the current follow-on assessment methodology based on specified guidelines and describes several other methods in the literature used for confidence interval estimation of two proportions. The fourth section provides the simulation study comparing these methods. The final section provides concluding remarks based on the comparisons and a discussion of other alternatives.

Confidence Interval Methods for a Single Proportion
The standard approach for sample size determination is based on the formulation of a confidence interval for a single proportion. Let X denote a binomial variable for sample size n and proportion p. To avoid approximation, the Clopper and Pearson (1934) "exact" confidence interval is typically used. The interval for p is based on inverting equal-tailed binomial tests of H 0 : p D p 0 . It has endpoints that are the solutions in p 0 to the equations In the case when x D 0 the lower bound is 0, and when x D n the upper bound is 1. This interval estimator is guaranteed to have coverage probability of at least 1 ¡ a for every possible value of p. The Clopper-Pearson exact interval is typically treated as the "gold standard" because it is necessarily conservative due to the discreteness of the binomial distribution. For any fixed parameter value, the actual coverage probability can be much larger than the nominal confidence level unless n is quite large. Another issue is that this method only considers integer number of successes. With a highly complex system with many components, one could have partial successes of the system during an operational test.
Confidence intervals that use the asymptotic normality of a maximum likelihood estimator are called Wald confidence intervals (Mukhopadhyay 2000). Letp D X /n denote the sample proportion. Then the 100(1 ¡ a)% Wald confidence interval for p iŝ The discreteness of the binomial distribution often makes the normal approximation work poorly, resulting in a confidence interval that often has a confidence level lower than stated. There are a considerable number of methods in the literature for forming confidence intervals for a single proportion that try to address the "conservativeness" of the exact confidence interval and the "liberalness" of the normal approximation. Agresti and Coull (1998) adjusted the Wald interval by adding z 2 1 ¡ a/2 pseudo observations (half successes and half failures), resulting iñ p D X C z 2 1 ¡ a/2 /2 À Á /.n C z 2 1 ¡ a/2 /. One can think ofp as a Bayes estimator with a beta (z 2 1 ¡ a/2 /2; z 2 1 ¡ a/2 /2) prior distribution (see Gelman et al. 2004). The adjusted Agresti-Coull confidence interval for p is This simple adjustment to the ordinary Wald interval changes it from a highly liberal to slightly conservative confidence interval. Other adjustments can be made using Jeffreys prior [beta(1/2, 1/2)] or the uniform prior [beta(1, 1)] distribution (Hamada et al. 2010). In the simulation study, the two intervals will be denoted as the adjusted Jeffreys and adjusted uniform confidence intervals, respectively.
Instead of using a Bayes estimator, one could calculate a 100(1 ¡ a)% equal-tailed Bayesian interval. If p has a prior distribution beta(a, b), then the posterior distribution for p given X and n is distributed beta(X C a, n ¡ X C b). The interval is given by the a/2 and 1 ¡ a/2 quantiles of the beta distribution. Brown et al. (2001) recommended the equal-tailed Bayesian interval using Jeffreys prior (Bayesian Jeffreys) for smaller sample sizes. Borkowf (2006) proposed a confidence interval to guarantee nominal coverage like the Clopper-Pearson method but is easier to construct and interpret like the Agresti-Coull method. This is done by centering proportions by adding one failure in the construction of the lower confidence bound and one success in the construction of the upper confidence bound. The lower and upper bounds of the Borkowf confidence interval are the following: where p 1 and p 2 , the centering proportions, are defined as p 1 D X n C 1 and p 2 D X C 1 n C 1 . The actual sample size n is used to construct the lower and upper bounds rather than the augmented sample size (n C 1). Newcombe (2001) investigated the logit Wald interval and the Wilson score interval. The logit Wald interval's major drawback is that it cannot be directly calculated in cases where the sample proportion is very close to 0 or 1, which makes it difficult to recommend for general use. The logit Wald confidence interval is logp 1 ¡p § z 1 ¡ a/2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 1 np 1 ¡p ð Þ s followed by a transformation back to the proportion scale. The Wilson score interval is the inversion of the normal approximation test statistic but uses the null standard error instead of the estimated standard error used in the standard Wald hypothesis test: wherep andp are the same as defined previously (Wilson 1927). All of the methods considered have some advantages and disadvantages. For example, the Wald-based methods are easy to explain but could give subnominal coverage at particular proportion values. The exact methods guarantee at least nominal coverage but can be conservative under many circumstances, resulting in larger sample sizes (i.e., more cost). The Bayesian approaches provide a slightly different interpretation from the traditional confidence interval. The methods considered are not a complete summary of the literature but include methods that are commonly used.

Simulation Comparisons of Single Proportions Methods
This section provides an evaluation of the alternative intervals considered through Monte Carlo simulation. The goal of the study is to determine whether or not there is a confidence interval approach that is uniformly better than other methods. Special attention is given to comparisons against the Clopper-Pearson exact method since it is the gold standard. Simulation results are compared based on two criteria: (1) the coverage probability against a nominal confidence level (i.e., its conservativeness) and (2) the width of the confidence interval.
For a fixed value of a parameter, the actual coverage probability is the proportion of times the confidence interval contains or "covers" the true parameter. If confidence intervals are constructed across many separate data analyses of repeated (and possibly different) experiments, the proportion of such intervals that contain the true value of the parameter should approximately match the nominal confidence level. A confidence interval with a coverage probability higher than the nominal confidence level is "conservative." The simulation is set up as follows: 1. A random sample, X, is drawn from a binomial(n, p) distribution, where n is the sample size and p is the true proportion. 2. All confidence intervals are constructed based on the formulations in the prior section. 3. A value of 1 is given to the interval if it contains the true proportion p or a value of 0 is given otherwise. 4. The confidence interval width (the upper bound minus the lower bound) is calculated.
The steps are completed 10,000 times for sample sizes ranging from five to seventy-five, proportions from 0.70 to 0.95, and nominal confidence levels from 0.80 to 0.95. The motivation for studying sample sizes up to seventy-five is due to the fact that one is dealing with expensive destructive testing and to consider anything higher would be unachievable. The motivation for the range of proportions considered is that missile systems are typically highly reliable.
Tables A1 through A5 in the Appendix (see the online supplementary information) show the average coverage probabilities and root mean square errors of the confidence interval methods for samples sizes of fifteen through seventy-five. For each sample size and confidence level, the methods are evaluated using a Freeman-Tukey test (Freeman and Tukey 1950). Methods with similar coverage probabilities are indicated by the same superscript. Tables A6 through A10 in the Appendix show the average confidence interval widths and associated standard deviations. Tukey's studentized range test was used to group the methods based on confidence interval widths (Montgomery 2009). Figures 1 and 2 provide the average coverage probabilities and average confidence interval widths of each method considered for select sample sizes at 0.90 and 0.95 confidence levels.
When the true proportion is between 0.70 and 0.80, most interval procedures provide at least the nominal confidence level. The average coverage probability of the adjusted uniform interval is typically the closest to the desired confidence level but becomes near 1 for very high proportions. Generally, these confidence intervals provide at least the nominal confidence, except in a few cases for smaller sample sizes. In terms of average interval width, the adjusted uniform method runs in the middle of the pack. Therefore, for proportions less than 0.80, it is recommended to use the adjusted uniform method.
For proportions greater than 0.80, the Bayesian Jeffreys and Wilson score methods provide intervals with

234
J. D. Warfield and S. E. Roberts the smallest widths regardless of sample size. Whereas most confidence intervals have coverage probabilities approximately equal to 1 as the true proportion increases, both methods are not as highly conservative as others. For small sample sizes, the coverage probabilities are always at least the nominal confidence level. Whereas the mean coverage probability of the adjusted uniform interval is more conservative for proportions greater than 0.80, the coverage probability of the Bayesian Jeffreys method remains relatively stable. The Bayesian Jeffreys method performs relatively well even for higher true proportions, but it does not provide the same coverage probability as the adjusted uniform. For proportions less than 0.80 both methods provide intervals of the approximately the same width, but for proportions greater than 0.80 the difference in interval widths is statistically significant, in favor of Bayesian Jeffreys. Therefore, the Bayesian Jeffreys confidence interval is recommended for proportions greater than 0.80. The other confidence interval methods considered become highly conservative with the exception of the Wald and logit intervals. The coverage probabilities of the Wald and logit intervals improve as the sample size increases but fall well below the nominal confidence level for higher proportions. The Clopper-Pearson, Borkowf, and adjusted Agresti-Coull procedures are consistently the most conservative and largest confidence intervals across all sample sizes. When the true proportion is above 0.95, their coverage probabilities are approximately 1. The adjusted Jeffreys method produces intervals of relatively small widths, yet the coverage probability is relatively high for proportions close to 1. This method performs relatively well under certain cases; however, the coverage probabilities are generally not consistently better or worse than the nominal confidence level.

Hypothesis Test Based Methods for Two Proportions
The second critical part of the test and evaluation program for a strategic weapon system is the follow-on test program. The goal of the follow-on test program is to detect a possible decrease in system reliability with a specified probability of detection (power) and a specified false alarm rate (significance level). This naturally leads to the comparison of two independent binomial proportions, which occurs frequently in statistical practice.
Let X and Y denote the number of successes in two independent samples from two binomial populations (n 1 , p 1 ) and (n 2 , p 2 ), respectively. The joint probability of a realization is for x D 0, 1, . . ., n 1 and y D 0, 1, . . ., n 2 .The goal is to test H 0 : p 1 D p 2 against H 1 : p 1 6 ¼ p 2 or to test H 0 : p 1 D p 2 against H 1 : p 1 > p 2 . The one-sided test is of interest because one wants to know whether there has been a significant reliability degrade in a given year. To avoid approximation, a quasi-exact method is used to determine the number of tests n 2 required. The method is described as follows: where the sums are restricted over the rejection region x n 1 ¡ y n 2 > C for C between 0 and D, and p 2 can be defined as p 1 ¡ D. The follow-on samples, n 2 , can be determined for a D degrade in reliability from the current reliability p 1 and number of tests n 1 for a specified power, 1 ¡ b, and confidence level 1 ¡ a. As with the initial test program approach, it is also thought that this test procedure is necessarily conservative because of the discreteness of the binomial distribution. Several articles in the literature have focused on assessing confidence interval methods for the difference between two proportions.  and Brown and Li (2005) provided comprehensive reviews of several of those methods. The approach taken here is to evaluate the confidence interval methods with respect to coverage probability and confidence interval width. The candidate confidence interval method will be inverted into the hypothesis testing setup and evaluated against the quasi-exact method with respect to power and significance level for specified reliability degrades. Berger (1994) evaluated several exact unconditional tests for computing two binomial proportions and concluded that Boschloo's (1970) test, with the confidence interval modification by Berger and Boos (1994), generally has the best power properties.
Letp 1 D X /n 1 andp 2 D Y /n 2 denote the sample proportions. The standard large-sample Wald confidence interval (Mukhopadhyay 2000) for comparing the difference between two proportions iŝ p 1 ¡p 2 § z 1 ¡ a/2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffif p 1 1 ¡p 1 ð Þ/n 1 Cp 2 1 ¡p 2 ð Þ/n 2 q : Though this method is straightforward and widely used, it behaves poorly as shown in the one-sample case. The adjustment to the Wald confidence interval by Agresti and Coull (1998) can be extended to the two-sample case (Agresti-Coull extension) by adding z 2 1 ¡ a/2 pseudo observations (half successes and half failures) to both samples, resulting iñ 1 ¡ a/2 /2Þ/.n 2 C z 2 1 ¡ a/2 / and resulting in the following confidence interval: Agresti and Caffo (2000) provided a simple adjustment to the Wald interval by adding one success and one failure to both samples. This adjusted interval will be referred to as the adjusted uniform method in the simulation study. The simple adjustment has advantages in computation and presentation and has shown through simulation to work quite well. However, there has been no theoretical support for their simulation conclusions. Other adjustments could be made using a Jeffreys prior [beta(1/2, 1/2)]; this adjusted interval will be referred to as the adjusted Jeffreys method.
An alternative way to adjust the Wald confidence interval is to take a fully Bayesian approach (Bayesian Jeffreys). If we assign Jeffreys prior distribution to both p 1 and p 2 , the joint posterior distribution for p 1 and p 2 is proportional to l p 1 ; p 2 j n 1 ; n 2 ; x; y ð Þ D p x 1 1 ¡ p 1 ð Þ n 1 ¡ x p y 2 1 ¡ p 2 ð Þ n 2 ¡ y and then one can calculate the probability of various functions of p 1 and p 2 . Brown et al. (2005) calculated confidence bands based on the fully Bayesian approach and showed that it did not outperform the adjusted Jeffreys method. Brown et al. (2005) suggested the recentered Wald confidence interval:p where t D T 1 ¡ a;n 1 C n 2 ¡ 2 andp is a truncated estimate between .p 1 ¡p 2 /n 2 n 1 C n 2 and 1 ¡ .p 1 ¡p 2 /n 1 n 1 C n 2 . Empirical results showed that using a T-distribution performed much better than the normal when n 1 and n 2 were small. The Newcombe hybrid score confidence interval ) is formed by calculating score intervals for each of the two independent binomial proportions p 1 and p 2 . Let (l i , u i ) be the roots of the quadratic equation z a/2 Dp 1 ¡p 2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffif p 1 ¡p ð Þ 1/n 1 C 1/n 2 ð Þ p ; then the confidence interval has the form p 1 ¡p 2 ¡ z a/2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi l 1 1 ¡ l 1 ð Þ/n 1 C u 2 1 ¡ u 2 ð Þ/n 2 p ; p 1 ¡p 2 ¡ z a/2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi u 1 1 ¡ u 1 ð Þ/n 1 C l 2 1 ¡ l 2 ð Þ/n 2 p Zhou et al. (2004) proposed a confidence interval based on the Edgeworth expansion for the studentized difference between two binomial proportions. Unlike the Wald approximation, this method takes into account the impact of the skewness in the binomial distribution. The resulting two-sided, 100(1 ¡ a)% skewness-corrected interval for p 1 -p 2 is defined as follows: All of the methods considered have some advantages and disadvantages that have been addressed in each method's description. The methods considered are not a complete summary of the literature but include methods that are commonly used.

Simulation Comparisons of Two Proportion Methods
This section provides an evaluation of the alternative two-sample confidence interval approaches considered through Monte Carlo simulation. The goal of the study is to determine whether or not there is a two-sample approach that is uniformly better than the other methods. Simulation results are compared based on two criteria: (1) the coverage probability for a given confidence and delta difference and (2) the width of the confidence interval. For fixed values of the parameters p 1 and p 2 , the actual coverage probability is the proportion of times the confidence intervals contain or "cover" the true difference. A confidence interval with a coverage probability higher than the nominal confidence level is "conservative." The simulation is set up as follows: 1. Two random samples, X and Y, are drawn from a binomial(n 1 , p 1 ) and binomial(n 2 , p 2 ), respectively, where n 1 and n 2 are the sample sizes and p 1 and p 2 are the true proportions. 2. All confidence intervals are constructed based on the formulations in the prior section. 3. A value of 1 is given to the confidence interval if it contains the true delta difference or a value of 0 is given otherwise. 4. The confidence interval width (upper bound minus the lower bound) is calculated.
The steps are completed 10,000 times for initial sample sizes (n 1 ) ranging from twenty-five to seventyfive, follow-on sample sizes (n 2 ) ranging from two to twenty, initial proportions from 0.70 to 0.95, nominal confidence levels from 0.75 to 0.95, and delta differences ranging from 0.10 to 0.30. The estimated coverage probability for a test is the proportion of intervals containing the true delta difference. The motivation of only studying follow-on sample sizes of up to twenty is due to the fact that one is dealing with expensive destructive testing performed annually and to consider anything higher would be unachievable. The two proportion methods considered are not necessarily limited to this program of testing but rather can be applied to systems tests on a more or less frequent basis. The motivation for the range of proportions considered is that missile systems are typically highly reliable. The motivation for the range of delta differences is that with limited testing one may only be able to detect significant drops in reliability.
Tables A11 through A16 in the Appendix show the average coverage probabilities and root mean square errors of each confidence interval method for samples sizes of fifteen through seventy-five. For each sample size and confidence level, the methods are evaluated on performance using a Freeman-Tukey test (Freeman and Tukey 1950). Methods with similar coverage probabilities are indicated by the same superscript. Tables A17 through A22 in the Appendix show the average confidence interval widths and associated standard deviations. Tukey's studentized range test was used to group the methods based on confidence interval widths (Montgomery 2009). Figures 3-14 provide the average coverage probabilities and average confidence interval widths of each method considered for select sample sizes at 0.90 and 0.95 confidence levels.
The adjusted Jeffreys, Bayesian Jeffreys, and Newcombe hybrid intervals provide coverage probabilities close to the nominal confidence without being as conservative as the Agresti-Coull and adjusted uniform intervals. For smaller sample sizes, the adjusted and Bayesian Jeffreys intervals typically do not achieve the nominal confidence level for proportions between 0.70 and 0.80, but for more extreme initial proportions the intervals become conservative. The Newcombe hybrid method is comparable to the two Jeffreys methods but fails to maintain adequate coverage for higher initial proportions. For larger samples sizes, the coverage probabilities improve for all three intervals, but the same general pattern still holds. The Bayesian Jeffreys and Newcombe hybrid methods provide intervals with the smallest widths regardless of sample size. For initial proportions between 0.70 and 0.85, the Newcombe hybrid method provides the smallest intervals, whereas the Bayesian Jeffreys method produces smaller intervals for initial proportions greater than 0.85. Therefore, the Newcombe hybrid method is recommended for initial proportions less than 0.85 and the Bayesian Jeffreys method is recommended for initial proportions greater than 0.85. For small sample sizes, the coverage probabilities of the Edgeworth expansion, Wald, and recentered Wald methods fall well below the nominal confidence level. The coverage probabilities of these three methods improve for larger sample sizes but fall short compared to the other methods further discussed. The Agresti-Coull and adjusted uniform methods have coverage probabilities exceeding 0.95 in almost all cases, even when the nominal confidence level is 0.90. Table 1 provides the results comparing the Bayesian Jeffreys method to the quasi-exact method with respect to power. The actual significance level is listed for the quasi-exact method because the method cannot always achieve the desired significance level. Overall the methods are comparable, with the quasi-exact method being slightly more powerful as the sample size increases.

CONCLUSIONS
Testing programs exist to establish initial estimates of reliability for systems and to determine whether a significant decrease in reliability has occurred in relation to the initial value. For initial test programs, the methodology is based on confidence intervals for a single proportion. Several methods were investigated through a simulation study. Based on the simulation results, the Bayesian Jeffreys method performs the best in the majority of situations, with the adjusted uniform performing best for proportions less than 0.80. For follow-on test programs, the methodology is based on hypothesis testing for two proportions. Several confidence interval methods for the difference between two proportions were investigated through a simulation study. Based on the simulation study results, the Newcombe hybrid score method and the Bayesian approach with Jeffreys prior show the most promise. However, the Bayesian method provided very little gain in power when compared to the quasi-exact method.

ABOUT THE AUTHORS
Joseph D. Warfield is a senior research statistician and section supervisor of the Statistical Sciences Section of the Force Projection Sector at the Johns Hopkins University Applied Physics Laboratory.His current areas of research focus on optimal design of experiments and reliability related methods.He received a B.S. degree in Mathematics from Loyola University, an M.S. degree in Statistics from Virginia Tech, and a Ph.D. in Statistics from the University of Maryland, Baltimore County.
Sarah Elise Roberts is a research statistician in the Force Projection Sector at the Johns Hopkins University Applied Physics Laboratory in Laurel, Maryland. She received a B.S. degree in Mathematics from the Pennsylvania State University and an M.S. degree in Statistics from the University of Maryland, Baltimore County.

SUPPLEMENTAL MATERIAL
Supplemental data for this article can be accessed on the publisher's website.