One-sided asymptotic inferences for a proportion

ABSTRACT Two-sided asymptotic confidence intervals for an unknown proportion p have been the subject of a great deal of literature. Surprisingly, there are very few papers devoted, like this article, to the case of one tail, despite its great importance in practice and the fact that its behavior is usually different from that of the case with two tails. This paper evaluates 47 methods and concludes that (1) the optimal method is the classic Wilson method with a correction for continuity and (2) a simpler option, almost as good as the first, is the new adjusted Wald method (Wald's classic method applied to the data increased in the values proposed by Borkowf: adding a single imaginary failure or success).


Introduction
Obtaining a confidence interval (CI) for an unknown binomial proportion p is a very common aim of inferential statistics. The two-sided case has traditionally been the subject of a great deal of attention: Yu et al. [27] and Martín Andrés and Álvarez [16] being two of the most recent publications. The case of one tail has aroused much less interest; most recently only Pradhan et al. [22] and Cai [10] have addressed the subject, by analyzing a small number of methods for a confidence of 95% or 97.5%, respectively. One-tail CIs are extremely useful for quality control [20] and in tests of non-inferiority and superiority in biomedical studies [22]. However, as Cai [10] showed that the performance of the same method can vary strongly depending on whether the CI has one or two tails, each situation should be analyzed separately. The different behavior of the same formula when inferences with either one or two tails are carried out has been observed in other areas; for example, when two independent proportions are compared, Martín Andrés et al. [18] verified that the validity conditions are stricter in the one-tailed test than in the two-tailed test. In general, this different behavior is due to the fact that in the two-tailed inferences a too small coverage in one of the tails is compensated for by a too large coverage in the other; however, in the one-tailed inferences such compensations cannot occur.
The exact CI for p can be obtained relatively simply by using the Clopper-Pearson method, a solution which may be determined using the distribution F of Snedecor. However, the procedure is too conservative and, in practice, not a good choice, unless it is completely necessary to respect the error α [10]. This is why many authors deal with the problem from the approximate standpoint. The approximate CIs may be of two types. The 'almost exact' CIs [1] require rather intense computation; the 'asymptotic' CIs tend to be simpler to apply and are of great interest in pedagogy [2]. In this paper, we focus on the asymptotic CI which, surprisingly, frequently functions well even for relatively small sample sizes.
From an asymptotic standpoint, the most frequent two-sided CI are based on the Wilson and Wald methods -although occasionally methods based on the arcsine transformation are used -and also on the likelihood ratio test. Because the Wald method is one of those that have the worst performance [3,6,7,11,13,21], various adjusted Wald procedures have been suggested that allow the method to be improved. These procedures are based on applying the Wald method to the sample increased by a given number of virtual successes and failures [3,8,11], a methodology which has practically never been applied in the case of the one-sided CI.
The aim of this paper is to furnish an asymptotic CI for a proportion p which, based on the Normal distribution, respects as far as possible the nominal error. In addition, for pedagogic reasons, the CI should be easy to calculate.
From a Bayesian point of view, the CI depends on the distribution assigned to the parameter p. Several authors (such as Brown et al. [9]) select a prior beta distribution B(0.5, 0.5), which produces Jeffreys' Bayesian method. This method is selected even in preference to the 'almost exact' inferences, so it was advisable to include it in the present study. In addition, the newest methods proposed recently [14,27] are also included.

General observations
In order for the inference to be coherent, the one-sided CI should be obtained by inverting the associated one-sided test. The consequences are twofold. On the one hand, the definition of a method can be made either from the perspective of the test or from the perspective of the CI: in this paper, both perspectives are adopted. On the other hand, evaluating a CI method is equivalent to evaluating the associated test method (if both are realized to the same nominal error α). To evaluate a CI, the parameters of real coverage and average length are usually used. To evaluate a test, the parameters of type I real error and power are used. But real coverage and real error add up to 1 and moreover, the greater the power of the test, the shorter the length of the CI obtained by inverting it. In the following, the evaluation of each method is carried out on the basis of the evaluation of the test that defines it.
The one-sided test may be right sided (H: p ≤ π vs. K: p > π) or left sided (H : p ≥ π vs. K : p < π), with 0 < π < 1. The present paper deals only with the right-sided version because the left-sided one is the same as the other if the successes are replaced by failures (H : 1−p ≥ 1−π vs. K : 1−p < 1−π). By inverting the right-sided test, one obtains the left-sided CI (both to the same error α): p ≥ p L = {π |the test for H: p ≤ π vs. K: p > π has no significance}. In asymptotic inferences, if P(π ) is the p-value of the test for H vs. K, then p L is the only solution in π for the equation P(π ) = α (in which 1−α is the nominal confidence of the CI).

Statistics to be used and the procedures they yield
Let x ∼ B(n; p) be a random binomial variable and let y = n−x,p = x/n,q = 1 −p = y/n and q = 1−p, where p is the parameter of interest. From the perspective of the classic statistic, various statistics can be used to contrast the hypotheses H: p ≤ π vs. K: p > π (with 0 < π < 1). The most traditional statistics are based on the typification of the binomial proportion z = (p − π)/ p(1 − p)/n. The classic procedure in elementary textbooks consists in substituting p withp, which gives rise to the procedure W of Wald. Wilson [25] proposed substituting p with π, so obtaining the Wilson score procedure S. The statistics that they obtain are W and S procedures : Once the sample has been obtained and with the experimental value of the statistic determined, the test is carried out in the classic way: by comparing z W or z S with z α (and similar in the following cases), in which z α is the 100 × (1−α)th percentile of the standard normal distribution.
Other less traditional options are based on the arcsine transformation (procedure A) and on the likelihood ratio test (procedure R), the statistics for these are A and R procedures : In this paper, the statistics based on the logarithmic transformation (of type ln or type logit) are omitted for two reasons: because in the case of two tails both have very bad behavior and, above all, because their performance is not coherent [16].
All these procedures, and those that follow, can be found in Table 1 (for easy reference later).

Statistics with correction for continuity
When a discrete distribution is approximated by another continuous distribution, it is common to carry out a correction for continuity (cc in the following). The theoretical justification for this may be seen in Cox [12]. This occurs in the present case in which the binomial distribution is approximated by the normal distribution. Haber [15] proposed that a cc should consist of adding or subtracting half its average jump from the variable. In the case of the statistic z W (similarly for the statistic z S ), the variable isp = x/n. Because this has a total jump of 1 (it takes values between 0 and 1) and gives a total of n jumps (because 0 ≤ x ≤ n), the cc will be the classic c = 1/2n. As a result, the statistics in expression (1) convert into the following classic statistics with cc z Wc [6] and z Sc (where c = 1/2n): Wc and Sc procedures : For the statistic z A in expression (2) a similar procedure can be followed, with the sole difference that the new variable sin −1 p has a total jump of 3.1416/2, and so obtains the following statistic with cc:

Data of samples and the methods they yield
Because the behavior of the classic and simple W procedure is worse than that of its traditional rivals [3,6,7,11,13,21], several authors have tried to improve it (while preserving the simplicity of the expression). The traditional improvement consists in applying the statistic, not on the basis of the original sample (x, y, n), but on the sample increased by h virtual successes and h virtual failures, that is, based on the sample (x + h, y + h, n + 2h). The most frequent increases are those which are indicated in what follows. Each increase produces a different case, which will be numbered as follows: The origin of the various cases is the following: Case 1 ( [26], in another context; [5]), Cases 2 and 3 [3,11], Case 4 [17], Case 5 [8] and Case 6 [4]. Note that all the cases are symmetric (except Case 5) in the sense that h = h . Note also that when 0 ≤ x < n, Case 3 produces the same results as Case 4; but if x = n, then h = z 2 α /2 in Case 3 and h = z 2 α in Case 4. Similarly, it can be seen that Cases 2 and 3 should produce very similar results for α = 2.5% (because 1.96 2 /2 ≈ 2). The seven increases above (Cases 0-6) may be applied to any of the seven procedures defined in the previous section (W, S, A, R, Wc, Sc and Ac), and so give rise to 49 possible inference methods. For example, procedure A will produce seven methods: A0, A1, A2, A3, A4, A5 and A6. In fact, the inference methods to be analyzed will only number 44, because Case 6 has meaning only with the procedures A and Ac (because it is associated with the arcsine transformation). Martín and Álvarez [16] analyzed many of the earlier procedures for the case of two tails.

Methods based on CIs
In recent years, several authors [14,16,27] have proposed new two-sided asymptotic CIs for a proportion. Guan [14] and Yu et al. [27] proposed new two-sided CI based on the generalized score interval given by the second expression in Equation (1). The first modifies the range of the interval; the second modifies its center. For the case of the one-sided CI, these intervals (procedures G and Y respectively) are given by In this case, adjusted methods are not proposed because the methods themselves can be understood as an adjusted-type procedure; thus each one yields a single inference method: G0 and Y0.
From a Bayesian standpoint, the CI depends on the distribution assigned to the parameter p. Traditionally, a beta distribution is assumed, the most frequent being the uniform distribution [24] and the non-informative Jeffreys prior distribution [9]. Brown et al. compare several CI and select the Jeffreys prior interval (procedure B) defined as with B α (a; b) being the αth quantile (0 ≤ α ≤ 1) of the beta distribution with parameters a and b. Due to its origen, once again only the method B0 (with unincreased data) is contemplated here.
In the three previous cases, the test to the type I error α is carried out using the classic procedure: H is accepted if π is within the obtained CI.
In this paper, the asymptotic method of Cai [10] is not considered as it is more complicated to calculate, infrequently used and similar to B0 (according to the author). Our system of analysis indicates that both are almost always similar and that, when this is not so, method B0 is the better one.
If the 3 methods here (G0, Y0 and B0) are added to the 44 methods mentioned in the previous section, the 47 inference methods to be analyzed below are obtained.

Procedure for obtaining and analyzing results
In order to compare and evaluate the 47 methods referred to above, it is necessary to obtain certain parameters that synthetize the quality of each method (the parameters α, θ and F indicated below). To this end, the following steps must be carried out: (1) Select a triplet of values (α, π , n) from among the values: α = 1%, 5% and 10%; π = 0.05, 0.1 (0.1), 0.9, 0.95; n = 20 (20), 100 and 200 (3 × 11 × 6 = 198 triplets). (2) Construct the critical region CR = {x|z exp (π ) ≥ z α }, with z exp (π ) referring to the experimental value of the statistic in the chosen method, and calculate the type I real error of the test α * = 100 n−x and the increase α = α−α* of the nominal error with regard to the real error. As the CI is obtained by inverting the test, each observation x gives rise to a CI for p given by CI(x) = {π 0 |z exp (π 0 ) < z α }, so that now the real coverage will be γ * =  0, 1, . . . , n). The value of θ is a good indicator of the power of the test, because as the number of points in the CR increases, the power will increase accordingly. Martín and Silva [19], in another context, called this 'long-term power' (because it is the average power of the test when the proportion p is obtained from a uniform distribution). The long-term power θ allows the power for two tests realized to the same error α to be compared globally, something that does not occur when the traditional power is used (because this depends on the particular alternative hypothesis under consideration). (4) Determine whether the method 'fails', that is if α ≤ −1%, −2% or −4% for α = 1%, 5% or 10%, respectively, permitting the number of times that the test is too liberal to be monitored. From the perspective of CI, this means that a method fails if it produces CI whose real coverage is ≤ 93% compared to the nominal coverage of 95%. When α > + 2% (real error lower than 3% or real coverage greater than 97%), the test will be very conservative, but will give false significances (its possible bad performance will be obvious from its low value of θ). This criterion will be used below only in a negative sense, that is, to discard those methods of inference with too many failures. The criterion described for α = 5% is traditional in various fields [3,16,17,23]. With all these data, Table 2 (which contains the results for three of the cited methods) is constructed. The data for the remaining 44 methods may be obtained from the authors. In order to compare the different methods easily, the results obtained should be synthesized by carrying out the following and final step.   Once the results are obtained, the selection of the optimal method will be carried out under the following criterion (giving special importance to the case α = 5%, as it is the most frequent nominal error): (A) Discard the methods with an excessive number of failures, giving preference to the methods with few failures (and preferably those with none). (B) Of the remaining methods, choose those with a α closest to 0 (i.e. methods with an average error close to the nominal one) and a greater value of θ (i.e. methods which yield more powerful tests and narrower intervals). Additionally, in order for the significances to be reliable, preference will be given to the conservative methods ( α > 0) ahead of liberal ones ( α < 0). Moreover, when evaluating the magnitude of θ , the fact that a liberal method will generally be more powerful than a conservative one will be taken into consideration (though this does not mean that the first is better than the second). (C) Finally, of the remaining methods, preference will be given to those that are simplest to apply. Table 3 contains the values of F, α andθ (to the type I error α = 5%) for the 47 methods evaluated (and grouped according to the statistic which produces them). These values have been obtained by globalizing all the pairs of values (n, π ). It can be noted that (a) In the type W methods, the only method that does not present failures is the method W5. The cc (type Wc methods) substantially improves the methods W2, W3 and W4, but not W5. The classic method W0 behaves badly even when a cc is carried out. The best methods are W5, Wc2, Wc3 and Wc4, which are equally conservative and have a similar power, but W5 is the one selected as it is the simplest of all these. (b) Of the type S methods, none behaves well without cc and almost all improve when this is performed. The only methods with no failures are Sc0 and Sc5 (both are conservative), but the Sc0 method has been chosen because its behavior is best. (c) Of the type A methods, only method A5 does not contain failures without a cc being performed, while almost all improve if one is carried out. Methods Ac1, Ac5 and Ac6 do not have failures either, but method A5 is the one chosen because overall it is the best (it is less conservative and is more powerful than the other three). (d) None of the methods of the types R, G, Y or B should be chosen, because they all contain a large number of failures. (e) Bearing in mind all that has been said here, the methods of greatest interest are W5, Sc0 and A5; for this reason Table 2 only displays outcomes for these three methods.

Selection of the optimal method
With the aim of analyzing the behavior of the three methods chosen for the three type I errors, Table 4 gives the values for F, α andθ for the errors α = 1%, 5% and 10%. Although all three methods behave well, it can be seen that a more specialized analysis indicates that • For α = 1%, W5 is discarded first because it has one failure and is less powerful than the other two, and Sc0 is selected in preference to A5 because it behaves somewhat better and is simpler. • For α = 5%, no method fails and the increase in the error is similar in the three cases, but method Sc0 is simpler and has slightly more power, so that it is again the selected method. • For α = 10%, none of the methods fails, but Sc0 is the least conservative and is slightly more powerful, so again it is the one chosen.
In short, the general optimal method will be method Sc0, although a good and simple alternative (only a little worse than the previous one) is method W5.

Comparison of the results with published literature
Very few papers devoted to the analysis of the CI with one tail have been published. Table  5 shows the results of the four most studied methods. Cai [10], in one of the two papers devoted exclusively to the case of one-sided CI, compares methods W0, S0 and B0 with its own (a method that is practically the same as B0, as was mentioned in Section 2.5) to the error of 5%. The conclusions are that methods W0 and S0 behave very badly, and it recommends the Bayesian method B0. In the other article, Pradham et al. [22] compare the asymptotic methods Wc0, S0, Sc0, W3, B0 and Cai's [10] to an error of 2.5%. Its conclusion is that, if it is not necessary to respect error α, the best method is Sc0.  From the outcomes in Table 5, it can be concluded that W0 functions very badly, and this agrees with the results in Cai [10], but that the behavior of B0 is only slightly better than S0 and neither should be recommended because they have too many failures. Additionally, our conclusion to select method Sc0 as the optimal method is compatible with the findings of Pradham et al. [22].

Comparison with the results of the case of two-sided CI
Martín and Álvarez [16] evaluate different asymptotic methods in order to obtain a twosided CI for p, and conclude that (a) the classic Wilson method (similar to the S0 in this paper) only performs well generally when n ≥ 50, although finally it is the optimal for any n when used to an error α = 1%; (b) for α ≥ 5%, the optimal method consists in increasing the data by 0.5 successes and 0.5 failures and then applying the arcsine transformation (a method similar to the A1 method in this paper); the procedure performs in practically the same way as the Jeffreys Bayesian method; (c) a simpler option that is only a little worse than the ones mentioned above is the classic adjusted Wald method of Agresti and Coull, which consists in increasing the data by two successes and two failures and then applying the Wald method (a method similar to the W2 method in this paper); (d) the method based on the likelihood ratio test for data increased by 0.5 successes and 0.5 failures (a similar method to the R1 here) performs very well, but is almost the same as the method based on the arcsine transformation and rather more complicated (it does not permit an explicit expression).
These conclusions, in contrast to those obtained during the analysis in this paper, repeat Cai's affirmation [10] that the performance of a specific method may vary depending on whether it is used to obtain a CI with one or two tails; hence the need for both situations to be analyzed separately (as has been done in this paper).

Explicit CIs
The aim of this section is to present the explicit expressions of the test and the CI for the selected methods, in both right-sided and left-sided cases.

Disclosure statement
No potential conflict of interest was reported by the authors.