A Rate of Incoherence Applied to Fixed‐Level Testing

It has long been known that the practice of testing all hypotheses at the same level (such as 0.05), regardless of the distribution of the data, is not consistent with Bayesian expected utility maximization. According to de Finetti’s “Dutch Book” argument, procedures that are not consistent with expected utility maximization are incoherent and they lead to gambles that are sure to lose no matter what happens. In this paper, we use a method to measure the rate at which incoherent procedures are sure to lose, so that we can distinguish slightly incoherent procedures from grossly incoherent ones. We present an analysis of testing a simple hypothesis against a simple alternative as a case‐study of how the method can work.

It has long been known that the practice of testing all hypotheses at the same level (such as 0.05), regardless of the distribution of the data, is not consistent with Bayesian expected utility maximization. According to de Finetti's "Dutch Book" argument, procedures that are not consistent with expected utility maximization are incoherent and they lead to gambles that are sure to lose no matter what happens. In this paper, we use a method to measure the rate at which incoherent procedures are sure to lose, so that we can distinguish slightly incoherent procedures from grossly incoherent ones. We present an analysis of testing a simple hypothesis against a simple alternative as a case-study of how the method can work. Cox (1958) and Lindley (1972) have shown that the practice of testing all hypotheses at the same level, regardless of the distribution of the data, can lead to inadmissibility and incompatibility with Bayesian decision theory. One of the most compelling arguments for Bayesian decision theory and the use of probability to model uncertainty is the "Dutch Book" argument, which says that if you are willing to accept either side of each bet implied by your statements and finite combinations of these together, either (a) those statements are "coherent," that is they comport with the axioms of probability, or (b) a gambler betting against you can choose bets so that you are a sure loser.

Introduction.
Excellent introductions to the concepts of coherence and Dutch Book can be found in Shimony (1955), Freedman and Purves (1969), and de Finetti (1974, Section 3.3). As a practical matter, it is very difficult to structure one's statements of probabilities (i.e. previsions) in such a way that they both reflect one's beliefs and are coherent (see Kadane and Wolfson 1998). Yet the dichotomy above does not allow for discussion of what sets of previsions may be "very" incoherent or just "slightly" incoherent. This paper explores a remedy for this by studying how quickly an incoherent bookie can be forced to lose money. A faster rate of sure financial decline to the bookie, or a faster rate of guaranteed profit to the gambler, is associated with a greater degree of incoherence.
The problem as stated so far requires some normalization. Suppose that a particular combination of gambles yields a sure loss y for the bookie. Then multiplying each gamble by the same constant k Ͼ 0 will create a combination of gambles that yields sure loss ky. In this paper we explore how to perform the normalization from the bookie's perspective. We introduce normalizations in Section 3. To fix ideas, however, consider that the bookie cannot be assumed to have infinite resources. A wise gambler would want to be sure that the bookie could cover all the bets. One way to do this would be to require the bookie to escrow the maximum amount that the bookie could lose on each gamble separately. Thus we can ask how much the bookie can be forced to lose for sure, given a specified level of escrow that the bookie can offer. In Section 4, we apply these ideas to assess the incoherence of the practice of fixed level hypothesis testing.

Gambles and Incoherence.
Think of a random variable X as a function from some space T of possibilities to the real numbers. We assume that, for a bounded random variable X, a bookie might announce some value of x such that he/she finds acceptable all gambles whose net payoff to the bookie is ␣(X ‫מ‬ y) for ␣ Ͼ 0 and y Ͻ x. Each such x will be called a lower prevision for X. In addition, or alternatively, the bookie may announce some value of x such that the gamble ␣(X ‫מ‬ y) is acceptable when ␣ Ͻ 0 and y Ͼ x. These x will be called upper previsions for X. We allow that the bookie might announce only upper previsions or only lower previsions or both. For example, if X is the indicator I A of an event A the bookie might announce that he/she finds acceptable all gambles of the form ␣(I A ‫מ‬ y) for y Ͻ p if ␣ Ͼ 0 but no other gambles, in particular, not for y ‫ס‬ p. It will turn out not to matter for any of our results whether or not the bookie finds the gamble ␣(I A ‫מ‬ p) acceptable. In the special case in which x is both an upper prevision and a lower prevision, we call x a prevision of X and denote it P(X ). Readers interested in a thorough discussion of upper and lower previsions should refer to Walley (1991).
It will be convenient to assume that, whenever both an upper prevision x ‫ם‬ and a lower prevision x ‫מ‬ have been assessed for the same random variable X, x ‫מ‬ Յ x ‫ם‬ , otherwise the bookie is willing to sell X for a certain price and then buy it right back for a higher price. Although such incoherence could be measured, it requires cumbersome bookkeeping that makes general results difficult to understand. (See Examples 4 and 5 of Schervish et al. 1999.) In particular, this assumption implies that there can be at most one prevision of X.
A collection x 1 , . . . , x n of upper and/or lower previsions for X 1 , . . . , X n respectively is incoherent if there exists e Ͼ 0 and a collection of ac- in which case we say that a Dutch Book has been made against the bookie. Of course, we would need ␣ i Ͼ 0 and y i Ͻ x i if x i is a lower prevision for X i and we would need ␣ i Ͻ 0 and y i Ͼ x i if x i is an upper prevision for X i . When a collection of upper and/or lower previsions is incoherent, we would like to be able to measure how incoherent they are. As we noted earlier, the e in (1) is not a good measure because we could make e twice as big by multiplying all of the ␣ i in (1) by 2, but the previsions would be the same. Instead, we need to determine some measure of the sizes of the gambles and then consider the left-hand side of (1) relative to the total size of the combination of gambles. This is what we do in Section 3.

Normalizations.
To begin, consider a single acceptable gamble such as Y ‫ס‬ ␣ (X ‫מ‬ y). There are a number of possible ways to measure the size of Y. For example sup t |Y(t)| or sup t ‫מ‬Y(t) might be suitable measures. This last one has a nice interpretation. It is the most that the bookie can lose on the one particular gamble. It measures a gamble by its extreme value in the same spirit as Dutch Book measures incoherence in terms of an extreme value (the minimum payoff to the gambler) of a combination of gambles. Alternatively, if we think of the gambler and bookie as adversaries with regard to this one gamble Y, the gambler might want to be sure that the bookie will be able to pay up when the bet is settled. We could imagine that the gambler requests that the bookie place funds in escrow to cover the maximum possible loss. So, for the remainder of the paper, we will call e(Y ) ‫ס‬ sup t ‫מ‬Y(t) the escrow for gamble Y. Note that e(cY) ‫ס‬ ce(Y) for all c Ͼ 0. We use the escrow to measure the size of the gamble Y.
Example 1. Let A be an arbitrary event which is neither certain to occur nor certain to fail. Suppose that a lower prevision p is given, and consider the gamble Y(t) ‫ס‬ ␣(I A (t) ‫מ‬ p) with ␣ Ͼ 0. Then sup t ‫מ‬Y(t) ‫ס‬ ␣p, and the escrow is e(Y) ‫ס‬ ␣p. If an upper prevision q is given and ␣ Ͻ 0, then sup t ‫מ‬Y(t) ‫ס‬ ‫1(␣מ‬ ‫מ‬ q) ‫ס‬ e(Y), where ␣ Ͻ 0.
When we consider more than one gamble simultaneously, we need to measure the size of the entire collection. We assume that the size of (escrow for) a collection of gambles is some function of the escrows for the individual gambles that make up the collection. That is e(Y 1 , . . . , Y n ) ‫ס‬ f n (e(Y 1 ), . . . , e(Y n )). In order for a function to be an appropriate measure of size, we have a few requirements. First, f n (cx 1 , . . . , cx n ) ‫ס‬ cf n (x 1 , . . . , x n ), for all c Ͼ 0, x 1 , . . . , x n . (2) Equation (2) says that the function f n must be homogeneous of degree 1 in its arguments so that scaling up all the gambles by the same amount will scale the escrow by that amount as well. Second, since we are not concerned with the order in which gambles are made, we require f n (x 1 , . . . , x n ) ‫ס‬ f n (y 1 , . . . , y n ), for all n, x 1 , . . . , x n and all permutations (y 1 , . . . , y n ) of (x 1 , . . . , x n ).
Third, in keeping with the use of escrow to cover bets, we will require that, if a gamble is replaced by one with higher escrow, the total escrow should not go down: f n (x 1 , . . . , x n ) is nondecreasing in each of its arguments.
If a gamble requires 0 escrow, we will assume that the total escrow is determined by the other gambles: f n‫1ם‬ (x 1 , . . . , x n , 0) ‫ס‬ f n (x 1 , . . . , x n ), for all x 1 , . . . , x n and all n. (5) Since nobody can lose more than the sum of the maximum possible losses from all of the accepted gambles, we require that  (6) Small changes in the component gambles should produce only small changes in the escrow, so we require that f n is continuous for every n.
Finally, since we have already decided how to measure the size of a single gamble, we require for some function f n satisfying (2)-(8) and call e(Y 1 , . . . , Y n ) an escrow for the collection of gambles. Every sequence of functions { } f n n= ∞ 1 that satisfy (2)-(8) leads to its own way of defining escrow. Such a sequence is called an escrow sequence. Each function in the sequence is an escrow function.
We can find a fairly simple form for all escrow sequences. Combining (8), (4), and (5), we see that f n (x 1 , . . . , x n ) Ն max{x 1 , . . . , x n }. From (3), we conclude that f n is a function of the ordered values . Combining these results with (6), we get In order to satisfy (5), we need k n (0, In order to satisfy (2), k n must be invariant under common scale changes for all of its arguments. That is k n (cx (1) , . . . , cx (n) ) ‫ס‬ k n (x (1) , . . . , x (n) ).
Every such function can be written as In order to satisfy (4), we must have c n nondecreasing in each of its arguments. In order to satisfy (9), we must have In summary, every escrow sequence satisfies for some sequence c 1 , c 2 , . . . of continuous functions where c 1 ϵ 0 and for n Ͼ 1 the functions satisfy the following properties: • c n (y 1 , . . . , y n‫1מ‬ ) is defined and continuous for 0 (The last condition is equivalent to f n being nondecreasing in x (n) .) It is straightforward to show that every sequence that meets this description satisfies (2) for each 0 Յ c Յ 1. Another example is c n (z 1 , . . . , z n‫1מ‬ ) ‫ס‬ z n‫1מ‬ for n Ͼ 1. This one makes the total escrow equal to the sum of the two largest individual gamble escrows. Other functions are possible, but we will focus on f c,n for 0 Յ c Յ 1. It is easy to see that the two extreme escrow functions correspond to c ‫ס‬ 0 and c ‫ס‬ 1: We now propose to measure the incoherence of a collection of incoherent previsions based on a normalization by an escrow. For a combination of gambles Y X y So, Dutch Book can be made if there exists a combination of acceptable gambles whose guaranteed loss is positive. The rate of guaranteed loss relative to a particular escrow func- Notice that the rate of guaranteed loss is unchanged if all ␣ i are multiplied by a common positive number. Also, the rate of guaranteed loss is interesting only when Dutch Book is made, otherwise the numerator is 0. The denominator f n (e(Y 1 ), . . . , e(Y n )) is 0 if and only if e(Y i ) ‫ס‬ 0 for all i. This will occur if and only if the agent who is required to escrow cannot lose any of the individual gambles, in which case the numerator is 0 as well, and we will then define the rate of guaranteed loss to be 0 (since we cannot guarantee loss). The extent of incoherence relative to an escrow and corresponding to a collection of previsions will be the supremum of H(Y) over all combinations Y of acceptable gambles. If the previsions are incoherent then the maximum rate of guaranteed loss is positive, otherwise it is 0. There is a slightly simpler way to compute the extent of incoherence corresponding to a finite set of previsions than directly from the definition.
The supremum and infimum are taken over those ␣ i that have the appropriate signs.
As with all of the more lengthy proofs in this paper, the proof of Theorem 1 is in Schervish et al. (1999). Theorem 1 allows us to ignore the fact that the gamble ␣(X ‫מ‬ x) might not be acceptable when x is a lower or upper prevision for X if we are proving results concerning the rate of incoherence.
Note that if a collection of gambles satisfies the escrow condition h(␣ 1 , . . . , ␣ n ) Յ 1, then every subcollection also satisfies the escrow condition because of (5). Also, note that, since every escrow function f n is between f 0,n and f 1,n , the maximum and minimum possible rates of incoherence correspond to these two escrows.
When we use the bookie's escrow with e(Y) ‫ס‬ sup t ‫מ‬Y(t) for each individual gamble Y, we call the extent of incoherence the maximum rate of guaranteed loss, since the extent of incoherence is the maximum rate at which the bookie can be forced to lose relative to the particular escrow chosen. We focus on the family of escrows f c,n defined in (12). The corresponding maximum rates of guaranteed loss will be donoted q c . Lindley (1972, 14) argues that it is incoherent to test all hypotheses at the same level, such as .05. (See also Seidenfeld, Schervish, and Kadane 1990.) Cox (1958) gave an example of how testing all hypotheses at the same level leads to inadmissibility. In this section, we show how this incoherence and inadmissibility can be measured using the measure of incoherence q.

Testing Simple Hypotheses At a Fixed Level.
Consider the case of testing a simple hypothesis against a simple alternative. Let f 0 and f 1 be two possible densities for a random quantity X, and let f be the "true" density of X. Suppose that we wish to test the hypothesis H 0 : f ‫ס‬ f 0 versus the alternative H 1 : f ‫ס‬ f 1 . To write this as a decision problem, let the parameter space and the action space both be {0, 1} where action a ‫ס‬ 0 corresponds to accepting H 0 and action a ‫ס‬ 1 corresponds to rejecting H 0 . Also, parameter i corresponds to f ‫ס‬ f i for i ‫ס‬ 0, 1. Let the loss function have the form Of course, a classical statistician who refuses to use prior and posterior probabilities will not acknowledge the implied prior. However, incoherence will arise if two tests about the same parameter imply different priors. We illustrate this with a version of the example of Cox (1958). Since the only part of the loss function that matters is c 0 /c 1 , let c 1 ‫ס‬ 1. As an ex-ample, let f 0 and f 1 be normal distributions with different means h but the same variance r 2 . Suppose that the hypothesis is H 0 : h ‫ס‬ 0 versus H 1 : h ‫ס‬ 1 with c 0 ‫ס‬ 1. (The phenomenon we illustrate here applies more generally as shown in Theorem 2). Suppose that either r ‫ס‬ 1 or r ‫ס‬ 0.3 will be true, but we will not know which until we observe the data. That is, the data consist of the pair (X, r). Let Pr(r ‫ס‬ 1) ‫ס‬ 0.5, so that r is ancillary. A classical statistician who prefers level 0.05 tests whenever available might think that, after observing r a conditional level 0.05 test should still be preferred to a test whose conditional level given r is something else. The most powerful conditional level 0.05 test is to reject H 0 : h ‫ס‬ 0 if X Ͼ 1.645r. The most powerful marginal level 0.05 test rejects H 0 if X Ͼ 0.5 ‫ם‬ 0.9438r 2 and is the Bayes rule with respect to the prior Pr(h ‫ס‬ 0) ‫ס‬ 0.7199. The marginal power of the Bayes rule is 0.6227, while the marginal power of the conditional level 0.05 test is 0.6069. Since both tests have the same level, the conditional test is inadmissible.
To see how this inadmissibility translates into incoherence, we interpret the preference of one test d 1 to another d 2 as a preference for suffering a loss equal to the risk function of d 1 to suffering a loss equal to the risk function of d 2 . The risk function of a test d is is an acceptable gamble. In our example, let ␣ d (r) and b d (r) denote the size and power of test d conditional on r. Also, let b cl (r) denote the power of the most powerful level 0.05 test. Then, for each r, the classical statistician prefers the level 0.05 test to every other test. So, for each r and all d that are not the most powerful level 0.05 test, the following gamble is acceptable, even favorable:  In other words, b is an upper or lower prevision for A depending on whether a Ͻ 0 or a Ͼ 0. We can make use of the construction in (17) to obtain a general result. Theorem 2 has a technical condition (concerning risk sets) that is known to be satisfied for problems of testing simple hypotheses against simple alternatives using fixed sample size and sequential tests. For more detail on risk sets, see Sections 3.2.4 and 4.3.1 of Schervish (1995).
Theorem 2. Let h be a parameter and let the parameter space X consist of two points {0, 1}. Consider two decision problems D 0 and D 1 both with the same parameter space X and with nonnegative loss functions L 0 and L 1 . Let the data in problem D i be denoted X i . Suppose that the risk sets for the two decision problems are closed from below. Suppose that an agent prefers the admissible decision rule d i to all others in problem D i for i ‫ס‬ 0, 1. For each decision rule w in problem D i , let .
If w is admissible in problem D i and is not equivalent to d i , then If d 0 and d 1 are not Bayes rules with respect to a common prior, then there exist real numbers d 0 and d 1 and decision rules w 0 (in problem D 0 ) and w 1 (in problem D 1 ) such that the two gambles d 0 a 0 (w 0 ) (I A ‫מ‬ b 0 (w 0 )) and d 1 a 1 (w 1 ) (I A ‫מ‬ b 1 (w 1 )) are both acceptable, but d 0 a 0 (w 0 )(I A ‫מ‬ b 0 (w 0 )) ‫ם‬ d 1 a 1 (w 1 ) (I A ‫מ‬ b 1 (w 1 )) Ͻ 0. Also, where the supremum is over all p 0 Ͼ p 1 such that either d i is a Bayes rule with respect to prior p i for i ‫ס‬ 0, 1 or d i is a Bayes rule with respect to prior p 1‫מ‬i for i ‫ס‬ 0, 1.
As an example of Theorem 2, return to the test of H 0 : h ‫ס‬ 0 versus H 1 : h ‫ס‬ 1 where X ϳ N(h, r 2 ) with r being one of two known values. ‫ס‬ n 0 since there is no incoherence in that case. Some of the curves come close to 0 in another location as well. For example, the n 0 ‫ס‬ 27 curve comes close to 0 near n 1 ‫ס‬ 2 and the n 0 ‫ס‬ 2 curve comes close to 0 near n 1 ‫ס‬ 27. The reason is that the implied priors corresponding to r ‫ס‬ 2/Z2 and r ‫ס‬ 2/Z27 are nearly the same (0.7137 and 0.7107 respectively), making these two level 0.05 tests nearly coherent. Indeed, the entire curves corresponding to n 0 ‫ס‬ 2 and n 0 ‫ס‬ 27 are nearly identical for this same reason. Another interesting feature of Figure 1 is that all of the curves are rising as n 1 r ϱ but not to the same level. As n 1 r ϱ, the implied prior on A ‫ס‬ {h ‫ס‬ 0} converges to 0. But if n 0 is large also, then the implied prior corresponding to r ‫ס‬ 2/Zn 0 is also close to 0. For example, with n 0 ‫ס‬ 100, the implied prior is 7.3 ‫ן‬ 10 ‫4מ‬ . There is not much room for incoherence between 0 and 7.3 ‫ן‬ 10 ‫4מ‬ , so the curve corresponding to n 0 ‫ס‬ 100 will not rise very high. On the other hand, with n 0 ‫ס‬ 11, the implied prior is 0.1691, leaving lots of room for incoherence. In fact, since 0.1691 is the largest possible implied prior in this example, all of the other curves have local maxima near n 1 ‫ס‬ 11, and the n 0 ‫ס‬ 11 curve rises higher than all the others as n 1 increases. Since the limiting implied prior is 0 as n 1 r ϱ, the height to which the n 0 curve rises as n 1 increases is . .
The curve corresponding to n 0 ‫ס‬ 4 illustrates the original example of Cox (1958), in which the alternative is that h equals the larger of the two different standard deviations. Lehmann (1958) offered a rule of thumb for choosing tests based on both their size and their power. One chooses a postive number k (such as c 0 in the loss function (16)) and then chooses the test so that the probability of type II error equals k times the probability of type I error. In our case of testing one normal distribution against another one with the same variance r 2 , this procedure will produce the minimax rule with loss (16) if k ‫ס‬ c 0 . When k ‫ס‬ 1, it is easy to check that Lehmann's suggestion is the Bayes rule with respect to the prior with Pr(h ‫ס‬ 0) ‫ס‬ 1/(1 ‫ם‬ c 0 ) for all r. In this special case q c ‫ס‬ 0 for all c. However, when k ϶ 1, each r leads to a Bayes rule with respect to a different implied prior. Assuming that the test will be to reject H 0 if X Ͼ y, one must solve the equation The implied prior, assuming, still, that the loss is (16), is then When k ‫ס‬ 1, y ‫ס‬ 1/2 solves (23). Plugging this into (24) yields p L (r) ‫ס‬ 1/(1 ‫ם‬ c 0 ) for all r as we noted earlier. Two other limiting cases are of interest. If r r ϱ, then y/r must converge to U ‫1מ‬ (k/[1 ‫ם‬ k]) in order for (23) to hold. This would make the type I error probability 1/(1 ‫ם‬ k), and the limit of p L (r) would be 1/(1 ‫ם‬ c 0 ). It is not difficult to see that the type I error probability is highest for r ‫ס‬ ϱ, so it must be less than 1/(1 ‫ם‬ k) for all finite r. If r r 0, then (y ‫מ‬ 1/2)/r 2 must converge to log(k) in order for (23) to hold. In this case, p L (r) converges to k/(k ‫ם‬ c 0 ). For the case of k ‫ס‬ c 0 ‫ס‬ 19, Figure 2 shows the value of q 1 with r i ‫ס‬ 2/Zn i for i ‫ס‬ 0, 1 for various values of n 0 and n 1 in the same spirit (and on the same vertical scale) as Figure 1. The curves in Figure 2 are higher for large n 1 than the corresponding curves in Figure 1. This means that, when c 0 ‫ס‬ 19, Lehmann's procedure with k ‫ס‬ 19 is more incoherent (as measured by q 1 ) for large values of n 1 than testing at level 0.05. Lehmann (1958) made his suggestion for testing, not to be more coherent than fixed level testing, but rather to avoid a different problem exhibited by fixed level testing. Testing all hypotheses at the same level, regardless of how much data one has, allows the probability of type II error to become much smaller than the probability of type I error as the sample size increases. This amounts to behaving as if the null hypothesis were not very important compared to the alternative. Indeed, the fact that the implied prior goes to zero as the sample size increases reflects this fact. Lehmann's procedure forces the type I and type II errors to decrease together as sample size increases, thereby making sure that both the null and the alternative remain important as the sample size increases. In fact, the implied prior approaches a value strictly between 0 and 1 as sample size increases. What makes Lehmann's procedure less coherent than fixed level testing is the rate at which the implied prior approaches its limit as sample size increases. For Lehmann's procedure, the implied prior differs from its limit by approximately a constant divided by sample size whereas the implied prior for a fixed level test differs from 0 by approximately exp(‫מ‬cn) for some constant c. In this simple testing problem, Lehmann's procedure with k ‫ס‬ 1 leads to coherent choices of admissible tests for all sample sizes. Lehmann's procedure with k ‫ס‬ 1 here corresponds to an implied prior for the null hypothesis of 1/(1 ‫ם‬ c 0 ) ‫ס‬ 0.05 when c 0 ‫ס‬ 19, and an implied prior of 1/2 when losses are equal (c 0 ‫ס‬ 1). As we noted, Lehmann's rule gives the minimax risk solution for k ‫ס‬ c 0 . However, as Lindley (1972, 14) points out, it is not guaranteed that minimax risk solutions from different families of admissible tests correspond to the same Bayes model. In our testing problem, this is what happens with Lehmann's rule when k ϶ 1, which explains how it suffers a positive degree of incoherence. An alternative procedure to Lehmann's which also lets type I and type II error probabilities decrease as sample size increases, but which is coherent, is to minimize a positive linear combination of those error probabilities.

Summary.
In this article we introduce a familiy of indices of incoherence of previsions, based on the gambling framework of de Finetti (1974). When a bookie is incoherent, a gambler can choose a collection of gambles acceptable to the bookie that result in a sure loss to the bookie (and a sure gain to the gambler). That is, the gambler can make a Dutch Book against the bookie. Our index of incoherence in the bookie's previsions is the maximum guaranteed rate of loss to the bookie that the gambler creates through his/her choice of coefficients, relative to the bookie's escrow. Throughout, we mean by "escrow" an amount needed to cover the bookie's possible losses as developed in Section 3.
In Section 4, we apply this idea to identify the degrees of incoherence in two policies for testing simple hypotheses. First, we consider testing at a level that is fixed regardless of the sample size, as in the example of Cox (1958). We show, through a trade of risks, how the gambler can make a "Dutch Book" against a statistician who follows such a testing policy. That is, our index of incoherence coincides with the extent to which the fixed alpha level tests can be dominated by combinations of other tests.
When tests are based on small sample sizes, the degree of incoherence in a fixed-level testing policy is complicated, as illustrated in Figure 1. However, the degree of incoherence between two such tests decreases as the sample sizes for these tests increases. Nonetheless, we do not find this fact sufficient to justify the policy, even with large samples, because the statitstician's near-to-coherent behavior then requires treating one of the hypotheses as practically impossible. That is, the Bayes model that the fixed level testing policy approaches with increasing sample size assigns probability 0 to the null hypothesis. Why bother to collect data if that is your behavioral policy? Obviously, mere coherence of a policy is not sufficient to make it also a reasonable one! A second testing policy that we examine is due to Lehmann (1958), who proposes admissible tests based on a fixed ratio of the two risks involved, i.e., with a fixed ratio of type I and type II errors denoted by his parameter k. Except for the case in which that ratio is 1, this too proves to be an incoherent policy for testing two simple hypotheses. Figure 2 shows the plot of the degree of incoherence for Lehmann's rule (k ‫ס‬ 19) applied to tests with differing sample sizes. Surprisingly, even in a comparison of two tests based on large sample sizes, Lehmann's policy is sometimes more incoherent by our standards than the fixed .05 level policy for the same two sample sizes. Thus, in order to gain the benefits of approximate coherence, it is neither necessary nor sufficient merely to shrink the level of tests with increasing sample sizes, as happens with Lehmann's rule. In tests based on increasing sample sizes Lehmann's policy (k fixed) is approximately coherent against a Bayes model that assigns equal prior probability to each of the two hypotheses, the implied priors converge to 1/2. Of course, for that prior, the choice of k ‫ס‬ 1 in Lehmann's rule assures exact coherence at all sample sizes! Our work on degrees of incoherence, illustrated here with an analysis of testing simple statistical hypotheses, indicates the importance of having finer distinctions than are provided by de Finetti's dichotomy between coherent and incoherent methods. We see the interesting work of Nau (1989Nau ( , 1992 providing useful algorithms for computing the rate of guaranteed loss with the escrow used in this paper. (See Nau 1989, 389).
In conclusion, we believe that approaches like Nau's and those we have developed here and in Kadane (1997, 1999) permit a more subtle treatment of such longstanding issues as the debate over coherence versus incoherence of some classical statistical practices. That is not the whole problem. Rather, we need to know how far from coherent a particular policy is after we learn that it is incoherent, and learn how it compares with other incoherent methods that have been adopted in practice. We hope to continue this line of investigation in our future work.