Noname manuscript No. (will be inserted by the editor) Ockham Efficiency Theorem for Stochastic Empirical Methods

Ockham’s razor is the principle that, all other things being equal, scientists ought to prefer simpler theories. In recent years, philosophers have argued that simpler theories make better predictions, possess theoretical virtues like explanatory power, and have other pragmatic virtues like computational tractability. However, such arguments fail to explain how and why a preference for simplicity can help one find true theories in scientific inquiry, unless one already assumes that the truth is simple. One new solution to that problem is the Ockham efficiency theorem (Kelly 2002, Minds Mach 14:485–505, 2004, Philos Sci 74:561–573, 2007a, b, Theor Comp Sci 383:270–289, c, d; Kelly and Glymour 2004), which states that scientists who heed Ockham’s razor retract their opinions less often and sooner than do their non-Ockham competitors. The theorem neglects, however, to consider competitors following random (“mixed”) strategies and in many applications random strategies are known to achieve better worst-case loss than deterministic strategies. In this paper, we describe two ways to extend the result to a very general class of random, empirical strategies. The first extension concerns expected retractions, retraction times, and errors and the second extension concerns retractions in chance, times of retractions in chance, and chances of errors.


Introduction
When confronted by a multitude of competing theories, all of which are compatible with existing evidence, scientists prefer theories that minimize free parameters, causal factors, independent hypotheses, or theoretical entities. Today, that bias toward simpler theoriesknown popularly as "Ockham's razor"-is explicitly built into statistical software packages that have become everyday tools for working scientists. But how does Ockham's razor help one find true theories any better than competing strategies could? 1 Some philosophers have argued that simpler theories are more virtuous than complex theories. Simpler theories, they claim, are more explanatory, more easily falsified or tested, more unified, or more syntactically concise. 2 However, the scientific theory that truly describes the world might, for all we know in advance, involve multiple, fundamental constants or independent postulates; it might be difficult to test and/or falsify, and it might be "dappled" or lacking in underlying unity (Cartwright 1999). Since the virtuousness of scientific truth is an empirical question, simplicity should be the conclusion of scientific inquiry, rather than its underlying premise (Van Frassen 1980).
Recently, several philosophers have harnessed mathematical theorems from frequentist statistics and machine learning to argue that simpler theories make more accurate predictions. 3 There are three potential shortcomings with such arguments. First, simpler theories can improve predictive accuracy even when it is known that the truth is complex (Vapnik 1998). Thus, one is led to an anti-realist stance according to which the theories recommended by Ockham's razor should be used as predictive instruments rather than believed as true explanations (Hitchcock and Sober 2004). Second, the argument depends essentially on randomness in the underlying observations (Forster and Sober 1994), whereas Ockham's razor seems no less compelling in cases in which the data are discrete and deterministic. Third, the assumed notion of predictive accuracy does not extend to predictions of the effects of novel interventions on the system under study. For example, a regression equation may accurately predict cancer rates from the prevalence of ash-trays but might be extremely inaccurate at predicting the impact on cancer rates of a government ban on ash-trays. 4 Scientific realists are unlikely to agree that simplicity has nothing to do with finding true explanations and even the most ardent instrumentalist would be disappointed to learn that Ockham's razor is irrelevant to vital questions of policy. Hence, the question remains, "How can a systematic preference for simpler theories help one find potentially complex, true theories?" Bayesians and confirmation theorists have argued that simpler theories merit stronger belief in light of simple data than do complex theories. Such arguments, however, assume either explicitly or implicitly that simpler possibilities are more probable a priori. 5 That argument is circular-a prior bias toward complex possibilities yields the opposite result. So it remains to explain, without begging the question, why a prior bias toward simplicity is better for finding true theories than is a prior bias toward complexity.
One potential connection between Ockham's razor and truth is that a systematic bias toward simple theories allows for convergence to the truth in the long run even if the truth is not simple (Sklar 1977, Friedman 1983, Rosenkrantz 1983). In particular, Bayesians argue that prior biases "wash out" in the limit (Savage 1972), so that one's degree of belief in a theory converges to the theory's truth value as the data accumulate. But prior biases toward complex theories also allow for eventual convergence to the truth (Reichenbach 1938, Hempel 1966, Salmon 1966, for one can dogmatically assert some complex theory until a specified time t 0 , and then revise back to a simple theory after t 0 if the anticipated complexities have not yet been vindicated. One might even find the truth immediately that way, if the truth happens to be complex. Hence, mere convergence to the truth does not single out simplicity as the best prior bias in the short run. So the elusive, intuitive connection between simplicity and theoretical truth is not explained by standard appeals to theoretical virtue, predictive accuracy, confirmation, or convergence in the limit. It is, nonetheless, possible to explain, without circularity, how Ockham's razor finds true theories better than competing methods can. The Ockham efficiency theorems (Kelly 2002(Kelly , 2004 2007a-e, Kelly 2010, Kelly and Glymour 2004) imply that scientists who systematically favor simpler hypotheses converge to the truth in the long run more efficiently than can scientists with alternative biases, where efficiency is a matter of minimizing, in the worst case, such epistemic losses as the total number of errors committed prior to convergence, the total number of retractions performed prior to convergence, and the times at which the retractions occur. The efficiency theorems are sufficiently general to connect Ockham's razor with truth in paradigmatic scientific problems such as curve-fitting, causal inference, and discovering conservation laws in particle physics.
One gap in the efficiency argument for Ockham's razor is that worst-case loss minimization is demonstrated only with respect to deterministic scientific methods. Among game theorists, it is a familiar fact that random strategies can achieve lower bounds on worst-case loss than deterministic strategies can, as in the game "rock-paper-scissors", in which playing each of the three actions with equal probability achieves better worst-case loss than playing any single option deterministically can. Thus, an important question is: "Do scientists who employ Ockham strategies find true theories more efficiently than do arbitrary, randomized scientific strategies?" In this paper, we present a new stochastic Ockham efficiency theorem that answers the question in the affirmative. The theorem implies that scientists who deterministically favor simpler hypotheses fare no worse, in terms of the losses considered, than those who employ randomizing devices to select theories from data. The argument is carried out in two distinct ways, for expected losses and for losses in chance. For example, expected retractions are the expected number of times an answer is dropped prior to convergence, whereas retractions in chance are the total drops in probability of producing some answer or another. A larger ambition for this project is to justify Ockham's razor as the optimal means for inferring true statistical theories, such as acyclic causal networks. It is expected that the techniques developed here will serve as a bridge to any such theory-especially those pertaining to losses in chance.

Empirical Questions
Scientific theory choice can depend crucially upon subtle or arcane effects that can be impossible to detect without sensitive instrumentation, large numbers of observations, or sufficient experimental ingenuity and perseverance. For example, in curve fitting with inexact data 6 (Kelly and Glymour 2004, Kelly 2007a-e, 2008, a quadratic or second-order effect occurs when the data rule out linear laws, and a cubic or third-order effect occurs when the data rule out quadratic laws, etc. (figure 62). Such effects are subtle in the above sense because, for example, a very flat parabola may generate data that appear linear even in fairly large samples. For a second example, when explaining particle reactions by means of conservation laws, an effect corresponds to a reaction violating some conservation law (Schulte 2001). When explaining patterns of correlation with a linear causal network, an effect corresponds to the discovery of new partial correlations that imply a new causal connection in the network (Spirtes et al. 2000, Schulte, Luo, and Greiner 2007. To model such cases, we assume that each potential theory is uniquely determined by the em-pirical effects it implies and we assume that empirical effects are phenomena that may take arbitrarily long to appear but that, once discovered, never disappear from scientific memory. Formally, let E be a non-empty, countable (finite or countably infinite) set of empirical effects. 7 Let K be the collection of possible effect sets, any one of which might be the set of all effects that will ever be observed. We assume in this paper that each effect set in K is finite. The true effect set is assumed to determine the correctness (truth or empirical adequacy) of a unique theory, but one theory may be correct of several, distinct effect sets. Therefore, let T , the set of possible theories, be a partition of K. Say that a theory T is correct of effect set S in K just in case S is an element of T . If S is in K, let T S denote the partition cell of T that contains S, so that T S represents the unique theory in T that is correct if S is the set of effects that will ever be observed. Say that Q = (K, T ) is an empirical question, in which K is the empirical presupposition and T is the set of informative answers. Call K the uninformative answer to Q, as it represents the assertion that some effect set will be observed. Let A Q be the set of all answers to Q, informative or uninformative.
An empirical world w is an infinite sequence of finite effect sets, so that the nth coordinate of w is the set of effects observed or detected at stage n of inquiry. Let S w denote the union of all the effect sets occurring in w. An empirical world w is said to be compatible with K just in case S w is a member of K. Let W K be the set of all empirical worlds compatible with K. If w is in W K , then let T w = T S w , which is the unique theory correct in w. Let w|n denote the finite initial segment of w received by stage n of inquiry. Let F K denote the set of all finite, initial segments of worlds in W K . If e is in F K , say that e is a finite input sequence and let e − denote the result of deleting the last entry in e when e is non-empty. The set of effects presented along e is denoted by S e , and let K e denote the restriction of K to finite sets of effects that include S e . Similarly, let T e be the set of theories T ∈ T such that there is some S in K e such that T S = T . The restriction Q e of question Q to finite input sequence e is defined as (K e , T e ).

Deterministic Methodology
A deterministic method or pure strategy for pursuing the truth in problem Q is a function M that maps each finite input sequence in F K to some answer in A Q . Method M converges to the truth in Q (or converges in Q for short) if and only if lim i→∞ M(w|i) = T w , for each world w compatible with K. Our focus is on how best to find the truth, so we consider only deterministic methods that converge to the truth.
Methodological principles impose short-run restrictions on methods. For example, say that M is logically consistent in Q if and only if M never produces an answer refuted by experience, i.e., M(e) is in A Q e , for all e ∈ F K .
The methodological principle of main concern in this paper is Ockham's razor. Consideration of the polynomial degree example suggests that more complex theories are theories that predict more relevant effects, where an effect is relevant only if it changes the correct answer to Q. To capture this intuition, define a path in K to be a nested, increasing sequence of effects sets in K. A path (S 0 , . . . , S n ) is skeptical if and only if T S i is distinct from T S i+1 , for each i less than n. Each step along a skeptical path poses the classical problem of induction to the scientist, since effects in the next effect set could be revealed at any time in the future.
Define the empirical complexity c Q,e (S) of effect set S in K to be the result of subtracting 1 from the length of the longest skeptical path to S in K e (we subtract 1 so that the complexity of the simplest effect sets in K is zero). Henceforth, the subscript Q will be dropped to reduce clutter when the question is clear from context. The complexity c e (T ) of theory T in T is defined to be the least empirical complexity c e (S) such that S is in T . For example, it seems that the theory "either linear or cubic" is simpler, in light of linear data, than the hypothesis "quadratic" and that the theory "quadratic" is simpler in light of quadratic data than "linear or cubic". The complexity c e (w) of world w is just c e (S w ). The nth empirical complexity cell C e (n) in the empirical complexity partition of W K is defined to be the set of all worlds w in K such that c e (w) = n.
Answer A is Ockham in K at e if and only if A = K or A is the unique theory T such that c e (T ) = 0. Method M satisfies Ockham's razor in K at e if and only if M(e) is Ockham at e. Note that Ockham's razor entails logical consistency and does not condone choices between equally simple theories. A companion principle, called stalwartness, is satisfied at e if and only if M(e) = M(e − ) when M(e − ) is Ockham at e. Ockham's razor and stalwartness impose a plausible, diachronic pattern on inquiry. Together, they ensure that theories are visited in order of ascending complexity, and each time a theory is dropped, there may be a long run of uninformative answers until a new, uniquely simplest theory emerges and the method becomes confident enough in that theory to stop suspending judgment.
Say that a skeptical path in Q is short if and only if, first, it is not a proper sub-sequence of any skeptical path in Q and second, there exists at least one longer skeptical path in Q. Then Q has no short skeptical paths if and only if for each e in F K , there exists no short skeptical path in Q e . Commonly satisfied sufficient conditions for non-existence of short skeptical paths are (i) that all skeptical paths in Q are extendable and (ii) that (K, ⊂) is a ranked lattice and each theory in T implies a unique effect set. The problem of finding polynomial laws of unbounded degree and the problem of finding the true causal network over an arbitrarily large number of variables both satisfy condition (i). The problem of finding polynomial laws and the problem of finding the true causal network over a fixed, finite set of variables both satisfy condition (ii) (Kelly and Mayo-Wilson 2010b).

Deterministic Inquiry
We consider only methods that converge to the truth, but justification requires more than that-a justified method should pursue the truth as directly as possible. Directness is a matter of reversing course no more than necessary. A fighter jet may have to zig-zag to pursue its quarry, but needless course reversals during the chase (e.g., performance of acrobatic loops) would likely invite disciplinary action. Similarly, empirical science may have to retract its earlier conclusions as a necessary consequence of seeking true theories, in the sense that a theory chosen later may fail to logically entail the theory chosen previously (Kuhn 1970, Gärdenfors 1988), but needless or gratuitous reversals en route to the truth should be avoided. We sometimes hear the view that minimizing retractions is a merely pragmatic rather than a properly epistemic consideration. We disagree. Epistemic justification is grounded primarily in a method's connection with the truth. Methods that needlessly reverse course or that chase their own tails have a weaker connection with the truth than do methods guaranteed to follow the most direct pursuit curve to the truth.
Let M be a method and let w be a world compatible with K (or some finite initial segment of one). Let ρ(M, w, i) be 1 if M retracts at stage i in w, and let the total retraction loss in world w be ρ(M, w) = ∑ ∞ i=0 ρ(M, w, i). If e is a finite input sequence, define the preference order M ≤ ρ e,n M among convergent methods to hold if and only if for each world w in complexity set C e (n), there exists world w in empirical complexity cell C e (n) such that ρ(M, w) ≤ ρ(M , w ). That amounts to saying that M does as well as M in terms of retractions, in the worst case, over worlds of complexity n that extend e. Now define: M < ρ e,n M iff M ≤ ρ e,n M and M ≤ ρ e,n M; M ≤ ρ e M iff M ≤ ρ e,n M , for each n; M ρ e M iff M ≤ ρ e,n M , for each n such that C e (n) is nonempty. Consider the comparison of M with alternative methods one might adopt when the last entry of finite input sequence e has just been received (and no theory has yet been chosen in response thereto). There is no point comparing one's method M in light of e with methods that did something different from M in the past along e, since the past cannot be changed. Accordingly, say that M is efficient in terms of retractions given e if and only if M is convergent and for each convergent competitor M that produces the same outputs as M along e − , the relation M ≤ The concepts of efficiency and being beaten are relative to e. When such a concept holds for every e in F K , say that it holds always and when the concept holds at each e in F K that extends e, say that it holds from e onward.

Deterministic Ockham Efficiency Theorems
A stalwart, Ockham strategy M is guaranteed to converge to the truth as long as M does not return the uninformative answer K for eternity. But other strategies also converge to the truth, so it remains to explain why one should follow Ockham's razor now. The Ockham efficiency theorems answer that more difficult question.
Theorem 1 (deterministic Ockham efficiency theorem) Let the loss be retractions. Assume that question Q = (K, T ) has no short skeptical paths, that each theory in T is correct for a unique effect set, and that method M converges to the truth and is logically consistent. Then the following are equivalent: 1. method M is always Ockham and stalwart; 2. method M is always efficient; 3. method M is always unbeaten.
Proof: Consequence of theorem 4 below.
The above theorem asserts that Ockham's razor and stalwartness are not merely sufficient for efficiency; they are both necessary. Furthermore, any method that is ever inefficient is also beaten at some time. Thus, convergent methods are cleanly partitioned into two classes: those that are efficient, Ockham, and stalwart, and those that are either not Ockham or not stalwart and are, therefore, beaten.
The main idea behind the proof is that nature is in a position to force an arbitrary, convergent method to produce successive theories (T S 0 , . . . , T S n ), with arbitrary time delays between the successive retractions, if there exists a skeptical path (S 0 , . . . , S n ) in Q.
Lemma 1 (forcing deterministic changes of opinion) Let e be a finite input sequence of length l, and suppose that M converge to the truth in Q e . Let (S 0 , . . . , S n ) be a skeptical path in Q e such that c e (S n ) = n, let ε > 0 be arbitrarily small and let natural number m be arbitrarily large. Then there exists world w in C e (n) and stages of inquiry l = s 0 < . . . < s n+1 such that for each i from 0 to n, stage s i+1 occurs more than m stages after s i and M w| j = T S i , at each stage j such that s i+1 − m ≤ j ≤ s i+1 .
Proof: To construct w, set e 0 = e and s 0 = l. For each i from 0 to n, do the following. Extend e i with world w i such that S w i = S i . Since M converges in probability to the truth, there exists a stage s such that for each stage j ≥ s, M w| j = T S i . Let s be the least such s. Let s i+1 = max(s , s i ) + m. Set e i+1 = w i |s i+1 . The desired world is w n , which is in C e (n), since S w n = S n .
Any non-circular argument for the unique truth-conduciveness of Ockham's razor must address the awkward question of how one does worse at finding the truth by choosing a complex theory even if that theory happens to be true. The Ockham efficiency argument resolves the puzzle like this. Suppose that convergent M violates Ockham's razor at e by producing complex theory T S n of complexity n. Then there exists a skeptical path (S 0 , . . . , S n ) in Q e . Nature is then in a position to force M back to T S 0 and then up through T S 1 , . . . , T S n , by the retraction forcing lemma, for a total of n + 1 retractions. A stalwart, Ockham method, on the other hand, would have incurred only n retractions by choosing T S 0 through T S n in ascending order. Therefore, the Ockham violator is beaten by each convergent, stalwart Ockham competitor ( figure 62.b). Incidentally, the Ockham violator also traverses a needless, epistemic loop T n , T 0 , . . . , T n , an embarrassment that cannot befall an Ockham method. A similar beating argument can be given for stalwartness. Non-stalwart methods are beaten, since they start out with an avoidable, extra retraction. Furthermore, the retraction-forcing lemma allows nature to force every convergent method through the ascending sequence T S 0 , T S 1 , . . . , T S n , so normal Ockham methods are efficient (figure 62.a). Thus, normal Ockham strategies are efficient and all non-Ockham or non-stalwart strategies are not just inefficient, but beaten as well. This sketch is suggestive but ignores some crucial cases; the details are spelled out in the proof of the more general theorem 4, which is provided in full detail in the appendix.
Theorem 1 does not imply that stalwart Ockham methods dominate alternative methods, in the sense of doing better in every world or even as well in every world-a violation of Ockham's razor can result in no retractions at all if nature is kind enough to refute all simpler theories immediately after the violation occurs. Nor are stalwart Ockham methods minimax solutions, in the usual sense that they achieve lower worst-case loss simpliciter-every method's overall worst-case loss is infinite if there are worlds of every empirical complexity, as in the case of discovering polynomial laws. The unique superiority of stalwart Ockham strategies emerges only when one considers a hybrid decision rule: dominance in terms of worst-case bounds over the cells of a complexity-based partition of possible worlds. The same idea is familiar in the theory of computational complexity (Garey and Johnson 1979). There, it is also the case that cumulative computational losses such as the total number of steps of computation are unbounded over all possible worlds (i.e., input strings). The idea in computational complexity theory is to partition input strings according to length, so that the worst-case computational time over each partition cell exists and is finite. That partition is not arbitrary, as it is expected that computational time rises, more or less, with input length. In the case of inquiry, inputs never cease, so we plausibly substitute empirical complexity for length. Again, it is expected that retractions rise with empirical complexity. Then we seek methods that do as well as an arbitrary, convergent method, in terms of worst-case bounds over every cell of the empirical complexity partition.
Theorem 1 provides a motive for staying on the stalwart, Ockham path, but does not motivate returning to the path after having once deviated from it. In other words, theorem 1 provides an unstable justification for Ockham's razor. For example, suppose that method M selects T 1 twice in a row before any effects are observed, and suppose that method O reverts to a stalwart, Ockham strategy at the second stage of inquiry. Then nature can still force M to retract in the future to T 0 , but O has already performed that retraction, so reversion to Ockham's razor does not result in fewer retractions. However, the inveterate Ockham violator retracts later than necessary, and efficient convergence to the truth also demands that one retract as soon as possible, if one is going to retract at all. It is common in economic analysis to discount losses incurred later, which may suggest the opposite view that retractions should be delayed as long as possible. Epistemology suggests otherwise. If nature is in a position to force one to retract T in the future by presenting only true information, then one's belief that T does not constitute knowledge, even if T is true. 8 By a natural extension of that insight, more retractions prior to arriving at the truth imply greater distance from knowledge, so getting one's retractions over with earlier brings one closer to knowledge and reduces epistemic loss.
To make this idea precise, let γ(M, w, i) be a local loss function, which is a function that assigns some nonnegative quantity to M in w at stage i (e.g., ρ(M, w, i) is a local loss function). Define the delay to accumulate quantity u of loss γ, where u is a non-negative real number, as: with the important proviso that the expression denotes 0 if there is no such stage j. In the deterministic case, ρ(M, w) is always a natural number. The time delay to the kth retraction is just: It remains to compare methods in terms of worst-case retraction times. It is not quite right to compare each method's delay to each retraction; for consider the output sequences σ = (T 0 , T 1 , T 2 ) and σ = (T 0 , T 0 , T 2 ). Sequence σ has an earlier elapsed time to the first retraction, but it still seems strictly worse than σ ; for the retraction delays in σ are at least as late as those in σ if one views the first retraction in σ as an "extra" retraction and ignores it. Ignoring extra retractions amounts to considering a local loss function γ such that In that case, say that γ ≤ ρ. Accordingly, define M ≤ τ e,n M to hold if and only if there exists local loss function γ ≤ ρ such that for each w in C e (n) there exists w in C e (n) such that: Define < τ e,n , ≤ τ e and τ e as was done for ρ. Now define efficiency and beating from e onward in terms of retraction times by substituting τ for ρ in the corresponding definitions provided in the preceding section.
Theorem 2 (deterministic, stable Ockham efficiency theorem) Let the loss be retraction times. Assume that question Q e has no short skeptical paths and that method M converges to the truth. Then the following are equivalent: 1. method M is Ockham and stalwart from e onward; 2. method M is efficient from e onward; 3. method M is unbeaten from e onward.
Proof: Consequence of theorem 4 below.
Retraction may be viewed as a strategy for eliminating error, so it is of interest to check whether theorem 2 can be strengthened to include the total number of errors committed as a loss. Let ε(M, w, i) assume value 1 if M produces a theory incorrect of S w at stage i and value 0 otherwise. Define the cumulative errors of M in w as ε(M, w) = ∑ ∞ i=0 ε(M, w, i). Violating Ockham's razor at e also increases the worst-case error bound over complexity cell C e (0). Why? We claim that any method that is Ockham from e onward never errs after e in any world in C e (0), whereas any method that violates Ockham's razor at e errs at least once in some world in C e (0). In every world w in C e (0), there is some stage n w at which T w becomes the uniquely simplest theory compatible with experience, and moreover, there is no stage between e and n w such that some other theory T = T w is uniquely simplest. Because every Ockham method refuses to answer anything other than the unique simplest theory (when it exists) after e, it follows such methods commit no errors in any world in C e (0). In contrast, if M violates Ockham's razor at e, then M returns some theory T that is not uniquely simplest at e. Hence, there is some theory T = T such that c e (T ) = 0, and it follows that M commits at least one error in every world in which T is true.
We focus on retractions and their times primarily because violating Ockham's razor at e yields more retractions in every non-empty complexity cell C e (n), whereas the Ockham violator does worse in terms of errors only in C e (0). The reason for the weaker result in the error case is, in a sense, trivial-the worst-case bound on total errors is infinite in every non-empty complexity cell C e (n) other than C e (0) for all convergent methods, including the stalwart, Ockham methods. To see why, recall that nature can force an arbitrary, convergent method M to converge to some theory T of complexity n and and to produce it arbitrarily often before refuting T (by lemma 1). Thereafter, nature can extend the data to a world w of complexity n + 1 in which T is false, so M incurs arbitrarily many errors, in the worst case, in C e (n + 1). Retractions and retraction times are not more important than errors; they are simply more sensitive than errors at exposing the untoward epistemic consequences of violating Ockham's razor.
Nonetheless, one may worry that retractions and errors trade off in an awkward manner, since avoiding retractions seems to promote dogmatism, whereas avoiding errors seems to motivate skeptical suspension of belief. Such tradeoffs are inevitable in some cases, but not in the worst cases that matter for the Ockham efficiency theorems. Consider, again, just the easy (Pareto) comparisons in which one method does as well as another with respect to every loss under consideration. Let L be some subset of the loss functions {ρ, ε, τ}. Then the worst-case Pareto order and worst-case Pareto dominance relations in L are defined as: Efficiency and beating may now be defined in terms of ≤ L e and L e , just as in the case of ρ. The following theorem says that the Ockham efficiency theorems are driven primarily by retractions or retraction times, but errors can go along peacefully for the ride as long as only easy loss comparisons are made.
Theorem 3 (Ockham efficiency with errors) Assume that L ⊆ {ρ, ε, τ} and that the loss concept is ≤ L . Then: Proof: Consequence of theorem 4 below. 9 6 Stochastic Inquiry The aim of the paper is to extend the preceding theorems to mixed strategies. As discussed above, the extension is of interest since the Ockham efficiency theorems are based on worstcase loss with respect to the cells of an empirical complexity partition and, in some games, stochastic (mixed) strategies can achieve better worst case loss than can deterministic (pure) strategies. We begin by introducing a very general collection of stochastic strategies.
Recall that a deterministic method M returns an answer A when finite input sequence e is provided, so that p(M(e) = A) = 1. Now conceive of a method more generally as a random process that produces answers with various probabilities in response to e. Then one may think of M e as a random variable, defined on a probability space (Ω , F , p), that assumes values in A . A random variable is a function defined on Ω , so that M e (ω) denotes a particular answer in A . A method is then a collection {M e : e is in F K } of random variables assuming values in A that are all defined on an underlying probability space (Ω , F , p). 10 In the special case in which p(M e = A) is 0 or 1 for each e and answer A, say that M is a deterministic method or a pure strategy.
Let M be a method and let e in F K have length l. Then the random output sequence of M in response to e with respect to ω is the random sequence M Consider the situation of a scientist who is deciding whether to keep method M or to switch to some alternative method M after e has been received. In the deterministic case, it doesn't really matter whether the decision is undertaken before M produces its deterministic response to e or after, since the scientist can predict perfectly from the deterministic laws governing M how M will respond to e. That is no longer the case for methods in generalthe probability that M e = A may be fractional prior to the production of A but becomes 1 thereafter. However, the case of deciding after the production of A reduces to the problem of deciding before because we can model the former case by replacing M e with a method that produces A in response to e deterministically. Therefore, without loss of generality, we consider only the former case.
The methodological principles of interest must be generalized to apply to stochastic methods. Let e be in F K and let D be an event of nonzero probability. Say that M is logically consistent at e given D if and only if: Say that M is Ockham at e given D if and only if: Finally, say that M is stalwart at e given D if and only if: when T is Ockham at e and p(M e − = T ∧D) > 0. This plausibly generalizes the deterministic version of stalwartness-given that you produced an answer before and it is still Ockham, keep it for sure.
The concepts pertaining to inquiry and efficiency must also be generalized. Say that M converges to the truth over K e given event D if and only if: Each of the above methodological properties is a relation of form Φ(M, e | D). In particular, one can consider Φ(M, e | M [e − ] = σ ), for some random output sequence σ of M along e − such that p(M [e − ] = σ ) > 0, in which case Φ is said to hold of M at (e, σ ). When Φ holds of M at each pair (e , σ ) such that e is in F K,e and σ is a random output sequence of M along e − such that p(M [e − ] = σ ) > 0, then say that Φ holds from (e, σ ) onward. When Φ holds from ((), ()) onward, say that Φ holds always. For example, one can speak of M always being stalwart or of M converging to the truth from (e, σ ) onward.
Turn next to epistemic losses. There are two ways to think about the loss of a stochastic method: as loss in chance or as expected loss. For example, T is retracted in chance at e if the probability that the method produces T drops at e. Define, respectively, the total errors in chance and retractions in chance at i in w given D such that p(D) > 0 to be: where x y = max(x − y, 0). Forγ ranging overρ,ε, define the total loss in chance to be: The next problem is to compare two methods M, M in terms of worst-case loss in chance or expected loss at e of length l. Each stochastic method has its own probability space (Ω , F , p) and (Ω , F , p ), respectively. Recall that M and M are being compared when the last entry of e has been presented and M, M have yet to randomly produce corresponding outputs. Suppose that, as a matter of fact, both M and M responded to e − by producing, with chances greater than zero, the same random trajectory σ of length l. Let γ be ρ or ε, and let γ be ρ or ε. Then, as in the deterministic case, define M ≤ γ e,σ ,n M (respectively M ≤ γ e,σ ,n M ) to hold if and only if for each w in C e (n), there exists w in C e (n) such that: Methods can be compared in terms of expected retraction times just as in the deterministic case. Define the comparison M ≤ τ e,σ ,n M to hold if and only if there exists random local loss function γ ≤ ρ such that for every world w in C e (n), there exists world w in C e (n) such that for each k: Comparing retraction times in chance is similar to comparing expected retraction times. Let γ, δ map methods, worlds, stages of inquiry, and measurable events to real numbers.
The only obvious difference from the definition for expected retraction times is the exemption of an arbitrarily small interval I of possible values for cumulative retractions in chance. The reason for the exemption is that stalwart, Ockham strategies can be forced by nature to retract fully at each step down a skeptical path, whereas some convergent methods can only be forced to perform 1 − ε retractions in chance at each step, for arbitrarily small ε.

Stochastic Ockham Efficiency Theorem
Here is the main result.
Theorem 4 (stochastic Ockham efficiency theorem) Theorem 3 extends to stochastic methods and losses in chance when "from e onward" is replaced with "from (e, σ ) onward", for all (e, σ ) such that p(M [e − ] = σ ) > 0. The same is true for expected losses.
The proof of the theorem is presented in its entirety in the appendix. The basic idea is that nature can still force a random method to produce the successive theories along a skeptical path with arbitrarily high chance, if the method converges in probability to the truth. The following result entails lemma 1 as a special case and is nearly identical in phrasing and proof.
Lemma 2 (forcing changes of opinion in chance) Let e be a finite input sequence of length l, and suppose that M converge to the truth in Q e . Let p(D) > 0. Let (S 0 , . . . , S n ) be a skeptical path in Q e such that c e (S n ) = n and let ε > 0 be arbitrarily small and let natural number m be arbitrarily large. Then there exists world w in C e (n) and stages of inquiry l = s 0 < . . . < s n+1 such that for each i from 0 to n, stage s i+1 occurs more than m stages after s i and p(M w| j = T S i | D) > 1 − ε, at each stage j such that s i+1 − m ≤ j ≤ s i+1 .
Proof: To construct w, set e 0 = e and s 0 = l. For each i from 0 to n, do the following. Extend e i with world w i such that S w i = S i . Since M converges in probability to the truth, there exists a stage s such that for each stage j ≥ s, p(M w| j = T S i | D) > 1 − ε. Let s be the least such s. Let s i+1 = max(s , s i ) + m. Set e i+1 = w i |s i+1 . The desired world is w n , which is in C e (n), since S w n = S n .
Hence, expected retractions are forcible from convergent, stochastic methods pretty much as they are from deterministic methods (lemma 5). Retractions in chance are a lower bound on expected retractions (lemma 4). On the other hand, it can be shown that a stochastic, stalwart, Ockham method incurs expected retractions only when its current theory is no longer uniquely simplest with respect to the data (lemma 8), so such a method incurs at most n expected retractions or retractions in chance after the end of e in C e (n). Violating Ockham's razor or stalwartness adds some extra retractions in chance (and expected retractions) that an Ockham method would not perform in every nonempty complexity cell C e (n), as in the deterministic case (lemmas 6 and 7).
The worst-case errors of stochastic methods are closely analogous those in the deterministic case. Ockham methods produce no expected errors or errors in chance in C e (0) (lemma 10) and all methods produce arbitrarily many expected errors or errors in chance, in the worst case, in each nonempty C e (n) such that n > 0 (lemma 11).
The retraction times of stochastic methods are a bit different from those of deterministic methods. Retraction times in chance are closely analogous to retraction times in the deterministic case, except that one must consider the times of fractional retractions in chance. The relevant lower bounds are provided by lemmas 15 and 16 and the upper bounds are provided by lemma 17. Expected retraction times are a bit different. For example, a stochastic method that produces fewer than n expected retractions may still have a nonzero time for retraction m > n, if the mth retraction is very improbable. That disanalogy is actually exploited in the proof of theorem 4. To force expected retraction times to be arbitrarily late in C e (n), for n > 0, one may choose the delay time m in lemma 2 to be large enough to swamp the small chance 1−nε that n retractions fail to occur (lemmas 13, 16). But the anomaly does not arise for stalwart, Ockham methods, which satisfy upper bounds agreeing with the deterministic case, so the logic of the Ockham efficiency argument still goes through (lemma 17).

Conclusion and Future Directions
According to theorem 4, the situation with stochastic methods is essentially the same as in the deterministic case-obvious, stochastic analogues of Ockham's razor and stalwartness are necessary and sufficient for efficiency and for being unbeaten, when losses include retractions, retraction times, and errors. Every deterministic method counts as a stochastic method, so deterministic, convergent, stalwart, Ockham methods are efficient over all convergent, stochastic methods. Therefore, the game of inquiry is different from the game "rock-paper-scissors" and many other games in that respect. In fact, flipping a fair coin sequentially to decide between the uninformative answer K and the current Ockham answer T is a bad idea in terms of expected retractions-it is a violation of stalwartness that generates extra retractions in chance and expected retractions at each time one does it, from the second flip onward. That resolves the main question posed in the introduction: whether deterministic, stalwart, Ockham strategies are still efficient in comparison with convergent, stochastic strategies. In fact, the Ockham efficiency argument survives with aplomb, whether expected losses or losses in chance are considered and for a variety of Pareto combinations of epistemic losses including total retractions, total errors, and retraction times.
The second ambition mentioned in the introduction concerns statistical inference, in which outputs are stochastic due to randomness in the data rather than in the method. Let the question be whether the mean µ of a normal distribution of known variance is 0 or not. According to statistical testing theory, one calls theory T µ=0 that µ = 0 the null hypothesis and one fixes a bound α on the probability that one's test rejects T µ=0 given that T µ=0 is true. A statistical test at a given sample size N partitions possible values of the sample mean X into those at which T µ=0 is accepted and into those at which T µ=0 is rejected. The test has significance α if the chance that the test rejects T µ=0 is no greater than α assuming that T µ=0 is true. It is a familiar fact that such a test does not converge to the true answer as sample size increases unless the significance is tuned downward according to an appropriate schedule. However, there are many significance-level schedules that yield statistically consistent procedures. We propose that retraction efficiency can plausibly bound the rate at which α may be dropped to the rate at which sample variance decreases.
Retractions in chance and, hence, expected retractions arise unavoidably, in the following way, in the problem of determining whether or not µ = 0. 11 Suppose that the chance that a statistical test M accepts T µ=0 at sample size N when µ = 0 exceeds 1 − ε/2, where ε > 0 is as small as you please. Then there is a sufficiently small r > 0 such that the chance that M accepts T µ=0 at sample size N given that µ = r still exceeds 1 − ε/2. But as sample size is increased, one reaches a sample size N at which the test M "powers up" and the chance that M rejects T µ=0 given that µ = r is greater than 1 − ε/2. We have forced the test into a retraction in chance of more than 1 − ε.
The preceding argument is exactly analogous to the proofs of the stochastic Ockham efficiency theorems, in which it is shown that any consistent method accrues at least one expected retractions in complexity class one. If one assumes, as is natural, that C(0) contains just µ = 0 and C(1) contains all values of µ other than 0, then the number of forcible retractions in chance equals the complexity of the statistical hypotheses in question, just as in our model of inquiry. 12 Generalizing the Efficiency Theorems to statistical inference requires, therefore, only three further steps: (1) proving that methods that prefer simpler statistical hypotheses approximate the theoretical lower loss bounds, (2) proving that methods that violate Ockham's razor do not approximate those bounds, and (3) generalizing (1) and (2) to multiple retractions.
The first step, we conjecture, is straightforward for one-dimensional problems like determining whether the mean µ of a normally distributed random variable is zero or not-if losses are considered in chance. It appears that expected retractions may be unbounded even for simple statistical tests because there are values of µ at which the chance of accepting the null hypothesis hovers around 1/2 for arbitrarily many sample sizes. 13 Retractions in chance are more promising (and also agree with standard testing theory, in which power is an "in chance" concept). Suppose statistical method M ignores the traditional logic of statistical testing, and accepts the complex hypothesis that µ = 0 with high chance 1 − α, contrary to the usual practice of favoring the null hypothesis. If µ is chosen to be small enough, then M is forced, with high probability, to accept that µ = 0 with arbitrarily high chance, if M converges in probability to the true answer. Thereafter, M can be forced back to µ = 0 when µ = r, for r suitably near to 0. Thus, M incurs an extra retraction, in the worst case, of nearly 1 − α, both in C(0) and in C(1).
The second and third steps, in contrast, are significantly more difficult, because statistical methods that converge to the truth in probability cannot help but produce random "mixtures" of simple and complex answers. Therefore, efficiency and adherence to Ockham's razor and to stalwartness can only be approximate in statistical inference.

Acknowledgements
This work was supported generously by the National Science Foundation under grant 0750681. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We thank Teddy Seidenfeld for urging the extension of the result to stochastic methods. We thank Cosma Shalizi for commenting on an earlier draft of this paper at the Formal Epistemology Conference in 2009. We thank the anonymous referee for requesting clarification on the connections with game theory and for unusually detailed comments. We thank the editor, Branden Fitelson, for encouraging clarification and for providing the space for it.

Appendix -Comparison with Game Theory
The model of scientific inquiry described above might be represented any number of ways as a game in the economist's sense. Thus, the reader might be interested in the relationship between our results and those typically found in game theory. We remark upon at least five important differences. 14 First, as stated in the introduction, the most general equilibrium existence theorems of game theory yield little information about what the equilibria are like. In contrast, our results uniquely pick out a particular important class of strategies, namely, the Ockham ones, as uniquely optimal. Some game-theoretic results specify properties of the equilibria. For instance, Von Neumann's minimax theorem shows that, in equilibria for finite, two-person, zero-sum games, both players employ minimax strategies, i.e. strategies that minimize the maximum possible loss. Although that theorem appears especially relevant to our results, the worst-case loss vectors that we consider are with respect to cells of a complexity based partition of worlds, and not with respect to all possible worlds. There are no minimax (simpliciter) actions in our model of inquiry (for either the scientist or Nature in our model of inquiry) and, as a result, Von Neumann's theorem is of little help.
Second, in our model of inquiry, the scientist's preferences cannot be represented by utilities. The chief difficulty is that the scientist's preferences involve lexicographic components: among all losses of inquiry, the scientist values eventually finding the truth highest and considers all other losses (e.g. minimization of errors and minimization of retractions) secondary. It is well-known that, in games in which players' preferences contain lexicographic components, even the simplest theorems guaranteeing the existence of equilibria fail. 15 Moreover, our players' preferences are not representable as utilities because they are also pre-ordered, and not totally ordered. That feature immediately threatens the existence of Nash equilibria in even the simplest games: consider, for example, a one-person game in which the only player has two actions, whose outcomes have incomparable value. Then there is no Nash equilibrium in the standard sense, as there is no action that is even weakly better than all others. One can show that in competitive games in which players' preferences are represented by vectors of real numbers with the Pareto ordering (again, such preferences do not have lexicographic components), there are "weak" Nash equilibria, in the sense that there are strategy profiles from which no player has reason to deviate. 16 However, the equilibria guaranteed by such proofs are "weak" in the sense that players may not prefer the equilibrium strategy profile to all others in which only his or her action were changed; they may have no preference whatsoever. In contrast, the result we obtain here is more analogous to a "strong" Nash equilibrium; the scientist prefers playing Ockham strategies to non-Ockham ones and that preference is strict! Third, both the scientist and "Nature" have access to infinitely many actions in our model of inquiry. There are well-known results guaranteeing the existence of countably-additive equilibria in infinite games, but generally, such theorems also contain strong restrictions on the player's preference relations, in addition to assuming that they are representable by utilities. For instance, it is often assumed that players' utility functions are continuous or bounded functions with respect to an appropriate topology on the outcome space. 17 No such assumptions hold in our model: the scientist's losses are potentially unbounded (even within complexity classes), and the obvious topology to impose on our outcome space does not yield continuous preference relations. If one permits players to employ merely-finitely additive mixed strategies, one can drop these assumptions on preference relations (but not the assumption that they are representable by utilities) and obtain existence of equilibria in zero-sum games. 18 However, the randomized strategies considered here are countablyadditive, which makes our result even more surprising.
Fourth, in game-theory, if one player is permitted to employ mixed strategies (or behavior strategies), it is typical to assume that all players are permitted to do so. The model of inquiry presented here does not permit the second player, "Nature", to employ mixed strategies. That raises the question: Can one make sense of Nature employing "mixed strategies" and if so, does it change the result stated here? We do think, in fact, that one can reasonably interpret Nature's mixed strategies as a scientist's prior probabilities over possible worlds, and one can prove the existence of (merely finitely-additive) equilibria in particular presentations of our model of inquiry when represented as game. 19 However, the main result of this paper employs no such prior probabilities.
Fifth, and finally, the last major hurdle in representing our theorems as game-theoretic equilibria is the development of a more general theory of simplicity. The definition of simplicity stated in this paper is very narrow, allowing only for prior knowledge about which finite sets of effects might occur-knowledge about timing and order of effects is not allowed for. But nothing prevents nature from choosing a mixed strategy that implies knowledge about timing or order of effects (recall that nature's mixture is to be understood as the scientist's prior probability). Such knowledge may essentially alter the structure of the problem. For example, if nature chooses a mixing distribution according to which effect a is always followed immediately by effect b, then the sequence a, b ought properly to be viewed as a single effect rather than as two separate effects. 20 But if simplicity is altered by nature's choice of a mixing distribution, then so is Ockham's razor and, hence, what counts as an Ockham strategy for the scientist. Therefore, in order to say what it means for Ockham's razor to be a "best response" to Nature, it is necessary to define simplicity with sufficient generality to apply to every possible restriction of the set of worlds compatible with K to a narrower range of worlds. More general theories of simplicity than the one presented in this paper have been proposed and have been shown to support Ockham efficiency theorems (Kelly 2007d, but those concepts are still not general enough to cover all possible restrictions of W K . Of course, a general Ockham efficiency theorem based on a general concept of simplicity would be of considerable interest quite independently of this exploratory discussion of relations to game theory.

Proof of Deterministic Theorems
Proof of theorem 1. It is immediate from the definitions that 2 implies 3. To see that 1 implies 2, suppose that finite input sequence e is in F K and let O be a convergent method that is always Ockham and stalwart. Suppose that O retracts k times along e − and that w is an arbitrary world in complexity cell C e (n). Method O retracts after e in w only when the currently simplest theory is refuted by the data. Thus, ρ(O, w) ≤ k + n if O does not retract at e and ρ O,w ≤ k + n + 1 otherwise. Let M be an arbitrary, convergent method that agrees with O along e − . Consider the case in which O retracts at e. Since O is stalwart, the theory T S output jointly by O and by M at e − is not the uniquely simplest theory given e. Since w is in complexity cell C e (n), there exists a skeptical path π of length n + 1 in Q e . Since T S is not uniquely simplest and Q has no short skeptical paths, there exists S 0 in K e such that S is distinct from S 0 . Since there are no short skeptical paths in Q e , the existence of π implies the existence of a skeptical path (S 0 , . . . , S n ) of length n + 1 in K e . Since S e is a subset of S 0 , there exists world w 0 extending e such that S w 0 = S 0 . Let e 0 be the data presented by w 0 when M converges to T S 0 , so M retracts T S after the end of e in e 0 . Apply lemma 1 to obtain world w in C e 0 (n) in which M retracts at least n times after e 0 . Then ρ(M, w ) ≥ k + n + 1 ≥ ρ(O, w). In the alternative case in which O does not retract at e, the forcing lemma yields immediately that ρ(M, w ) ≥ k + n ≥ ρ(O, w). Thus, O ≤ ρ e M. Since M is an arbitrary, convergent method that agrees with O along e − , we have that O is efficient given e in terms of retractions, and since e is an arbitrary input sequence in F K , it follows that O is always efficient.
For the proof that 3 implies 1, suppose that e is in F K and that M violates either Ockham's razor or stalwartness at e but not on any finite initial segment of e. Suppose that O retracts k times along e − . Let O agree with M along e − and revert to a convergent, stalwart, Ockham strategy thereafter. Consider the (hard) case in which M violates Ockham's razor at e and O retracts at e. The theory T S produced jointly by O and M at e − is uniquely simplest at e − but not at e, by the stalwartness of O at e and the fact that M is Ockham at e − . Since S is uniquely simplest at e − and each theory is correct for a unique effect set, it follows that S is a subset of every S in K e − . That would still be the case at e if S e were a subset of S, so e must refute T S . Since M is logically consistent, it follows that T S differs from T , so both M and O retract k + 1 times along e. Suppose that C e (n) is non-empty. Then there exists a skeptical path π of length n + 1 in Q e . Since T S is not uniquely simplest, there exists S 0 distinct from S in K e . Since there are no short skeptical paths in K e , the existence of π implies the existence of a skeptical path (S 0 , . . . , S n ) of length n + 1 in Q e . Since S e is a subset of S 0 , there exists world w 0 extending e such that S w 0 = S 0 . Let e 0 be the data presented from w 0 when M converges to T S 0 , so M retracts T S after the end of e in e 0 , for a total of k + 2 retractions along e 0 after e. The second retraction is the penalty incurred by M for violating Ockham's razor. Apply lemma 1 to obtain world w in C e 0 (n) such that ρ M,w ≥ k + 2 + n. It has already been shown that ρ O,w ≤ k + 1 + n, for each w in C e (n). Therefore, O < ρ e M, so M is beaten at e in terms of retractions and, hence, is not always unbeaten. The case in which O does not retract at e is easier, for one can simply drop the proof that M retracts at e from the argument for the alternative case, with the result that ρ O,w ≤ k + n and k + n + 1 ≤ ρ M,w . The case in which M violates stalwartness at e is easier still, because M picks up a retraction at e that O avoids and the forcing argument prevents M from catching up with O later.
Proof of theorem 2. It is immediate from the definitions that 2 implies 3. To see that 1 implies 2, suppose that finite input sequence e of length l is in F K and let O be a convergent method that is always Ockham and stalwart. Suppose that O retracts at times r 1 , . . . , r k along e − and that w is an arbitrary world in complexity cell C e (n). Method O retracts after e in w only when the currently simplest theory is refuted by the data. Thus, there exist s 1 < . . . < s n such that l < s − 1 and: τ(O, w) ≤ (r 1 , . . . , r k , l, s 1 , . . . , s n ). Let M be an arbitrary, convergent method that agrees with O along e − . Consider the case in which O retracts at e. As in the proof of theorem 1, there exists skeptical path (S 0 , . . . , S n ) of length n + 1 in K e . Since S e is a subset of S 0 , there exists world w 0 extending e such that S w 0 = S 0 . Let e 0 be the data presented by w 0 when M converges to T S 0 , so M retracts T S no sooner than stage l. Let m be the maximum of s i+1 − s i , for i ≤ n. Apply lemma 1 to obtain world w in C e 0 (n) in which M retracts at least n times after l with time lag ≥ m between retractions. Then: τ(M, w ) ≥ (r 1 , . . . , r k , l, s 1 , . . . , s n ) ≥ tau(0, w).
In the alternative case in which O does not retract at e, delete l from τ(O, w) and τ(M, w ). Since M is an arbitrary, convergent method that agrees with O along e − , we have that O is efficient given e in terms of timed retractions, and since e is an arbitrary input sequence in F K , it follows that O is always efficient.
The proof that 3 implies 1 is simpler than in the proof of theorem 1, because M can be beaten in terms of times rather than overall retractions. Suppose that e is in F K and that M violates either Ockham's razor or stalwartness at e, not necessarily for the first time. Let (r 1 , . . . , r k ) be the times at which M retracts along e − . Let O agree with M along e − and revert to a convergent, stalwart, Ockham strategy thereafter. Consider the hard case in which M violates Ockham's razor at e by producing T S and O retracts at e. Since O is stalwart at e, the theory T S produced jointly by O and M at e − is not uniquely simplest at e − . Suppose that C e (n) is non-empty. As in the proof of theorem 1, there exists a skeptical path (S 0 , . . . , S n ) of length n + 1 in Q e such that S 0 differs from S. Nature is free to present just the effects in S 0 until M retracts T in favor of T 0 , which happens after e. This late retraction is the penalty incurred by M for violating Ockham's razor. Now lemma 1 yields a world w in C e (n) such that: τ(M, w ) ≥ (r 1 , . . . , r k , l + 1, s 1 , . . . , s n ). Since O retracts at most n times along arbitrary world w in C e (n), there exist s 1 , . . . , s n such that: τ(O, w) ≤ (r 1 , . . . , r k , l, s 1 , . . . , s n ) < (r 1 , . . . , r k , l + 1, s 1 , . . . , s n ) ≤ τ(M, w ).
Thus, M ≤ τ e O. Since it has already been shown that O ≤ τ e M, it follows that O < τ e M. The other cases are easier because M also retracts more times than O in those cases.
Proof of theorem 3. When ρ suffices to show Ockham efficiency, so does τ, since lemma 1 yields retractions that come arbitrarily late in C e (n + 1). Thus, stalwart, Ockham strategies are efficient in terms of τ even though their retractions come arbitrarily late. Furthermore, non-Ockham or non-stalwart strategies are beaten in terms of retractions and, hence, in terms of retraction times. Extending the Pareto ordering to errors does no harm, since lemma 1 yields arbitrarily many errors in each non-empty complexity cell C e (n + 1) and Ockham methods produce no errors in C e (0).

Proof of Stochastic Theorem and Lemmas
The proof of theorem 4 breaks down naturally into two principal cases. Assume that e of length l is in F K , that M is a method, that σ is an output sequence of length l such that p(M [e − ] = σ ) > 0. In the defeat case, the last entry in σ is some informative answer T to Q that is not Ockham with respect to e (i.e., any justification for T derived from Ockham's razor is defeated by e). Thus, Ockham methods pick up a retraction at e in the defeat case and non-Ockham methods may fail to retract at e. The non-defeat case holds whenever the defeat case does not.
Proof of theorem 4: We begin by proving the case of theorem 4 that corresponds to the second clause of theorem 3. Assume that Q e has no short skeptical paths. We begin by showing that convergent methods that are stalwart and Ockham from (e, σ ) onward are efficient from (e, σ ) onward. Let stochastic method O be stalwart and Ockham from (e, σ ). It is immediate that efficiency from (e, σ ) onward implies being unbeaten from (e, σ ) onward.
To show that being convergent and unbeaten from (e, σ ) onward implies being stalwart and Ockham from (e, σ ) onward, assume that M is convergent but violates either Ockham's razor or stalwartness at (e , σ ), where (i) e is in F K e , (ii) σ is an answer sequence extending σ , and (iii) both e and σ − have length l . Let O be a convergent method that is always stalwart and Ockham.
Consider first the case for expected losses, in which τ is in L , which is a subset of {ρ, ε, τ}.
Next consider the case for losses in chance, in which τ is in L , which is a subset of { ρ, ε, τ}. Follow the preceding argument down to the invocation of lemma 16. The same lemma, in this case, provides a world w in C e (n) and an α > 0 such that either τ(M, w, k + 1 | M [e − ] = σ ) > l or τ(M, w, k + n + 1 + α | M [e − ] = σ ) > 0. By lemma 12, there exists ε > 0 such that the preceding inequalities hold for each v such that k + 1 − ε < v ≤ k + 1 or k + n + 1 + α − ε < v ≤ k + n + 1 + α, respectively. So by lemma 17, there is no open interval I in the real numbers that witnesses M ≤ τ e ,σ ,n O[σ /e − ]. Next, we prove the case of theorem 4 that corresponds to the first clause of theorem 3. Focus first on the case of expected losses. Note that "always" is the special case of "from (e, σ ) onward" in which e, σ are both the empty sequence. Therefore, the case in which τ is in L drops out as a special case of the preceding argument. For the case in which ρ is in L , it suffices to show that if every theory is correct of a unique effect set and if M ever violates Ockham's razor or stalwartness, then M is beaten in terms of ρ at the first violation of either principle. Suppose that M violates either Ockham's razor or stalwartness at (e, σ ), so that p(M [e − ] = σ ) > 0. Further, suppose that (e, σ ) is the first time that M violates Ockham's razor, so that there are no proper subsequences e and σ of e and σ where some violation occurs. Let O be a convergent, stalwart, Ockham method, and suppose C e (n) is nonempty. Then M ≤ ρ e,σ ,n O[σ /e − ] by the defeat and non-defeat cases of lemmas 6 and 9. Suppose that stalwartness is violated at (e, σ ). Then M ≤ ρ e,σ ,n O[σ /e − ] by lemmas 7 and 9. Note that only the non-defeat case of lemma 9 applies in this case due to lemma 7. The argument based on losses in chance is similar and appeals to the same lemmas.
Lemma 3 (forcing retractions in chance) Suppose that M converges to the truth in Q e and that (S 0 , . . . , S n ) is a skeptical path in K e such that c e (S n ) = n. Then for each ε > 0, there exists world w in C e (n) such that: Proof: Let ε > 0. Using the skeptical path (S 0 , . . . , S n ), apply lemma 2 to obtain a world w in C e (n) and stages l = s 0 < . . . < s n+1 such that s 0 = l and s i+1 − s i ≥ m and p(M w|s i+1 = T S i | D) > 1 − ε/2n, for each i from 0 to n. It follows that M incurs more than 1 − ε/n retractions in chance from s i + 1 to s i+1 in w, since T i drops in probability from more than 1 − ε/2n to less than ε/2n. Since there are at least n such drops, there are more than n − ε retractions in chance.
In all the lemmas that follow, assume that e of length l is in F K , that M is a method, that σ is an output sequence of length l such that p(M [e − ] = σ ) > 0, and that p(D) > 0.
Lemma 4 (losses in chance that bound expected losses) Proof: Let S be an arbitrary set of natural numbers.
Lemma 5 (retractions: lower bound) Suppose that Q e has no short paths, that M converges to the truth in Q e , and that C e (n) is non-empty. Then for each ε > 0, there exists w in C e (n) such that: Proof: Let ε > 0. In the defeat case, the last entry T in σ is not Ockham at e. Hence, there exists S 0 in K e such that c e (S 0 ) = 0 and T = T S 0 . Extend e with just effects from S 0 until e is presented such that p(M e = T S 0 | M [e − ] = σ ) > 1 − ε /2, which yields nearly one retraction in chance from l to the end of e . Since there are no short paths, there exists a skeptical path (S 0 , . . . , S n ) in K e such that c e (S n ) = n. Apply lemma 3 to (S 0 , . . . , S n ) with e set to e , ε set to ε /2, and arbitrary m > 0 to obtain another n − ε /2 retractions in chance after the end of e , for a total of more than n + 1 − ε retractions in chance from l + 1 onward. The nondefeat case is easier-just apply lemma 3 directly to (S 0 , . . . , S n ) to obtain n − ε retractions in chance. To obtain the results for expected retractions, apply lemma 4.
Lemma 6 (retractions: lower bound for Ockham violators) Suppose that Q e has no short paths, that M converges to the truth in Q e , and that C e (n) is non-empty. Assume, further, that each theory is correct of a unique effect set, that M is logically consistent, and that M violates Ockham's razor for the first time at (e, σ ). Then there exists w in C e (n) such that: Proof: Suppose that M violates Ockham's razor for the first time at (e, σ ) so that for some T S that is not Ockham at e, we have that p(M e = T S | M [e − ] = σ ) = α > 0. Consider the defeat case. Then the last entry T S of σ is not Ockham at e. So there exists S 0 in K e such that c e (S 0 ) = 0 and T S 0 = T S . Since each theory is true of at most one effect set and M was Ockham at e − (since e is the first Ockham violation by M) and is no longer Ockham at e, it follows that S e is not a subset of S. Since M is logically consistent, p(M e = T S | M [e − ] = σ ) = 0. But since T S is the last entry in σ , we have that p(M e − = T S | M [e − ] = σ ) = 1, so there is 1 retraction in chance already at e. Since there are no short paths, there exists skeptical path (S 0 , . . . , S n ) such that c e (S n ) = n. Choose 0 < ε < α and let α = α − ε . Extend e with just the effects in S 0 until M produces T S 0 with chance 1 − ε . That entails a retraction in chance of at least α. Choose 0 < ε < α. The effects presented are still compatible with S 0 , so one may apply lemma 3 to obtain w in which n − ε more retractions in chance occur, for a total of n + 1 + α − ε > n + 1 retractions in chance in w. The non-defeat case simply drops the argument for the first full retraction. For the expected case results, apply lemma 4.   (b) for all j such that k < j ≤ n + k in the non-defeat case.
Proof: Let m > 0 be given. Consider the defeat case, in which the last entry T in σ is not Ockham at e. Hence, there exists S 0 in K e such that c e (S 0 ) = 0 and T = T S 0 . Let p = p(M e = T S 0 | M [e − ] = σ ). We now use p to construct a finite input sequence e , which we use in turn to construct w in C e (n) and γ ≤ ρ. If p = 1, then set e = e. If p < 1, then p(M [e − ] = σ ∧ M e = T S 0 ) > 0, and one can choose ε > 0 sufficiently small so that: To see that ε exists, note that pl + (1 − p)(l + 1) > l when p < 1. Let w in C e (0) be such that S w = S 0 . As M is convergent in Q e , there exists m > m/(1 − (n + 1)ε) such that: Set e = w |m . Since C e (n) is nonempty and Q e has no short paths, there exists a skeptical path (S 0 , . . . , S n ) in K e such that c e (S n ) = n. Apply lemma 2 to (S 0 , . . . , S n ), ε, and e to obtain w in C e (n) and stages m = s 0 < . . . < s n+1 such that for all 0 ≤ i ≤ n, one has s i+1 −s i > m and p(M w| j = T S i | M [e − ] = σ ) ≥ 1−ε, for each j such that s i+1 −m ≤ j ≤ s i+1 . Let U be the set of all ω in Ω such that n i=0 M w|s i+1 (ω) = T S i . Let ω be in U. Then since T = T S 0 and T S i = T S i+1 for all 0 ≤ i ≤ n, the random output sequence M [w|s n ] (ω) has retractions at some positions r 0 , . . . , r n , such that l < r 0 = m ≤ s 0 < r 1 ≤ s 1 < . . . s n < r n+1 ≤ s n+1 . Let γ be just like ρ except that for each ω in U, the function γ(M, w, i, ω) has value 0 at each stage i between m + 1 and s n+1 along M [w|s n ] (ω) except at the n + 1 stages r 0 , . . . , r n . Note that the retraction at stage r j is the k + j + 1th retraction of M along w, as M retracts k times along e − . Now by construction of w and m : Exp (Di) (γ M,w,i ≥ k + j + 1) | M [e − ] = σ > m · (1 − (n + 1)ε) > m, so world w and γ satisfy condition 2a. The argument for 2(b) is similar but easier, since in the non-defeat case one may skip directly to the existence of (S 0 , . . . , S n ) in the preceding argument.
Lemma 14 (push) Ifγ is is a local loss function in chance and γ(M, w) ≥ v and u < v, then: Next, suppose that M violates stalwartness at e. Then since stalwartness is violated, it follows that the last entry of σ is some T S that is Ockham at e, so S is uniquely simplest at e and we are in the non-defeat case. Since there are no short paths and C e (n) is nonempty, there exists skeptical path (S = S 0 , . . . , S n ) in Q e such that c e (S n ) = n. Choose ε such that 0 < ε < 1/2n. Apply lemma 2 to (S 0 , . . . , S n ) to obtain w in C e (n) and stages l = s 0 < . . . (2)