MODEL THEORY AND MACHINE LEARNING

Abstract About 25 years ago, it came to light that a single combinatorial property determines both an important dividing line in model theory (NIP) and machine learning (PAC-learnability). The following years saw a fruitful exchange of ideas between PAC-learning and the model theory of NIP structures. In this article, we point out a new and similar connection between model theory and machine learning, this time developing a correspondence between stability and learnability in various settings of online learning. In particular, this gives many new examples of mathematically interesting classes which are learnable in the online setting.


INTRODUCTION
The purpose of this note is to describe the connections between several notions of computational learning theory and model theory.The connection between probably approximately correct (PAC) learning and the non-independence property (NIP) is well-known and was originally noticed by Laskowski [8].In the ensuing years, there have been numerous interactions between the combinatorics associated with PAC learning and model theory in the NIP setting.Below, we provide a quick introduction to the PAC-learning setting as well as learning in general.Our main purpose, however, is to explain a new connection between the model theory and machine learning.Roughly speaking, our manuscript is similar to [8], but develops the connection between stability and online learning.
That the combinatorial quantity of VC-dimension plays an essential role in isolating the main dividing line in both PAC-learning and perhaps the second most prominent dividing line in model-theoretic classification theory (NIP/IP) is a remarkable fact.This connection has been the subject of numerous works in recent years [5,6,7,11].In the setting of online learning (described below), another combinatorial notion, the Littlestone dimension, isolates the dividing line between learnability and non-learnability of a concept class.Given how well-studied the connection between model theory and the combinatorics associated with machine learning is, it is surprising that it hasn't been noticed until now that the same combinatorial quantity isolates what is perhaps the most prominent dividing line in classification theory (stable/unstable).Now we roughly describe the PAC setting, in part to contrast the setting with that of online learning.Given an infinite set X with a probability measure µ on X and a collection of measurable subsets of X, denoted by F, one attempts to "learn" a fixed but unknown F ∈ F by sampling from X.For some large n, n elements of X are randomly sampled, and the goal is to estimate the probability µ(F) by the proportion of elements of the sample which lie in F. For some ǫ > 0 fixed ahead of time, we say that the sample estimates James Freitag was supported by NSF grant no.1700095.the set F ǫ-well if the proportion of elements of the sample which lie in F is within ǫ of µ(F).The class F is learnable if for any δ there is a large enough n such that the measure of the samples of size n (computed using the product measure µ n ) which estimate the sample ǫ-well is greater than 1 − δ.Roughly, for large enough sample size, we can get arbitrarily high likelihood that a sample estimates the true probability arbitrarily well.That is, for a large enough sample size, predictions are probably approximately correct.It turns out that there is a purely combinatorial characterization of F being PAC-learnable (which remarkably does not depend on the distribution µ); the collection F is PAC-learnable if and only if F has finite VC-dimension.
The connection to model theory is as follows: when X is taken to be M, a model of a first order theory T and φ(x, y) is a formula in the language of T, we let F = {φ(M, a) | a ∈ M}.Then the VC-dimension of F is finite if and only if φ(x, y) is NIP.
In the most straightforward (and restrictive) setup of online learning, we are given an infinite set X (with no distribution) along with a collection F of subsets of X.The collection F is known to the learner.Fix some F ∈ F which is not known to the learner.Fixing some large n, there will be n rounds.In round i, an element x i is selected, and the learner must predict the value of 1 F (x t ), that is, whether or not x i is in the unknown set F. We call the value of the learner's prediction ŷi .The goal of online learning is to minimize the number of mistakes made during these predictions In this setting, there is no assumption about how the elements x = (x 1 , . . ., x n ) are chosen, and the choice of x i+1 is allowed to depend on the predictions made by the learner in the previous rounds.One seeks to minimize the number of mistakes over all possible sequences of samples.This setting of computational learning often arises when the data becomes available in sequential order or the data is chosen by a process which is assumed to be adversarial to the learner (a process or opponent seeking to make the number of mistakes large).Variations on how the samples are chosen are possible as well; for instance, a certain limited amount of randomness is often injected into how the elements x i are chosen without moving the sampling back into the PAC context.
It turns out that the number of mistakes that the best deterministic algorithm makes (over all possible samples) can be bounded in terms of a combinatorial quantity associated with the collection F, the Littlestone dimension.When X is taken to be M, a model of a first order theory T, φ(x, y) is a formula in the language of T, and F = {φ(M, a) | a ∈ M}, the Littlestone dimension (also called thicket dimension) is precisely the Shelah 2-rank of φ(x, y), which is finite if and only if φ(x, y) is stable.A number of variants of this basic setup have much less restrictive assumptions (sometimes with a certain amount of randomness similar to the PAC setting) while also having the property that learnability is characterized by stability.In section 4 we will give an exposition of the various settings in which stability characterizes learnability.
It seems surprising to the authors that the connection pointed out in the previous paragraph has not been previously noticed, but the following quote of [15] offers something of an explanation: A reflection on the past two decades of research in learning theory reveals (in our somewhat biased view) an interesting difference between Statistical Learning Theory and Online Learning.In the former, the focus has been primarily on understanding complexity measures rather than algorithms...In contrast, Online Learning has been mainly centered around algorithms.
The dividing lines in model-theoretic classification theory are more naturally associated with combinatorial properties and the various complexity measures associated with PAC learning than with algorithms, and in the less restrictive online setups, the role of Littlestone dimension is perhaps somewhat more hidden than the role of VC-dimension in the PAC setup.
The correspondence between online learnability and stability is similar to the correspondence between PAC learnability and NIP, but it should be mentioned that the fields (online learning and stability theory) are in rather different positions than in PAC learning correspondence with NIP.At this point, stability theory has been extensively developed, while at the time of [8], the study of theories without the independence property was in its infancy, while PAC learning was much more developed.Various notions from PAC learning eventually played a big role in the development of structural results for NIP structures.In the case of the correspondence between stability and online learning, there seems to be more potential for the application of model theoretic ideas in online learning.For instance, in the final sentence of [2], the authors mention that one of the main open questions in the theory is to close the gap between the lower bounds and upper bounds for the expected number of mistakes a learner makes in various online contexts, and that this question seems to have as a main obstacle a lack of interesting infinite concept classes with finite Littlestone dimension.Model theory offers a remedy for this obstacle; a great many mathematically interesting theories have been proven to be stable over the last forty plus years of classification theory, often with highly nontrivial proofs.So, following our discussion of online learning, we give some prominent examples of stable theories, giving various new examples of classes of finite Littlestone dimension.Now we describe the organization of this manuscript.In section 2, we describe the setting of computational learning in very general terms.In section 3 we specialize to the PAC setting.In section 4, we specialize to the setting of online learning before describing several variants.In the final section, we survey some stable theories, and use the connection pointed out earlier in the paper to give many new examples of classes with finite Littlestone dimension.
1.1.Acknowledgements.The authors would like to thank Siddharth Bhaskar, Alex Kruckman, Dimitrios Diochnos, Dave Marker, Lev Reyzin, Dhruv Mubayi, Maryanthe Malliaris, and Gyorgy Turan for useful suggestions and conversations during the preparation of this article.

MACHINE LEARNING GENERALITIES
In this section, we describe the generalities of machine learning, in quite a general setup, while mentioning the cases of particular interest to us.Let Y be a set, which we will call the set of labels.Let Y ′ be another set, which we will refer to as the predictions.Fix a function which we call the loss function.Another common example occurs when Y = Y ′ = I ⊆ R, with I a bounded interval.In this case, a common loss function is given by L(y, y ′ ) = (y − y ′ ) 2 .Settings in which Y, Y ′ ⊂ R are sometimes called margin-based.These settings are less natural to connect directly to model theory, though it might make sense to study margin-based machine learning in the context of continuous model theory [20].
Let X be another set, which we call the set of examples (also sometimes called inputs or instances).A concept is a map c : X → Y.In the example given above with Y = {0, 1}, a concept is simply a subset of X.A concept class C is a collection of concepts.
Fix some concept c.The learner will make a series of predictions about a sample of inputs from X by selecting a prediction ŷi for the label of each element x i from the sample.The learner incurs a loss for each element x i of the sample, by evaluating L(c(x i ), ŷi ).If the elements of the sample are indexed by the set I, then the total loss incurred is given by The goal of the learner is always the same-minimize the total loss coming from making predictions about a series of elements of X.Besides the objects described above, the differences in various settings of learning theory are derived from the assumptions about what data the learner has available and how the elements of the sample are chosen.

PAC-LEARNING AND NIP
In this section, we will quickly explain the connection between PAC learning and NIP.Our presentation essentially follows [6] Let µ be a probability measure on X such that each element of C is measurable.We will think of the learner as having complete knowledge of the elements of C, and the elements for a sample being drawn randomly with respect to the distribution given by µ.
Here one should think that G is a function being used to generate predictions, while the error is the probability that the next prediction is incorrect.We say that C is probably approximately correct learnable (PAC-learnable) if there is a G : C f in → 2 X such that for all ǫ > 0 and all δ > 0, there is N ǫ,δ ∈ N such that for all f ∈ C, and all µ on X such that all elements of C measurable, where µ N ǫ,δ is the product measure.That is, the probability that the error is high (bigger than ǫ) is small (less than δ).Supposing that the class C is PAC-learnable, there is a minimal N ǫ,δ for which the inequality holds, which is called the sample complexity.
The following theorem establishes the connection between VC-dimension and PAClearnability: Theorem 3.1.Let C be a concept class on X.Then the following are equivalent: (1) C has finite VC-dimension.
(2) C is PAC-learnable, and In fact, even more is true-if C is PAC-learnable with sample complexity N ǫ,δ , then one can show that the expected value of the function ā In the years since Laskowski's paper [8], connections between the VC theory and NIP have developed extensively with important notions from VC-theory adapted to the modeltheoretic setting and vice versa [5,6,7,11].

ONLINE LEARNING AND STABILITY
The initial setting of online learning which we describe is due to Littlestone [9]; the particular setting received relatively little attention, perhaps due to the very strong assumptions ([9] is in fact famous for several other contributions).Littlestone's work was generalized in various ways in the ensuing years, with the assumptions being significantly weakened.We will begin with the original setup of [9], and eventually describe two settings laid out in [2].First, we set up some of the combinatorial notions pertinent in each of the settings we consider.
The next several definitions follow the notation and terminology of Bhaskar [3].

Definition 4.1.
A binary element tree of height h, denoted by T h , is a rooted complete binary tree of height h whose non-leaf vertices are labeled by elements of the set X and whose leaves are labeled by elements of C (see Figure 1).
For the following definitions, fix a binary element tree of height h.

Definition 4.2.
A vertex v 1 is below a vertex v 2 if v 2 lies on the (unique) path from v 1 to the root of the tree.We say that v A binary element tree of height three.Here a i ∈ X and X i ∈ F. The leaf labeled with X 4 is well-labeled if and only if a 1 ∈ X 4 and a 2 , a 4 / ∈ X 4 .For all other a i , there is no requirement about membership in X 4 .
Definition 4.4.The thicket shatter function ρ F : Z ≥0 → Z ≥0 is defined by letting ρ F (n) be the maximum number of well-labeled leaves on a binary element tree of height n, T n , whose leaves are labeled with elements of F. The thicket dimension Ldim(F) is the maximum integer n such that ρ Thicket dimension has appeared in at least several other contexts under different names; in fact Bhaskar [3] was aware of the terminology and definitions of [18], which we reproduce next: Definition 4.5.Let M be a monster model of a complete L-theory.Fix a consistent partial type π(x) and a partitioned formula φ(x; y).Then the ordinal R(π, φ, 2), called the Shelah 2-rank, is defined as follows: • R(π, φ, 2) ≥ 0.
• For any limit ordinal λ, R(π, φ, 2) In general, R(π, ∆, 2) can also be defined for a finite collection of formulas ∆, but this case can be shown to reduce to the case of a single formula.The formula φ(x, y) is stable if and only if R(∅, φ, 2) is finite [18]; a theory is stable if every formula is stable.It is reasonably clear that the R(π, φ, 2) is the thicket dimension of the set system on M |y| given by the collection of sets {φ(b, M) | b ∈ π(M)}; for more details, see [3].
The thicket dimension also appears for the first time in the context of learning theory in [9]; the quantity came to be called the Littlestone dimension [2].
4.1.The realizable case.Fix a set system C on a set X. Assume that Y = Y ′ = {0, 1} and the loss function for a prediction ŷ and concept (that is, a set) X on input x is given by | ŷ − 1 X (x)|.Over all possible algorithms, we seek to minimize our loss, that is, the number of mistakes we make over n rounds of predictions.In the realizable case, we assume that X ∈ C, so that the true concept is among the set of concepts C accessible to the learner.There are no assumptions on the choices of the instances x t .The goal is to minimize the worst case number of mistakes made by our predictions over all possible samples of the instances and choice of the concept.So, we seek to bound where ŷt is chosen by some deterministic algorithm.
For applications and purposes of discussing the bounds, one often views the entity selecting the instances x as antagonistic to the learner-and in our current simplified setting, bounding the worst case number of mistakes bounds the actual number of mistakes made when the antagonistic sampling entity has perfect information about the prediction process.
Theorem 4.6.[9] The worst case number of mistakes of any deterministic algorithm in the online learning setting with concept class C is at least the Littlestone dimension of C, and there is an algorithm that makes at most this many mistakes.Remark 4.7.The algorithm which minimizes the number of worst-case mistakes in the above setting is referred to as the Standard Optimal Algorithm (SOA), and we describe it briefly here.Begin with V 0 = C.At each stage, the learner inductively defines V i .At stage t, the learner receives x t , and sets, for r = 0, 1, The learner predicts ŷt = r which maximizes the Littlestone dimension of V (r) t (ties are predicted in some fixed manner, say ŷt = 0 in the case of a tie).Then the learner gets the value of 1 X (x t ) and realizes whether a mistake has been made.At this point, set The essential point here is that if a mistake is made, it must be the case that the Littlestone dimension of V t is strictly less than the Littlestone dimension of V t−1 (proving this is an easy exercise).Of course, this bounds the total number of mistakes which the algorithm can ever make under any choice of x by the Littlestone dimension.

Learning from experts.
The case in which we assume that the learner has access to true concept X ∈ C is often referred to as the realizable case of online learning.For various applications, this assumption is too strong (as are other assumptions from the previous subsection which we will deal with in later sections).In this section, we will explain a context of online learning which removes the realizability assumption.
The goal again is to minimize mistakes, but here, the minimization will be relative to a particular class of {0, 1}-valued functions, which we will call H.That is, we wish to minimize, for any sampling of instances, x = (x 1 , . . ., x T ), the difference between the number of mistakes made by the learner and the minimal number of mistakes made by any of the functions in H. So, in this case, the loss function is taken to be Here one often thinks intuitively that the functions in H are experts making predictions, and the learner's job is to choose which expert's prediction to believe.
Littlestone and Warmuth [10] consider this problem in the case that H is finite via a probabilistic weighted majority algorithm.We will now describe their algorithm.At the outset, each of the N many experts { f i } N i=1 = H is assigned weight 1, and the weight of expert i at stage t will be denoted by w t i .We fix the learning rate η > 0, which dictates how much we discount the weight of an expert for providing incorrect advice.At each stage, the learner receives the expert advice, ( f 1 (x t ), . . ., f N (x t )), a tuple in {0, 1} N .The learner predicts 1 with probability Then once the actual value y t is revealed, the weights are updated via: That is, those experts who were wrong see their weight drop by a factor of e −η .
The expected value of the loss function of their algorithm with a sample of size T is Here, the assumption that H is finite is often too strong for applications, however, [2] generalize the setup to the case in which H is infinite, but of finite Littlestone dimension, proving: Theorem 4.8.There is an algorithm such that for all h ∈ H and any sequence of instances x = (x 1 , . . ., x T ), In [2] it is also shown that no algorithm (even allowing randomization) can achieve an expected bound better than 1  8 Ldim(H)T.Closing the gap between the lower and upper bounds for the loss function (sometimes called regret in this context) is one of the main open problems mentioned in [2], where the authors remark that there are few known interesting examples of infinite classes with finite Littlestone dimension.4.3.Bounded stochastic noise.Suppose that we work in the general setup from the previous section (again, not assuming realizability), but with a difference in the way we generate labels and measure mistakes.Suppose that there is a function h ∈ H such that the labels y 1 , . . . ,y T are independent {0, 1}-valued random variables with the property that for all t, Pr(h(x t ) = y t ) ≤ γ with γ ∈ (0, 1  2 ).This value γ will be called the noise rate.In this setting, one seeks to minimize the difference between the predictions and the output of the noisy function on the samples: Note here that there are two sources of randomness-the choices of the algorithm may be randomized and the labels y t are random variables.The expectation is taken with respect to both of these.Theorem 4.9.For any concept class H, and any γ ∈ [0, 1  2 ), there is an algorithm (possibly randomized) so that for any h ∈ H, and a sequence of examples (x 1 , y 1 ), . . ., (x T , y T ) with each y t a random variable as described above, That is, the expected number of mistakes grows only logarithmically in the sample size.In [2], the authors give an example of a class H which shows that the left hand side of the inequality in the theorem is bounded below by Ω(Ldim(H) • ln(T)).

STABILITY THEORY
In this section, we use stability theory to point out various mathematically interesting examples of classes which have finite Littlestone dimension.We will assume some basic familiarity with first order logic, but we provide some reminders for the non-model theorist for whom this section is written.
Fix some complete theory T in a language L and let M be a monster model of T. The non-model theorist can simply loosely assume that M is a very large structure in which over a small subset A (say of cardinality at most κ) for any tuple c in any model of T containing A, there is some b ∈ M such that tp(c/A) = tp(b/A).Here tp(c/A) denotes the collection of all first order formulas in the language L with parameters from A which are satisfied by c.
For n ∈ N, the space of types of n-tuples of M over some subset A ⊂ M is denoted by S n (A).It comes naturally equipped with a topology in which the basic open sets correspond to first order formulas with parameters in A. Rather than considering all formulas, sometimes it is natural to restrict to the φ-type of a tuple, denoted tp φ (c/A), the collection of instances of φ with parameters in A which hold of c.When φ(x; y) is a formula, the space of φ-types over A (treating the variables y as parameters) is denoted by S φ (A).
The theory T is called κ-stable if for every set A ⊆ M with |A| ≤ κ, we have |S n (A)| ≤ κ for all n ∈ N. The theory is stable if it is κ-stable for some κ ≥ |T|.Part of the utility of the notion is that it can be characterized in several disparate ways (this is not an exhaustive list): Fact 5.1.[18] The following conditions are equivalent: (1) T is κ-stable for some κ.
(2) For any countable set A ⊂ M, S φ (A) is countable.
( When κ in the first condition of the above definition is be taken to be ℵ 0 , the theory is (somewhat enigmatically) called ω-stable.Not every stable theory is ω-stable, even when making strong assumptions about various aspects of the language or structure.For instance, the theory of the integers where the language consists of the additive group operation as a binary function is stable, but not ω-stable.
Stability is one of the dividing lines (probably the most prominent one) which in certain contexts, model-theorists view as the border between "tame" and "wild" structures; stability allows for the development of various structural results, which are (often provably) impossible in the case of unstable theories.Stability has various non-obvious interactions with algebraic structure, and understanding these interactions has been the subject of a huge amount of model theoretic work over the past fifty years (for instance, there is a deep structure theory of stable groups [14]).
Consider the concept class C φ on M |y| given by the collection of sets {φ(b, M) | b ∈ π(M)}.The theory T is stable precisely if each concept class of this form has finite Littlestone dimension (see section 4 for an explanation).
We will elaborate on condition (4).Given a class C φ , there is a natural bipartite graph G φ associated with any concept class.The sets of vertices consist of 1) the elements of the underlying set and 2) concepts, with an edge between an element and a concept if and only if the element is in the concept.Finite Littlestone dimension of the concept class C φ is equivalent to there being an upper bound on the size of any half-graph which appears as an induced subgraph of G φ .

Examples of notable stable theories.
We now make a list (very far from comprehensive) of some notable stable theories and offer some explanation of the set systems (families of definable sets) which arise in the various settings.From our list, many mathematically interesting classes C φ with finite Littlestone dimension can be obtained.
(1) ACF, the theory of algebraically closed fields.By quantifier elimination for algebraically closed fields, the concept classes which appear as C φ in the theory of algebraically closed fields are precisely the uniform families of affine constructible sets.That is, when f : V → W is a rational map (everything defined over some fixed algebraically closed field), the corresponding family of constructible sets is the collection of fibers of the function f .More concretely, one can think of such a family as being given by solutions sets of families of polynomial equations and inequations: where x is a tuple of indeterminates and a is a tuple which varies over some constructible subset of A |a| .(2) DCF 0 , the theory of differentially closed fields of characteristic zero, was first investigated by Robinson [16], and Blum [4] gave an elegant axiomatization from which it was straightforward to notice that the theory is stable.See [12] for a more comprehensive discussion of DCF 0 , as we will be brief here.Differentially closed fields are universal domains for algebraic differential equations; that is, if a system of equations has a solution in some field of functions, it already has a solution in the differential closure of the field generated by the coefficients of the equations.By quantifier elimination for differentially closed fields, the concept classes which appear as C φ in the theory of differentially closed fields are precisely the uniform families of constructible sets in the Kolchin topology (boolean combinations of the zero sets of algebraic differential equations).That is, when f : V → W is a differential rational map between affine constructible sets V, W in the Kolchin topology (everything defined over some fixed differentially closed field), the corresponding family of constructible sets is the collection of fibers of the function f .Such a family is alternatively given by a collection of differential equations and inequations where x is a tuple of indeterminates from M |= DCF 0 and a ∈ M is a tuple which varies over some Kolchin-constructible subset of A |a| .(3) The theory of separably closed fields with characteristic p = 0 and fixed degree of imperfection e ∈ N (which we will describe here) is complete and was shown to be stable by Wood [19].When a field F of characteristic p is closed under separable extensions, we say F is separably closed.A set B ⊆ F is a p-basis of F if the collection of products of powers of elements of B of degree at most p − 1 forms a basis for F as an F p -vector space.The cardinality of such a set B is called the degree of imperfection of F (which we assume to be finite).Now let {a 1 , . . ., a e } be a p-basis of F, and let {m 1 , . . ., m p e } be the collection of monomials in {a 1 , . . ., a e } of degree at most p − 1 in each element.Every element of F can be written uniquely in the form where x i ∈ F. For each element x i in the above sum, we can repeat the process, writing x p (i,j) m i .
Naturally, one can continue to iterate this process, defining x σ for any σ a finite tuple of elements from {1, . . ., p e }.Let λ σ be the unary function x → x σ .Let L p,e be the language {+, −, •, −1 , 0, 1} ∪ {a 1 , . . ., a e } ∪ {λ σ : σ ∈ (p e ) <ω }.The theory of separably closed fields of characteristic p with degree of imperfection e eliminates quantifiers in the language L p,e .So, in one variable, definable sets correspond to boolean combinations of the zero sets of ideals in F[x, λ σ (x)] σ∈(p e ) ≤n , for some n.(4) Let X be a compact complex manifold.Consider the structure A(X) where the basic relations are the complex analytic subsets of X n for any n ∈ N; we call a subset A ⊆ X n complex analytic if it is, for any point p ∈ X n there is a neighborhood U of p such that A ∩ U is given by the zero set of some fixed finite number of holomorphic functions on U. The model theory of compact complex manifolds began with Zilber's observation [21] that if one adds as a relation all complex analytic subsets of X n for all n, then the induced structure is stable.For an overview of the model theory of compact complex manifolds, see [13].(5) Let R be a ring and L R be the language of right R-modules, consisting of a symbol for addition and a unary function f r for each r ∈ R, which is interpreted as scalar multiplication by r.Let T be any complete theory of right R-modules in the language L R .By a result of Baur [1], every formula φ(x) is equivalent to a boolean combination of positive primitive formulas, that is, formulas of the form ∃yψ(x, y), where ψ is a conjunction of atomic formulas.In particular, every definable subset of an R-module M is a boolean combination of cosets of positive primitive definable subgroups of M.An abelian group can be viewed as a Z-module, and from this characterization of definable sets, it is not hard to show that every abelian group has a stable theory in the language of groups.(6) The theory of the nonabelian free group T f g in the language of groups was shown to be stable by Sela [17] (Sela shows the same for any torsion-free hyperbolic group).Every formula in the language of groups is, modulo the theory of the free group, equivalent to a ∀∃-formula.The strategy of the proof is complicated and is developed by Sela over a series of seven previous papers; see [17] for complete references.

Definition 4 . 3 .
the first edge along the path from v 2 to v 1 goes down and to the left.The notion of right-below is defined analogously.When a vertex labeled by b is left-below a vertex labeled by a, we write a < L b. Similarly, when a vertex labeled by b is right-below a vertex labeled by a, we write a < R b.A leaf, labeled by Y ∈ C is said to be well-labeled if for each vertex above Y, say labeled by a, a ∈ Y if and only if a Every formula φ(x; y) has finite Shelah 2-rank-that is, R(∅, φ, 2) is a finite ordinal (recall that Shelah 2-rank is equal to Littlestone dimension).(4) No formula φ(x; y) has the order property.A formula φ(x; y) has the order property if there are tuples (a 1 , b 1 ), (a 2 , b 2 ), . . .from M so that M |= φ(a i ; b j ) if and only if i ≤ j.