Approximation by DNF: Examples and Counterexamples

,


Introduction 1.Definitions
This paper is concerned with approximating boolean functions f : {0, 1} n → {0, 1} by DNF formulas of small size.Let us first give the requisite definitions.
Circuits: We will consider single-output circuits composed of unbounded fan-in AND and OR gates over the input literals (inputs and negated inputs).The size of a circuit is the number of AND and OR gates it contains, and the depth of the circuit is the number of AND and OR gates on the longest path from an input bit to the output gate.We will also make the not completely standard definition that the width of a circuit is the maximum, over all AND and OR gates, of the number of literals feeding into the gate.
We will only be concerned with constant-depth circuits in this paper, and we will be particularly interested in depth 2. We assume circuits of depth 2 are always given by an OR of ANDs of literals, in which case they are DNFs, or by an AND of ORs of literals, in which case they are CNFs.The ORs in a DNF are called its terms and the ANDs in a CNF are called its clauses.
Finally, we will often identify a circuit over n input bits with the boolean function {0, 1} n → {0, 1} that it computes.
Approximation: Given two functions f, g : {0, 1} n → {0, 1}, we will say that f -approximates g, or f is an -approximator for g, if the fraction of inputs in {0, 1} n on which they disagree is at most .We will also write this as with the convention that boldface letters are random variables, and that they are drawn from the uniform distribution on {0, 1} n unless otherwise specified.
We will later need the following well known observation, showing that small-size circuits are well approximated by small-width circuits: Observation 1.1 If C is a circuit of size s, then for every > 0 there is a "simplification" C of C that -approximates C and has width at most log(s/ ).1 By "simplification" we mean that C is obtained from C by replacing some of its gates with constants, so that C has size and depth no more than C, and C is a DNF (respectively, CNF) if C is.
Proof: Consider any gate in C that is connected to at least log(s/ ) input literals.If such a gate is an AND gate, replace it with a 0; and if it is an OR gate, replace it with a 1.This gives us C , which clearly has width at most log(s/ ).Now on a uniformly random input, the probability that a particular replacement affects C's computation is at most 2 − log(s/ ) = /s.Since C has at most s gates, the probability that any replacement affects its computation is at most , by the union bound.Thus C is an -approximator for C. 2

Approximation by DNF
DNF formulas are one of the simplest and most natural representation classes for boolean functions.Although every function can be computed by a DNF, some functions on n bits may require DNFs of size Ω(2 n ).The natural question we pursue in this paper is whether this size can be significantly reduced for a given function if we are only required to -approximate it, for some small constant .Positive results along these lines would have interesting applications in several research areas, including computational learning theory and the the study of threshold phenomena in random graphs; these topics will be discussed in Sections 1.3 and 1.4, respectively.However there do not seem to be many results on either upper or lower bounds for approximation by DNF in the literature.
A notable conjecture in this area was made 8 years ago by Benjamini, Kalai, and Schramm [BKS99] (published again in [Kal00,KS05]).To describe this conjecture, which we call the BKS Conjecture, we need to recall the notion of total influence [KKL88, LMN93]: where σ i x denotes x with its ith bit flipped.The total influence (or average sensitivity) of f is where the notation y ∼ x means that the Hamming distance between y and x is 1.
The total influence is an important measure of the complexity of a function, used frequently in learning theory, threshold phenomena, and complexity theory.One important result to note is that constant-depth circuits of small size have small total influence: This was first proved by Boppana [Bop97], tightening an argument made by Linial, Mansour, and Nisan [LMN93] based on Håstad's Switching Lemma [Hås86].Note that the d = 2 case of this theorem is quite easy, building on the simple result that I(f ) ≤ 2w for any f computable by a DNF of width w.
We can now state Benjamini, Kalai, and Schramm's conjecture, which essentially asserts a converse to Theorem 1.3 for monotone functions: ) Observation 1.1 implies that the BKS Conjecture could also add the condition that width is at most (K •I(f )) 1/(d−1) without loss of generality.
If this conjecture were true it would be an important characterization of monotone functions with small total influence; if it were further true with d fixed to 2 it would yield very interesting positive results for approximation by DNF (or CNF).

Approximating Majority by DNF
Suppose the BKS Conjecture were true even with d fixed to 2. This would imply that for every constant > 0, every monotone function f could be -approximated by a DNF or CNF of size exp(O(I(f ))).Using Observation 1.1, we could further make the width of the approximator O(I(f )).One reason to hope that this is true is that it is true, even for non-monotone functions, if one allows a more powerful class of depth-2 circuits: Definition 1.4 A TOP ("threshold of parities" [Jac95]) is a depth-2 circuit with Parity gates at the lower level and a Majority gate on top.
Proposition 1.5 For all > 0, every boolean function f is -approximated by a TOP of width O(I(f )/ ).This proposition was shown in [KKL88,LMN93] by relating the total influence of a function to its Fourier spectrum.
TOP circuits arise frequently as the hypothesis class in many uniform-distribution learning algorithms.Examples include Linial, Mansour, and Nisan's algorithm for learning depth-d size-s circuits [LMN93], Jackson's Harmonic Sieve algorithm for learning polynomial-size DNFs [Jac95], Bshouty and Tamon's algorithm for learning monotone functions [BT96], and O'Donnell and Servedio's algorithm for learning monotone polynomial-size decision trees [OS06].(Incidentally, except for Jackson's algorithm, all of these proceed by proving upper bounds on total influence.)An open question in learning theory is whether these algorithms (especially Jackson's DNF algorithm) can be made to use the simpler class of DNFs as their hypothesis class.
This suggests the idea of trying to approximate TOPs by DNFs.By Proposition 1.5, approximating TOPs by DNFs could also be considered a way of attacking the BKS Conjecture.Now the Parities in a TOP could be converted to DNFs or CNFs of no greater width.But how to approximate the Majority by a small DNF or CNF is an interesting question.We solve the problem of -approximating Majority by DNFs in Sections 2 and 3. Unfortunately, the size necessary is too large to give good approximations of TOPs.
The question of computing Majority by small circuits has a long and interesting history.Significant work has gone into computing Majority with small circuits of various sorts [PPZ92, AKS83,HMP06,Bop86,Val84].Some of this work involves the subproblem of constructing small circuits for "approximate-Majority" -i.e., circuits that correctly compute Majority whenever the number of 1's in the input is at least a 2/3 fraction or at most a 1/3 fraction.Note that this notion of approximation is not at all the same as our notion.Constructions of constant-depth circuits for this "approximate-Majority" have had important consequences for complexity theory [Ajt83,Ajt93,Vio05].It seems, however, that no paper has previously investigated the existence of small constant-depth circuits for Majority that are -approximators in our sense.
Our result on this topic is the following, following from the main results proved in Sections 2 and 3: Theorem 1.6 For every constant 0 < < 1/2, the Majority function on n bits can be -approximated by a DNF of size exp(O( √ n)), and this is best possible up to the constant in the exponent.
Note that the following fact is well known: Thus Theorem 1.6 shows that the BKS Conjecture with d fixed to 2 is true for the Majority function.
Our proof of the upper bound in Theorem 1.6 is by the probabilistic method; we essentially use the random DNF construction of Talagrand [Tal96].Our proof of the lower bound in Theorem 1.6 uses the Kruskal-Katona Theorem to show that even -approximators for Majority must have total influence Ω( √ n); the lower bound then follows from Theorem 1.3: Then any depth-d circuit computing f requires size exp(Ω(n 1/(2d−2) )).
For a discussion of why switching lemmas do not seem to provide any help in proving lower bounds on -approximating DNFs, please see Appendix B.

Threshold phenomena and the BKS Conjecture
One of the main motivations behind the BKS Conjecture is to provide general conditions under which a monotone function has large total influence.Benjamini, Kalai, and Schramm made their conjecture in the context of problems about threshold phenomena and noise sensitivity in random graphs.There, proving lower bounds on total influence is important, as the total influence relates to certain "critical exponents" in percolation problems, and it also captures the sharpness of "thresholds" for graph properties.
To understand the connection to threshold phenomena, consider the Erdős-Rényi random graph model on v vertices, and write n = v 2 .Now a boolean string in {0, 1} n can be identified with a graph, and a boolean function f : {0, 1} n → {0, 1} can be identified with a collection of graphs.We say that f is a graph property if it closed under permutations of the v vertices.Suppose f is a nontrivial monotone graph property (i.e., f is a monotone function that is not constantly 0 or 1).Then as we increase the edge probability p from 0 to 1, the probability that a random graph from the p-biased distribution on {0, 1} n satisfies f increases continuously from 0 to 1. Hence there will be a critical exponent p * where the probability of a random graph satisfying f is 1/2.It is of great interest to understand how rapidly the probability of satisfying p jumps from near 0 to near 1 in the interval around p * .The Russo-Margulis Lemma [Mar74,Rus78] shows that , for an appropriate p-biased definition of total influence.It follows that graph properties having "sharp" thresholds corresponds to them having large total influence.
A celebrated theorem of Friedgut [Fri99] provides a version of the depth-2 BKS Conjecture in the context of graph properties with p * = o(1): Friedgut's Theorem There is a function K = K(C, ) < ∞ such that the following holds: If f is a monotone graph property with critical probability p * = o(1) and This result has been used with great success to show that various natural graph properties -and also random k-SAT problems -have sharp thresholds (see, e.g., [Fri05]); one proves this essentially by showing that the property cannot be well approximated by a small-width DNF.
The relationship between sharp thresholds and large total influence continues to hold in the context of general monotone boolean functions (i.e., not necessarily graph properties).Indeed, there has been significant interest in trying to extend Friedgut's Theorem to the general, no-symmetry case.The BKS Conjecture is one proposal for such an extension (in the case of p * = 1/2).It is weaker than the Friedgut Theorem in that it allows for approximating circuits of depth greater than 2. However the BKS Conjecture's size/width bound for d = 2 is very strong, essentially matching Friedgut's Theorem -it states that in the d = 2 case, K may be taken to have a linear dependence on I(f ).
Some partial progress has been made towards proving Friedgut's Theorem in the case of general monotone boolean functions.In an appendix to Friedgut's paper, Bourgain [Bou99] showed that every boolean function with )); he used this to show that when f is monotone and p * = o(1), there is a term of width O(C) that has exp(−O(C 2 ))-correlation with f .Friedgut himself later showed [Fri98] that his theorem can be extended to general functions, even non-monotone ones (assuming p * is bounded away from 0 and 1), at the expense of taking K(C, ) = exp(O(C/ )).
However it turns out that these generalizations cannot be taken too far -our main result in Section 4 is that the BKS Conjecture is false.Specifically, we show: Theorem 1.9There is a monotone function F : {0, 1} n → {0, 1} with total influence I(F) ≤ O(log n) satisfying the following: Any DNF or CNF that .01-approximatesF requires width Ω(n/ log n) and hence size 2 Ω(n/ log n) ; and, any circuit that .01-approximatesF requires size Ω(n/ log n).
This rules out the BKS Conjecture.In particular, it shows that Friedgut's Theorem cannot be proved for general monotone functions (in the p * = 1/2 case) unless one takes K(C, .01)≥ exp(Ω(C)).We remark that the function F used in the theorem is is computed by a polynomialsize, depth-3 circuit.

Approximating Majority
In this section we give a construction of a DNF of size 2 O( √ n/ ) that -approximates Majority on n bits.In the next section we will show this result is optimal up to the constant in the exponent.Our construction is by the probabilistic method, inspired by the random DNF construction of Talagrand [Tal96]: there is a DNF of width w = 1 √ n and size (ln 2)2 w which is an O( )-approximator for Maj n .
Proof: Let D be a randomly chosen DNF with (ln 2)2 w terms, where each term is chosen by picking w variables independently with replacement.It suffices to show that because then a particular D must exist which has Given a string x ∈ {0, 1} n , let us write the fraction of 1's in the string as is close to being normal, by the Central Limit Theorem.We have that Maj(x) = 1 iff t > 0, and furthermore, by construction, Pr D [D(x) = 1] only depends on t.Indeed, (We chose the size to be (ln 2)2 w so that this quantity would go to 1/2 as t goes to 0.) So to show (2), it suffices to show that and For (3) we use (1 − y) r ≤ exp(−yr) (for 0 ≤ y ≤ 1, r > 0) to get where we also are using w = 1 √ n and the assumption t > 0. For (4) we use (1 where we used the assumption t < 0, and (1 − y) r ≤ exp(−yr) again.Hence to prove (3) and (4), it remains to show E It is relatively easy to check that this holds when t is truly normally distributed.With its actual binomial distribution, we use the following fact: for each i = 0, 1, 2, . . ., which follows from the Berry-Esseen Central Limit Theorem.We have , and this proves (5). 2

A Lower Bound for Majority, via Total Influence
The main result in this section shows that corrupting the Majority function, Maj n , on even a large fraction of strings cannot decrease its total influence very much: Theorem 3.1 Let f : {0, 1} n → {0, 1} be an -approximator for Maj n .Then As mentioned in Proposition 1.7, the total influence of Maj n is Θ( √ n).Thus Boppana's relation, Theorem 1.3, implies the following: Corollary 3.2 For any constant < 1/2, every -approximator for Maj n with depth d requires size at least exp Ω(n 1/(2d−2) ) .
In particular, any -approximating DNF for Majority requires size at least exp(Ω( √ n)).
This matches the upper bound we proved in Theorem 2.1, up to the constant in the exponent.
The remainder of this section is devoted to proving Theorem 3.1.In Appendix A we include an alternate proof of the following much weaker statement: If > 0 is sufficiently small then I(f ) ≥ Ω( √ n) for all f that -approximate Majority.
The first basic fact we will need is that we can assume without loss of generality that the approximators f are monotone.
Proof: Recall the combinatorial shifting operators κ 1 , . . ., κ n introduced by Kleitman [Kle66]; the operator κ i applied to f yields the function given by where σ i x denotes the string x with the ith coordinate flipped.Let f = κ 1 κ 2 • • • κ n f ; it is well known that this makes f a monotone function.The fact that I(f ) ≤ I(f ) follows because the κ i operators never increase total influence [BOL90].Finally, it is easy to see that the κ i operators can only improve approximation with respect to a monotone function g; this shows that f is an -approximation for g. 2 The second basic fact we'll need involves the following definition: Definition 3.4 Given f : {0, 1} n → {0, 1}, we define C(f ) to be the expected number of "correct" bits in a random string x; i.e., Thus it suffices to show that Pr x [x i = f (x)] = 1/2+Inf i (f )/2 for each i.When x is chosen randomly, there is an Inf i (f ) chance that x i is influential for f .In this case, the expected number of correct bits x i is 1; this is because f is monotone so f (x) agrees with f (x i ).With probability 1 − Inf i (f ) the bit x i is not influential for f ; in this case the expected number of correct bits x i is 1/2, since on of the two possibilities will agree with f 's constant value.Thus the overall probability of a correct We will now prove Theorem 3.1 under an assumption, namely, that f only disagrees with Maj on the strings where Maj is 0. (Recall that we are assuming f is monotone.)We will show later that we can get rid of this assumption.
We will define some functions.Define M 0 := Maj.For an integer 0 ≤ t ≤ 2 n−1 , M t will agree with M t−1 on all strings except one.That string is the "largest" string in M −1 t−1 (0), where "largest" is respect to the ordering when strings are interpreted as binary integers.We will view these functions as a process, where strings are being added to M −1 t (1) as t increases.We will refer to the unique string x such that M t (x) = 1 and M t−1 (x) = 0 as the string added at time t.For example, if n = 5, the first few strings added are 11000, 10100, 10010, 10001, 10000, 01100, etc.We also define w j,t as the "largest" string x with |x| = j such that M t (x) = 0.The string w j,t is the next string of Hamming weight j whose value becomes 1 as t increases.
We will show that if f is a monotone function such that f disagrees with Maj on exactly t ≤ (2 n ) strings, then I(f ) ≥ I(M t ).Fix t, and let M := M t .
For any function g: {0, 1} n → {0, 1}, let X j (g) be the set of strings such that |x| = j, g(x) = 1 and Maj(x) = 0. Define X(g) as the vector (|X 0 (g)|, |X 1 (g)|, . . ., |X (n−1)/2 (g)|).Note that since we are assuming that f differs from Maj on t strings of Hamming weight at most (n − 1)/2, the sum of the entries of X(M) equals the sum of the entries of X(f ), or equivalently, the sum of the entries of X(M) − X(f ) is 0.
Claim 3.7 The vector X(M)−X(f ) has all its nonnegative entries preceding all its negative entries.
Proposition 3.8 Claim 3.7 implies Theorem 3.6 Proof: Suppose that the claim is true.Assuming f ≥ Maj, we have ).The sum of the entries of X(M) − X(f ) is 0, as M disagrees with Maj on t strings and f disagrees with Maj on at most t strings.In the weighted sum given, the weight on entry i increases with i, so if all the nonnegative entries of X(M) − X(f ) come first (getting lower weights), and the sum of the entries of X(M) − X(f ) is 0, then C(M) − C(f ) ≤ 0. Thus C(M) ≤ C(f ), and by Lemma 3.5, I(M) ≤ I(f ). 2 Proposition 3.9 Claim 3.7 is true if f ≥ Maj.
We can now state the Kruskal-Katona theorem.
Theorem 3.10 Suppose A is a set of strings of Hamming weight j, and B is the set of the first |A| strings of Hamming weight j in lexicographic order.Then |∂ u A| ≥ |∂ u B|.
We will require a few claims.
Claim 3.11 For any t and j, X j (M t ) is a set of least strings in lexicographic order.
This claim follows from the fact that lexicographic order is a suborder of the order we get when we order all binary strings by comparing them as numbers in decreasing order.Claim 3.12 Let u be the string that is added at time t + 1, and suppose |u| = j + 1.Then u ≥ w j,t .
Proof: Suppose not.Compare u and w j,t as binary numbers.If u is less than w j,t , then M t and M t+1 would not disagree on u, since as a number, u would not be the largest string such that M t = 0.So u must be greater than w j,t .Now suppose the claim is false.Then there is a bit k such that u k = 0 and (w j,t ) k = 1.If we change the least significant 0 bit of w j,t from 0 to 1, the resulting string has Hamming weight j + 1 and is still as a number less than u.But M t is a monotone function, so M t is 1 on this string.But in our process, we derive {M t } by always flipping the output of the "largest" string, a contradiction. 2 Claim 3.13 For any j, ∂ u (W j (M)) ⊇ X j+1 (M t ), with equality only if the string to be added at time t + 1 is w j,t .
Proof: Take any x in X j+1 (M t ).There exists t ≤ t such that x was added at time t .By Claim 3.12, x ≥ w j,t .If w j,t = w j,t , then x is in the upper shadow of w j,t , and we are done.Otherwise, it must be the case that w j,t is in X j (M t ), and thus x is in the upper shadow of X j (M t ).So it follows that x is in the upper shadow of W j (M).Equality occurs only if all the strings y such that y ≥ w j,t are already added, and thus the string added at time t + 1 will be w j,t . 2 Given these claims, we can now finish the proof of Theorem 3.6.The theorem is obvious when M = f , so we assume M = f .Suppose that the theorem is false.As the sum of the entries of X(M) − X(f ) is 0, then there exists some 0 As f is monotone, X j+1 (f ) ⊇ ∂ u (X j (f )), and so By the Kruskal-Katona theorem, |∂ u (X j (f ))| is at least as large as the upper shadow of the first |X j (f )| strings in lexicographic order.Consider X j (M).By Claim 3.11, the set of strings X j (M) is precisely the first |X j (M)| strings in lexicographic order.So if |X j (f )| > |X j (M)|, it must be true that the size of the upper shadow of the first |X j (f )| strings in lexicographic order is at least as large as the upper shadow of the first |X j (M)| + 1 strings in lexicographic order.But the first |X j (M)| + 1 strings in lexicographic order is precisely W j (M) by definition, so By Claim 3.13, ∂ u (W j (M)) contains X j+1 (M), and thus Along with (7), we have that We now analyze the total influence of g k and h k .For any function g, let s(g, x) = |{y ∼ x : g(x) = g(y)}|.In the case of Maj, we have that ].We will assume s(g k , x) is nonzero only for strings x where x 1 x 2 . . .x k is 0, so we get ].The assignment to x 1 , x 2 , . . ., x k satisfying that the AND of them is 0 that minimizes the expectation is the one that minimizes the probability of the string being sensitive, which is the assignment where all the bits are 0. Thus, So we can take , by only considering strings where x 1 = x 2 = . . .= x k = 0. Then we get

So we can take
It follows now from the above and 3.7 that if f ≥ Maj and f is -close to Maj, then (1) if < 1/4, then In Proposition 3.9 as well as most of this section, we have assumed that if f ≥ Maj.Here we remove this assumption.First note that by symmetry of 1 and 0 (and the Kruskal-Katona theorem on an appropriate order), we could have proved everything in this section in a very similar way assuming that if Maj is 0 then f is 0.
Proof: Suppose that f is -close to Maj.Set = 0 + 1 , where 0 is the fraction of strings where f is 1 and Maj is 0, and 1 is the fraction of strings where f is 0 and Maj is 1.We could think of building f as a process, starting with Maj, and flipping the output of one string at a time.First Suppose we attempt to approximate Tribes n with some CNF C. We view C as being an AND of ORs, where each OR's wires may pass through a NOT gate before being wired to an input gate x i,j .Now further suppose we introduce additional "input gates" y i , where each y i is always equal to j∈J x i,j , and we allow the circuit C to access the y i gates if it wants.Our main lemma uses the fact that Tribes n depends only on the y i 's to show that C can be taken to only depend on the y i 's as well: Lemma 4.1 Suppose Tribes n is -approximated by a CNF C of size s and width w over the variables (x i,j ) i∈I,j∈J .Then there is another CNF C of size at most s and width at most w only over the variables (y i ) i∈I that also -approximates Tribes n .
Proof: Given C over the input gates x i,j , imagine that every wire going to an input gate x i,j is instead rewired to access x i,j ∨ y i .Call the resulting circuit C 1 .We claim that C 1 and C compute the same function of x.The reason is that on any input x where y i = 0, the rewiring to x i,j ∨ y i has no effect; and, on any input x where y i = 1, the rewiring to x i,j ∨ y i converts x i,j to 1, but that still has no effect since y i = 1 ⇒ x i,j = 1.Since C was an -approximator for Tribes n , we have Now picking x uniformly at random induces the 2 −b -biased product distribution on y ∈ {0, 1} I .We can get the same distribution on (x, y) by picking y first and then picking x conditioned on y.I.e., for each i ∈ I: if y i = 1 then all x i,j 's are chosen to be 1; if y i = 0 then the substring In view of this, and using the fact that Tribes n depends only on y, we have We next introduce new input gates (z i,j ) i∈I,j∈J that take on random values, completely independent of the x i,j 's and the y i 's.Each substring (z i,j ) j∈J will be uniform on {0, 1} J \ {(1, 1, . . ., 1)}; i.e., it will have the same distribution as (x i ) j∈J | y i = 0. Now let the circuit C 2 be the same as C 1 except with all accesses to the x i,j 's replaced by accesses to the corresponding z i,j 's.
We claim that for every string y ∈ {0, 1} I , the distributions C 1 (x|y, y) and C 2 (z, y) are identical.The reason is that for each i ∈ I such that y i = 1, the (x i,j ) j∈J and (z i,j ) j∈J values are irrelevant, since C 1 only accesses x i,j by accessing x i,j ∨ y i and the same is true of C 2 and z i,j .On the other hand, for each i ∈ I such that y i = 0, the (x i,j ) j∈J and (z i,j ) j∈J values are identically distributed.
In light of this, we conclude Since z and y are independent, we can conclude there must be a particular setting z * such that We may now take C to be the circuit only over the y gates gotten by fixing the input z * for C 2 .It is easy to check that C has width at most w and size at most s. 2 We can now use Lemma 4.1 to show that Tribes n has no good CNF approximator of width much smaller than n/ log n: Theorem 4.2 Any CNF that .2-approximatesTribes n must have width at least (1/3)2 b = Ω(n/ log n).
Proof: Let C be a CNF of width w that .2-approximatesTribes n over the variables (x i,j ) i∈I,j∈J .Using Lemma 4.1, convert it to a CNF C over the variables (y i ) i∈I that .2-approximatesTribes n .We may assume that no term in C includes both y i and y i for some i.We now consider two cases.
Case 2: C has at least one term in which all y i 's are unnegated.Suppose this term has width w.Since y i is true only with probability 2 −b , this term is true with probability at most w2 −b , by the union bound.And whenever this term is false, C is false.Hence Pr As an aside, we can now show that the idea of approximating TOPs by DNFs discussed in Section 1.3 cannot work.Since Tribes † n is computable by a polynomial-size CNF, Jackson's Harmonic Sieve learning algorithm [Jac95] can produce a polynomial-size O(log n)-width TOP -approximator for it, for any constant > 0. But one can never convert this to even a .2-approximatingDNF of size smaller than 2 Ω(n/ log n) , by Corollary 4.3 combined with Observation 1.1.
We now define the function that contradicts the BKS Conjecture: Definition 4.4 Let n be of the form b2 b+1 .We define F n : {0, 1} n → {0, 1} to be the OR of Tribes n/2 and Tribes † n/2 , on disjoint sets of bits.
Proposition 4.5 F n is a monotone function computable by a depth-3 read-once formula, and Theorem 4.6 Any depth-2 circuit that .04-approximatesF n must have size at least 2 Ω(n/ log n) .
Proof: Suppose D is a DNF of size s that .04-approximatesF n .By Observation 1.1, we can replace it with a DNF D of width at most log(100s) which .04 + 1/100 = .05-approximatesF n .
Consider choosing x ∈ {0, 1} n/2 uniformly at random from the set of strings that make Tribes n/2 false, and also choosing y ∈ {0, 1} n/2 independently and uniformly at random.Since at least 1/4 of all strings make Tribes n/2 false (close to 1/e, in fact), this distribution is uniform on some subset of {0, 1} n of fractional size at least 1/4.Since D errs in computing F n on at most a .05fraction of strings, we conclude that Pr[D (x, y) = F n (x, y)] ≤ 4 • .05= .2.
A very similar argument, restricting to the inputs to F n where the Tribes † n/2 part is 0 and then using Theorem 4.2 shows that any CNF that is a .04-approximatorfor F n must have size at least 2 Ω(n/ log n) .This completes the proof. 2 Theorem 4.6 already implies that the BKS Conjecture cannot hold with d always equal to 2. To completely falsify the conjecture, we need the following additional observations: Proposition 4.7 Any function f : {0, 1} n → {0, 1} that .02-approximatesF n must depend on at least Ω(n) input bits.
Proof: It is very well known (see [DF06] for a written proof) that there is an explicit > 0 (and certainly = .1 is achievable) such that any function g : {0, 1} n → {0, 1} that -approximates Tribes n/2 must depend on at least Ω(n) of its input bits.Now an argument very similar to the one used in the proof of Theorem 4.6 shows that if f is a .02-approximatorfor F n , then some restriction of f must be a δ-approximator for Tribes n/2 with δ ≤ 4 • .02< .1.Since this restriction must depend on at least Ω(n/2) input bits, we conclude that f must also depend on at least this many input bits. 2 Proposition 4.8 Any circuit that .01-approximatesF n must have size at least Ω(n/ log n).
Proof: Suppose the circuit C has size s and is a .01-approximatorfor F n .By Observation 1.1, there is another circuit C of size at most s and width at most log(100s) that .01-approximatesC; this C is thus a .01+ .01= .02-approximatorfor F n .But C depends on at most size × width = s log(100s) literals.Hence s log(100s) ≥ Ω(n) by Proposition 4.7 and so s ≥ Ω(n/ log n). 2 Finally, we've established: Theorem 4.9 The BKS Conjecture is false.
Proof: We use the function F n , which is monotone and has I(F n ) = O(log n).The BKS Conjecture implies that there is some universal constant K = K(.01)< ∞ such that the following holds: There is a circuit C that .01-approximatesF n and has depth d and size s, for some d and s satisfying s ≤ exp (K • I(F n )) 1/(d−1) = exp O(log 1/(d−1) n) .Now d can't be 2, since this would imply s ≤ poly(n), and we know from Theorem 4.6 that there is no circuit .01-approximatingF n of depth 2 and size 2 o(n/ log n) .But d ≥ 3 is also impossible, since this would imply s ≤ exp( √ log n), and we know from Proposition 4.8 that there is no circuit .01-approximatingF n of size o(n/ log n). 2 be computed by a circuit of depth d and size s.Then I(f ) ≤ O(log d−1 s).

C 2
(z, y) = Tribes n (y) ≤ , which can be switched to E z Pr y C 2 (z, y) = Tribes n (y) ≤ .