Informative interventions

Causal discovery programs have met with considerable enthu siasm in the AI and data mining communities. Amongst philoso phers they have met a more mixed response, with numerous skeptics point ing out weaknesses in their assumptions. Some criticize the reliance up on faithfulness (the idea that every causal connection will result in probab ilistic dependence), since the true model may in fact be unfaithful (Cartwright, 2 001). Despite a common, self-imposed restriction to observational data in causal discovery, the intervention account of causality (Pearl, 2000; Spirte s et al., 2000) suggests that the inclusion of intervention data may alleviate this concern. Korb and Nyberg (2006) established that, for linear network s, even underwhelming interventions (that never overwhelm other infl uences) have sufficient power to overcome unfaithfulness and go beyond the li mits of observational data to identify the true model. Here we extend those r e ults to discrete networks, which present the added difficulty that they can be unfaithful along a single path (as noted, e.g., by Hitchcock, 2001). In doing s o, we illustrate both unfaithful chains and unfaithful collisions, give mat hematical criteria for such interactions, make some recommendations for diagnosi ng unfaithfulness and designing informative interventions, and finally, demo nstrate the power of both one andN − 1 underwhelming interventions.


Introduction
Recently many philosophers of science have turned their attention to Bayesian networks as a tool for reasoning with and about causality (e.g., Woodward, 2003;  Hitchcock, 2001).This interest is largely due to the development of causal discovery algorithms, which have made substantial progress over the last decade, as documented at the Uncertainty in AI conferences.These technological advances make it more pressing to develop a convincing causal interpretation of the learned models.Causal discovery is meant to discover causal models, whereas the standard semantics for Bayesian networks portray them simply as a way to represent probability distributions which may yield nice computational properties.So a causal interpretation of Bayesian networks requires making sense of the relation between probability and causality.
A fair number of skeptics have arisen to voice doubts about the assumptions behind causal discovery.These include general doubts about the possibility of automating discovery (Humphreys and Freedman, 1996), doubts about the Common Cause Principle or its Bayesian net generalization in the causal Markov condition (Cartwright, 1999), and doubts about the causal faithfulness assumption (Cartwright, 2001).The faithfulness assumption is (roughly) that causally connected variables must always be probabilistically dependent.The most infamous counterexample is that of Germund Hesslow (1976), Figure 1.1(a).Here the contraceptive pill directly causes thrombosis, but it also indirectly prevents it, by preventing pregnancy (which itself causes thrombosis).Its two effects on thrombosis exactly cancel out ex hypothesi, which leaves the pill and thrombosis marginally independent, despite the fact that they are twice causally connected.In the case of linear causal models, we have previously shown that such failures of faithfulness can be overcome by causal discovery algorithms when they are provided with (somewhat idealized) interventional data (Korb and Nyberg, 2006).Here we generalize those results to discrete causal Bayesian networks, where new problems arise due to failures of faithfulness along single causal paths (as noted, e.g., by Hitchcock, 2001).

Assumptions
We shall make the following assumptions about the causal models and data.These assumptions are all common in the literature, although not made universally, and are not true in all situations.
The real system under study is assumed to be a stable process, characterizable by a dag R (for reality) over the N measured variables, which gives rise to a joint probability distribution over the variables, P R .It is unperturbed by measurement, and hence the measurements of P R are i.i.d.(independent and identically distributed).Moreover, the original relationships remain intact despite the introduction of interventions, which is a considerably stronger idealization.1 2. The N variables are discrete and OK.The same N measured variables are included in R and every candidate dag; they are all multinomial, taking discrete states;2 and there are no defects in their specification.For example, there is no significant missing variable that plays the role of a latent common cause (they are causally sufficient).There are no significant missing states in the specification of each variable (they are fully specified). 3The measurements are OK.
Each datum in the sample is the result of observing some instance of the system.For each observed variable there is a measured value, and all such values in each datum can be recorded as an N -dimensional vector.The observed frequencies thus reflect P R .For accurate estimation of P R it is not necessary for every variable to be measured in each instance, provided that there is no systematic bias in which measurements are missing that results in mis-estimation of P R .
4. Other information is excluded.
Sources of information about R other than the measurement vectors are excluded: prior knowledge, the timing of each datum or measurement, etc.4 5. The sample accurately represents P R .
In real experiments the sample size is always limited, and hence it is frequently difficult to estimate probabilistic dependencies accurately enough to avoid inferential errors.However, we set this problem aside by assuming that the sample represents P R sufficiently accurately to make our evaluations of probabilistic dependency correct. 5Hence, we effectively have unfettered access to P R .
6. R is highly stochastic.
All the original relationships between variables are assumed to be indeterministic.X causally influences Y just in case changing the probability distribution over X causes the distribution over Y to change. 6Furthermore, under any given conditions specifying the values of some variables, 7 there is some non-degenerate probability distribution over the states of each remaining variable.Not only does this distribution not determine a single value for any remaining variable, we are assuming that every state of the remaining variables has a finite probability, however small (P R is everywhere positive). 8 Models must be Markov.
Roughly, this means that whenever two variables are probabilistically dependent, they must be causally connected.Models that are not Markov leave probabilistic dependencies unexplained, whereas the goal of causal modeling is to explain those dependencies causally.Probabilistic dependence between X and Y conditional upon some set of variables S is denoted X |= Y |S, which for discrete dags abbreviates ∃i∃j∃k[P (x i |y j , s k ) = P (x i |s k )] (and is in contrast to independence, denoted X |= Y |S, abbreviating ∀i∀j∀k[P (x i |y j , s k ) = P (x i |s k )]).The corresponding graphical relationship is d-connection, denoted X ⊥ Y |S (in contrast to d-separation, denoted X ⊥ Y |S).So the Markov property can be expressed as ∀S[X |= Y |S entails X ⊥ Y |S]. 9 8. Models must be minimal.
We require that our candidate dags are minimal Markov models.That is, if there is any proper subgraph of some causal model M which is also Markov, then the subgraph is to be preferred, as the additional paths in M are doing no useful explanatory work. 10nimality entails that each individual arc be effective; that is, it makes some positive contribution to the dependency relationship between parent and child. 11 call causal models admissible if and only if they are (a) Markov with respect to P R , and (b) parameterizable to fit P R , 12 and (c) minimal.Thus they fit and explain the data qualitatively, quantitatively, and without redundancy.
However, unlike much of the literature, we shall not assume: 9. R, and our models, must be faithful.
Roughly, this means that whenever two variables are causally connected, they must be probabilistically dependent.More formally, a causal model is Logically, this is the converse of the Markov property.It amounts to the assumption that the path structure displays a kind of maximal dependency: every d-connection, however indirect, must exhibit dependency under all possible conditioning sets.Obviously this is a convenient assumption for discovery!
All algorithms must evaluate candidate models according to how well they fit the data from P R (and satisfy any other adopted criteria, such as minimality).The Verma-Pearl algorithm can be expressed, in a simplified form, as three rules for constructing a model in a piecewise fashion from specific features of P R (and thereby implicitly eliminating incompatible alternatives in the process).Step II is justified by an important asymmetry between the four possible subdags with the uncovered skeleton X − Y − Z.The collision X → Y ← Z can induce conditional dependence under any conditions that specify Y , but cannot induce marginal dependence when nothing is known about Y .In contrast, the chains X → Y → Z and X ← Y ← Z and also the common cause X ← Y → Z can all induce marginal dependence when nothing is known about Y , but cannot induce dependence conditional on Y (in such cases, knowing Y is said to "screen off" X from Z).Hence, uncovered collisions are easier to identify, 13 because they have a unique probabilistic signature.(In describing when two-arc paths can exhibit dependence, we have thereby defined when they are d-connecting.Longer paths are d-connecting just in case they are a sequence of d-connecting two-paths and do not use the same node twice.)These two steps are then followed by a Step III, checking for any arc directions that are forced by further considerations, such as avoiding the introduction of cycles or any uncovered collisions not already identified in Step II.
All steps implicitly assume faithfulness.In Step I, causal arcs are added only if there is always dependence; so any arc that sometimes shows no dependence cannot be discovered.In Step II, uncovered collisions are added only if there is always a conditional dependency; so any collision that sometimes (or even always!)shows no such dependency will not be identified.In Step III, a remaining structure 13 By identification we mean throughout determining arc structure and arc orientations.like X → Y − Z must be oriented as X → Y → Z, simply because insufficient conditional dependency was discovered in Step II to orient it as X → Y ← Z.Thus, an unfaithful collision could be misidentified as another structure.
Causal discovery algorithms naturally divide into two classes: "constraint-based" learners explicitly test for the dependencies sought in the Verma-Pearl algorithm (e.g., IC, PC); and "metric learners" which generate a Bayesian or informationtheoretic metric that scores models given the data (e.g., K2, BDe/BGe, CaMML).Despite this difference, the metric learners rely upon the probabilistic dependencies exhibited in P R just as much as the constraint learners.The real difference between the two approaches is that the constraint learners test for each dependency in isolation, whereas the metric learners score the models based upon their ability to represent the total pattern of dependencies.All of these learners ultimately rely upon faithfulness to some extent in preferring one model over another.In particular, when other things are equal, they will prefer a faithful candidate to any unfaithful alternative. 14

Statistical Indistinguishability
The Verma-Pearl algorithm leads directly to a theory of the observational equivalence, or statistical indistinguishability, of models.There are several different types of indistinguishability of concern here.
and vice versa That is, any probability distribution representable by M 1 , via some parameterization θ 1 , is also representable by M 2 via some other parameterization, and vice versa.
This is not a very interesting property.If two models are weakly indistinguishable, but the probability distribution(s) that they can both represent do not include P R , then this indistinguishability will not create any difficulty for causal discovery.More interesting is a special case of weak indistinguishability: 14 Not all unfaithfulness leads these algorithms into error, since in selecting a model, they do not rely upon every dependence entailed by faithfulness.For example, the Verma-Pearl algorithm does not rely upon dependencies such as those between X and Z in the structure XrarrY rarrZ.So even if this chain dependency is missing, the algorithm may still recover the direct connections individually, without even noticing the infidelity.Nonetheless, there are cases that lead all these algorithms astray, as we shall discuss in §5.
where P R is the real probability distribution, when we also say M 1 and M 2 are admissible.
This case is more interesting because it captures what we are really after: all the models that can explain reality.
The two main results in the theory of strong indistinguishability are: 1. (Verma and Pearl, 1990) Any two causal models that are in the same pattern -i.e., which have the same skeleton (undirected arcs) and the same uncovered collisions -are strongly statistically indistinguishable.
2. (Chickering, 1995) If M 1 and M 2 are strongly statistically indistinguishable, then they have the same maximum likelihoods relative to any joint samples: where θ i is a parameterization of M i Another consequence of this theory is that Chickering transformations maintain (or expand) the representational power of models.One model is a Chickering transformation of another if it reverses one arc and either (a) neither introduces nor destroys an uncovered collision (in which case the two are in the same pattern) or else (b) the introduced (or eliminated) collision is then covered by adding a direct connection between the parents of the (new or old) collider.In the latter case, the new model's representational power is strictly greater than that of the original; we can call the two models asymmetrically indistinguishable, as all of the first model's distributions are representable by the second, but not vice versa.Any admissible candidate model can be transformed into other Markov models in this fashion: into all the equivalent models within its pattern, together with all chains of supermodels initiated by each equivalent model.Some of these alternative Markov models may not be minimal. 15However, there will always be some minimal Markov alternatives that can be parameterized to fit P R .Thus, even if minimality is assumed, the problem of underdetermination for dags is universal.

Cancelling Paths
The other problem with faith-based discovery is that it isn't always reliable.There are a number of ways in which true models can lack faith and mislead discovery 15 None of these definitions or results concerning indistinguishability assume that the models must be minimal or faithful to every distribution.Spirtes et al. (2000) define various further sub-types of indistinguishability, where such additional properties are required.
algorithms.One kind of unfaithfulness is where multiple paths exactly cancel each other out, leaving the end-points of the paths d-connected yet marginally independent.Another kind of unfaithfulness occurs along a single path, which we will discuss in §8- §10.
The neutral Hesslow case of Figure 1.1(a) is the paradigmatic example of cancelling paths.The alternative graph of Figure 1.1(b) (however intuitively implausible) is a faithful impostor, having all and only the probabilistic dependencies implied by the neutral Hesslow model.Indeed, the impostor is the candidate model that a causal discovery algorithm (lacking such intuition) will normally find, given data generated by the neutral Hesslow model.
Skeptics of causal discovery naturally conclude that algorithms prone to learn such aberrant impostor models as Figure 1.1(b) must be defective.Previously, proponents have responded by arguing that Hesslow cases are not a significant problem, because they have a vanishing probability.If model parameters are selected by nature from a non-trivial uniform distribution over real numbers, then the probability that the influences determined by those parameters should exactly balance out is zero.In a slogan: measure zero implies probability zero (Spirtes  et al., 2000).
Undeterred, skeptics have replied by giving numerous real examples of cancelling path unfaithfulness.These are intended to illustrate that this theoretical problem does arise in practice with a significant frequency.Redundant alleles screen off their primary alleles (Steel, 2004).Evolution often also results in systems in equilibrium, where various factors balance each other.Similarly, governments sometimes regulate by introducing factors that balance others (Hoover,  2001).Thus, whether by evolutionary "design" or "intelligent" design, reality defies the slogan.
Thus, there are circumstances in which path effects are determined in ways that are quite unlike sampling from a uniform probability distribution, and where balanced effects can be highly probable.Furthermore, the path effects need not exactly match to cause trouble: it is sufficient for them to be roughly equal, so that the available data is insufficient to reveal any net dependency.Although in principle one can sample to the "limit", this is a principle never applied in practice!We do have some sympathy for the Measure Zero Defence, since there seem to be many circumstances where its assumptions are close to the truth.However, we accept that it does not always apply, and so a blanket assumption of faithfulness is wrong.
Fortunately, there is a further response available to the problem of cancelling path unfaithfulness.If we can get data from suitable interventions upon the original variables, and incorporate them into our learning in an appropriate way, then this will be sufficient to expose the unfaithful impostor.Moreover, such interventions can allow causal discovery algorithms to go beyond identifying the correct pattern, to identifying the uniquely correct dag, potentially eliminating the underdetermination problem altogether.
In summary, we concede that critics of causal discovery algorithms are correct: when such algorithms rely only upon observational data and assumptions like faithfulness, then they may go wrong.However, we argue that algorithms are not always limited in this way, and hence they are not inherently doomed to making such errors.A more general assessment of the power of causal discovery algorithms must take into account the added power of intervention.

Intervention as Augmentation
Nominally, we distinguish between the original model, R (e.g., the neutral Hesslow model) from the model under intervention, R ′ .Each intervention can be represented by augmenting the original model with a new intervention node affecting an original (target) node. 16Each new intervention variable is itself uncaused: it is not the child of any other node in the augmented model.Its direct impact in the model is limited to the targeted node itself: it has only one child.This somewhat idealized form of intervention is what we call augmentation.In order to consider the upper limit of what intervention can achieve, it will also be useful to consider full augmentation, in which every original node is augmented.For example, the two Hesslow models of Figure 1.1, under full augmentation, become Figure 1.2.The intervention nodes: (a) have at least two states (as any variable must) (b) at least one of these states is effective (changing the probability distribution of the target node, in comparison to no intervention, and also in comparison to the other state) (c) at least one of these states is underwhelming (permitting the original parents to retain some influence over the probability distribution of the target node).
These are all the essential features of our interventions, but to avoid confusion we shall now note some optional features, and the relationships our scheme has to a number of alternatives.Interventions can be: When the intervention node is in a state that overwhelms the target (overpowering all other influences on it), it clearly satisfies condition (b), and this is simply an extreme case of influence.However, we do also assume that 16 Modeling interventions by augmenting models with new intervention variables is mentioned by Pearl (2000), more fully developed by Korb et al. (2004), and previously used by Korb and Nyberg  (2006).Here we are simply making some summary observations.the intervention node has another state that is not overwhelming, in condition (c).Pearl (2000) usually assumes the same thing, but also that the underwhelming state is completely ineffective (say, Off).Once again, in our framework a complete lack of influence is simply an extreme case.In Pearl's "do-calculus" (Pearl, 2000) when an "atomic" intervention is performed, it is overwhelmingand deterministic: it solely determines a specific value of the target node.Where such an intervention is coupled with a completely ineffective Off state, we shall call it a setting intervention.A socalled "randomizing" intervention is also overwhelming, but determines a specific (non-degenerate) probability distribution instead of a specific value.

Pill
Where such an intervention is coupled with a completely ineffective Off state, and the random distribution is positive for all values of the target, we shall call it a scrambling intervention.

Underwhelming.
Most interventions are underwhelming (Korb et al., 2004).Where an overwhelming intervention is possible it is usually too costly, having an underwhelming alternative that is informative and cheaper (in the general sense of decision theory, where the cost may be financial, ethical, or even express physical impossibility).Pearl acknowledges this in his discussion of "imperfect compliance" (Pearl, 2000): in medical trials subjects are often randomly assigned to either the treatment or control group, but not all subjects comply with the treatment assigned to them.Pearl represents this situation in the same way that we do: an intervention node represents the random assignment, but this has an underwhelming influence on the target node, which represents the treatment actually received.17More generally, any situation where the experimenter does not have complete control over the variable of interest, but only some experimental "instrument" that influences it, can be represented in the same way.We regard this situation as typical of intervention, and our augmentation scheme reflects this (since even an underwhelming intervention can satisfy condition (b) Overwhelming interventions effectively sever other parental arcs when they are On (and hence are often called "edge breaking", "surgical", etc).However, if an intervention node includes any state (such as Off) which allows another parent to influence the target, then this other parental arc must still be included in the graph.So when both extreme Pearlean states are represented in the one intervention variable, then other parental arcs cannot be removed.Where no intervention state is overwhelming, then of course the intervention is never arc-cutting.

Non-randomized.
It is common to distinguish between intervention variables that are formally randomized (e.g., assignments in medical trials determined by a supposedly randomizing device, such as the roll of a die), those that are informally randomized (e.g., states determined by some complex combination of physical processes that can reasonably be treated as random, i.e., close enough to some kind of roll of a die), and those that are determined by human decisions (e.g., an experimenter's decision to choose On or Off, which is supposedly not like any kind of roll of a die).A correlated distinction is between human interventions (e.g., Pearl's "policy" variables) and naturally occurring ones (e.g., the "instrumental" variables in statistics).Any of these scenarios can be represented by augmentation; and these distinctions are unnecessary to our discussion.This is because the probability relationships of interest here can all be expressed as conditional probabilities given the values of all intervention variables (e.g., P (X|I X = Off ) = P (X|I X = On)).Hence, whether or not there is a well-defined probability distribution over these intervention values is immaterial.For convenience, we shall suppose that there is one, so that we can have well-defined probability distributions without specifying the state of every intervention variable. 19istemically, we do not assume that the effects of an intervention must be known in advance, or that its hypothetical effects must be known on the counterfactual supposition that another dag is true (except where we explicitly specify otherwise).Some knowledge of this kind is usually available and useful, in order to design informative and cost-effective interventions.However, the more important requirement for accurate statistical analysis of the data is correctly representing the effects of an intervention after it has been performed and the data collected.
The key aspects of this representation are graphical: we assume the experimenter knows that each intervention was an augmentation. 20The quantitative effect of each intervention arc can then be measured directly from the data: so here we assume only that the experimenter discovers that, after all, intervention conditions (a), (b) and (c) were satisfied.In this respect, our epistemic requirements are relatively easy to satisfy.Consequently, the inferences we shall discuss have a broad application.
showed that under full augmentation the true dag can be uniquely identified (because it is uniquely admissible) and the problem of underdetermination disappears.Our proofs, however, depended upon the fact that in linear dags individual paths are always faithful.We now proceed to explain how these results can be applied to discrete networks.In such networks a second kind of unfaithfulness is possible: unfaithfulness along a single path.To describe lesser degrees of dependence, especially along single paths, we must introduce some additional terminology.

Locality of Dependence
Suppose we have specified a path from one variable to another, e.g., the path from Y to Z formed by Y ← X ← W → Z.For convenience, we shall then call the first variable in such a path (here Y ) the head, and the last variable in the path (here Z) the tail, regardless of the direction of causation. 21ow, faithfulness is a property that is usually attributed to entire dags.However, it is defined in terms of individual d-connections, so it is straightforward to relativize faithfulness to individual paths: DEFINITION 4 (Path faithfulness).A path is faithful if and only if for each conditioning set S that makes it a dconnection, the head is dependent upon the tail for some conditioning values Hence, we can talk about dags that are not fully faithful, but have some faithful paths.More generally, any subdag is faithful if and only if all its paths are faithful.Such local dependencies can be enough to make causal discovery possible.For example, if an uncovered collision is faithful, then it may be identifiable even if the dag is unfaithful elsewhere.22

Conditions of Dependence
In linear dags, the particular conditioning values x i , y j and s k of the variables X, Y and S don't matter much: if there is a dependency for one combination of values, then there will be a dependency for all such combinations.In nonlinear dags, the particular values of the variables can matter a great deal: there may be a dependency for one combination of values, but not for others.Faithfulness requires only that for each d-connecting conditioning set, there is one combination of conditioning values for which there is dependence.However, this suggests two other dependency properties, one logically stronger and one logically weaker.

DEFINITION 5 (Path saintliness).
A path is saintly if and only if for each conditioning set S that makes it a dconnection, the head is dependent upon the tail for all conditioning values S = s k , i.e., ∀S∀i∀j∀k[P (x i |y j , s k ) = P (x i |s k )] DEFINITION 6 (Path relevance).
A path is relevant if and only if for some conditioning set S that makes it a dconnection, the head is dependent upon the tail for some conditioning values So in linear dags, any faithful path is also saintly; whereas in nonlinear dags, a path may be faithful without being saintly (but not conversely).Similarly, a path may be relevant without being faithful (but not conversely), e.g., the direct connection between the pill and thrombosis in the neutral Hesslow case.

Contributions to Dependence
These definitions are useful for describing dependencies that exist only in limited locations or under limited conditions.However, they leave out something important about the relationship between d-connection and dependence.They only describe when dependence happens to coincide with d-connection.They do not require that the path itself actually contributes to this dependence relation, i.e., that the specified string of causal relationships actually helps to make the head and tail dependent.

DEFINITION 7 (Path effectiveness).
A path is effective under a conditioning set S if and only if it contributes to the dependency relationship between its head and its tail.
Naturally, a path can only be effective under a conditioning set that makes it a d-connection.Given any such d-connecting set S, we can further specify the conditions S = s k under which it is effective.More abstractly, we can say that the path is always effective (∀k), sometimes effective (∃k), or never effective ( ∃k), according to how its contribution depends upon the specific conditioning variable values.By default, "an effective path" shall mean "a sometimes effective path, under some d-connecting set S".This makes effectiveness analogous to relevance.
To illustrate the difference between effectiveness and relevance, consider the Hesslow isomorph of Figure 1.3 (i.e., a dag structure isomorphic to the Hesslow case, although the variables and their effects may be quite different).Suppose that the individual arcs X → Y and Y → Z are effective, yet the indirect path X → Y → Z never actually has any effect. 23Nevertheless, X and Z will still be marginally dependent, due to the effect of the direct connection X → Z.So by definition, the indirect path qualifies as faithful, and therefore relevant.However, it actually makes no contribution to the dependency relationship, and its faithfulness is merely coincidental (in the sense that it depends entirely on the existence of other paths). 24n contrast, consider the original neutral Hesslow case (in which the two paths exactly cancel).The indirect path clearly makes a difference to the dependency relationship between the pill and thrombosis.Marginally, both paths are contributing to the outcome, but the contribution of the indirect path has the unusual consequence of turning dependence into independence, rather than the other way around.So by definition, the indirect path does not even qualify as relevant, despite being effective.
The basic idea of effective paths is a natural one; and the neutral Hesslow case in particular has elicited similar remarks from Pearl (2000), Hitchcock (2001) and Woodward (2003). 25Moreover, effectiveness is an important property for causal discovery.An effective path makes a difference to the pattern of dependence, a difference which constitutes evidence that the path exists, and so facilitates our discovery of it.Ineffective paths make no such difference, and hence provide no evidence that they exist (even if they are, coincidentally, faithful).Thus, in exploring weaker dependence relations than global faithfulness, it turns out that the minimal requirement for causal discovery is effectiveness over selected paths.For example, if we can condition to make a path the sole possible d-connection between two variables, then not only will any effectiveness in the path induce relevance, but furthermore this relevance must be explained by precisely this dconnection (even if the path is not faithful).Such local relevance is a logically weaker requirement than global faithfulness, and hence it is more likely to be true.

Measuring Contributions
Although the concept of effectiveness is natural, popular and important, it does present a difficulty: how do we detect (and preferably measure) effectiveness?The concepts of saintliness, faithfulness, and relevance present no such difficulty, since they refer only to the net dependency between the head and the tail, which is relatively easy to detect.But where there is more than one d-connection between the head of the tail, detecting the contribution of each path to this net dependency is less straightforward.Without some specification of how to detect (and preferably measure) effectiveness, it is only a vague theoretical notion rather than a precise operational one.
In linear dags, calculating the effect of a d-connecting path is always straightforward: multiplying the path coefficients of all the individual arcs on the path gives a measure of its total effect.There are certainly some simple non-linear cases in which detecting effectiveness is also straightforward.If we already know the neutral Hesslow structure, and we condition upon pregnancy, then the direct path from the pill to thrombosis becomes the sole d-connection.Therefore, any dependency between these two variables must be due to the contribution made by this path.Hypothetically, if this path were ineffective, then the two variables would be independent.Since in fact they are dependent, clearly this path is effective.The same reasoning applies to any other sole d-connection. 26s we have already noted, the same neutral Hesslow structure also provides an easy case for detecting the effect of a path that is not the sole d-connection.When we marginalize over pregnancy, how exactly do we know that the indirect path must be effective?We know that taking the pill increases the risk of thrombosis, both in women who are pregnant and in women who are not.We do not know what proportion of women are pregnant, or how much the odds of thrombosis increase in each condition.Nevertheless, we realize that the marginal direct effect of taking the pill must be to increase the risk of thrombosis.Hypothetically, if the indirect path were ineffective, then we would observe such a marginal increase in risk.Since in fact these variables are marginally independent, we know that the indirect path must be effective (and reducing the risk).Now, the same line of reasoning can certainly be applied to non-cancelling Hesslow structures.However, this requires quantitatively measuring the conditional dependencies, and hence correctly calculating the quantitative marginal direct effect.Then if the observed marginal net effect is different to this calculation, the indirect path must be effective.Similar reasoning can be extended to other dag structures. 27But there are some more complicated cases in which measuring effectiveness is more difficult. 28ortunately, for our present purposes we do not need to give a completely general rule for measuring effectiveness for any path in any dag under any conditions.It is sufficient that effectiveness can clearly be detected in some simple cases, which include all sole d-connections. 29It would certainly be nice to be able to calculate the total effect of a path in a piecewise fashion from component interactions (in an analogous fashion to multiplying linear path coefficients), as an alternative to observing the resulting dependency of the head on the tail.In §9, we will look more closely at the discrete-state interactions that make paths effective, and present some calculations that can be applied in this way. 30

Ineffective Paths
There is another kind of unfaithfulness that occurs only in non-linear dags: unfaithfulness along a single path.Consider an example of Richard Neapolitan (unpublished by him because of the extreme delicacy of his publisher): finesteride, in some studies, has been found to reduce testosterone in rats; further, it is known that a sufficient reduction of testosterone results in erectile dysfunction; however, finesteride treatments had no effect upon erectile dysfunction, because the reduction in testosterone did not pass an effective threshold.Threshold effects are a problem impossible to encounter in linear models, but common elsewhere.
If ineffective chains occur between original nodes, then standard algorithms may still be able to recover the individual arcs (provided that these arcs themselves show sufficient relevance).Moreover, standard faith-based algorithms don't rely upon chain dependencies to orient these arcs, so it would seem that chain irrelevance doesn't matter to them.Similarly, intervention arcs on parents create causal chains, and if these chains are relevant then the resulting dependencies help to confirm that causal chains are present.However, since we can instead rely upon intervention collisions in intervention-based discovery, it would seem that these ineffective chains need not matter much to intervention-based discovery either.
However, the problem of ineffective paths is not confined to chains: it can occur in collisions too.The possibility of ineffective collisions has not been widely appreciated, 31 but we can demonstrate it by elaborating upon Neapolitan's finesteride case, as illustrated in Figure 1.4.Suppose that testosterone comes in four levels: high, normal, low, and absent.Both high and normal testosterone result in normal erectile function; low testosterone results in reduced erectile function; and no testosterone results in no erectile function at all.Finesteride only affects testos- 29 In our further work, we will present a more general analysis. 30We leave open the possibility that there are other ways to measure effectiveness, two of which have already been suggested.Woodward (2003) proposes that, in principle, paths can be isolated by interventions.Alternative paths are blocked by setting variables to fixed values, and then one end of the path in question is wiggled to see what happens at the other.Pearl (2000) has suggested that since single arcs cannot be blocked, we could contrast their positive impact with a hypothetical scenario in which the path has been "deactivated".However, since this is not an observed condition, we need to decide how to adjust the contingency tables appropriately.In contrast to the reasoning in our easy cases, both these suggestions rely on counterfactual scenarios for which we have no immediate data. 31Intransitivity, as discussed by Hitchcock (2001), is restricted to unfaithful causal chains.terone by preventing abnormally high levels from accumulating; thus, it reduces the frequency of high levels and increases the frequency of normal levels, but consequently it has no knock-on effect on erectile function.In contrast, there is a genetic hurdle that only trips up subjects who are already testosterone-challenged: a genetic defect will reduce abnormally low levels to complete absence, and consequently this defect will reduce erectile function.Now, the presence of finesteride is not only irrelevant to erectile function, it isn't relevant to the genetic defect either!For example, if we know that a subject has low testosterone levels, then this reduces the probability that they have the genetic defect; but subsequently discovering that finesteride is absent (or present) doesn't make any difference to this probability.Indeed, regardless of the testosterone level, finesteride never gives us any useful information about the genetic defect.So we have a collision that is always ineffective, and hence always irrelevant: it never exhibits any conditional dependence.

Finesteride
What problems can such ineffective uncovered collisions create for causal discovery algorithms?If they occur between original nodes, then standard algorithms may once again be able to recover the correct skeleton by recovering the individual arcs (provided these arcs themselves carry sufficient relevance).However, since such collisions do not display any conditional dependencies, standard faithbased algorithms will not identify them as collisions.Moreover, they may make erroneous inferences based on the assumption that the structure is not a collision.For example, the Verma-Pearl Step III might be applied to orient the arcs as a chain.Similarly, ineffective collisions will create problems for intervention-based discovery, since it relies upon relevant intervention collisions.
The problem of ineffective collisions is exacerbated by the existence of ineffective chains and common causes.If some skeleton X − Y − Z shows neither conditional nor unconditional dependence, then it could be any ineffective structure: a chain, a common cause or a collision.So it would seem that we cannot make any valid inference from the absence of dependence!Similarly, it is possible for an intervention on X − Y to be entirely uninformative, if the structure I X → X − Y could either be an ineffective chain or an ineffective collision.

Ineffective Interactions
Having identified and defined the problem of ineffective arcs, chains and collisions, we proceed to describe more specifically what must happen to the states of their variables, in order to create these dependency failures.This yields a better understanding of these difficulties and new criteria for when they will occur, which provide new dependency tests for identifying them.

Likelihoods
Let us begin with the effects of single arcs.Suppose the true structure is X → Y → Z.The probability distribution over this dag can be encoded by specifying an independent or "prior" distribution for X, namely P (X), a conditional or "likelihood" distribution for Y , namely P (Y |X), and a similar likelihood for Z, namely P (Z|Y ).Thus the probability distribution for each child depends upon the value of its parents, but not vice versa.The asymmetry of this encoding (according to arc direction) reflects the asymmetry between causes and effects.In particular, we can give a propensity interpretation to the likelihoods (cf.Gillies, 2002): P (Y |X) represents the propensity of various specific values of X to cause specific values of Y .The frequency with which each Y -value occurs depends upon both the prior distribution over X and these propensities, i.e., P (Y ) = P (X).P (Y |X).
If this form of encoding is sufficient to specify P R , then the dag has the Markov property.Hence, this formula also entails "screening off": for example, information about a variable's grandparents may be useful (e.g., information about X may tell us something about Z) but it ceases to be useful when the specific values of the parents are known (e.g., the value of Y ).Another corollary of this formula is that the probability distribution over each variable (say P (Y )) must be just a weighted average, namely the sum of the prior probability of each state of the parent (P (X = x 0 )) multiplied by the associated likelihood distribution over the child (P (Y |X = x 0 )).Despite the fact that children have no causal influence over their parents, information about a variable's children (say Z = z 0 ) may be useful diagnostically to infer P (Y ), using Bayes' theorem: P (y 0 |z 0 ) = P (y 0 ).P (z 0 |y 0 )/P (z 0 ).
Given the likelihoods, there is no mystery about how to detect the effectiveness of a single arc such as X → Y .Its effect consists in how it changes the likelihood ratios at Y .It has no effect if it makes no difference to these ratios.We do not need to imagine how this contrasting non-effect situation might be realized, we only need to inspect the likelihood ratios to see that they are not constant for all values of X.

Tail Equivalence
Consider an ineffective chain, say X → Y → Z, which for simplicity is disconnected (i.e., not connected to any other nodes).Because this is a chain, X cannot affect Z directly; it can only do so indirectly by affecting Y .Indeed, any particular probability distribution over the states of Y entails a probability distribution over the states of Z.Because both individual arcs are effective, we know that some changes in the value of X do cause changes in the Y -distribution, and similarly some changes here do cause changes in the Z-distribution.But not all distinct Y -distributions necessarily result in unique Z-distributions: some may produce exactly the same Z-distribution; so there may be distinct Y -distributions that are Z-equivalent.The point is: no X-change will result in any Z-change if and only if every effective X-change merely replaces one Y -distribution with another that is Z-equivalent.In contrast, if X changes do result in Z-non-equivalent distributions over Y , then Z must be dependent on X. Roughly, if X wiggles Y in the right way, then we must see Z move.
In the finesteride example, there are levels of testosterone (high and normal) that are equivalent with respect to erectile function (tail-equivalent states).Finesteride forms an ineffective chain simply because it only "transfers" some probability from one such state to another, leaving the total probability of this tail-equivalence class unchanged.However, tail-equivalent distributions can be produced even if there are no tail-equivalent states.
Consider a primary school in which students are placed in a fourth-grade class simply according to their ages.Invariably, a few "gifted" students are already performing at a fifth-grade level; while a few "deprived" students are still performing at a third-grade level.At the end of each year, the students all sit a nationwide test, which each student may either pass or fail.The ability of a student affects their chance of passing: gifted students usually pass, deprived students usually fail, and the remainder (let us say) have a 50% pass rate.Hence, the number of students in each category (the distribution of abilities) clearly affects the overall pass rate of the class (the distribution of results).Now, suppose that a new school principal decides to intervene, by reclassifying the new year's prospective fourth-grade students according to their abilities.Gifted students will skip up to the fifth grade; deprived students will repeat third grade.Clearly this new classification regime will affect the distribution of abilities.However, it may turn out that the overall pass rate of the class remains the same, simply because the loss of gifted students is exactly balanced by the loss of deprived ones.So we have an ineffective chain in which the intermediate variable has no tail-equivalent states, but does have tailequivalent distributions.
For any particular value of Y , say y j , there is some probability distribution over all the states of Z, P (Z|y j ).Given some more diffuse probability distribution over the states of Y , P i (Y ), then the resulting probability distribution over Z is just the weighted average of the probability distributions over Z that result from particular values of Y .Thus:

QED
This theorem is a logical criterion, which relates the properties of being a chain, being disconnected, and being effective to substituting tail-equivalent distributions.
It can be used to generate various tests: if some of these properties can be measured, while others are unknown, then applying the criterion can be informative.
We demonstrate one such test in §10.This result can easily be extended to common causes, or reversed chains.They exhibit the same dependency pattern, produced by similar interactions; and hence the tail-equivalence criterion applies to them too.

Absolute Non-Interaction
Consider a disconnected ineffective collision, say X → Z ← Y .Because this is an uncovered collision, X and Y do not affect each other directly, so X is marginally independent of Y .However, X does affect Z. Suppose for argument's sake that we are interested in the probability of some particular state of X, say x 0 .We discover that Z = z 0 and so update our prior probability P (x 0 ) to P (x 0 |z 0 ).Now, would it make any difference if we also discovered that Y = y 0 ?In this instance, by definition X is conditionally independent of Y if and only if P (x 0 |z 0 , y 0 ) = P (x 0 |z 0 ).So the conditional dependency formula is expressed in terms of whether the probability of one parent changes, given information about the other parent (and assuming that the value of the collider is known).But for disconnected uncovered collisions, this can be reformulated in terms of the relative likelihood of the collider value, to see whether one parent affects the likelihood given the other.
DEFINITION 9 (Absolute non-interaction). (1.1) for all values x i , y j , z k .
THEOREM 1.3.Disconnected collisions show no conditional dependence if and only if they are absolutely non-interactive.
Proof.Suppose there is some disconnected collision X → Z ←Y .Let the probability of x 0 be specified in terms of the odds ratio P (x 0 )/P (¬x 0 ).Now, when we learn that z 0 , we can update the probability of x 0 using the odds-likelihood formulation of Bayes' theorem: But the updating likelihood ratio, expressing the probability that x 0 will result in z 0 , can be subdivided into two cases: one where y 0 , and the other where ¬y 0 : (1.2) P (z 0 |x 0 , y 0 ).P (y 0 ) + P (z 0 |x 0 , ¬y 0 ).P (¬y 0 ) P (z 0 |¬x 0 , y 0 ).P (y 0 ) + P (z 0 |¬x 0 , ¬y 0 ).P (¬y 0 ) If we learn that y 0 , then P (y 0 ) = 1 and P (¬y 0 ) = 0, so this simplifies to: P (z 0 |x 0 , y 0 )/P (z 0 |¬x 0 , y 0 ).Whereas if we learn that ¬y 0 , then P (y 0 ) = 0 and P (¬y 0 ) = 1, so this simplifies to: P (z 0 |x 0 , ¬y 0 )/P (z 0 |¬x 0 , ¬y 0 ).So learning y 0 is irrelevant just in case these two likelihood ratios are equal.Learning anything about the value of Y is always irrelevant just in case these ratios are equal for all values z k , x i , y j .QED Part of the point of this reformulation is to shed light on what kind of interaction can produce an ineffective collision, by interpreting the formula as follows.x 0 has a certain tendency to bring about z 0 (relative to other states of X) which can be expressed as P (z 0 |x 0 )/P (z 0 |¬x 0 ).The critical issue is whether or not y 0 affects this tendency.On the left-hand side we have the tendency given y 0 , and on the right-hand side we have the tendency given ¬y 0 .If these are the same, then y 0 does not affect how x 0 affects z 0 .If this is true for all the states of X, Y and Z, then we can say that the two parents affect the child in completely independent ways: one has no effect on the effect of the other.Now, absolute interaction must be carefully distinguished from relative interaction.In linear dags, for example, changes in each parent increase or decrease the value of the child by the same proportion, regardless of the effect of the other parent.Thus, linear dags are usually said to be non-interactive.However, this is only relative non-interaction: the increase or decrease is relative to a value which is due to the other parent.One parent does not simply raise the probability of certain child values, regardless of the other parent.Rather, the values that become more probable depend upon the combined effect of both parents.Our variety of non-interaction, in contrast, is absolute.These two properties are mutually exclusive.By our standard, linear or quasi-linear dags are absolutely interactive, and hence will always display conditional dependency.
As before, this theorem specifies a logical criterion that relates the properties of being a collision, being disconnected, being effective, and being absolutely noninteractive.It can be used to generate various tests, and we demonstrate one such test in §10.
In the finesteride example, there are levels of testosterone (high and normal) that are equivalent with respect to the genetic factor (tail-equivalent states).Finesteride forms an ineffective collision simply because it only "transfers" some probability from one such state to another.However, absolute non-interactions can be produced even if there are no such states.
Consider the risk of having a heart attack (i.e., a condition where some obstruction, such as a blood clot, prevents the heart muscle from receiving enough blood to function properly).Exercise is known to reduce this risk, by improving the fitness of the heart (i.e., increasing its size and blood supply).Low-dose aspirin is also known to reduce this risk, by reducing clotting.But the benefit of exercise may well be completely independent of the benefit of aspirin.Specifically, suppose that exercise halves the probability of a heart attack, whether a patient takes aspirin or not.Similarly, aspirin may reduce the probability of a heart attack by one third, whether a patient exercises or not.This is illustrated in Table 1 Thus the two effects do not interact in any way whatsoever; the state of one is absolutely irrelevant to the effect of the other.Now suppose we are informed that some unknown patient, call him Jones, has died of a heart attack.Like good Bayesians, we update our probabilities about Jones, calculating that it is now less probable that he exercised or took aspirin.But then we are told that Jones did, after all, take aspirin.Does this change the odds that he exercised?Not at all! 32 Moreover, the same would apply if we knew that Jones was alive: Jones' pillpopping (or lack of it) would not help us to predict his exercise regime.So we have an always-ineffective collision, despite having no tail-equivalent states.

Dag Context
These results for two-arc chains and collisions can easily be extended to paths of any length, since longer paths are concatenations of two-arc interactions.The key idea is that independence is always transitive (unlike dependence).For example, if any two-arc component is completely ineffective in connecting its own head and tail, then there cannot be any dependence between the head and tail of the longer path that includes it.Furthermore, even if every two-arc component is effective in connecting its own head and tail, if the head of the longer path produces only tail-equivalent distributions over any intermediate variable (apart from a collider), then the longer path will be ineffective.
Although extending paths is unproblematic, placing them in a broader context can create complications.If a path is connected to other, peripheral variables, then they can affect the distribution of path variables, or indeed the effect that one path variable has upon another.Consequently, we must be careful to distinguish the effect of the path under some conditions S = s j from its effect under other conditions S = s k .Peripheral variables can also participate in alternative paths between path variables, making it harder to distinguish the effect of one path from another.We shall define conditions under which this does not occur:

DEFINITION 10 (Path isolation).
A path is isolated for some conditioning set S that makes it a d-connection, if and only if there are no other d-connections between any pair of path variables (outside 32 In the odds-likelihood form of Bayes' theorem we have where λ = 1/2 is the likelihood ratio. There is good evidence that most humans are bad at reasoning with probability, and hence there is a tempting fallacy lurking here.A heart attack is already unlikely with aspirin, and extremely unlikely with both aspirin and exercise.So when we learn that Jones took aspirin, it is tempting to conclude that Jones is now less likely to have exercised, since then his unfortunate end would be an extremely improbable event.However, it is the relative odds that matter here, not (in itself) the absolute improbability of Jones' misfortune.the path itself).
An isolated path must be the sole d-connection between its head and its tail (but not conversely).The preceding results for disconnected chains and collisions hold for any isolated path, even if it is connected to peripheral variables.Hence these results provide necessary and sufficient conditions for both isolated dependence and isolated effectiveness.
We are particularly interested in the possible effects of peripheral variables on the effectiveness of an isolated collision, say X → Z ← Y , and make the following summary observations.1.The existence of collider dependence is not affected by changing the (nondegenerate) probability distributions over the colliding parents.
The conditional dependence between X and Y depends only upon the likelihood ratios at Z, and hence not upon the distribution over Y , for example.
The only proviso (which we have already assumed) is that all states of Y must occur with some positive probability, so that the relevant likelihood ratios are well-defined and measurable.
2. The existence of collider dependence is not affected by changing the probability distributions over the relatives of colliding parents.This is a corollary of the preceding point, since such changes can only affect the probability distributions over the colliding parents, and only by substituting one non-degenerate probability distribution for another.
3. The existence of collider dependence is not affected by any collider children.
The conditional dependence between X and Y relies upon conditioning on Z, so under such circumstances children of Z are always screened off.
4. The existence of collider dependence can be affected by other parents of a collider.Suppose that there is some variable W that is a parent of Z, and is otherwise disconnected to the collision.Then the likelihood at Z is a function of all three parents, so W can interfere with how X interacts with Y , and hence affect the conditional dependency between them.In the heart attack case, for example, a patient's sex could make a difference to the interaction between exercise and aspirin: their effects might still be independent for men like Jones, but mutually reinforcing for women.So in such cases we must distinguish the dependence of X on Y conditional upon Z but marginalizing over W from more specific dependencies, obtained by conditioning on various alternative values of W .
We shall define conditions under which this kind of parental interference does not occur: 10.DIAGNOSIS AND DESIGN 27 DEFINITION 11 (Fundamental conditions).Fundamental conditions for a collision X → Z ← Y are any conditions S which marginalize over X and Y , but condition upon Z and any other parents of Z.
5. Under any fundamental conditions S = s k , the existence of collider dependence is not affected by the probability distribution over any other variable in the dag.Fundamental conditions specify the values of the collider's other parents (e.g., W ), so they screen off the influence of any other relatives of these parents.Together with the preceding points, it follows that the influence of all other variables in the dag are screened off.The fundamental likelihood ratios of a collider are an intrinsic property of the parameterization at the collider, and hence a relatively stable target for causal discovery.
In some discovery contexts, we may have limited or no information about what variables may be other parents of the collider.We define the following kind of conditioning for such cases: DEFINITION 12 (Parental conditions).Parental conditions for a collision X → Z ← Y are any conditions S which marginalize over X and Y , but condition upon Z and any other possible parents of Z.
Obviously parental conditions are always fundamental conditions.Parental conditions may involve some superfluous conditioning upon additional nodes, but this cannot affect the existence of the corresponding fundamental effect.

Diagnosis
What can we do about unfaithful paths?In the next section, we shall show how unfaithfulness in the original dag can be overcome with effective interventions, which create informative new dependencies.Of course, if the interventions themselves are ineffective, then no statistical trick can make all of them informative.However, in some cases it is nevertheless possible to make informative inferences from a lack of dependence.Our criteria for ineffective chain and collision interactions can help us to diagnose which kind of ineffective relationship is present, whether it occurs in the intervention or in the original dag.
Suppose that an intervention arc I X → X forms an ineffective structure I X → X − Y , but that both individual arcs are always effective and there are no paths cancelling X −Y , and hence there is no difficulty in recovering the correct skeleton from the data.In that case, the absence of both conditional and unconditional dependency means that we cannot immediately orient the original arc, as noted earlier, since this is consistent with both an ineffective collision and an ineffective chain.However, closer examination of the effects may expose the difference.
The data will show what effects I X has on the probability distributions over the states of X, and also which such X-distributions are Y -equivalent.If the structure is in fact an ineffective chain, then the data must show that I X makes no Y -relevant changes to X (the tail-equivalence test).Otherwise, the structure must instead be an ineffective collision.For example, suppose that administering low-dose aspirin to patients is the intervention, and the orientation of the exercise -heart attack link is unknown.As discussed, there is no conditional dependence (since it is a collision that is absolutely non-interactive), and neither is there any unconditional dependence (simply because it is a collision).But applying the tail-equivalence criterion shows that it cannot be an ineffective chain: aspirin lowers the frequency of heart attacks, and this in turn is correlated to exercise, so if increasing the risk of heart attacks really reduced exercise, then we ought to see a marginal dependence.We don't; so it's not a chain, it's a collision.
Similarly, the data will show whether or not the effect of I X on X ever makes a difference to the likelihood of a particular X-state occurring together with a particular Y -state.If the structure I X → X −Y is in fact an ineffective collision, then the data must show that I X has no such effect (the absolute non-interaction test).Otherwise, the structure must instead be an ineffective chain.For example, suppose that reclassifying fourth-grade students is the intervention, and the orientation of the ability -pass rate link is unknown.As discussed, there is no marginal dependence (since it is a chain that creates only tail-equivalent distributions), and neither is there any conditional dependence (simply because it is a chain).But applying the absolute non-interaction criterion shows that it cannot be an ineffective collision.Under the age-based classification, if a student passes the end-of-year test, then this results in a negligible likelihood that the student is deprived, and slightly higher likelihoods that the student is normal or gifted.In contrast, under the ability-based classification, if a student passes the end-of-year test, then this results in negligible likelihoods that the student is either deprived or gifted, since virtually all students have normal ability in this scenario.So, if passing the test really increased a student's ability, then this effect would be absolutely interacting with reclassification, and we ought to see a conditional dependence between them.We don't; so it's not a collision, it's a chain. 33his additional pair of tests will still not be conclusive in all cases.There are some structures, involving equivalent states, which will pass both the tailequivalence and the absolute non-interaction tests, so that the ambiguity between a chain and a collision remains.For example, suppose that administering finesteride is the intervention, and the orientation of the testosterone -genetic defect link is unknown.This recalcitrant structure shows no dependence, and is consistent with either explanation.However, combining the usual dependency tests with these additional unfaithfulness tests does provide a more powerful tool for orienting arcs in a dag that may be unfaithful.These new tests can be applied to other unfaithful interactions in the dag, not only those resulting from interventions.Moreover, even if an algorithm takes faithfulness as a working assumption, it can still include an "unfaithfulness module" that can be employed where unfaithfulness is demonstrable.Wherever there is an isolated two-path where arcs are unoriented and they display neither marginal nor conditional dependency, then this is a demonstrably unfaithful structure that invites closer inspection.
There are general types of case where the tail-equivalence test cannot possibly be satisfied, and hence we can readily identify the true relationship.Where X is a binary variable and both arcs are effective, then ineffective chains can never occur; so given an ineffective relationship at X, there is no need to inspect the correlations in order to diagnose an ineffective collision (e.g., in our heart attack case).Regardless of how many states X has, if the intervention sets each state of X exhaustively, then once again ineffective chains can never occur.Similarly, where the intervention is overwhelming and sets a sufficient number of sufficiently different probability distributions over the states of X, then a little linear algebra shows that Y must be affected regardless of its precise chain dependence upon X. Interventions will also form effective chains if they are quasi-linear.
Scrambling interventions are an interesting special case.Where they form intervention chains, these are very likely to be effective.However, they will not necessarily be so -since through bad luck, the On distribution imposed on X may be Y -equivalent to the Off distribution.Nonetheless, even this does not usually impede discovery, because X remains correlated to Y under interventionand this is different to the dependency pattern that such interventions induce in collisions.
Scrambling interventions have the notable advantage that they always ensure an effective collision.Whatever effect Y may have on X, this is obviously not independent of I X , since I X completely overrides Y when it is On, destroying any correlation between X and Y .Similarly, exhaustive setting interventions must create an effective collision, as must quasi-linear interventions.

Design
Consideration of our mathematical criteria and their relationship to these cases naturally gives rise to corresponding recommendations for appropriate experimental design, in order to avoid ineffective interventions from the outset.Exhaustive setting, scrambling, and quasi-linear interventions are clearly powerful.More deli-cately, to orient some known arc X −Y we could try to influence X in a Y -relevant way, which should therefore affect Y if the arc is X → Y .
However, we are reluctant to trumpet such recommendations, for a number of reasons.First, we must already know the likely effects of an intervention upon the target variable in order to decide in advance whether it is likely to be informative.Second, to design delicate, customized interventions it is also necessary to have prior information (either from observational data, previous experiments, or theoretical knowledge) about the relationship between original variables such as X and Y .Third, we must consider the relative costs of various designs.In real experimental situations, there are usually many other factors that determine the most cost-effective design, so it would be foolish to design solely to guarantee effectiveness.For example, scrambling interventions are attractive, but as already noted, many medical trials are not of this kind, and for very good reasons.Fourth, the cheapest strategy may well be to intervene first, and if the intervention is insufficiently informative, to try something else (the venerable "suck it and see" strategy).Fifth, we hope (perhaps optimistically) that experimental scientists generally manage to avoid uninformative interventions (certainly, they generally avoid publishing uninformative results!).If so, then the potential application of our analysis is primarily to automated causal discovery algorithms, rather than to scientists' non-automated ones.
In this paper, our modest aim has been to identify some valid discovery inferences that any algorithm may make.We have not yet attempted to implement these inferences in any particular algorithm.Refining causal discovery algorithms for applications where faithfulness is uncertain, so that they can diagnose and deal with unfaithfulness most appropriately, is a remaining challenge.It is clear that existing algorithms are not optimal for this purpose.For example, the Verma-Pearl algorithm would orient an ineffective intervention collision incorrectly as an ineffective intervention chain, and not apply the tail-equivalence test to see if the latter possibility could be eliminated.Nonetheless, existing algorithms are an adequate tool for overcoming unfaithfulness under certain conditions.Where the possibility of unfaithfulness arises, it is sufficient to gather additional data from appropriate, effective interventions, and represent such interventions correctly through augmentation.All existing algorithms will then be able to identify the correct dag in the usual way.Similar comments apply to the job of orienting arcs that would otherwise be left undirected.

Discovery by Intervention
Having identified the criteria for effective chains and collisions, we can now explain how discovery can be done using interventions on discrete networks.Regardless of how it is represented, intervention aids discovery by creating a new pattern of dependency in P R , which is consistent with some candidate dags but not with others.When intervention is represented by augmentation, the addition of the intervention arc creates new paths that create new possible dependencies.As noted earlier, we assume that the experimenter knows the graphical properties of the intervention (after the experiment, if not before).So whichever candidate dag is correct, the new paths all begin in the same way: with the intervention arc.However, some paths must then differ according to the different structures of the competing dags, and hopefully such paths will be effective and result in idiosyncratic dependencies.34

Interventions with faith but no structure
Consider discovery scenarios where we assume faithfulness, and as it turns out, R ′ is in fact faithful.Suppose that we have no observational data or other information about the structure of the dag prior to the experimental intervention.
Consider any pair of nodes X and Y .There are only three possibilities: either (a) X ← Y , or (b) X → Y , or (c) they are unconnected, X : Y .Suppose we intervene with I X → X.
LEMMA 1.4.If we can correctly assume that R ′ is faithful, then each intervention necessarily identifies all the original parents of its target node.
Proof.If the intervention node is connected to Y by a collision (I X → X ← Y ), then this faithful collision can be discovered in the usual way.Specifically, the direct connection between X and Y will exhibit dependence under all conditions and is the only structure that can account for this dependency.The intervention arc is known a priori.Its collision will exhibit dependence under all conditions, including X, and this will orient the arc X ← Y .QED LEMMA 1.5.If we can correctly assume that R ′ is faithful, then each intervention necessarily identifies all the original children of its target node.
Proof.If the intervention node is connected to Y in a chain (I X → X → Y ), then this faithful chain can be discovered in the usual way.Specifically, the direct connection between X and Y can be discovered as in Lemma 1.4; the chain will exhibit marginal dependence under all conditions excluding X, and this will orient the arc X → Y .QED LEMMA 1.6.If we can correctly assume that R ′ is faithful, then each intervention necessarily identifies all the original disconnections of its target node.
Proof.If the intervention node is not connected to Y by any two-arc path (I X → X : Y ), then this disconnection can be discovered in the usual way.Specifically, the universal absence of dependency between X and Y can only be explained by a disconnection in a faithful dag.(The intervention arc is not required in this case.)QED THEOREM 1.7.If we can correctly assume that R ′ is faithful, then each intervention necessarily identifies all the original parents, children, and disconnections of its target node.
Proof.This follows from Lemmas 1.4, 1.5, and 1.6.QED Of course, even without the intervention we can always identify the correct pattern, which includes all these direct connections, simply by gathering enough observational data.The point of intervention is to ensure that these connections are correctly oriented.How many more arcs might an intervention orient?This depends in a complex way on the structure of the dag.We shall just point out one basic case.If the pattern is X − Y − Z, then intervening on Y will always orient both arcs.However, intervening on X (or alternatively, Z) may or may not achieve the same outcome.If X → Y , then the two alternative orientations of Y − Z will display different dependencies; but if X ← Y then they will not. 35MMA 1.8.If we can correctly assume that R ′ is faithful, then for any K-sized subset of the N variables, K interventions are sufficient to identify all arcs connected to the K variables.
Proof.Suppose that one intervention is performed upon each of the K variables.The K intervention arcs are known a priori.All arcs to and from each of the K variables are identified by Theorem 1.7.QED A corollary is that N interventions can identify R ′ . 36Under our augmentation assumptions each of the interventions is independent of the others.Thus, data from full augmentation can be sampled in one large experiment, in which all interventions are varied simultaneously but independently.However, the same result obviously follows from performing all these interventions in any sequence of subsets, since each intervention is independently informative.
THEOREM 1.9.If we can correctly assume that R ′ is faithful, then N − 1 interventions are sufficient (and in the worst case necessary) to identify R ′ .
Proof.Suppose that by Lemma 1.8, N − 1 interventions identify all arcs connected to the N − 1 nodes.Any arc connected to the N th variable must also be 35 Suppose that we wish to employ a sequence of interventions, in order to orient all the arcs of a faithful dag.How do we minimize the number of interventions?Murphy (2001) has presented a general Bayesian, decision-theoretic approach for choosing a particular scrambling intervention to maximize the amount of information gained.But what is the best simple heuristic for using our interventions?Our discussion certainly indicates that it is better to perform one intervention at a time, so that the next intervention can be selected on the basis of the information previously gained (although such sequential intervention will often not minimize the expected experimental costs).Furthermore, intervening on the node participating in the most undirected arcs is attractive, as is intervening in the middle of chains. 36This result is similar to those of Spirtes et al. (2000) for so-called "rigid indistinguishability", but applied to the context of underwhelming interventions.
connected to one of the other N − 1 nodes, and thereby can be identified.So for N variables, N − 1 interventions are always sufficient.Suppose that we only intervene upon N − 2 variables.In the worst case, there will be a direct connection between the two remaining variables that can be oriented in either direction without changing any uncovered collisions.(For example, in a fully connected dag any two variables will be such a case.)So in the worst case N − 1 interventions will be necessary.QED In Theorem 1.9, we follow Eberhardt et al. (2005; 2006), who proved that N − 1 interventions are sufficient when using scrambling interventions and assuming faithfulness.Their result remains valid in our framework a special case.37

Interventions with structure
Suppose that we do not assume faithfulness, and R ′ may in fact be unfaithful (which may include our intervention paths).In particular, suppose that there is some structure in the dag that is isomorphic to the neutral Hesslow case, as in Proof.It is true that the paths from I X to Z differ in the two cases.However, I X is only connected to Z by a chain through X, and therefore any dependency I X has on Z relies upon its effect on X, which must then also be dependent on Z.But by assumption the dependencies between X and Z do not differ between these two structures when X varies, whether we condition upon Y or not.So I X cannot be dependent on Z after all.QED LEMMA 1.11.Intervening on the pregnancy variable (Y ) in a Hesslow isomorph can create one informative relevant path, whichever dag is true.
Proof.If the neutral Hesslow dag is true, then intervening on Y creates an intervention chain I Y → Y → Z.By assumption, the connection Y → Z is effective and produces a corresponding dependency.So if I Y ever affects Y in the right kind of way, it can in turn affect Z and display marginal dependence.If the impostor were true, such marginal dependence would be inexplicable.So this would be an informative dependency.
If the impostor is true, then it can exhibit dependence through the structure I Y → Y ← Z, when conditioning upon {X, Y }, which is inexplicable in the neutral Hesslow model.So this would be an informative dependency.QED Notice that to distinguish between the two models, it is not necessary for the intervention paths to be faithful.All that is necessary is for them to be relevant: to display a dependency under some conditions.Any such dependency cannot be produced by the other structure.This point applies also to the dependencies considered in Lemma 1.12.LEMMA 1.12.Intervening on the thrombosis variable (Z) in a Hesslow isomorph can create one or two informative relevant paths, depending upon which dag is true.
Proof.Once again, the contrasting orientation of Y − Z produces a contrasting chain and collision, with the possibility of a revealing dependency in either case.
Moreover, if the Hesslow dag is true, then the collision X → Z ← I Z may also reveal itself.It is true that before we intervene, the direct effect of X → Z is cancelled by the indirect path.However, it is nonetheless possible that intervening on Z will change the effect of X → Z without a counterbalancing change in Y → Z, thus creating a dependency between I Z and X conditional upon Z (and also unbalancing the two paths).If the impostor were true, such conditional dependence would be inexplicable.
In contrast, if the impostor is true then the disconnection between I Z and X conditional upon Z cannot be informative, since the resulting lack of dependency could also be produced by the Hesslow model.QED THEOREM 1.13.Intervention on either the pregnancy (Y ) or thrombosis (Z) variables (but not the pill (X) variable) can distinguish between a Hesslow isomorph and its impostor, by creating any one of five informative relevant paths.
Proof.This follows from Lemmas 1.10, 1.11, and 1.12.QED In Korb and Nyberg (2006) we were content to point out one of these path differences for the linear case.However, in the discrete case any of these intervention paths may be completely ineffective, and therefore fail to display any informative dependency.So it is more appropriate to list all the alternative dependency tests, only one of which need be successful.
Epistemically, if we are assuming faithfulness without question, then we may not be motivated to perform any of these interventions, since we will believe in the impostor.However, if we correctly doubt the faithfulness assumption, then we know that alternative models such as Hesslow cancellation are possible, and we may then be motivated perform such interventions.Alternatively, we may perform them for other reasons (e.g., an intervention on thrombosis because it is intrinsically undesirable, or because it is connected to other arcs that we wish to orient); or we may simply happen to record a new relationship with an instrumental variable that performs the function of a deliberate intervention.In any case, the possibility of distinguishing between the two structures using interventional data is a welcome advance on the impossibility of doing so using any amount of observational data over the original variables.
For greater generality, consider any discovery scenario where we are trying to distinguish between two admissible candidate dags, presumably selected through observational data or other information.THEOREM 1.14.If we can correctly assume that R ′ is one of two admissible nonidentical dags, then for each dag, there is at least one identifiable intervention that will create a unique collision, if and only if this dag is true.
Proof.Consider any pair of dags, say M 1 and M 2 , where M 1 = M 2 .M 1 must have at least one unique directed arc that is not present in M 2 .Otherwise M 2 contains M 1 as a Markov subdag, and is therefore not minimal, contrary to hypothesis.If and only if M 1 is true, then intervening on the child of such a unique arc will create a unique collision with it.Similarly, M 2 must have at least one unique directed arc that is not present in M 1 .If and only if M 2 is true, then intervening on the child of such a unique arc will create a unique collision with it.So for each dag, there is at least one identifiable intervention that will create a unique collision, if and only if this dag is true.QED COROLLARY 1.15.If we can correctly assume that R ′ is faithful and is one of two admissible non-identical dags, then there are at least two interventions that will identify R ′ .
Proof.This is a corollary of combining Theorem 1.14 with Lemma 1.4.Identifying a unique intervention collision will identify one dag, whereas performing this intervention but failing to identify the unique collision will identify the other dag.QED Even if we cannot correctly assume that R ′ is faithful, if we can correctly assume that the intervention collision will be fundamentally effective, then we can instead combine Theorem 1.14 with Lemma ?? below, to derive the same result.
However, suppose that although any intervention collisions are in fact fundamentally effective, we do not assume that this is so: we simply examine the data for observed dependencies.In that case failing to detect a unique intervention collision dependency will not in itself be informative, since this could be explained by an ineffective intervention collision.However, in that case performing both of the alternative interventions must create a unique dependency in at least one case.

Faithful interventions without original faith or structure
Consider discovery scenarios where we do not assume faithfulness in the original dag and R may in fact be unfaithful; however, we can correctly assume faithfulness in our intervention two-paths.Suppose that we have no observational data or other information about the structure of the dag prior to the experimental intervention.
Consider any pair of nodes X and Y .There are only three possibilities: either (a) X ← Y , or (b) X → Y , or (c) they are unconnected, X : Y .Suppose we intervene with I X → X.
LEMMA 1.16.Under any parental conditions, an intervention collision is the sole possible d-connection between the head and the tail.
Proof.Suppose there is some intervention collision I X → X ← Y , and consider any possible d-connection between I X and Y under any parental conditions.Since I X is an intervention node, any d-connection between I X and Y must begin with I X → X.It cannot then proceed via children of X, since such paths are blocked.But it cannot then proceed via other parents of X besides Y , since such paths are also blocked.So any parental conditioning makes an intervention collision the sole possible d-connection between I X and Y .QED LEMMA 1.17.If R may be unfaithful but all intervention two-paths are faithful, then each intervention necessarily identifies all the original parents of its target node.
Proof.Varying the proof given above for Lemma 1.4, a faithful intervention collision can be discovered directly, without first discovering the direct connection between X and Y .The collision exhibits dependence under all conditions including X.These include parental conditions.But parental conditions make an intervention collision the sole possible d-connection by Lemma 1.16, so the dependence under these conditions can only be explained by the intervention collision itself, and hence each faithful intervention collision can necessarily be identified.Thus, if R may be unfaithful but all intervention two-paths are faithful, then each intervention necessarily identifies all the original parents of its target node.QED Moreover, the remaining analogues of Theorem 1.7 and Theorem 1.9 and their Lemmas all follow without any alteration.So it is not necessary to assume that R ′ is faithful: if we merely assume that our intervention two-paths are faithful, then we get all these results anyway.This is useful in epistemic contexts where we are using interventions that we are confident will create faithful two-paths, such as scrambling interventions, to discover a dag that may well be unfaithful.

Effective interventions without faith or structure
Suppose now that we correctly reject the assumption of faithfulness, because R ′ may in fact be unfaithful (which may include our intervention paths).Moreover, we have no observational data or other information about the structure of the dag prior to the experimental intervention.
Consider any pair of nodes X and Y .There are only three possibilities: either (a) X ← Y , or (b) X → Y , or (c) they are unconnected, X : Y .Suppose we intervene with I X → X.
THEOREM 1.18.Fundamentally effective intervention collisions identify the colliding original parent of the target node.
Proof.Suppose that an intervention collision is fundamentally effective under some conditions S = s j .Then some parental conditions S = s k will satisfy such fundamental conditions.Furthermore, such parental conditioning makes the intervention collision the sole possible d-connection (from Lemma 1.16).Hence, under such parental conditioning there is a conditional dependency between the parents, which can only be explained by the intervention collision.By exhaustively examining all parental conditions, we can ensure that we examine S = s k and hence identify the intervention collision.So fundamentally effective intervention collisions identify the colliding original parent of the target node.QED Using exhaustive parental conditioning is a very simple procedure for detecting intervention collisions. 38However, in the worst case it involves conditioning upon all the other nodes!Such extreme conditioning would, for large N , require very large data sets to estimate any dependency.However, we are driven to these extremes only by trying to guarantee discovery of an extremely sensitive dependency, in a situation of complete ignorance.Realistically, there are many degrees of dependency between complete faithfulness and unfaithfulness.The larger and more robust the dependency, the easier it will be to discover, and the less data will be required.
LEMMA 1.19.If R ′ may be unfaithful and N > 2, then each effective intervention chain does not necessarily identify the corresponding original child of its target node.
Proof.This follows from Lemma 1.10: if there is a Hesslow structure and we intervene upon the pill variable, then we cannot necessarily identify the connection between the pill and thrombosis.(Although we could in this case identify the other child.)QED LEMMA 1.20.If R ′ may be unfaithful and N > 2, then each effective intervention does not necessarily identify every corresponding original disconnection of its target node.
Proof.Again, this follows from the Hesslow case: if we appear to be intervening 38 There is no equally simple procedure for detecting intervention chains.Possible children of the target node present a conditioning dilemma: if they begin a chain towards Y , then conditioning upon them will successfully block this path, but if they begin a collision towards Y , then conditioning upon them will actually d-connect an otherwise blocked path.
upon the parent of an uncovered collider, then it is also possible that this is a neutral Hesslow structure, and the parents are in fact connected.QED THEOREM 1.21.If we cannot assume that R ′ is faithful, then each intervention nevertheless identifies all fundamentally effective collisions at the target node; if also N > 2, interventions will not necessarily identify all effective chains or disconnections.
Proof.This follows from Lemmas 1.18, 1.19, and 1.20.QED LEMMA 1.22.If we cannot assume that R ′ is faithful, then nevertheless for any K-sized subset of the N variables, K interventions are sufficient to identify all parents of these K variables, provided that all their collisions are fundamentally effective.
Proof.This proof is analogous to that of Lemma 1.8, but Lemma 1.22 can only guarantee that the parents of each of the K variables are identified, since this is all that Theorem 1.21 promises.QED THEOREM 1.23.If we cannot assume that R ′ is faithful, then nevertheless N − 1 interventions are sufficient (and in the worst case necessary) to identify R ′ , provided that all their collisions are fundamentally effective.
Proof.Even if no children are identified, N such interventions are clearly sufficient by Lemma 1.22, since every original arc must connect a parent of one of the original N nodes.Now suppose that we do not intervene upon the N th node.Only parents of this node remain unidentified by the N − 1 interventions.Such direct connections are effective, and there are no admissible models lacking these connections, since such models would require a reversed arc that is inconsistent with the N − 1 interventions.Therefore, such direct connections must be included in the dag in order for it to be Markov.They must then be oriented toward the N th node, since the N th node is already ruled out as a parent for the other nodes.QED It is interesting that the use of both scrambling interventions and the assumption of faithfulness does not (in itself) reduce the number of interventions that are necessary.But either of these assumptions does guarantee effective collisions, which cannot be said for underwhelming interventions in unfaithful dags.
As our discussion indicates, there are some complexities involved in dealing with unfaithfulness and underwhelming interventions, which do not arise when simply assuming faithfulness and/or overwhelming interventions.However, the former approach is more powerful, because it is more general and rests upon logically weaker assumptions that are more likely to be true.
If interventions are to be informative, then logically they must be informative either by changing dependency or by failing to do so.We have considered some inferences of both kinds.When changing dependency, the mere fact that the inter-vention arc affects the target node is not, in itself, informative about the rest of the dag.So it is usually necessary for intervention nodes to create some two-arc dependencies, arising from either intervention chains or collisions.By considering effective intervention collisions, we have minimized the amount of dependency that we need to assume for discovery.Of course, it is always possible for interventions to be entirely uninformative!We claim only that interventions can distinguish between models that are otherwise statistically indistinguishable, not that every intervention must do so.
In summary, whether the models are linear or discrete, effective augmentation is in principle sufficient to identify the entire dag.This eliminates any ambiguity, whether it results from unfaithfulness or not, and regardless of the location or kind of unfaithfulness involved.

Conclusions
Our central point is that causal discovery algorithms should rely less upon the faithfulness assumption and more upon intervention.Perhaps it is an obvious platitude that intervention is a powerful tool for discovering causal relationships.There is certainly no need to tell experimental scientists!However, this has yet to be fully incorporated into either the causal discovery algorithms or their philosophical assessments.Algorithms are more powerful when they can use intervention.
To draw some specific conclusions: 1. Intervention data should be fully utilized in automated causal discovery procedures.There has been a disproportionate emphasis on analysing observational data.
2. Interventions can generally be represented by augmenting the dag in some appropriate way, including specifying the directions of intervention arcs.
3. The inferences this makes possible can overcome some of the limitations of faith-based discovery, such as Hesslow cancellation and the restriction to pattern inference, even for discrete dags.
4. Under our assumptions, one intervention can identify all its fundamentally effective collisions, and N − 1 such interventions can identify the true dag, eliminating underdetermination altogether.
5. The capacity of automated causal discovery algorithms should be reassessed accordingly.For example, the skeptical view that they rely irrevocably on the dubious assumption of faithfulness is incorrect.
6.In discrete dags, causal discovery does not require dependency under all possible conditioning sets (faithfulness), because it can succeed with less (sufficient relevance, through effectiveness).
Figure 1.4.Finesteride as a doubly ineffective intervention Figure 1.5.A neutral Hesslow case and its faithful impostor Put an undirected link between any two variables X and Y if and only if for every set of variables S s.t.X, Y ∈ S 1.Step I:2.Step II: For every undirected structure X − Y − Z, orient the arcs X → Y ← Z if and only if(X |= Z)|Sfor every S s.t.X, Z ∈ S and Y ∈ S.I.e., there is an uncovered collision if and only if the end variables X and Z are dependent under every conditioning set that includes the middle variable Y .