Network Structure Learning Under Uncertain Interventions

Abstract Gaussian Directed Acyclic Graphs (DAGs) represent a powerful tool for learning the network of dependencies among variables, a task which is of primary interest in many fields and specifically in biology. Different DAGs may encode equivalent conditional independence structures, implying limited ability, with observational data, to identify causal relations. In many contexts however, measurements are collected under heterogeneous settings where variables are subject to exogenous interventions. Interventional data can improve the structure learning process whenever the targets of an intervention are known. However, these are often uncertain or completely unknown, as in the context of drug target discovery. We propose a Bayesian method for learning dependence structures and intervention targets from data subject to interventions on unknown variables of the system. Selected features of our approach include a DAG-Wishart prior on the DAG parameters, and the use of variable selection priors to express uncertainty on the targets. We provide theoretical results on the correct asymptotic identification of intervention targets and derive sufficient conditions for Bayes factor and posterior ratio consistency of the graph structure. Our method is applied in simulations and real-data world settings, to analyze perturbed protein data and assess antiepileptic drug therapies. Details of the MCMC algorithm and proofs of propositions are provided in the supplementary materials, together with more extensive results on simulations and applied studies. Supplementary materials for this article are available online.


Motivation and Framework
Graphical models based on directed networks have been widely employed to understand dependence relations between variables, a crucial problem in many scientific areas, especially in biology (Friedman 2004;Shojaie and Michailidis 2009).Typically, the network structure is inferred under the assumption that multivariate data have been generated by a stable system.More realistically however, measurements can be heterogeneous, meaning that modifications in the generating mechanism, for example, due to exogenous interventions, have occurred.
An instance is genomic medicine, where interactions between genes provide insights on the genesis and progression of diseases, whose occurrence is reflected by aberrations in the gene-network functioning.In this setting, drug therapies capable of gene-inhibition can be applied to regulate and restore dependencies in the gene-network structure.However, the effect of drug treatments at gene level can be uncertain or completely unknown (Marton et al. 1998;Paananen and Fortino 2019), so that discovering the targets of an intervention or therapy becomes of interest in itself.Drug target discovery is also essential for the development of personalized treatments, to identify genes that are affected by drugs and in turn evaluate patients' response to therapies; see Rawat et al. (2020) for a recent discussion.
In this article we consider multivariate data generated from a system subject to unknown interventions, and we propose a novel method for learning their dependence structure and the effects of interventions.We represent the data generating mechanism through a Directed Acyclic Graph (DAG) which allows for a factorization of the joint distribution in terms of "parent-child" relations between nodes (variables).An intervention modifies the original DAG structure by dropping the dependence of each intervened node (the intervention target) from its parents.Deterministic interventions assume that each intervened variable is set equal to a constant level, an assumption reasonable in some contexts, such as gene-knockout experiments; by converse, stochastic interventions (Korb et al. 2004), that we adopt in the current paper, are more general and replace the conditional distribution of the intervened node with that of a new random variable, independent from all parent nodes.

Related Works
The problem of learning DAGs from interventional data has received some interest over the last years.In particular, some methodologies for structure learning of DAGs given interventions with known targets have been developed; see for instance Hauser and Bühlmann (2015) and Castelletti and Consonni (2019) for a frequentist and Bayesian approach, respectively.When unknown targets are allowed, Eaton and Murphy (2007) apply the dynamic programming algorithm of Koivisto and Sood (2004) on a graph augmented with interventional nodes, to estimate edge inclusion probabilities and interventions.Their method is implemented for categorical data, using the Bayesian-Dirichlet score of Heckerman and Geiger (1995), although it could be adapted to the Gaussian case.The authors show in simulations that their method can correctly recover both the intervention targets and the graph structure under specific settings.Importantly however, the augmentation with interventional nodes increases the graph size, and causes the method to be practically feasible only up to 20 nodes.Higher dimensions compel to severely restrictive assumptions on the interventions: (i) in terms of the number of intervened nodes, thus, ruling out interventions with a diffuse impact, as in our application of Section 6.1, or (ii) in terms of interventions forced to act on distinct nodes, thus, excluding cases where different but related drug therapies can partially share effects on the same node, as in the analysis of antiepileptic therapies Valporate and Carbamazepine, discussed in Section 6.2.
A similar idea of graph augmentation to include hidden nodes representing intervention targets has been proposed in Zhang et al. (2017): they apply constraint-based methods, as the PC algorithm of Spirtes et al. (2000), on the augmented graph to identify the network skeleton and v-structures, and then recover arrow directions through invariance considerations.Specifically, the distribution of a target node, conditional on its causal parents, should not change when interventions affect other nodes; this idea was first adopted in Peters, Bühlmann, and Meinshausen (2016), who propose a method that estimates causal effects, and that can be iterated with the purpose of network learning under uncertain interventions.The PC algorithm is also used by He and Geng (2016) who first recover from a collection of datasets group-specific network structures, then pooled together to infer the causal graph.The methods of Zhang et al. (2017) and He and Geng (2016) are both substantially different from our framework as they are based on multiple hypothesis tests and do not provide any uncertainty of the estimated graph.In addition, He and Geng (2016) do not perform target estimation, while Zhang et al. (2017) recover the manipulated variables, but without identifying the interventional settings they refer to.More recently, Ke et al. (2019) propose an optimization-based Bayesian method for network learning, limited to categorical interventional and observational data.Finally, Squires, Wang, and Uhler (2020) suggest a greedy search algorithm for causal structure learning with unknown interventions and target estimation, in the presence of multiple datasets, one of which has to be purely observational.

Contribution and Structure of the Article
In this article we propose a Bayesian methodology for joint structure learning of Gaussian DAGs and intervention targets that extend the literature along the following directions: (i) we build a new modelling framework where observational data are not strictly required, the excessive reliance on multiple tests is avoided, and unknown interventions are represented as indicator vector parameters, rather than auxiliary nodes that increase the graph dimension; (ii) we demonstrate theoretically, and validate empirically, the correct asymptotic identification of the targets and of the equivalence class of the true DAG; (iii) we propose a novel MCMC algorithm for joint posterior analysis over the space of graphs and interventions, without resorting to optimization routines.In addition, we emphasize that our method is practically feasible on graphs of dimension larger than those studied so far in the Bayesian literature, without imposing restrictive assumptions on the structure of the interventions.Finally, differently from other Bayesian approaches for DAG structure learning, our method revolves around arbitrary DAGs, that is, with completely unknown ordering of the nodes; see Ni, Stingo, andBaladandayuthapani (2017, 2019) for a comparison.
The rest of the article is organized as follows.In Section 2 we briefly summarize the main concepts about DAGs and interventions, introduce our Gaussian DAG-model, and priors on DAGs, intervention targets and DAG-parameters.Asymptotic theoretical properties of target identification and graph learning are described in Section 3, whilst in Section 4 we develop the MCMC scheme for posterior inference on DAGs and targets.Simulation studies to assess the performance of our method are conducted in Section 5, while Section 6 presents applications to real data and comparisons with alternative methods available in the literature.Finally, Section 7 contains a brief discussion together with possible extensions of our methodology.Further details on our MCMC algorithm, proofs of propositions, and more in-depth simulation and real-world studies are provided in the supplementary materials.

Directed Acyclic Graphs and Interventions
Let D = (V, E) be a Directed Acyclic Graph (DAG), where V = {1, . . ., q} is a set of nodes and E ⊆ V × V a set of edges.If (u, v) ∈ E, then (v, u) / ∈ E and we say that u is a parent of node v and v is a child of u.The set of all parents of v in D is denoted by pa D (v).Consider a collection of q random variables, X 1 , . . ., X q .We assume that the joint distribution f factorizes according to D as ( 1 ) If (1) holds, f (x 1 , . . ., x q | D) is said to obey the Markov property of D. In our context of interventional data, Equation ( 1) is also called observational or preinterventional distribution.
An intervention on the node j ∈ V is defined as the action of setting X j to the value of a random variable U j having density f (u j ).A joint intervention on I ⊆ V fix, for each j ∈ I, X j to U j , with {U j } j∈I mutually independent.I is called an intervention target and the do-operator do{X j = U j } j∈I (Pearl 2000) is used to denote such an intervention.As a consequence of the intervention, the original dependence in D of node j from its parents pa D (j) is dropped, and the intervention DAG of D is defined, following Hauser and Bühlmann (2012), as D I = (V, E I ), where ∈ I}.See also Figure 1 for an example of DAG and related intervention DAG.The intervention on I replaces the conditional density of each node j ∈ I in (1), that is f (x j | x pa D (j) ), with f (x j ).Therefore, the postintervention distribution of X 1 , . . ., X q given the operator do{X j = U j } j∈I is obtained from (1) using the truncated factorization (2) When there are no interventions, Equation (2) reduces to Equation (1).Multiple independent interventions form a family of intervention targets I = {I 1 , . . ., I K }, where each I k ⊆ V and index k refers to the kth intervention.

Gaussian DAG-Models
We assume (X 1 , . . ., X q ) | , D ∼ N q (0, ) , where ∈ C D , the space of all covariance matrices Markov w.r.t.D. A Gaussian DAG-model can be equivalently written as a linear Structural Equation Model (SEM) of the form L (X 1 , . . ., X q ) = ε, where L is a q × q matrix with unit diagonal elements and ε ∼ N q (0, D), D = diag(σ 2 1 , . . ., σ 2 q ); see Kaplan (2009).Accordingly, the decomposition = L − DL −1 holds.For each (u, v)-element of L and u = v we have L u,v = 0 if and only if u ∈ pa D (v), and it holds that where ϕ is the Gaussian density function, ≺ j ] = pa D (j) × j and L A×B denotes the sub-matrix of L with elements belonging to rows and columns indexed by A and B, respectively.The set L ≺j ] , σ 2 j q j=1 represents the collection of observational parameters (because they index the observational model).
For any intervention target I ⊆ V, we further assume that each interventional density f in (2) is zero-mean Gaussian f (u j ) = ϕ(u j | 0, φ j ), j ∈ I, with U j ⊥ ⊥U j for each j = j .With this specific choice of interventional density, we can now replace the conditioning event do{X j = U j } j∈I with the parameters I, = {φ j } j∈I , where is the collection of interventional parameters.The postintervention distribution of (X 1 , . . ., X q ) thus becomes We now split the available observations of X 1 , . . ., X q among K datasets, as arising from the family of K independent interventions: each n (k) × q dataset X (k) consists of a collection of (rows of the data matrix X (k) ) associated to intervention target I k , for i = 1, . . ., n (k) .Accordingly, the postintervention distribution related to intervention k is f x (k) , I k , D , where j } j∈I are the interventional parameters associated to the k-th intervention.Given the collection of datasets X = X (1) , . . ., X (K) , the likelihood function is finally where θ = D, L, (1) , . . ., (K) is the collection of all DAGdependent (observational and interventional) parameters.

Prior on DAG Parameter θ
Conditionally on DAG D and the collection of targets I 1 , . . ., I K , we first assign a prior to the observational parameters (D, L).
Recall that = L − DL −1 , where is the covariance matrix of a multivariate Gaussian random variable Markov w.r.t.DAG D. We assign (D, L) a DAG-Wishart prior with hyperparameter U (a q × q positive definite matrix) and shape hyperparameter a D = (a D 1 , . . ., a D q ) ; see Ben-David et al. (2015) and Cao, Khare, and Ghosh (2019).Also, a standard choice, hereinafter adopted, is U = gI q (g > 0).The DAG-Wishart distribution induces a reparameterization of in terms of the node- , independent across j = 1, . . ., q, and with distribution where ≺ j ] = pa D (j) × j and I-Ga(a, b) stands for an Inverse-Gamma distribution with shape a > 1 and rate b > 0 having expectation b/(a − 1).From ( 6), the prior on the observational parameters (D, L) is given by p(D, L) = q j=1 p L ≺j ] | σ 2 j p σ 2 j .Hyperparameters a D j are specific to each DAG model, and it can be shown that the default choice (hereinafter adopted) a D j = a + |pa D (j)| − q + 1 (a > q − 1) guarantees compatibility among prior distributions for Markov equivalent DAGs; see Peluso and Consonni (2020).In particular, we set a = q, the minimum integer value that guarantees a proper prior distribution, regardless of the specific D.
Consider now the interventional parameters Because each φ (k) j corresponds to an unconditional variance in a postintervention distribution where each node j ∈ I k has no parents, we can set independently, where a = g/2, following the same elicitation procedure leading to (6).The prior on the collection of interventional parameters is therefore, p( (1) , . . ., (K) , leading to a conditional prior on θ of the form (8)

Prior on Targets I 1 , . . . , I K
Consider now the collection of targets I 1 , . . ., I K , where For convenience, we represent each target I k as an indicator vector , and 0 otherwise.Conditionally on a prior probability π k (j) ∈ (0, 1), we can assign a prior to where π k = (π k (1), . . ., π k (q)) .Assuming prior independence among intervention targets, we then set p(I 1 , . . ., In addition we assign, for j = 1, . . ., q and k = 1, . where corresponds to the number of intervened nodes under intervention k.Expression (10) resembles the multiplicity correction prior introduced in Scott and Berger (2010) for variable selection.

Prior on DAG D
For a given DAG D = (V, E), let S D be the 0-1 adjacency matrix of its skeleton (the underlying undirected graph obtained after removing the orientation of its edges), such that for each ) ∈ E, and 0 otherwise.Given some prior probability of inclusion η ∈ (0, 1), we assume (equivalently in its skeleton) and q(q − 1)/2 corresponds to the maximum number of edges in a DAG with q nodes.Finally we set p(D) ∝ p(S D ), for any D ∈ S q , where S q is the space of all DAGs on q nodes.Such a prior only depends on the number of edges in the graph and can easily reflect prior knowledge of sparsity (Castelletti et al. 2018).Other priors, specific for DAGs and based on the number of compatible perfect orderings of the vertices, are also present in the literature (Friedman and Koller 2003;Kuipers and Moffa 2017).

Theoretical Properties of Target and Graph Learning
In the present section we investigate the asymptotic behavior of the false positive and false negative rates associated to the estimation of the intervention targets, and the correct asymptotic identification of the graphical structure.The dependence on a given graph D is assumed and omitted in the first part of this section, and later reinstated when we discuss graph learning.Let I 01 , . . ., I 0k be the true unknown intervention targets, (1) 0 , . . ., (K) 0 the true interventional parameters, and (D 0 , L 0 ) the true observational parameters corresponding to the Cholesky decomposition of the variance (precision) matrix 0 ( 0 ).For a given node j and dataset k, and with θ = D, L, (1) , . . ., (K) , we first define the posterior log-odds of an intervention on node j in dataset k as k) , D, L, (k) , where logit(A) = ln(P(A)/P( Ā)) for some event A and its complement Ā.The posterior conditional probability of an intervention on node j in dataset k is where ϕ p is the density function of a p-variate Gaussian r.v. and X (k)  j denotes column indexed by j in dataset X (k) .For the complement event we have instead We further remove the dependence from θ by considering the conditional expectation where the conditioning event is A k j = j ∈ I 0k , therefore, interpreted as the posterior expected log-odds of an intervention on node j in dataset k, given that j is indeed a true intervened node (target) in the dataset.In the following proposition we show that the posterior expected log-odds γ (k) j (X | A k j ) correctly diverges when the intervention has truly occurred, and the asymptotic normality of the scaled log-odds. where → +∞.
Proof.See supplementary materials.
The proposition states that, if j ∈ I k , that is, node j is a target under intervention k, this will be detected with sample size large enough and for any given graph D, therefore, with a false negative rate eventually zero.The scaled log-odds of the (correct) target classification has asymptotic Gaussian distribution.Note that the mean of the asymptotic distribution increases when a D j is large.This is typical of a node with many parents in a large graph, which makes easier the identification of an intervention, because the latter will suppress many dependence relations.We refer the reader to the supplementary materials for a more extensive discussion of the proposition.
In the opposite case of no intervention, we analyze the behavior of where the conditioning event is Āk j = j / ∈ I 0k , therefore, interpreted as the posterior expected log-odds of no intervention on node j in dataset k, given that j is not an unknown intervened node.With the following result we show conditions for which this quantity diverges when there is no intervention, and its asymptotic normality. where and ψ is the digamma a.s.
Proof.See supplementary materials.
Proposition 3.2 tells that a true negative case of no intervention, that is, j / ∈ I k , will be eventually detected with sample size large enough.Note that the mean of the asymptotic distribution is closer to zero when node j is independent from any other node.Intuitively, it is more difficult to understand the absence of an intervention since there are no parent-child relations that are removed by the intervention on node j, which makes the intervention effect less apparent.We refer the reader to the supplementary materials for a more extensive discussion.In the following sections we develop and implement an MCMC algorithm that empirically confirms the correct detection of nodes which are targeted by interventions.
The above discussion focuses on the correct identification of interventions for a given DAG.We now prove model selection consistency of the true DAG observational equivalence class; the latter, combined with consistent estimation of targets, allows to identify the group-specific intervention graphs.Data within group k is used to find interventions specific to that group, and observational data from all groups are combined to make inference on the underlying observational DAG structure.In Section 3 of supplementary materials, we first extend the conjugacy result on the DAG-Wishart prior of Ben-David et al. (2015) to interventional Gaussian multivariate data from multiple groups; then we prove, following Cao, Khare, and Ghosh (2019) and Peluso and Consonni (2020), its Bayes factor and posterior ratio consistency outside [D 0 ], the equivalence class of the true DAG, and its asymptotic compatibility within [D 0 ].
We have Bayes factor consistency if, for all D = D 0 , the Bayes factor whenever D 0 is the true DAG generating X, where P → denotes convergence in probability, P is the probability measure under the true DAG D 0 , and m(X | D, I 1 , . . ., I k ) is the marginal (or integrated) likelihood.We have posterior ratio consistency if, with D 0 being the true DAG, it holds that max For each j ∈ V, let ñj = k:j∈I k n (k) and n * j = k:j / ∈I k n (k) be the number of observations among groups k = 1, . . ., K such that node j is, respectively, intervened and not. ) for all j ∈ V, and (c) for all j = l ∈ V there exists a k such that j / ∈ I k and l / ∈ I k hold, then as n → ∞, Proof.See supplementary materials.
Proposition 3.3 shows that posterior ratio and Bayes factor consistency under DAG-Wishart prior holds outside the Markov equivalence class of the true generating DAG D 0 .On the other hand, the posterior ratio tends to the prior ratio (Bayes factor equal to one) within the true equivalence class.This result is coherent with Peluso and Consonni (2020), in the context of a single-group observational dataset.We refer the reader to the supplementary materials for a discussion of the assumptions underlying the result.

MCMC Scheme and Posterior Inference
We construct a collapsed Metropolis-Hastings sampler (Metropolis et al. 1953) on the space of DAGs and intervention targets to approximate the marginal posterior The full conditional distribution of To update DAG D we implement a Metropolis-Hastings step where, given the current DAG, a new DAG D is proposed from as suitable proposal q( D | D) and accepted with probability Conditionally on DAG D we update the K targets I 1 , . . ., I K (equivalently, the indicator vectors h 1 , . . ., h K introduced in Section 2.4) sequentially.For a given k, the full conditional of Update of I k conditionally on {I s } s =k and DAG D is again performed through a Metropolis Hastings step, where a new target I k proposed from q( I k | I k ) is accepted with probability We refer the reader to Section 4 of supplementary materials for full details.The output is a collection of DAGs D (s) S s=1 and targets I (s)  1 , . . ., I (s) K S s=1 approximately sampled from the posterior (12), where S is the number of finally kept MCMC iterations.Given this output, we can estimate, for each node j and target I k , the posterior probability of intervention a measure of evidence that intervention k acts on node j.In addition, an approximate marginal posterior distribution over the DAG space is available, with each DAG posterior probability estimated as p(D | X) = 1 S S s=1 1 D (s) = D , for D ∈ S q , the set of all DAGs on q nodes.Furthermore, we can estimate with the posterior probability of inclusion of each directed edge u → v, where

Simulated Settings
To assess the performance of our method, we construct various simulated settings by varying the number of variables q ∈ {20, 40}, the sample size of each dataset X (k)    4).Next, for each k = 1 . . ., K, n (k) iid interventional data collected in the n (k) ×q data matrix X (k) are generated as in (4).Each dataset is therefore, a collection of K interventional data matrices X (1) , . . ., X (K) .
For comparison we include the Unknown Target Interventional Greedy Sparsest Permutation algorithm of Squires, Wang, and Uhler (2020), with significance level α ∈ {0.1%, 0.001%} (IGSP 0.1% and IGSP 0.001%, respectively), as recommended in the original paper.We also include Algorithm 1 of He and Geng (2016), implemented at significance level 0.05.As a further benchmark, we also construct a baseline node-wise regression approach, by adapting to our interventional setting the twostage adaptive lasso method of Han et al. (2016); we call this benchmark Node-wise.Finally, we include the Greedy Interventional Equivalence Search (GIES) method of Hauser and Bühlmann (2012), implemented using the Extended Bayesian Information Criterion (EBIC, Foygel and Drton 2010) with tuning coefficient γ ∈ {0.5, 1} (GIES 0.5 and GIES 1, respectively) as also recommended in the original paper.The method of GIES was developed for known intervention targets that we input, as in an oracle setting, using the true intervened nodes.All methods can be adopted for DAG structure learning, whilst only IGSP and Node-wise can perform target estimation.Finally, notice that IGSP requires n (k) > q, so that results of IGPS for n (k) ∈ {10, 20} are missing.Further details on benchmarks, together with convergence diagnostics and more extensive simulation settings are provided in supplementary materials.

Results
We fix the number of MCMC iterations S ∈ {25000, 50000} for q ∈ {20, 40}, respectively.In addition we set g = 1/n in the Inverse Gamma priors ( 6) and ( 7), and η = 1/q in the prior on DAG (Section 2.5) which corresponds to a prior probability of edge inclusion smaller than the expected level of sparsity, as commonly recommended; see for instance Barbieri and Berger (2004).Finally we fix a k = 1/q, b k = 1 for each k = 1, . . ., K in the Beta prior leading to (10).Other scenarios not reported for brevity show that the results are quite insensitive to these hyperparameter choices, especially for large sample sizes; for instance when we fix η = 2/q there is a negligible change for n (k) = {10, 20} and no change is observed for higher sample sizes.
We start by evaluating the performance of the methods in identifying the intervention targets.With regard to our Bayesian method, we first compute for each simulation, the posterior probabilities p j∈I k in (15).Next, by fixing a threshold of inclusion z = 0.5 we provide a target estimate I k , k = 1, . . ., K, by including in I k all nodes j such that p j∈I k > z.Estimated targets I k 's are compared with the true targets by computing the false positive and false negative rates, respectively, defined as similarly for the other frequentist methods under evaluation.
Results for q = 40 setting are summarized in the box-plots of Figure 2, whilst results for q = 20 are reported in the supplementary materials.This reports the distribution of FPR and FNR constructed across the N = 40 simulations for three methods under evaluation and increasing sample sizes n (k)  under scenarios Sparse and Diffuse.With regard to our method (Bayes) we notice that coherently with the theoretical results of Section 3, both sources of error vanish as sample size increases.This tendency is more evident for FNR that rapidly goes to zero already at moderate sample sizes, for example, n (k) = 20.
It is clear the outperformance of our proposal, relative to the benchmarks, with Node-wise performing equally well only in terms of FNR.
We then evaluate the overall performance of each methodology in recovering the DAG structure.With regard to our method, we provide a DAG estimate by computing first p u→v , the (estimated) posterior probability of inclusion, for each edge (u, v) as in ( 16); then we fix a threshold for edge inclusion z = 0.5 and obtain an estimate D by including all edges such that p u→v > 0.5, as in the median probability model proposed by Barbieri and Berger (2004) in a linear regression framework.We compare D with the true DAG by measuring the Structural Hamming Distance (SHD, Tsamardinos, Brown, and Aliferis 2006) between the two graphs; similarly for each DAG estimate directly outputted by the other methods; lower values of SHD correspond to better performances.Results for q = 40 are summarized in the box-plots of Figure 3, where each plot Sparse (first column) and Diffuse (second column) for number of nodes q = 40 and increasing sample sizes n (k) .Methods under comparison are: our Bayesian methodology (Bayes), the Unknown Target Interventional Greedy Sparsest Permutation algorithm implemented at significance level α ∈ {0.1%, 0.001%} (IGSP 0.1% and IGSP 0.001%) and node-wise regression (Node-wise).k) under Sparse and Diffuse Scenarios.Methods under comparison are: our Bayesian methodology (Bayes), the Unknown Target Interventional Greedy Sparsest Permutation algorithm implemented at significance level α ∈ {0.1%, 0.001%} (IGSP 0.1%, IGSP 0.001%), node-wise regression (Node-wise), Algorithm 1 of He and Geng (2016) and the Greedy Interventional Equivalence Search method with tuning coefficient γ ∈ {0.5, 1} (GIES 0.5, GIES 1).
reports the distribution of SHD across the N = 40 simulated datasets for the various methods and increasing sample sizes n (k) ∈ {10, . . ., 500} under Scenarios Sparse and Diffuse.It is clear the tendency of a better and better recovery of the true graphical structure as we increase the amount of available data, and an overall better performance of Bayes relative to all the benchmarks.The only exception is GIES 0.5, implemented with knowledge of the true targets, that outperforms our Bayesian method in few settings characterized by small sample sizes, where indeed target identification was more difficult for our method.However, it performs worse than Bayes as n (k)  increases, especially under Scenario Sparse.

Protein Signaling Data
In the current section we apply our methodology to the protein signaling data presented in Sachs et al. (2005).The dataset, provided as a supplementary material to the original paper, collects simultaneous measurements of multiple phosphorylated proteins and phospholipid components in individual primary human immune system cells.Measurements of q = 11 phosphorylated proteins and phospholipids were collected after a series of stimulatory cues and inhibitory interventions obtained from the administration of distinct reagents.Each reagent induces a perturbation of the proteins' pathway since it affects either one of the signaling molecules directly or some (unmeasured) receptor enzymes.More specifically, seven datasets are associated to known interventions, while other two (reagents CD3/CD28 and ICAM-2) refer to general (unknown) perturbations; see also Table 1 in Sachs et al. (2005) and our Table 1.The same dataset was analyzed by Castelletti and Consonni (2019) who implemented their OBIES method on the collection of measurements associated to known targets to learn the structure of an interventional essential graph.
We include in our analysis all nine datasets, by assuming known intervention targets for the first seven, whilst considering the last two as characterized by uncertain interventions that we learn with our methodology.We run S = 25,000 iterations of our MCMC algorithm to approximate the posterior distribution over the space of DAGs and intervention targets; we fix g = 1/n in the priors on DAG parameters ( 6) and ( 7), a prior probability of edge inclusion η = 1/q (Section 2.5) and a k = 1/q, b k = 1 (k = 1, . . ., K) in the Beta prior leading to (10).
Results are summarized in Figure 4, with the heat map of posterior probabilities of intervention for each node and each of the two uncertain intervention groups CD3/CD28 and ICAM-2.Black dots for the first seven interventions represent the targets that were assumed to be known.Our findings are coherent with biological literature establishing that both reagents ICAM-2 and CD3/C28 are capable of activating enzyme ZAP70 and in turn signaling nodes Mek, PLC and PKC among others.These are indeed selected as promising targets under at least one of the two interventions.We also report in the right panel of Figure 4 a DAG estimate obtained by including those edges  whose posterior probability of inclusion exceeds 0.5: dark and light gray circles identify proteins whose posterior probability of intervention computed under CD3/C28 and ICAM-2, respectively, is larger than 0.5, and therefore, represent plausible intervention targets for the two reagents.We stress that a higher cardinality of the estimated intervention set in the two datasets with unknown targets is not unexpected, since Sachs et al. (2005) refer to general perturbations that overall stimulated the cell, against specific perturbations acting on defined set of molecules in the remaining datasets with known targets.We refer to Sections 6 and 7 of supplementary materials for sensitivity analyses and detailed comparisons with alternative methods.
In particular, He & Geng does not provide target estimates, whilst we share 3 out of 4 targets with Node-wise in the dataset CD3/C28 and 1 out of 2 with IGPS in the dataset ICAM-2.Also, Node-wise and He & Geng estimate the same graph skeleton of our method, whilst IGPS, that wrongly assumes one dataset to be purely observational, misses two links.The benchmark Nodewise also share many edge orientations; on the other hand, with He & Geng many edges remain unoriented.

Gene Expression Profiles Under Antiepileptic Drug Therapies
In this section we consider a gene expression dataset relative to patients affected by epilepsy.In the following we consider four groups of subjects (Drugnaïve, Valporate, Phenytoin, and Carbamazepine) to evaluate the effect of each drug therapy, relative to the untreated patients (Drug-naïve group).We include in our analysis 100 genes that were most differentially expressed between healthy and unhealthy drug-naïve patients as resulting from limma (Linear Models for Microarray and RNA-Seq Data) tests (Smyth 2005) and therefore, represent suitable candidate targets to evaluate patients' response to each drug therapy.Our algorithm is run for number of iterations S = 1,200,000 by fixing prior hyperparameters as in Section 6.1.Because our interest lies in discovering drug-induced effects (interventions) on patients who received one of the drug therapies as opposed to unhealthy, yet untreated, patients, we consider drug-naïve individuals as a ground (reference) group and fix the corresponding intervention target as the empty set.
We first compute, for each node/gene v ∈ {1, . . ., 100} and each intervention target I 2 , I 3 , I 4 corresponding to one of the three drug therapies, the posterior probabilities of intervention; see Equation ( 15).Results reported in the supplementary materials show that there are few genes exhibiting a high posterior probability of intervention under some of the treatments.Specifically, only six genes, that are reported in the submap of Figure 5, are associated with probabilities of intervention exceeding 0.5.In addition, we show in the right panel of Figure 5 the estimated (median probability) subgraph of these genes, including parent and child nodes.This DAG estimate can help understanding how gene dependencies modify in force of an intervention after one of the drug therapies is administrated.The implementation of alternative methods (Section 5.1) specifically designed for DAG and target learning under uncertain interventions revealed several difficulties that are specific to this kind of data.In particular, while IGSP cannot be applied since n (k) < q, the baseline Node-wise method analyses separately the K datasets and does not identify any dependence relation between genes, even when implemented for different values of the tuning parameter.Finally, the approach of He and Geng (2016), shown to underperform in simulation, does not provide target estimates and outputs a DAG with a huge number of links, even for values of tuning parameter α encouraging sparsity.We refer the reader to Sections 6 and 7 of supplementary materials for sensitivity analyses and further illustrations of the results.

Discussion
We have developed a Bayesian statistical methodology for simultaneous learning of network-based structure dependencies and of intervention targets.Our proposal is useful in those contexts where the data are subject to interventions, but each intervention affects unknown variables in the network, as in diffuse protein perturbations or drug target discovery.We implement a novel MCMC sampler to approximate the posterior distribution on the space of DAGs and intervention targets.When applied to genomic data collected under various drug treatments, our method provides insights on the dependence relations between genes and the effect of distinct drug therapies.We also provide theoretical guarantees of the methodology, by looking at the consistency in recovering the true intervention targets and graph.Our theoretical results are supported by rigorous simulation studies, showing an overall outperformance of our method with respect to alternative approaches, in terms of target and structure learning.
Since changes in brain networks are known to be involved in stimulus-response associations (Boettiger and D'Esposito 2005), a potential field of application of the developed methodology is on the joint learning of dependent changes in functional Magnetic Resonance Imaging (fMRI) activations within brain regions and functional connectivities between regions; see for instance Warnick et al. (2018).The human brain is an oriented network system of brain regions involving directional connectivity, and the mainstream statistical approach relies on the theory of random networks (Simpson, Bowman, and Laurienti 2013).Still, many statistical issues remain unaddressed, with the study of complex dependencies among brain regions a fertile area of methodological development, often based on simplistic inferential frameworks.For instance, in the network construction process from raw fMRI data up to the adjacency matrix, methods for estimating functional connectivities between network nodes typically rely on association measures, whilst modeling methods remain relatively limited for brain network estimation.After estimating a functional brain network, the following step often involves various methods of crude thresholding of the connectivity matrix (Telesford et al. 2011), to remove weak connections and produce an adjacency matrix which notes the presence or absence of a functional connection between any two nodes.We conjecture that an implementation of our methodology directly on fMRI data can lead to an alternative reliable estimation of the adjacency matrix characterizing the brain network and of the activated target areas under stimuli, avoiding the arbitrary steps involved in the process of network construction.
Our approach to structure learning of DAGs and targets relies on the do-calculus theory and the allied notion of intervention DAG (Pearl 2000).Accordingly, we are able to recover interventions which modify the DAG Markov property, by destroying parent-child dependencies for each intervened variable.Alternative definitions of intervention are available in the literature.Among these, Kocaoglu et al. (2019) consider from a theoretical perspective the case of soft interventions, where parent-child dependencies are "modified" but yet preserved after intervention, and develop graphical criteria to represent the post-intervention DAG Markov property and characterize DAGs Markov equivalence.A Bayesian methodology for structure learning under soft interventions is possible, following the lines of Castelletti and Consonni (2019), and an extension to

Figure 1 .
Figure 1.A DAG D and the corresponding intervention DAG D I for the target I = {3, 5}.
Proposition 3.3.Let D 0 be the true DAG.Assume (D, L) | D follows a DAG-Wishart distribution with hyperparameters U and a D as in Equation (8), and consider the likelihood function in Equation (5).If (a) a D j = a + |pa D (j)| − q + 1, (b) ñj = o(n * j

Figure 4 .
Figure 4. Sachs data.Heat map with estimated posterior probabilities of intervention computed under each of the nine interventions for each node u (u = Raf, . . ., JNK), (left side).Estimated DAG with dark (light) gray circles representing nodes whose posterior probability of intervention under reagent CD3/C28 (ICAM-2) exceeds 0.5 (right side).

Carbamazepine.
The aim of the original study was to identify mRNA expression biomarkers associated with the disease and the antiepileptic drug response.Results inRawat et al.  (2020)  revealed that patients showing differential response to antiepileptic monotherapies were also characterized by differential blood gene expression profiles.

Figure 5 .
Figure 5. Epilepsy data.Left panel: Heat map with estimated posterior probabilities of intervention computed under each intervention (drug treatment) for selected nodes (genes 1654552, . . ., 3192316).Right panel: Estimated (preintervention) sub-graph of six selected genes, parent and child nodes, with gray circles representing intervention targets.
with increasing values n(k)∈ {10, 20, 50, 100, 200, 500}, for a number of interventions (datasets) K = 4.A family of intervention targets I 1 , . . ., I K is generated under two scenarios resembling different degrees of "sparsity" in the targets.Scenario Sparse is characterized by a moderate number of interventions, with each target I k obtained by drawing without replacement s ∈ {2, 4} nodes, respectively, for q = 20 or q = 40 .On the other hand, in Scenario Diffuse we assign each node to one of the K targets.As a consequence, each variable is involved in one of the K interventions, with an overall larger number of simultaneous interventions (sizes of the targets I k ).Under each scenario defined by (q, n (k) ) and settings Sparse and Diffuse, we perform 40 simulations, each corresponding to a true DAG D, family of targets {I 1 , . . ., I K } and resulting in a (multiple with K = 4 groups) dataset.For each simulation we first generate a topologically ordered DAG D with probability of edge inclusion p edge = 2/q.Given DAG D and family targets I 1 , . . ., I K , we then generate parameters D, L and (k) = φ

Table 1 .
Sachs data.Intervention targets and sample sizes for each of the nine administrated reagents giving rise to the collection of nine datasets.Symbol ?indicates an unknown target for the corresponding reagent.