Cyclic structural causal model with unobserved confounder effect

Abstract This study examined cyclic non linear structural equation models (SEMs) with an unobserved confounder. The determination of causal direction, identifying whether X causes Y, Y causes X, or X and Y affect each other, is a fundamental issue in science and technology. Unobserved confounding factors are so-called third variables which influence both X and Y. Inclusion of unobserved confounding factors in the model, if they exist, is highly important, as unnoticed confounding variables can result not only in biased estimates but in pseudo-correlation between X and Y, which can distort causal inference. In this article, a cyclic non linear SEM with confounding factors was developed and the causal graph identifiability of the model was proved. The model’s identifiability was confirmed by simulation studies involving synthetic datasets and by application to a real dataset, namely, a Census Income (KDD) dataset containing weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. The model effectively described the relationship between the variables age and wages, and no strong confounding factors effect was detected.


Introduction
Determination of causal direction, identifying whether X causes Y, Y causes X, or X and Y mutually affect each other, is a fundamental issue in science and technology. Many applied researchers have explored empirical methods to study causal direction, and many methodological researchers have made efforts to develop appropriate methodology (Spirtes, Meek, and Richardson 1995;Richardson 1996;Kano and Shimizu 2003;Shimizu et al. 2006;Hoyer et al. 2009;Mooij et al. 2011;Nagase and Kano 2017). However, to date, no useful statistical methods that can reasonably handle cyclic relations have been reported.
One of the issues that makes causal inference difficult is existence of unobserved confounding factorsa set of so-called third variables influencing both X and Y: Unnoticed confounding variables can result in biased estimates and pseudo-correlation between X and Y and thus they can distort causal inference. The issue of confounding factors has been discussed in Hoyer et al. (2008), Tashiro et al. (2012), and Chen and Chan (2013), under acyclic and non Gaussian error setting. Lacerda et al. (2012) and Mooij et al. (2011) have discussed cyclic models without confounders. In this article, we develop a cyclic non linear structural equation model (SEM) with confounding factors under Gaussian error setting and prove the identifiability of the model. Here, unobserved confounders are expressed as one or more covariances in a variance-covariance matrix of the error terms in the model. Although it is highly important to develop cyclic models with confounders in practice, it appears no study for the identifiability issue of such models possibly because of its difficulty. To this end, we discuss the model identifiability in Section 2. In Section 3, we show the effectiveness of our method by analyzing several synthetic datasets, and the method is then applied to a real dataset. Conclusions are presented in the final section.

Model structure
Let ðy, xÞ be random variables involved in the model. A cyclic SEM has the following form: x, y f 0 ðxÞg 0 ðyÞ j j< 1: The condition max x, y f 0 ðxÞg 0 ðyÞ j j< 1 is necessary, as the model assumes that the set of random variables ðy, xÞ is yielded in the equilibrium state in the corresponding recur- where ðy t , x t Þ is the t th sequence point, often denoting time. Suppose that i) the error term ðe y , e x Þ does not change over time, and ii) lim t!1 jy t j < 1 and lim t!1 jx t j < 1, and iii) lim t!1 ðy t , x t Þ exists, given ðe y , e x Þ: Note that the domain of ðy, xÞ is defined as the one satisfying max x, y f 0 ðxÞg 0 ðyÞ j j< 1 given the functions f and g: The Jacobi matrix associated with the transformation in Equation (1) is given as @e y @y @e y @x @e x @y @e x @x 2 6 6 6 4 3 7 7 The Jacobian is given as J y, x ð Þ :¼ @e y @y @e y @x @e x @y See Mooij et al. (2011, formula (4)).

Causal graph identifiability
Causal graph identifiability for the no-confounder model, i.e. e y ?e x , has been proved in Mooij et al. (2011). It is extended here to show that the same conclusion holds true even if the model permits a covariance between errors.
Theorem 1. Two causal graphs G M andG M corresponding to the following models M : do not necessarily coincide, if and only if i) f , g,f ,g are all affine or ii) f , g,f ,g have the following special form: f x ð Þ ¼ C x þ D, g y ð Þ ¼ ay À bg y ð Þ þ c,f x ð Þ ¼D, andg ðyÞ satisfies the ordinary differential equation (ODE) below: Cr XY Àr YY , and c is a constant.
Proof. We show here only a sketch of the proof and the main difference from the noconfounder case. Let / x ð Þ :¼ f 0 ðxÞ, w y ð Þ :¼ g 0 ðyÞ,/ x ð Þ :¼f 0 ðxÞ andw y ð Þ :¼g 0 ðyÞ: The second order derivative of the log probability density function of Equation (2) for model M is expressible as D 00 M ¼ r XX / À r XY /w À r XY þ r YY w À / 0 w 0 /wÀ1 ð Þ 2 , and let D 00 M be the log density function for modelM: As in the case of Mooij et al. (2011), the following two cases in both of which the causal graphs ofM do not coincide with M are considered: i)M has zero arrows (f 0 ¼g 0 ¼ 0), and ii)M has one arrow (4-2) respectively. Polyanin and Zaitsev (2004: Section 4.3. Solution of Functional Differential Equations by Differentiation) has provided a method to solve the following form of ODE: where the functions U i x ð Þ and W i y ð Þ depend only on x and y, respectively. The i th term U i x ð ÞW i y ð Þ will vanish, if we divide Equation (5) by U i x ð Þ and differentiate it by x (or divide Equation (5) by W i y ð Þ and differentiate it by y, equivalently). We can repeat the procedure until only two terms remain, sayÛ 1 x ð ÞŴ 1 y ð Þ þÛ 2 x ð ÞŴ 2 y ð Þ ¼ 0, and this has possible solutions: The form of ODE described in Equation (5) can be obtained by multiplying the denominator in the last term /w À 1 ð Þ 2 by both sides of Equation (4). Here, we will show an example for the one arrow case: Equation (4-2), which has very different trajectory from what we see in the no-confounder case. After a certain set of procedures (the repeating divisions and differentials. For details, see the supplemental material), a combination of the final two terms can be obtained with the assumption / 0 6 ¼ 0 and w 0 6 ¼ 0 as follows: and d), we indicate w 0 ¼ 0, this violates the assumption w 0 6 ¼ 0: Similarly, sinceÛ 1 ¼ 6/ 0 ,Û 1 x ð Þ ¼ 0 in c) indicates / 0 ¼ 0, then this violates the assumption / 0 6 ¼ 0: Then possible solution can be obtained only as 1 w , and assume r XY 6 ¼ 0, the ODE can be expressed as r XX n À r XY ¼ Cnn 0 and it is solvable. Then one candidate solution of w can be obtained as (see Equation (2) in the supplement material): (6) where C and D 0 are constants and the LambertW function is defined as the inverse function of h W ð Þ ¼ We W : Next step is to check if the candidate solution in Equation (6) is really a solution. Utilize the equality r XX w 2 À r XY w 3 ¼ ÀCw 0 , we can replace r XY w 3 with r XX w 2 À Cw 0 , and after repeating another set of procedures with the assumption / 0 6 ¼ 0 and w 0 6 ¼ 0, the following another combination of the final two terms is obtained: The candidate solution in Equation (6) should also be a solution for this combination, if it is really a solution. The only possible solution here is obtained from d) andŴ 2 ¼ 0 indicates that: where D 1 and D 2 are constants. Since the right-hand sides of Equation (7) and (8) are identical, we can connect those equations as follows: Insert the candidate solution of w in Equation (6) into (9) yields: However, the coefficient of the 3rd order term of the Lambert W function in Equation (10) is 5 2 r XY r XX 6 ¼ 0 and the equality in Equation (9) does not hold, then it turns out that the candidate solution in Equation (6) is not a solution.
After applying similar procedures to all of the combinations, the only two solutions are obtained: i) the all-affine case; ii) the case in which f is affine,f is a constant, g y ð Þ ¼ ay À bg y ð Þ þ c, andg satisfies Equation (3).
It is also confirmed that i) ifg is assumed to be affine, Equation (3) also supports the assumption, and indicates that g is affine as well, and ii) when r XY ¼r XY ¼ 0, Equation (3) is equivalent to Equation (6) in Mooij et al. (2011).

Simulation study
In this article, we only present results of a two-variable case for simplicity, whereas we have confirmed that similar simulations to those described below were feasible for cases involving more than three variables, and in fact, up to seven-variable cases in our computing environment were possible within a realistic time frame.

Quadratic polynomial models
Simulation studies were performed to practically confirm the identifiability of the model proved in the previous section. Datasets with n ¼ 1600, 400, 100, 25 observations, distributed from the following SEM, were generated. A normal distribution was chosen so that nearly all of the generated points ðX, YÞ satisfy the condition max x, y f 0 ðxÞg 0 ðyÞ j j< 1, and the coefficients were randomly generated: y ¼ À0:362x þ 0:163x 2 þ e y x ¼ 0:083y À 0:352y 2 þ e x , R ¼ :607 2 :184 :184 :607 2

! &
Data points that did not satisfy the condition were eliminated. The non linear optimization using augmented Lagrange method (Ye 1987) was used to maximize the loglikelihood function P n i¼1 logq y i , x i ; f , g, R ð Þ , where n is the number of observations and q y i , x i ; f , g, R ð Þis the pdf, see Equation (2), of the i th observation, with the constraint max i21, :::, n f 0 ðx i Þg 0 ðy i Þ j j< 1: Notice that f(x) and g(y) include parameters to be estimated.
All of the datasets were analyzed in R version 3.5.2 (R Core Team, 2018, Vienna, Austria) on macOS version 10.13.6 (Cupertino, CA, USA) using the package 'Rsolnp' version 1.16 (Ghalanos and Theussl 2015). The R scripts used for all the analyses are provided as supplemental material. As shown in Table 1 and Figure 1, all of the parameters are nearly completely recovered with n ¼ 1600 observations, supporting the fact that the model is identifiable. On the other hand, the result with n ¼ 25 observations barely recovers the causal structure. These results indicate that 100 or more observations are necessary to acquire reliable estimates in this setting.

Models with sine and cosine functions
We next show that the causal structure can be completely recovered if linear combinations of known non linear functions are assumed for the functions f ðxÞ and gðyÞ: In order to allow the functions to describe a wide variety of shapes and to easily meet the condition max x, y f 0 ðxÞg 0 ðyÞ j j< 1, a model with the following structure is considered; we call it the 'sin-cos model.' Complexity parameters p and q, which specify the degree of expansion for both the sine and cosine functions, are introduced: A dataset with n ¼ 300 observations distributed from the following SEM was generated. The coefficients in function f were randomly generated, as in the case of the quadratic model. Causal direction in this setting is considered to be X to Y since the magnitude of the coefficients in g is set to a value small enough to be ignored. It becomes apparent from the simulation that the cyclic model can handle a situation where the causal structure is close to acyclic and, in fact, can be considered, in practical terms, to be acyclic. y ¼ À1:207 sin x þ 0:277 cos x þ 0:542 sin 2x À 1:173 cos 2x þ e y x ¼ 0:1 sin y þ 0:1 cos y þ e x , R ¼ 1 :5 :5 1

! &
The complexity parameters in the true setting are p ¼ 2 and q ¼ 1: In general, one cannot know the complexity parameters beforehand; they need to be specified in the analysis. One idea is to compare AIC values; the other is to use BIC, as suggested in Haughton (1988) for a greedy search of DAGs. As shown in Table 2, the model with p ¼ 2, q ¼ 1 gives the lowest AIC value (817.45), whereas the model with p ¼ 2, q ¼ 0 gives the lowest BIC score (851.14). Using the best model according to BIC, all of the parameters are recovered with very similar values to the true ones, as shown in Table 3 and Figure 2. Thus, the causal structure, including the causal direction and the  magnitude of direct causal effect, can be simultaneously specified in the cyclic SEMs in this experiment.

Real-world data with trivial causal structure
Real-world data containing cause-effect pairs were obtained from a public database compiled by Mooij et al. (2009). For our demonstration, we chose the Census Income (KDD) dataset, which contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. The dataset has two variables: X, age; and Y, hourly wage. Since the variable Y is strongly skewed to lower wages, it is log-transformed so thatỸ :¼ logY: The dataset includes 5000 observations; however, for computational efficiency, we randomly sampled 1000 observations among them. The variable X is scale-changed, so thatX :¼ ðX À 15Þ=64: We confirmed that the random selection did not have a significant impact on results, as the estimated shapes of the functions are nearly the same, while the selected complexity parameters   were slightly different. Periodic-range-adjusting-parameters T x and T y were also introduced. We applied both the quadratic model and the sin-cos model to the data. As in the case of the previous synthetic examples, we explored all combinations of the complexity parameters up to p þ q 6, ultimately selecting the model with p ¼ 5, q ¼ 1 (AIC ¼ 429:13) and the model with p ¼ 3, q ¼ 1 (BIC ¼ 490:98Þ for the sin-cos model, as shown in  The underline represents minimum value in AIC or BIC.   where T x ¼ maxxÀminx 2p and T y ¼ maxỹÀminỹ 2p , respectively. The estimates of the coefficients a 1 , :::, a 10 are shown in the first row (AIC) and the second row (BIC) of Table 5. The curve of the estimated functions f ðxÞ and gðỹÞ for the quadratic model and the sin-cos model (best in AIC and BIC) are plotted in the left panel of Figure 3 and the middle and right panel of Figure 3, respectively. Note the scale is back-transformed in the figure for better interpretation. The results are similar, and one can see that age X affects the hourly wage Y, i.e. it shows that the highest hourly wage is for ages around 40-50; on the other hand, the hourly wage Y has almost no explicit impact on age X:

Discussion
Graph identifiability and model identifiability are different. The model identifiability requires identifiability of parameter values in addition to graph identifiability. Linear cyclic SEM models with two variables under normality are known not to have the both identifiabilities. In this article, we examined basic properties of cyclic non linear SEMs and their graph identifiability. Simulations and analysis of a real-world dataset were made to confirm identifiability of parameter values, feasibility, and interpretability for several models for f ðxÞ and gðyÞ: A model with sine and cosine functions may have wider applicability in practical situations, since the model can cover versatile shapes of functions as the values of the complexity parameters p and/or q increase. One can choose the complexity parameters by information criteria, such as AIC or BIC (using BIC is perhaps better). Importantly, the model proposed here allows users to assess whether unobserved confounders exist. Although it requires additional computational power, the model can handle more than two variables, assuming an error covariance between every pair of variables. Based on our simulation studies, all of the parameters appear to be identifiable. Our approach is quite promising and further development is expected.