High-Dimensional Mediation Analysis for Selecting DNA Methylation Loci Mediating Childhood Trauma and Cortisol Stress Reactivity

Abstract Childhood trauma tends to influence cortisol stress reactivity through the mediating effects of DNA methylation. Houtepen et al. conducted a study to investigate the role of DNA methylation in cortisol stress reactivity and its relationship with childhood trauma. The study collected a dataset consisting of 385,882 DNA methylation loci, cortisol stress reactivity, one-dimensional score on a childhood trauma questionnaire and several covariates for 85 healthy individuals. Of great scientific interest is to identify the active mediating loci out of the 385,882 ones. Houtepen et al. conducted 385,882 linear mediation analyses, in each of which one locus was considered, and identified three active mediating loci. More recently, van Kesteren and Oberski proposed a coordinate-wise mediation filter (CMF) and applied it to the same dataset. They identified five active mediating loci. Unfortunately, the three loci identified by Houtepen et al. are completely different from the five loci identified by van Kesteren and Oberski, probably because both Houtepen et al. and van Kesteren and Oberski did not consider all loci jointly in their analyses. The high dimensional DNA methylation loci indeed necessitate new techniques for identifying active mediating loci and testing the direct and indirect effects of the early life traumatic stress on later cortisol alteration. Motivated by the contradictory results in the aforementioned two scientific works, we develop a new estimating and testing procedure, and apply it to the same dataset as that analyzed by the two works. We identify three new loci: cg19230917, cg06422529 and cg03199124, and their effect sizes and p-values are 321.196 (p-value = 0.035965), 418.173 (p-value = 0.000234) and 471.865 (p-value = 0.001691), respectively. These three loci possess both reasonably neurobiological interpretations and statistically significant effects via our proposed tests. Based on our new procedure, we further confirm that the childhood trauma does not have significant direct effects on cortisol change—it only indirectly affects cortisol through DNA methylation, and the indirect effect is negative. Supplementary materials for this article are available online.


Introduction
Childhood trauma plays a pivotal role in the development of psychiatric disorders across life span (Burke et al. 2005;Petrowski et al. 2013). Its persistently detrimental influence is typically realized via altering neuroendocrine substances like cortisol (Heim et al. 2000;Carpenter et al. 2007). Ever since the pilot study conducted by Luecken (1998), researchers thereof have sought for the mechanism relating cortisol change to various circumstances of childhood trauma, such as maltreatment (Carpenter et al. 2007), physical abuse (Heim et al. 2000;Bremner et al. 2003;Elzinga et al. 2010;Carpenter et al. 2011), early parental loss (Luecken 1998;Kraft and Luecken 2009), separation experience (Pesonen et al. 2010), among others.
On finding such relations, the aforementioned works nevertheless have not reached a concordant solution. This pushes through deeper exploration toward epigenetic alteration involved in the traumatic stress. Convincingly demonstrated by preclinical studies, childhood trauma tends to influence neuroendocrine system in adulthood via altering DNA methylation patterns. McGowan et al. (2009) studied the epigenetic regulation of glucocorticoid receptor (NR3C1) in human brain associated with childhood abuse. Perround et al. (2011) showed that early life adverse events may permanently impact on the Hypothalamus-Pituitary-Adrenal axis (HPAA) though epigenetic modifications of NR3C1. Edelman et al. (2012) demonstrated epigenetic changes at the GR exon 1F correlate with HPAA reactivity measured by total cortisol (area under curve). See Vinkers et al. (2015) for a comparative review of literature regarding trauma-induced changes in DNA methylation in humans.
These works mainly concentrated on single-layer linear models, where effects of early life trauma on DNA methylation and effects of DNA methylation on HPAA or cortisol alteration are separately evaluated. However, DNA methylation ought to play a bridging role in the relation between childhood trauma and cortisol stress reactivity. In addition, all of their scientific findings are based on epigenetic modifications of a single gene. In theory, this is unlikely to be the case and would result in estimation bias. In sight of such issues, Houtepen et al. (2016) conducted a genome-wide mediation analysis and identified a locus on the KITLG gene that mediates the relationship between childhood trauma and cortisol stress reactivity. Although starting at 385,882 DNA methylation loci, only the top three loci were selected for further investigation by the QQ plot of the p-values obtained from individual significance tests, with a total discard of the dependence structure and joint effects of DNA methylation. To account for the between-loci dependency, van Kesteren and Oberski (2019) proposed an embedding algorithm called coordinate-wise mediation filter (CMF), which consists of an inner loop and outer loop. A key strategy of CMF to address dependency is the use of residuals and projection when detecting loci in the inner loop. This CMF algorithm targets dichotomous decisions-whether each of the DNA methylation locus should be recognized as a mediator, while offers no guarantee of either statistical significance or model fits. Interestingly, van Kesteren and Oberski (2019) identified completely different DNA methylation loci from Houtepen et al. (2016), based on the same dataset but using the CMF algorithm. In response to this contradiction, we in this article carry out an in-depth mediation analysis for a thorough understanding of how early life trauma affects cortisol stress reactivity in adulthood via DNA methylation.
From a statistical point of view, this is a high dimensional mediation problem, with DNA methylation loci being potential mediators, the vast majority of which though are supposed to be inactive. Nothwithstanding no shortage of strategies dealing with high dimensional mediators, including those in Houtepen et al. (2016) and van Kesteren and Oberski (2019), most existing literature rely on the marginal screening or penalized regression for sparse estimation. See for instance Preacher and Hayes (2008), Zhang et al. (2016), Serang et al. (2017), and so forth. A pitfall of using these dimension reduction techniques in each or either layer of mediation models lies in the pertinent difference between penalizing paths and finding actual mediators. That is, they choose paths instead of mediators. As a potential insight to break through this obstacle, Zhou, Wang, and Zhao (2020) proposed a debiased Lasso method that can integrate the two layers of high dimensional mediation models, and they also developed significance tests for both direct and indirect effects. However, the method proposed in Zhou, Wang, and Zhao (2020) involves high dimensional matrix estimation and operation, which might bring about a huge computational burden. In addition, the procedure penalizes all parameters, and the debiased step relies on the entire covariance matrix. This leads to inevitable efficiency loss of the estimators. More recently, Guo et al. (2021) observed that despite of high dimensional mediators, the direct and indirect effects are both low dimensional, with sum being the total effect. They thereby proposed a partial penalized approach for estimating the direct effects, which avoids high dimensional matrix estimation and the debiased step, and thus, enhances efficiency of proposed estimators. In spite of the plausible theory and efficient algorithms, Guo et al. (2021) have not yet explicitly elucidated the method with potential confounders, which typically should be considered when studying traumatic effects on cortisol alteration via DNA methylation, as in the literature (Houtepen et al. 2016;van Kesteren and Oberski 2019). Therefore, we in this article extend the work of Guo et al. (2021) to the models with confounders. Then we use our new procedure to study the mediating role of DNA methylation relating childhood trauma and cortisol stress reactivity, with several clinical variables involved as confounders. We further develop relevant tests for the direct and indirect effects of the early life trauma on cortisol stress reactivity.
Aside from the eight DNA methylation loci detected by Houtepen et al. (2016) and van Kesteren and Oberski (2019), we identify three additional loci on the RAB5IF gene (cg19230917), the CPQ gene (cg06422529) and the AGPAT1 gene (cg03199124) as mediators. We look through existing literature, and find reasonably neurobiological interpretations toward these three genes, with details referred to Section 3. Thus, our findings point out a potential direction for deeper neurobiological and epigenetic investigation of the connection between traumatic stress and cortisol alteration. From statistical point of view, we perform several statistical tests, and the results are also in support of the selected genes. According to the tests for the direct and indirect effects proposed in this article, the childhood trauma influences cortisol reactivity only through DNA methylation, since the indirect effect is negatively significant, yet the direct effect is not significant. In the full model with all detected loci, those from the newly identified genes are all significant, while the KITLG gene (cg27512205) selected by Houtepen et al. (2016), the HNRNPF gene (cg12500973) and the ZSCAN30 gene (cg16657538) selected by van Kesteren and Oberski (2019) are no longer significant. However, models with only the genes in Houtepen et al. (2016) yield a contradictory conclusion that KITLG is significant.
In Section 2, we introduce the statistical formulation of the high dimensional mediation problem, including the mediation models with confounders involved, the estimation for direct and indirect effects, and the tests of significance of indirect and direct effects. The detailed analysis is presented and discussed in Section 3. We also conduct a thorough simulation study to validate the finite sample performance of the proposed procedure in Section 4. A brief summary and conclusion are provided in Section 5.

Statistical Formulation: High Dimensional Linear Mediation Models with Confounders
In this section, we introduce the high dimensional mediation models with confounders involved, as the statistical formulation associating childhood trauma with cortisol stress reactivity via altering DNA methylation. Then we extend the partial penalization technique in Guo et al. (2021) to these models, for estimating and testing the direct and indirect traumatic effects. Let y be the response variable, m consist of p-dimensional mediators, x consist of q-dimensional exposure variables, and z consist of d-dimensional confounding variables. In our study, y is designated as the cortisol stress reactivity, x is childhood trauma, and elements in m correspond to DNA methylation loci that potentially mediate relations between trauma and cortisol. Moreover, we take several clinical variables as confounders in z, with detailed descriptions in Section 3. Consider linear mediation models where ε 1 is a random error with Eε 1 = 0 and var(ε 1 ) = σ 2 1 and ε 2 is a random error vector with E(ε 2 ) = 0 and cov(ε 2 ) = * . Assume that ε 1 is independent of m, x and z, and ε 2 is independent of x and z. Furthermore, assume that ε 1 and ε 2 are independent.
Motivated by the real data analysis in Section 3, it is assumed throughout this article that q and d have fixed and finite dimensions, while p is high dimensional. Plugging (2.2) into (2.1), we obtain where α 1 and β = 1 α 0 are called the direct and indirect effect of exposure x in mediation literature, respectively, and γ x = α 1 + β is called the total effect of x. Often of primary interest from mediation point of view is to estimate and test α 1 and β. And these two parameters possess their own interpretations as natural indirect effect and natural direct effect in causal inference.

Natural Direct and Natural Indirect Effects
We link the parameters α 1 and β with natural direct and natural indirect effects on a causal diagram. Let y(x, m) denote the potential outcome that would have been observed had x and m been set to x and m, respectively, and m(x) denote the potential mediator that would have been observed had x been set to x. Following Imai, Keele, and Tingley (2010), Vanderweele and Vansteelandt (2014), and others, for x = x 1 versus x 0 , the natural direct effect is defined as while the indirect effect is defined as The total effect is then naturally defined as the sum of natural direct and indirect effects Furthermore, the independence assumptions of random errors in the mediation models (2.1) and (2.2) ensure the following sequential ignorability conditions (Imai, Keele, and Tingley 2010;Vanderweele and Vansteelandt 2014;Huang 2019;Zhou, Wang, and Zhao 2020).
(A1) x⊥ ⊥y(x, m)|z: that is, no unmeasured confounders between the exposure and outcome.
(A2) m⊥ ⊥y(x, m)|(x, z): no unmeasured confounders between the mediators and outcome. (A3) x⊥ ⊥m(x)|z: no unmeasured confounders between the exposure and mediator. (A4) m(x)⊥ ⊥y(x, m)|z: no exposure-dependent confounders between the mediators and outcome, wherex is the realization of exposure at a different value from x.
Under these sequential ignorability conditions, Vanderweele and Vansteelandt (2014) showed that Thus, α 1 can be interpreted as the average natural direct effect, and β = 1 α 0 can be interpreted as the average natural indirect effect.

Identifying Active Mediators
In this section, we introduce the procedure of identifying active mediators in the mediation models (2.1) and (2.2), and estimating the direct effect α 1 and indirect effect β that can get around high dimensional matrix estimation. Suppose that {m i , x i , z i , y i }, i = 1, . . . , n is a random sample from (2.1) and (2.2). Let y = (y 1 , . . . , Despite high dimensionality of m, model (2.3) is indeed a fixed-dimensional model. Therefore, the coefficient of x, or say the total effect γ x = α 1 + β, could be naturally estimated via the ordinary least squared estimator in model (2.3), that is, where I q is q×q dimensional identity matrix, and O q×d is a q×d zero matrix.
Another key observation is that x and z in model (2.1) are both fixed dimensional, thus, we opt not to impose sparsity on α 1 and α 2 . On the other hand, sparsity on α 0 , the coefficient associated with the high dimensional mediator m, could be naturally and reasonably assumed, as so in most existing highdimensional literature. Therefore, following Guo et al. (2021), we apply the partial penalized least squared method to fit model (2.1) by only penalizing α 0 . That is, the objective function subjected to minimization is where α 0j is the jth element in α 0 , and p λ (·) is a penalty function with tuning parameter λ. Throughout this article, we will use the SCAD penalty, whose first-order derivative is and set a = 3.7 as suggested by Fan and Li (2001). The tuning parameter λ is selected by HBIC (Wang, Kim, and Li 2013).
A numerical algorithm to solve this penalized least squares problem is given in Section S.4 in the supplementary material of this article. Denote the corresponding estimates to be α 0 , α 1 and α 2 . Further note that β = γ x − α 1 . Then we can estimate the indirect effect β by β = γ x − α 1 . Let A = {j : α 0j = 0} and A = {j : α 0j = 0}. Under suitable conditions, we could obtain oracle property for our proposed estimates. That is, with probability tending to 1, A = A. Note that all truly active mediators are included in A. This then implies that we can identify all truly potential mediators. We will empirically evaluate the performance of the partial penalized least squared method in (2.5) in the simulation section.

Test of Direct Effect and Indirect Effect
In terms of further statistical inference, penalizing α 0 gains efficiency when estimating the coefficients, and hence, enhances power toward the subsequential tests. Meanwhile, not penalizing α 1 and α 2 renders their respective estimators unbiasedness, thus, there is no need for conducting any of the debiased, desparsified or decorrelated procedures (Zhang and Zhang 2014;Van de Geer et al. 2014;Ning and Liu 2017;Zhou, Wang, and Zhao 2020), which admittedly correct estimation biases brought by ordinary regularization methods yet sacrifice efficiency.
To develop tests for the direct effect α 1 and indirect effect β, we first derive the asymptotic distributions of their estimators α 1 and β.
Thus, models (2.1) and (2.2) can be rewritten as which coincide with the causal mediation models considered in Guo et al. (2021). Thus, incorporating the results in Corollary 1 of Guo et al. (2021), the asymptotic distribution of α 1 and β can be obtained in a similar fashion. Specifically, define m A to be the subvector of m corresponding to A = {j : α 0j = 0}. And The asymptotic covariance matrices in (2.7) and (2.8) could be estimated in the same routine as Guo et al. (2021), by replacing quantities related to x in their work with those related to w in this article. These estimates lay the foundation of subsequential tests.
For testing the direct effect α 1 with hypotheses we modify the F-type lack-of-fit test proposed by Guo et al. (2021) by incorporating confounding effects. In model (2.1), denote RSS f to be the residual sum of squares (RSS) in the full model fitted by the partial penalized least squares method (2.5), and RSS r to be the RSS in the reduced model with x deleted from (2.1), obtained by the same partial penalized regression yet with objective function The test statistic thereby is defined as which follows χ 2 (q), the chi-squared distribution with degrees of freedom q, under the null hypothesis. And it possesses local power for local alternatives which converge to the null at the rate of n −1/2 . For testing the indirect effect β with hypotheses we construct the Wald test statistic with the estimated covariance matrices, namely, WW . The hat versions are the sample counterparts of the covariance matrices. The limiting null distribution of S is χ 2 (q), and the statistic can also detect the local effects with root-n convergence rate, as discussed in Guo et al. (2021). In addition, the Wald test for H 0β is based on the asymptotical normality of β. One may construct a Wald test for individual mediation effect β j or a subvector of β based on their marginal asymptotical normality.

A Case Study: Exploration of Mediating Effects of DNA Methylation between Childhood Trauma and Cortisol Stress Reactivity
This section is devoted to an empirical analysis of the same dataset as that in Houtepen et al. (2016) and van Kesteren and Oberski (2019), for studying how DNA methylation plays a role in the regulation of human stress reactivity. More specifically, Houtepen et al. (2016) aimed to provide an unbiased investigation of the role of DNA methylation in cortisol stress reactivity and its relationship with childhood trauma. The data can be downloaded from the following website: https://www.ebi.ac.uk/ arrayexpress/experiments/E-GEOD-77445, and the dataset consists of 385,882 DNA methylation loci and various variables for 85 people. R markdown file for this analysis is available at GitHub: https://github.com/zengmudong/High-dimensionalmediation-analysis Houtepen et al. (2016) performed a genome-wide DNA methylation analysis for cortisol stress reactivity in healthy individuals. Since the number of DNA methylation loci is much greater than the sample size, Houtepen et al. (2016) first ran 385,882 linear regression models-response being cortisol stress reactivity, predictors being each out of the 385,882 DNA methylation loci, respectively, and confounders being several clinical variables. They reported 22,425 loci with p-values less than 0.05, while no statistically significant loci at level 0.05 after adjustment for multiple testing. The authors then selected three loci that stood out in the p-value distribution of the genomewide cortisol stress reactivity analysis. The three loci are cg27512205 (denoted by m 1 ), cg05608730 (m 2 ) and cg26179948 (m 3 ), based on which the authors further conducted a mediation analysis, and identified a locus on the KITLG gene (cg27512205) that is not only associated to cortisol stress reactivity, but also partly mediates the relationship between childhood trauma and cortisol stress reactivity. More importantly, they replicated the analysis using two independent samples from the whole blood and buccal (cross-tissue) DNA, respectively, and concluded that the KITLG locus is indeed a mediator.
More recently, van Kesteren and Oberski (2019) proposed a coordinate-wise mediation filter (CMF), which aims to improve the marginal screening method for linear mediation models with high-dimensional mediators. They further applied CMF for an empirical analysis of the same dataset as Houtepen et al. (2016), and identified five loci as the mediators. The five loci are cg16657538 (m 4 ), cg25626453 (m 5 ), cg02309301 (m 6 ), cg13136721(m 7 ), and cg12500973(m 8 ), which are completely different from the three loci identified by Houtepen et al. (2016). This contradiction motivates us to conduct a further analysis using the new procedure for studying the mediating role of DNA methylation that relates childhood trauma and cortisol alteration.

Mediation Analysis via the Proposed Procedures
In our analysis, the exposure variable (x) is a one-dimensional score on a childhood trauma questionnaire, and the outcome y is the increased area under the curve (iAUC) in cortisol after a stress test. We consider 385,882 DNA methylation loci in the blood as potential mediators in m. Following van Kesteren and Oberski (2019), we first carry out a screening step to retain the top 1000 potential mediators by ranking the absolute value of the product of two correlations-the correlation between x and each element of m, and between y and each element of m. This indeed is a marginal screening procedure based on Pearson correlation proposed by Fan and Lv (2008). They showed that for linear models, under some regularity conditions, the screening procedure possesses a sure screening property. We also include the eight loci identified by Houtepen et al. (2016) and van Kesteren and Oberski (2019) as domain knowledge and for comparison purpose. Furthermore, eight clinical variables are involved, including age (Z 1 ), sex (Z 2 ), B cell proportion (Z 3 ), CD4 T cell proportion (Z 4 ), CD8 T cell proportion (Z 5 ), Monocytes cell proportion (Z 6 ), Granulocytes cell proportion (Z 7 ) and Natural Killer cell proportion (Z 8 ), as confounding variables. This leads to the linear mediation models (2.1) and (2.2), where x (with dimension q = 1) and y are defined above; the confounder vector z is z = (Z 0 , Z 1 , . . . , Z 8 ) T , with Z 0 ≡ 1 to include an intercept in the model.
We apply the proposed procedure to analyze the data. In the partial penalized least squares approach, we first select the tuning parameter λ by HBIC, and λ = 60.8163. The eight loci m 1 , . . . , m 8 are treated as domain knowledge and are not penalized. Aside from them, our proposed method selects three additional loci cg19230917(m 9 ), cg06422529(m 10 ), and cg03199124(m 11 ). The estimated coefficients α 0 , α 1 , and α 2 ,  Tables 1  and 2, the newly identified loci m 9 , m 10 , and m 11 should be considered as mediators since their coefficients are significant, and the coefficients of exposure variable on these loci are also significant at level 0.05. Table 3 presents the results for testing the direct and indirect effects. The indirect effect is significant with p-value 0.0016, while the direct effect is insignificant since the pvalue is 0.7643. Further note that the estimate of the indirect effect equals −17.3726 < 0. This implies that childhood trauma influences the cortisol stress reactivity only through the mediation mechanism of the DNA methylation, and the indirect effect is significantly negative. Table 4 lists the 11 DNA methylation loci together with the genes to which they belong. It also provides some field knowledge of these genes dug out from existing research, according to which, the newly identified genes m 9 , m 10 , and m 11 have particularly interesting biological and genetical interpretations. The locus m 9 corresponds to the RAB5IF gene (cg19230917). This gene modulates cell endocytosis process by which cells engulf substances, such as hormones, from outside into the cell (Ravikumar et al. 2008). Cortisol is a steroid hormone produced by the adrenal glands, and it may signal the cells through receptor for endocytosis. Thus, the RAB5IF gene likely plays a mediator rule that transmits the epigenetic alterations evoked by the traumatic stress. m 10 belongs to the CPQ gene (cg06422529), which is shown by Hauptmann et al. (2017) to function in thyroid and tumor development. Peter (2011) testified this gene as a pathway from stress to cortisol level change in fish. A further neurobiological exploration is worthwhile to find out whether it has similar mediating effect in human body. m 11 is located in AGPAT1 gene (cg03199124), which is involved in signal transduction and lipid biosynthesis for creating and storing body fat (Aguado and Campbell 1998). Some existing literature (Gonzalez-Bono et al. 2002;Kuo et al. 2007;Aschbacher et al. 2013) investigated the associations between physical stress like trauma and fat tissue biosynthesis. Vicennati et al. (2009) conducted a retrospective study and showed that women weight gain caused by trauma stress is accompanied by abnormal hormonal level such as cortisol. Our study which finds gene AGPAT1 as a mediator relating trauma stress and cortisol level therefore, may provide clues for such stress pathophysiological mechanism research. In summary, there is a reasonable conjecture that the identified loci, or their located genes, regulate neurobiological pathways and mediate the cortisol stress reactivity in response to childhood trauma, as also indicated by Table 3.

Some Comparisons
It is worth to compare our results with those in Houtepen et al. (2016) (Aguado and Campbell 1998) with (2.1) and (2.2) where m is taken to be m (1) , and models in van Kesteren and Oberski (2019) correspond to those with m (2) . We further consider the mediation models with m (3) , which merges m (1) and m (2) . The estimated regression coefficients α j 's in model (2.1) are listed in Table 5. The estimated 1 and 2 and their values coincide with those in Table 2 because regressing the multiple responses m over the exposure variable and confounding variables in linear model (2.2) coincides with regressing individual mediator m j over the exposure variable and confounding variables. Tables 1 and 5 both suggest that the direct effect of childhood trauma, or say the coefficient of the exposure variable x, is not significant in model (2.1). All confounding variables except for Z 2 (i.e., sex) are not significant. Mediators m 5 and m 6 are not significant based on all belonging models under investigation. Furthermore, comparing Tables 1 and 5, we observe that the effect of mediator m 1 may change from significance to insignificance at level 0.05, after inclusion of other mediators into the model. The reversal of test results for mediator m 1 , as well as the insignificance of m 5 and m 6 , motivates us to explore the relationship among all the identified mediators. Their pairwise correlations, partial correlations given x and z, and several multiple regression models all reflect certain degree of association among mediators, which further explain their insignificance given other mediators included in the model. We put the detailed discussion in the supplementary material to save space.
The estimated direct and indirect effects for these three models are presented in Table 6, together with corresponding significance tests. This table indicates that the direct effect is not significant and indirect effect is significant for models with mediators m (1) and m (3) , while both direct and indirect effects are not significant for model with mediator m (2) .

Simulation Studies
We in this section conduct Monte Carlo simulation studies to investigate the finite sample performances of the statistical procedure described in Section 2, and compare it with the oracle tests that know the true mediator set A, with statistics S O and T O , and those in Zhou, Wang, and Zhao (2020), with statistics denoted by S Z and T Z . The results are based on 500 replications and significance level 0.05.
We set up the experiments to mimic the real data analyzed in Section 3 to the utmost. The sample size is taken to be the same, the dimension of potential mediators is 1000, corresponding to the 1000 candidate DNA methylation loci, and the exposure variable x and confounder z are directly extracted from the dataset. Meanwhile, m is generated via model (2.2), since it needs to be considered as random according to the mechanism of mediation models. Then y is accordingly generated from model (2.1). To accomplish this, we first draw Gaussian noise ε 1 ∼ N(0, σ 2 1 ) in model (2.1), where σ 2 1 is the estimated σ 2 1 in Section 3. The multivariate noise in model (2.2) is generated from ε 2 ∼ N(0, * ), where * is taken to be autoregressive covariance matrix. That is, the (i, j)-element of * equals ρ |i−j| , and ρ = 0.5. The true mediators in A are designed in accordance with the 11 detected loci, m 1 , . . . , m 11 , from the real data. Their associated coefficients α 0 in model (2.1) is taken to be (1.0, 0.9, 0.8, −0.9, −0.8, −0.7, 0.6, 0.5, 0.4, 0.3, 0.2), with signs of elements consistent with those in α 0 estimated in Section 3. Moreover, the direct effect α 1 = c 2 , where c 2 = 0, 0.1, . . . , 1.0, to capture the size and power curve of the test for direct effect. For generating the indirect effect, the true value of

Simulation Studies Without Confounding Variables
We first consider models without confounding variable z. That is, α 2 and 2 are both taken zero in model (2.1) and (2.2). We evaluate the indirect effect tests, by fixing the direct effect α 1 = c 2 = 0.5. The left panel of Figure 1 depicts the size (c 1 = 0) and power (c 1 = ±0.1, . . . , ±1.0) for the three tests with statistics S,  S O and S Z . From this figure, powers of all three tests increase as |c 1 | increases, and sizes are well controlled. Our proposed test S performs as well as the oracle test S O , and is more powerful than S Z . For instance, when c 1 = 0.4, the empirical powers of S and S O are 0.63, while that of S Z is 0.23. We also consider testing direct effect α 1 by holding c 1 = 0.5, corresponding to true value of indirect effect β = −0.7989. Similarly, the right panel of Figure 1 shows the empirical size (c 2 = 0) and power (c 2 = ±0.1, . . . , ±1.0) for the proposed test T, the oracle one T O , and T Z proposed by Zhou, Wang, and Zhao (2020). The powers of all three tests increase as the value of |c 2 | increases. T performs closely with T O , and is more powerful than T Z when c 2 is positive. For instance, when c 2 = 0.2, the empirical power of T and T O can reach 0.78, while the empirical power of T Z test is 0.06.
Moreover, we investigate the performances of the estimators of direct effect α 1 and indirect effect β in terms of bias and standard deviation. The results are reported in Table 7. From this table, the biases of our proposed estimators α 1 , β and oracle ones α O 1 , β O are very small, while the biases of α Z 1 are very large. This in turn results in low power of S Z and T Z . Table 8 depicts the sample standard deviations of the estimates α 1 and β over 500 replications in the column with label "std, " which can be regarded the true value of standard error of the estimates. These sample standard deviations are also shown in parentheses of Table 7. In the column with label "se(std)" in Table 8, we report the sample average and sample standard deviation of the 500 estimates of standard errors of α 1 and β based on the asymptotic covariance matrix formulas in (2.7) and (2.8). Note that the R package "freebird" in Zhou, Wang, and Zhao (2020) does not provide the estimated standard error of α 1 . The difference between the column "std" and "se(std)" can be used to gauge the performance of the standard error formula based on the asymptotical covariance matrices. From Table 8, both the new method and the oracle have smaller difference than the method proposed by Zhou, Wang, and Zhao (2020).

Simulation Studies with Confounding Variables
We next examine the performances of the proposed methods for the models with confounding variables. In our simulation, we set the associated coefficients α 2 and 2 to be those estimated from the real data. Figure 2 shows the empirical sizes and powers of the tests S, S O , and S Z for indirect effect, and the tests T, T O , and T Z for direct effect. The left panel assesses the performance of tests for  To understand in depth the abnormal behavior of the power curves of Zhou, Wang, and Zhao (2020)'s tests, we investigate the performance of estimated direct effect α 1 and indirect effect β in terms of bias and standard deviation, as reported in Table 9. The biases of Zhou, Wang, and Zhao (2020)'s estimates are fairly large compared to the proposed estimators α 1 , β and oracle ones α O 1 , β O . When holding c 2 = 0.5, the bias of α Z 1 increases dramatically as c 1 increases. Similar phenomenon occurs when holding c 1 = 0.5, where the bias of β Z changes substantially. The bias of estimated α Z 1 and β Z results in the shifted curves shown in Figure 2. The large bias of Zhou, Wang, and Zhao (2020)'s estimates and the low power of their tests are possibly due to the penalization on direct effect in the scaled lasso, as also pointed out in Zhou, Wang, and Zhao (2020) (see the discussion in their sec. 7). The penalization on direct effects would make sense when the direct effects are expected to small. This is another main merit of applying partial penalized method toward this problem setting. We explore Zhou, Wang, and Zhao (2020)'s method more to understand its inferior performance. Note that the proposed method does not penalize the direct effect α 1 , while Zhou, Wang, and Zhao (2020)'s method does penalize the direct effect in fitting scaled lasso (Sun and Zhang 2012). This leads to a larger estimated error variance σ 2 1 than the proposed method and the oracle estimator. Figure 3 compares the estimated σ 1 of the proposed estimate, oracle estimate and Zhou, Wang, and Zhao (2020) when confounding variables are involved in the mediation model. From Figure 3, we can observe that when c 1 or c 2 changes from negative to positive, Zhou, Wang, and Zhao (2020)'s estimated σ 1 dramatically increases, while the proposed method and oracle estimate do not. The increasing trend of variance estimation results in large biases of initial estimates used in Zhou, Wang, and Zhao (2020), making the debiased step   more challenging. In addition, as c 1 or c 2 increases, estimating through D − UU ≤ τ , where UU = uu T , u = (m T , w T ) T , and D and are defined following Zhou, Wang, and Zhao (2020), requires larger tuning parameter τ , corresponding to more penalization on parameters and hence, further biases as well. Moreover, the power loss in the debiased step is attributed in part to multicollinearity, which also increases with c 1 and c 2 , and when more confounders are involved.
As in Table 8, we report the empirical standard deviation of the 500 estimates and the average of 500 estimated standard errors over the 500 replications in Table 10 to examine the accuracy of variance estimation. For the new method and oracle, the standard errors estimated by Monte Carlo simulations are close to those calculated from formulas; while the empirical standard deviation and the average standard error of Zhou, Wang, and Zhao (2020) have a large gap.

Conclusion
Early life trauma plays a critical role in developing psychiatric disorders, typically via altering certain neuroendocrine substances like cortisol. Various researches thus, have been conducted to understand the mechanism relating cortisol change to different circumstances of early life trauma. Along with such prolific research output, scientists gradually realized the bridging role of DNA methylation toward the relation between childhood trauma and cortisol stress reactivity. On genome-wide level, Houtepen et al. (2016) conducted a study to investigate how DNA methylation affects cortisol stress reactivity and its relationship with childhood trauma. They identified three top-rated DNA methylation loci by ranking the p-values in an increasing order, one of which, on the KITLG gene (cg27512205), was shown not only to associate with cortisol change, but also partly mediate the relationship between childhood trauma and cortisol stress reactivity. Another study by van Kesteren and Oberski (2019), however, yielded a completely different set of loci, based on the same dataset while using their proposed CMF algorithm.
Motivated by such contradictory results in Houtepen et al. (2016) and van Kesteren and Oberski (2019), in which the authors did not consider the potentially active mediating loci jointly, we propose a partial penalized least squared method for linear mediation models with high-dimensional mediators in the presence of confounders. We further develop relevant tests for the direct and indirect effects in such high-dimensional linear mediation models. Simulation studies validate the capability of the proposed approach for efficiently estimating and testing the direct and indirect effects, and the numerical comparisons also imply that the proposed procedure outperforms the debiased method advocated by Zhou, Wang, and Zhao (2020).
We use this partial penalized least squares method and testing procedures to investigate the high dimensional mediating effects of DNA methylation loci to relate childhood trauma and cortisol stress reactivity, with confounding variables involved. For comparison purpose, we analyze the same dataset as Houtepen et al. (2016) and van Kesteren and Oberski (2019). We choose to include the eight DNA methylation loci discovered by these two papers in the model as domain knowledge. Using the proposed approach, we identified three new loci, on the RAB5IF gene (cg19230917), the CPQ gene (cg06422529) and the AGPAT1 gene (cg03199124), respectively, that actively play the mediator role. We compare our findings with Houtepen et al. (2016) and van Kesteren and Oberski (2019) from statistical perspectives, where tests and related analyses are all in favor of the three newly identified loci. Furthermore, we estimate and test the direct and indirect effects for childhood trauma on cortisol change, and conclude that the early life trauma affects cortisol only indirectly through DNA methylation and the indirect effect is negative, while the direct effect is insignificant.
From domain knowledge in existing literature, we also provide biological and genetical interpretations for the three selected loci and their belonging genes. The RAB5IF gene takes charge of cell endocytosis process, by which cells engulf substances like cortisol, thus, reasonably serves as a mediator which transmits the hormone change brought by the traumatic stress. As to the CPQ gene, previous research has verified it as a pathway from stress to cortisol change in fish. Thus, incorporating our findings, an neurobiological exploration toward its role in human is worthwhile. The AGPAT1 gene, on the other hand, was shown to control fat tissue biosyn-thesis; while some retrospective studies demonstrated that fat biosynthesis and storage caused by trauma stress is accompanied with abnormal hormonal level such as cortisol. Therefore, our findings may offer potential clues for such pathophysiological mechanism research. In short, statistical tests and scientific interpretations both show convincing evidence that the newly identified three DNA methylation loci, or their located genes, should be considered as active mediators that relate childhood trauma and cortisol stress reactivity.

Supplementary Materials
The supplementary materials consist of detailed explanations of confounders and relationship among the mediators for the empirical analysis in Section 3, and additional numerical comparison with the global test (Djordjilović et al. 2019).