A General Framework for Identifying Hierarchical Interactions and Its Application to Genomics Data

Abstract The analysis of hierarchical interactions has long been a challenging problem due to the large number of candidate main effects and interaction effects, and the need for accommodating the “main effects, interactions” hierarchy. The two-stage analysis methods enjoy simplicity and low computational cost, but contradict the fact that the outcome of interest is attributable to the joint effects of multiple main factors and their interactions. The existing joint analysis methods can accurately describe the underlying data generating process, but suffer from prohibitively high computational cost. And it is not straightforward to extend their optimization algorithms to general loss functions. To address this need, we develop a new computational method that is much faster than the existing joint analysis methods and rivals the runtimes of two-stage analysis. The proposed method, , adopts the framework of the forward and backward stagewise algorithm and enjoys computational efficiency and broad applicability. To accommodate hierarchy without imposing additional constraints, it has newly developed forward and backward steps. It naturally accommodates the strong and weak hierarchy, and makes optimization much simpler and faster than in the existing studies. Optimality of sequences is investigated theoretically. Simulations show that it outperforms the existing methods. The analysis of TCGA data on melanoma demonstrates its competitive practical performance. Supplementary materials for this article are available online.


Introduction
There are numerous situations in which interactions have been shown to play an important role beyond main effects. In human disease genetics, it has been suggested that interactions, in particular gene-environment interactions (Hunter 2005) and genegene interactions (Cordell 2009), have important implications for disease biology, new ways of classifying/defining diseases, and more informed clinical decision-making (Claussnitzer et al. 2020).
The existing studies on hierarchical interactions, although quite successful, still have limitations. A family of methods conducts two-stage analysis and identifies main effects in the first stage and corresponding interactions under the hierarchy constraint in the second stage (Kong et al. 2017;Hao et al. 2018;Wang et al. 2021). Among the existing methods, RAMP (Hao et al. 2018) is perhaps the most representative. It performs forward selection with main effects and allows interactions into the model once main effects have been selected. Although enjoying simplicity and low computational cost, such methods contradict the fact that the outcome of interest is attributable to the joint effects of multiple main effects and their interactions.
Accordingly, many of the recent studies have focused on joint analysis that accommodates the joint effects of all main effects and their interactions in a single model and can more accurately describe the underlying data-generating processes  (Wu et al. 2018). In joint analysis, the number of parameters is usually much larger than sample size. In addition, among all candidate main effects and interactions, we expect only a small subset to be associated with the outcome of interest. This poses a variable selection problem. Subsetting methods, such as the feasible solutions method (Miller 1984), can select interactions with satisfactory statistical properties when the signalnoise ratio is high (Mazumder et al. 2022). However, genomics data analysis usually faces low signals but high noises. For such data, regularization methods can lead to more stable selection (Bühlmann and van de Geer 2011). Multiple approaches have been developed based on the Lasso penalty with a hierarchy constraint. Bien, Taylor, and Tibshirani (2013) proposed hier-Net, an approach that applies specifically designed constraints to achieve sparsity and hierarchy. Lim and Hastie (2015) examined hierarchical interactions for categorical variables. Yan and Bien (2017) and She et al. (2018) extended group penalties for learning hierarchical interactions.
The sparsity-inducing penalty along with hierarchy constraints makes optimization in joint analysis extremely difficult and incurs prohibitively high computational cost, posing challenges to practical applications (Wu et al. 2018). To this end, some novel algorithms, such as FAMILY (Haris et al. 2016) and hierScale (Hazimeh and Mazumder 2020), have been developed under the convex regularization framework. Although they are efficient and hierScale can be easily extended to different loss functions, their extensions to weak hierarchy is not straightforward.
In this article, we propose a new method for identifying hierarchical interactions that can overcome the aforementioned shortcomings. Our method builds upon the framework of the forward and backward stagewise algorithm (Zhao et al. 2009;Shi et al. 2018), and embraces their simplicity, computational efficiency, and parallelizable strategy. We refer to it as hierarchical forward and backward stagewise (HierFabs). With the completely newly developed forward and backward steps to accommodate hierarchy, it is able to detect important main effects and interactions under both the strong and weak hierarchy. Advancing from the existing studies, HierFabs accommodates multiple types of interaction models with differentiable loss functions under a unified framework. Importantly, it is scalable to data with tens of thousands of interactions. We establish the approximate optimality of HierFabs solutions. As an application, we examine gene-environment (GE) interactions under the Cox model and gene-gene (GG) interactions under the linear regression model. We illustrate benefits of HierFabs through extensive simulations and an in-depth analysis of a skin cutaneous melanoma dataset with measurements on gene expressions and environmental variables. We develop the R package HierFabs with the computational core written in C. The code and manual are publicly available at https://github.com/ XiaoZhangryy/HierFabs.

Problem Formulation
As a motivation for our proposal, first consider GE interactions, and discussions on GG interactions are provided latter. Denote y as the outcome of interest, which can be binary, continuous, categorical, or censored survival. Denote X = (X 1 , . . . , X p ) as the p genetic measurements, and Z = (Z 1 , . . . , Z q ) as the q environmental/clinical risk factors. For simplicity, assume that each column of X and Z is standardized to have mean 0 and variance 1. The joint regression model is Our goal is to select important main and interaction effects and estimate the unknown parameters under the hierarchy constraints. Let L(θ) denote the chosen loss function such as the negative log-likelihood. The strong and weak hierarchy constraints are, strong: That is, if an interaction is included in the model, then both of the corresponding main effects must be present under the strong hierarchy, while at least one of the corresponding main effects must be present under the weak hierarchy. Such constraints have been widely adopted (Haris et al. 2016) to improve statistical power (Hao et al. 2018), facilitate model interpretation (Cordell 2009), and simplify experimental design (Bien, Taylor, and Tibshirani 2013). We consider the following optimization problem, where Here = s for the strong hierarchy and = w for the weak hierarchy. Note that if there is no constraint, minimizing Q(θ ) with respect to θ is the Lasso problem. The tuning parameter λ balances the goodness of fit and model complexity. As an alternative to the 1 penalty, other types of penalties can also be used in (2), such as elastic net (Zou and Hastie 2005), group Lasso (Yuan and Lin 2006), and other convex penalties. We adopt Lasso here because of its computational simplicity and satisfactory numerical performance. In addition, adopting Lasso facilitates comparison with the existing Lasso-based methods such as hierNet and RAMP. While the constraint θ ∈ enforces hierarchy, the conditions are not convex, which makes the optimization in (1) difficult (Bien, Taylor, and Tibshirani 2013). In this article, we propose the hierarchical forward and backward stagewise (HierFabs) algorithm to address this problem. For flexibility, the proposed approach does not make distributional assumptions on y and can accommodate loss functions other than likelihood. The only assumption needed by HierFabs is the differentiability of L(θ).

The HierFabs Algorithm
The HierFabs builds upon the forward and backward stagewise (Fabs) algorithm (Shi et al. 2018), which produces an approximate solution path to the Lasso problem with statistical guarantees. After initialization, the Fabs algorithm adopts a backward step which moves one coordinate toward zero if it can reduce the objective function L(θ). Otherwise, it takes a forward step which updates the coordinate with the largest derivative toward the minus gradient direction and updates the tuning parameter when necessary. This algorithm has a simply strategy, rendering it a great potential for accommodating hierarchy constraints.
Similar to the Fabs algorithm, HierFabs computes a decreasing sequence of λ values automatically, starting with the value λ 0 under which θ = 0 and ending with a prefixed value λ min . Following Friedman et al. (2010), we set λ min = 10 −4 λ 0 . To accommodate the hierarchy constraints, HierFabs has completely newly developed forward and backward steps.
Before introducing HierFabs, we first define some notations which are used to derive the forward and backward steps in HierFabs. Let P, Q be the index sets of X and Z, respectively. Denote M = P ∪ Q as the index set of all main effects. For set A ⊆ P and set B ⊆ Q, define a mapping I(A, B) that maps A×B to the index set of their interactions. For example, the collection of all interactions corresponding to Z 1 , that is, {X Z 1 : ∈ P}, has the index set I(P, {1}).

The Forward Step
Let θ t denote the solution generated at step t. If a forward step is taken, Fabs searches for the optimal coordinate that has the largest absolute gradient of L evaluated at θ t across all coordinates, and then updates the current solution at the selected coordinate with a fixed step size. For the hierarchical interaction problem in (1), a natural extension of Fabs' forward step for solutions that retain the hierarchical structures is to search the optimal coordinate from a candidate set which guarantees that the hierarchy constraints are satisfied. Let H t+1 denote such a forward candidate set at step t + 1, and we introduce its three types of elements in detail.
First, an update of the main effects will never break the hierarchical structures. So, M ⊂ H t+1 .
Second, H t+1 contains interactions whose one or both corresponding main effects are already in the active set by definition. Let C t+1 be the set of these candidate interactions. Define P t = {k : α t k = 0} and Q t = {l : β t l = 0} as the active sets at step t for X and Z, respectively. Since C t+1 is different under the strong and weak hierarchy, we use superscript "s" and "w" to distinguish them. Specifically, we have where P c t , Q c t are the complement of P t , Q t , respectively. Note that C s t+1 and C w t+1 search interactions based on the identified main effects in θ t .
Some true main effects may not be strong enough to be identified in the early steps, which obstructs identification of their interactions. For identifying important interactions that have not been included in C t+1 and maintaining the hierarchy structures, we define the candidate sets of main-interaction pairs: Intuitively, HierFabs tries to pick up a strong interaction by pairing a main effect for it, and a forward update will be conducted on that pair to avoid a violation of the hierarchy constraints. This is a significant advantage of our method over the existing ones: for the strong hierarchy, we will not omit an interaction who has one of its corresponding main effects pruned, and for the weak hierarchy, strong interactions with two weak main effects will not be omitted.
In summary, we have the forward candidate sets Similar to Fabs, hierFabs updates the optimal candidate which has the largest absolute gradient, and leaves the other coefficients unchanged. Note that an update of a maininteraction pair in G t+1 will affect two coefficients, while an update of M and C t+1 will affect only one. To compare gradients of the three types of elements fairly, we use the average gradient for each main-interaction pair. The optimal S * is where |S| is the size of S, |S| = 2 for S ∈ G t+1 , and |S| = 1 for M ∪ C t+1 . At last, an update of some active main effect may violate the hierarchy constraints if it satisfies the following two conditions: (i) The first derivative of this coordinate has the same sign as its coefficient, and the absolute value of its coefficient is equal to the step size. That is, this main effect will vanish after updating a step size ε; (ii) For the strong hierarchy, this main effect has an interaction in the active interaction set; and for the weak hierarchy, this main effect has an interaction whose other corresponding main effect is not in the active set. We collect main effects satisfying the two conditions in F s t and F w t under the strong and weak hierarchy, respectively. To avoid these violations, we adopt a new step size ε = 2ε if the forward update is implemented to the above main effects, otherwise, ε = ε. The choice of ε will be described in Section 2.5. The same as in Fabs, the forward step iŝ where 1 k denotes the length-d vector with the kth component being 1 and the rest being 0. Obviously, this strategy avoids shrinking one main effect of an interaction to zero and retains the hierarchical structures.

The Backward
Step A backward step in Fabs moves the estimate along one coordinate in the active set toward zero if this move can lead to a decrease of the objective function. It may break the hierarchy if a main term is removed while its corresponding interactions are still in the active set. In HierFabs, for = 1, . . . , |A t |, we calculate: and let d 1 ≤ · · · ≤ d |A t | . We search for the optimal backward coordinate as follows. If there is no interaction in the active set A t , then coordinate 1 is optimal. If there are interactions, a backward step can be taken on any active interaction, or on those active main effects that do not belong to F s t or F w t . We choose the smallest k ∈ {1, . . . , |A t |} that will not break the hierarchy, and let the optimal coordinate be k . Then, a backward move can be applied to the k th coefficient: The algorithm is presented in Algorithm 1.

Algorithm 1 Hierarchical forward and backward stagewise
Step 1 (Initialization) With a small step size ε(> 0), set t = 0, and take an initial forward step: Step 2 (Backward and Forward steps) 2.1 Determine the backward direction that minimizes the empirical loss: if backward on k violates the hierarchy, then k = k + 1; else, k = −sign(θ t k )1 k and break.

If
Otherwise, force a forward step, and relax λ if necessary: Here H t = M ∪ C s t ∪ G s t for the strong hierarchy and H t = M ∪ C w t ∪ G w t for the weak hierarchy.
Step 3 (Iteration) Update t = t + 1, and repeat Steps 2 and 3 until λ t ≤ λ min , where λ min is the lower bound of the tuning parameter.

GG Interactions
The joint regression model is We include the quadratic terms X 2 k in this article, but everything carries over if we restrict to γ kk = 0. We provide this as an option in the HierFabs package.
The HierFabs algorithm for GG interactions can be derived following that for GE interactions. The forward and backward steps are taken similarly, with modifications to the forward candidate set corresponding to GG interactions. Specifically, the elements of the forward candidate set H t+1 are as follows: the index set of main effects M = P, the set of these candidate interactions C s t+1 = I(P t , P t ), C w t+1 = I(P t , M), and the candidate sets of main-interaction pairs In HierFabs, calculating the gradient ∇L is the most computationally intensive operation. However, the optimal forward and backward directions are selected from F t and A t , respectively. Therefore, the number of gradients required is much smaller than the dimensionality of θ. Note that with the forward candidate set, the proposed algorithm guarantees that the solution satisfies the strong and weak hierarchy without imposing additional constraints. Furthermore, the update equation for θ is universal, and hence HierFabs can accommodate GG and GE interaction models with any differentiable loss functions. For the existing methods that are applicable to general loss functions, such as hierScale (Hazimeh and Mazumder 2020), it is not straightforward to extend to the weak hierarchy. The proposed approach has computational advantages and improved generality.

Toy Examples
To gain some insight into the candidate set of HierFabs, we look at two toy examples.
Example 1. Let p = 3 and q = 2. In Figure 1, we show the current active set in the left panel, and the candidate set under each hierarchy in the right panel.
For the strong hierarchy, we assume that the current active set includes X 1 , X 3 , Z 1 , and X 1 Z 1 , which are denoted by circles in Figure 1(a 1 ). The elements in the forward candidate set are shown in the rest of Figure 1: (a 2 ) M; (a 3 ) C s t+1 , including X 1 Z 1 and X 3 Z 1 ; (a 4 ) G s t+1 , framed by dashed lines, that is, (X 2 , X 2 Z 1 ), (Z 2 , X 1 Z 2 ), and (Z 2 , X 3 Z 2 ).
For the weak hierarchy, an additional interaction term, X 1 Z 2 , is assumed to be in the current active set (as shown in Figure 1(b 1 )). The elements in the forward candidate set are shown in the rest of Figure 1: Here we assume that X 1 ∈ F s t and F w t , which is denoted by circles with slanting lines. HierFabs will update the coefficient of X 1 with step size 2ε.
We set stepsize ε = 0.01. To simplify calculation, we demonstrate the iteration process with population data and gradient ∇L(θ ) = (θ − θ 0 ). See supplementary C.3 for further details and R code for a simulated dataset. At the initial step, ∇L(0)= − θ 0 =(−0.1009, −0.2539, 0.3343) T . Although X 1 X 2 has the largest absolute gradient, HierFabs updates X 2 instead due to the strong hierarchy. After seven forward steps, we getθ 7 = (0.01, 0.07, −0.01) T . Note that α 7 1 has a wrong sign at this time. As the iteration progresses, HierFabs improves estimation for α 2 and γ 12 . At the 36th iteration,θ 36 = (0.01, 0.12, −0.24) T , and the gradient is (0.0502, −0.0467, 0.0497) T . Now ∇ 1 L has the largest magnitude, and so X 1 is in F s t . A forward step with stepsize ε = 0.01 will pushα 1 to zero and break the strong hierarchy. In contrast, F s t adopted in HierFabs updatesα 1 from 0.01 to −0.01 directly, and the violation of the strong hierarchy can be avoided. Finally, the iteration stops at t = 60 with estimate (−0.11, 0.20, −0.30) T . The full sequence of solutions, the objective function Q and λ at each iteration in Example 2 are shown in Figure 2.

Theoretical Properties
Theorem 1. Assume that the second-order derivatives of L(·) are bounded from above, and that ε is small enough. If step t + 1 is a forward step and λ t+1 < λ t , then for all S ∈ H t , Proof is provided in supplementary A. Theorem 1 guarantees that the HierFabs solution is local optimal in the sense that any perturbation with step ε over a feasible direction that satisfies the hierarchy will increase the objective function up to a O(ε 2 ) term. We note that H t is a subset of all the feasible directions. As such, it is not clear whether the obtained solution is the global optimal or a local one for the nonconvex problem (1), even when ε → 0. However, the assumption on L is weak, and so our method can be applied to a large class of objective functions with the hierarchy constraints, such as the least squared loss and Cox partial likelihood function considered in the next Section. Some nonconvex losses, such as the smoothed rank-based loss (Song et al. 2007;Lin and Peng 2013), can also be solved.
From a statistical point of view, a smaller stepsize leads to a closer approximation to the minimization coefficients and also a finer grid for λ. From a computational point of view, the complexity of HierFabs is similar to that of Fabs, and roughly O( 1 ε ) steps are needed (Zhao and Yu 2007) to produce the whole sequence. More precise computational complexity under both the weak and strong hierarchy is presented in supplementary B. In general, a smaller stepsize demands more computational power. Furthermore, as shown in simulation, a smaller stepsize does not guarantee better prediction performance and sparsity, due to the smaller regularization effect (Zhao and Yu 2007). Following Fabs, we explore moderate stepsizes (ε = 0.02, 0.01, 0.005), which can achieve a balance between the two coupled tasks. They have similar performance in simulation. In the real data application, we use ε = 0.01.
In the implementation of HierFabs, choosing of a good initial value is important. We use zero as the initial value, as it satisfies the hierarchy constraints and is the global optimal solution for λ = ∞. For all the following steps,θ t+1 is updated by a backward or forward step fromθ t . With this warm start strategy, HierFabs provides a sequence of solutions for problem (1). The selection of the optimal t orθ t along the whole sequence will be discussed in the next section.

Simulation
Two models are considered, namely the Cox model for GE interactions (Cox-GE) and linear model for GG interactions (LM-GG). With each model, we vary the ratio of main effect signal level to interaction effect signal level, where the signal level is defined as the L 2 norm of the corresponding coefficients. Let SL m and SL i denote the signal levels of main and interaction effects, respectively. We fix SL m = 2 and consider ratio = SL i /SL m = 0.5, 1, 2. For a specific ratio, the nonzero elements in θ are generated from uniform (0.5, 2) (with random signs) and then normalized according to SL m and SL i .
We compare HierFabs (ε = 0.02, 0.01, 0.005) with relevant alternatives. Specifically, for Cox-GE, we consider the following alternatives: (a) marginal analysis (Margin). This approach conducts the Cox regression of one gene, all environmental factors, and their interactions at a time. The p-values of the main G effect and interactions are adjusted using the Bonferroni approach. This approach has been commonly adopted in published studies. (b) Fabs, which is the non-hierarchy counterpart of the proposed approach. It produces estimation for the Lasso penalized Cox regression and can be equivalently produced by glmnet. Comparing with this approach can reveal the value of respecting the hierarchical structures. For LM-GG, hierNet (Bien, Taylor, and Tibshirani 2013) and RAMP (Hao et al. 2018) are considered for their capability to respect both the weak and strong hierarchy. FAMILY (Haris et al. 2016) and hierScale (Hazimeh and Mazumder 2020) are two fast convex optimization approaches for detecting interactions under the strong hierarchy. As they can not respect the weak hierarchy, we only include their results under the strong hierarchy. In addition, the oracle estimator is considered for both models. The tuning parameters of all methods except FAMILY are selected by EBIC with γ = 1 for LM-GG (Chen and Chen 2008) and γ = 1 − log(n) 2 log(d) for Cox-GE (Luo et al. 2015), where d is the length of θ . The tuning parameter of FAMILY is selected by cross-validation as suggested by Haris et al. (2016).
In identification evaluation, measures include the Matthews correlation coefficient (MCC), true positive rate (TPR), and model size (MS) for θ , main effects, and interactions, respectively. MCC measures the overall accuracy of identification for true positives, false negatives, true negatives, and false positives, with a larger value indicating overall better identification (Baldi et al. 2000;Chicco and Jurman 2020). The definition of MCC is provided in supplementary C. Estimation performance is assessed using estimation error (ERR) defined as θ −θ 0 2 θ 0 2 , wherê θ and θ 0 are the estimated and true values of θ . For evaluating prediction performance, independent testing data is generated. We adopt R squares (R 2 ) calculated on the independent testing data. For censored survival data, generalized R 2 (Cox and Snell

Cox-GE
with (n, p, q) = (400, 500, 10). X and Z have multivariate normal distributions with zero mean and AR covariance matrix ρ |i−j| , where ρ = 0.3, 0.7. The error terms satisfy an exponential distribution with parameter 1, corresponding to the Cox model. Six main effects, two from environmental risk factors Z and four from genes X, are associated with the response. And six important interactions are selected to respect the strong and weak hierarchy, respectively. The censoring rate is about 20%.
Results based on 500 replicates for TPR, MCC, MS for the main effects (α, β), interactions (γ ), and all the effects combined (θ ) are provided in Supplementary Table S3-S8. Results for estimation error ERR and generalized R 2 are also included. It can be seen that a smaller ε usually leads to a comparable or slightly larger R 2 , but overall, HierFabs has quite similar performance under different ε values. For a clear presentation, we also summarize MCC for θ , ERR, and R 2 in Figure 3. And the results of HierFabs with ε = 0.01 are shown. Across all simulation scenarios with different ratio and ρ, HierFabs is observed to have larger MCC compared to the Margin approach and Fabs. Specifically, it can more accurately identify true interactions compared with Fabs under both the strong and weak hierarchy. For true gene effects, HierFabs is slightly inferior to Fabs. And both methods can significantly outperform Margin. For example, in supplementary Table S6 under the weak hierarchy and (ratio, ρ) = (0.5, 0.7), the proposed approach with ε = 0.01 has (MCC, MCC x , MCC i )=(0.82,  Here subscripts x and i indicate main G and interaction effects, respectively. For estimation and prediction, we compare our method with Fabs and Oracle, as the Margin approach does not generate estimation and prediction. As it can be seen in Figure 3, HierFabs with ε = 0.01 has a smaller ERR and larger R 2 than Fabs. As expected, both methods have inferior performance compared to Oracle.

LM-GG
Consider the model: with (n, p) = (200, 300). The design matrix X is generated similarly as for Cox-GE. The random error has a standard normal distribution. Five main effects and five interactions are associated with the response. γ kk , k = 1, . . . , p is set as 0 to make sure that the model is estimable with hierScale, which does not accommodate square terms.
Results based on 500 replicates are provided in supplementary Table S9-S14. Results on ERR and prediction R 2 are also included. We also summarize MCC for θ , ERR, and R 2 in Figure 4. And the results of HierFabs with ε = 0.01 are shown.
When the main effects have a larger signal level than the interaction effects (ratio = 0.5), performance of all methods except for FAMILY are comparable, with similar MCC, ERR, and R 2 values. As the signal level of interaction effects gets comparable to (ratio = 1) that of main effects, HierFabs out- performs the other methods. The advantage of the proposed method gets more prominent when interaction effects have a larger signal level than main effects (ratio = 2). As discussed in Section 2, when important interaction effects have small corresponding main effects, HierFabs can pick up these interactions in the forward step by pairing one (weak hierarchy) or both (strong hierarchy) of their corresponding main effects. In contrast, the other methods may have difficulty identifying them. For example, in supplementary  Table S27 in supplementary C.2. It can be seen that HierFabs has comparable CPU time to RAMP and hierScale, and is much faster than FAMILY and hierNet. The computation time of HierFabs is inversely proportional to its stepsize. When ε = 0.02, HierFabs is even faster than the two-stage method RAMP under both the strong hierarchy and weak hierarchy. Similar superiority of the proposed method is observed with ρ = 0.7. Overall, the proposed method is observed to be highly computational efficient.

Real Data Application
We demonstrate the practical application of HierFabs to hierarchical interaction identification with genomics data using the skin cutaneous melanoma (SKCM) data downloaded from The Cancer Genome Atlas (https://tcga-data.nci.nih.gov). SKCM is one of most aggressive types of cancer and its incidence, mortality, and disease burden have been increasing (Xiong et al. 2019). The data contains disease outcomes, environmental factors, and high-dimensional gene expressions. The goal of analysis is to identify interactions that are associated with the prognosis of SKCM. First, we consider the identification of GE interactions associated with a right censored outcome. Additionally, we consider the identification of GG interactions associated with a continuous outcome.

GE Interaction Analysis under the Weak Hierarchy
The outcome of interest is overall survival. After removing samples with missing survival time and genes with minimal expression variations, we obtain 17,944 gene expressions on 253 patients. For gene expression measurements, the top 2000 are screened out for downstream analysis (Jiang et al. 2016). Each gene expression is standardized to have mean zero and standard deviation one. We consider 14 environmental factors: gender, age at diagnosis, tumor status, Breslow thickness at diagnosis, Clark level at diagnosis, primary melanoma tumor ulceration, AJCC tumor pathologic stage, AJCC nodes pathologic stage, new tumor event indicator, percent of lymphocyte in ltration, percent of monocyte in ltration, percent of necrosis, percent of stromal cells, percent of tumor cells, and percent tumor nuclei.
The estimation results are provided in Table 2. To evaluate the selection stability, we conduct 100 random splits. In each split, we randomly select 75% as the training set and the rest as the testing set. The frequency of identifications based on the training set are summarized. HierFabs identifies 3 environmental main effects, 1 gene main effect, and 19 interaction effects. Among all the environmental factors, tumor status has the largest main effect (0.9219) and the largest identification frequency. It is the most important prognostic factor in evaluating primary SKCM tumors and recurrence (Adler et al. 2018). Breslow's thickness and age at diagnosis have positive effects, which confirm the previous findings (Dickson and Gershenwald 2011;Enninga et al. 2017). Gene CREG1 has a main effect and an interaction effect with new tumor event indicator. Previous studies have found that up-regulated CREG1 in ABCB5enriched cutaneous melanoma circulating tumor cell fractions was involved in melanoma metastasis (Aya-Bonilla et al. 2019). Most of the identified interactions are between genes and tumor status. It is difficult to conclude whether the identified interactions are meaningful. However, there are more results on the functionalities of genes, which may provide some insights into the identified effects. The detailed discussions on these genes are provided in supplementary D.1.
To evaluate prediction performance, C-statistic is calculated using R package survAUC (Uno et al. 2011). The same to the AUC, it ranges from 0.5 to 1 and a larger value means better prediction. Based on the aforementioned random splits, we apply the proposed method to the training set, obtain estimated coefficients, and predict event risk in the testing set. The average C-statistic is 0.758, 0.732 for HierFabs and Fabs, respectively. Comparing these AUCs using the paired samples Wilcoxon test suggests that the differences are significant with p-values 3.948 × 10 −5 .

GG Interaction Analysis under the Weak Hierarchy
Breslow's thickness is a continuous variable that has been suggested as a clinicopathologic feature of cutaneous melanoma (Dickson and Gershenwald 2011). In Section 4.1, we find that it has an important environmental main effect on overall survival of SKCM samples. Here we treat (log-transformed) Breslow's thickness as the outcome of interest. Data is available on 361 samples. We conduct a similar prescreening by the p-value of a marginal linear model, and the top 2000 genes are selected for downstream analysis.
To identify GG interactions under the weak hierarchy, we need to fit a high dimensional linear model with 2,003,000 covariates. We report the estimates of HierFabs in Table 3. The identification frequency is also examined based on the same random splitting strategy adopted in Section 4.1. The average numbers of identified main and interaction effects are (8.2, 11.1), (55.9, 55.9), and (2.1, 241.2) for HierFabs, hierNet, and RAMP, respectively. The average numbers of overlaps among the identified effects across the three methods are (0.96, 3.51). We refer to supplementary D.2 for more details on the biological implications of genes identified by our method.
We compare prediction performance of HierFabs with hier-Net and RAMP based on the 100 replicates. The average MSE and number of identified effects are (0.67, 19.28), (1.22, 111.74) and (3.86, 243.26) for HierFabs, hierNet, and RAMP, respectively. The proposed method has the best prediction accuracy with the smallest model size, which suggest its competitive performance.

Conclusion
The joint analysis of interaction under hierarchy is computationally challenging because of the high data dimensionality and presence of hierarchy constraints. Although there are a few recent methods, optimization usually incurs prohibitively high computational cost, posing challenges to high-dimensional genomics applications of these methods. In this study, we have developed a new general method HierFabs. The proposed method takes advantage of the forward and backward stagewise algorithm to handle high dimensionality efficiently. To ensure hierarchy, completely newly forward and backward steps have been developed. Without additional hierarchy constraints (which are common with the existing alternatives), the proposed method has a significant computational advantage and is adaptable to both the strong and weak hierarchical structures. Advancing from the existing studies, the proposed method is relatively independent of the data and model forms. It has the only requirement that the loss function is differentiable. The optimality of the HierFabs sequences has been investigated theoretically and numerically. We have also demonstrated its application to genomics data: GE interaction analysis under the Cox model and GG interaction analysis under the linear model. Extensive numerical studies have shown that our method has superior identification and prediction performance while maintaining its computational advantage.
The proposed method can be extended in multiple directions. For practical applications, it may be of interest to extend our general framework to the identification of high-order interactions. Additionally, HierFabs analyzes linear forms of interaction effects. It can be extended to achieve group-sparsity with hierarchical structures and has the potential of further identifying important interaction effects with functional forms, which can provide better insights into predictor-outcome relationships in genomics data. Lastly, we have not been able to fully establish statistical properties of HierFabs and postpone such research to the future.

Supplementary Materials
Supplementary: A PDF file that contains proofs, supplemental figures and tables, and additional results of real data analysis referenced within the main text. (pdf) R-package for HierFabs implementation: The R-package HierFabs implementing our algorithm is available at https://github.com/XiaoZhan gryy/HierFabs. The code to reproduce all analysis presented in the article is also included. (GNU zipped tar file)