Robust Signal Recovery for High-Dimensional Linear Log-Contrast Models with Compositional Covariates

Abstract In this article, we propose a robust signal recovery method for high-dimensional linear log-contrast models, when the error distribution could be heavy-tailed and asymmetric. The proposed method is built on the Huber loss with penalization. We establish the and consistency for the resulting estimator. Under conditions analogous to the irrepresentability condition and the minimum signal strength condition, we prove that the signed support of the slope parameter vector can be recovered with high probability. The finite-sample behavior of the proposed method is evaluated through simulation studies, and applications to a GDP satisfaction dataset an HIV microbiome dataset are provided.


Introduction
Compositional data, which represents the proportions or fractions of a composition, are commonly encountered in economics. For instance, in a consumer demand research, the budget shares of the commodities are the elements of the composition and sum to unity. Other examples include aggregate output, stockholder's portfolio composition, etc. The unit sum structure of compositional data leaves standard statistical methods for unconstrained data inappropriate or inapplicable (Aitchison 2003). To describe the relationship between the response and the compositional covariates, Aitchison and Bacon-Shone (1982) proposed a linear log-contrast model to link the response and the log of the compositional covariates. Since the seminal work of Aitchison and Bacon-Shone (1982), a great amount of works have been done for the analysis of compositional data under low-dimensional settings; see Aitchison (2003) and references therein. With the development in information technologies and statistical science, we often encounter large compositional datasets, in which the dimensionality of features could be of the same order as or substantially higher than the number of observations. The aim of this work is to conduct signal recovery for highdimensional linear log-contrast models, focusing on support recovery and consistent estimation for the slope parameter vector. A great deal of work on signal recovery has been done for the analysis of high-dimensional unconstraint data. Support recovery methods were studied by Meinshausen and Bühlmann (2006), Zhao and Yu (2006), Wainwright (2009) (2010), etc. In addition, novel findings on consistent estimation for regression parameters in high-dimensional models were reported in Zhang (2010), Zhang and Zhang (2012), and Plan and Vershynin (2016), etc.
Recently, there have been significant developments of signal recovery methods for high-dimensional compositional data (Lin et al. 2014;Cao, Lin, and Li 2019;Cao, Zhang, and Li 2020). For example, Lin et al. (2014) conducted signal recovery for a linear log-contrast model based on 1 -regularized least squares. Cao, Lin, and Li (2019) provided a covariance estimation method for compositional data via composition-adjusted thresholding. Cao, Zhang, and Li (2020) took a multi-sample approach to the estimation of bacterial abundances.
Heavy-tailed data arise frequently in many scientific disciplines such as microarray experiments (Purdom and Holmes 2005), neuroimaging (Eklund, Nichols, and Knutsson 2016) and finance (Cont 2001). Ignorance of this characteristic of data could lead to improper results (Fan, Wang, and Zhu 2016;Fan, Li, and Wang 2017). The Huber loss (Huber 1964) is a wellknown robust criterion for parameter estimation. Under fixed or low-dimensional settings, theoretical properties on the Huber estimators were thoroughly studied (Huber 1973;Yohai and Maronna 1979;Portnoy 1985;He andShao 1996, 2000;Zhou et al. 2018).
Robust methods for high-dimensional linear models based on the Huber loss were presented in Fan, Li, and Wang (2017), Loh (2018), Fan, Guo, and Jiang (2019), Chen and Zhou (2020), Sun, Zhou, and Fan (2020), Wang et al. (2021), etc. When the error distribution was asymmetric, Fan, Li, and Wang (2017) established the 2 consistency for Huber-type estimators. In the presence of symmetric errors, Loh (2018) and Fan, Guo, and Jiang (2019) developed statistical methods for choosing the robustification parameter in the Huber loss adaptively. Under asymmetric errors, data-driven robustification parameter selection procedures were reported in Wang et al. (2021) and Sun, Zhou, and Fan (2020). Chen and Zhou (2020) studied the construction of confidence sets and large scale simultaneous hypothesis testing by using the multiplier bootstrap method. However, to the best of our knowledge, there is no specific construction of robust statistical method on signal recovery for high-dimensional compositional data.
In this article, we propose a robust signal recovery method based on the Huber loss for a high-dimensional linear logcontrast model, allowing for a heavy-tailed and asymmetric error distribution. The 1 and 2 consistency for the resulting estimator are developed. The signed support of the slope parameter vector can be recovered under conditions similar to the irrepresentability condition and the minimum signal strength condition (Zhao and Yu 2006;Wainwright 2009).
The remainder of the article is organized as follows. In Section 2, we introduce a linear log-contrast model and our signal recovery method. The associated optimization algorithm and theoretical results are reported in Section 3. In Section 4, we conduct simulation studies to evaluate the finite-sample properties of the proposed method. Applications to a GDP satisfaction dataset an HIV microbiome dataset are presented in Section 5. A few closing remarks are provided in Section 6. All proofs are deferred to the supplementary materials.

Model
Let Y be a response variable and X be a p-vector of covariates. The nature of composition implies that X lies in a (p − 1)dimensional positive simplex S p−1 = {(x 1 , . . . , x p ) : x j > 0, j = 1, . . . , p and p j=1 x j = 1}. Due to the unit-sum constraint, the components of a composition cannot vary freely, contrary to the unconstraint case. To ensure identifiability, traditional methods require the omission of certain components, which may result in difficulties in interpretating the slope parameters. To tackle the problem, Aitchison and Bacon-Shone (1982) introduced a linear log-contrast model: where Z p = (log(x 1 /x p ), . . . , log(x p−1 /x p )) is a (p − 1)dimensional log-ratio vector with the pth component as the reference component, μ * is an intercept, β * \p = (β * 1 , . . . , β * p−1 ) is a vector of slope parameter, and is a zero-mean error term. By letting β * p = − p−1 j=1 β * j , model (1) can be recast into the following symmetric form (1) and (2) have two advantages: (a) The log-ratio transformation in (1) allows the components of Z p to vary freely; (b) The estimator of β * is invariant regardless of the choice of the reference component. Thus, the difficulties in interpretating the slope parameters can be avoided. The observed data (X i , Y i ) (i = 1, . . . , n) are independent and identically distributed copies of (X, Y). Throughout the article, we focus on the situation that p is of the same order as or substantially larger than n. Our objectives are to recover the support and obtain consistent estimator for β * under model (2).
Remark 1. Since our method is not based on the least squares, the intercept term in model (2) cannot be simply eliminated by centering the response and the predictors. Moreover, different from Lin et al. (2014), we allow the error distribution to be asymmetric and heavy-tailed.

Estimation Method
To account for different magnitudes of the errors, the Huber loss is considered for robust estimation: where τ > 0 is a robustification parameter. Given any τ > 0, define where W = (1, Z ) and θ = (μ, β ) . When the error distribution is not symmetric, it is known that the Huber regression may give biased estimator for θ ; however, Proposition 1 in Section 3 indicates that, under mild conditions, the bias occurs only in the estimation of the intercept, but not the slope parameters. As a result, it is possible to conduct signal recovery for β * via the Huber loss. For model (2), based on the Huber loss with 1 penalization, we define the proposed estimator as the solution to the following constrained convex optimization problem: where λ n is a tuning parameter and ||β|| 1 = p j=1 |β j |. Define X * = (X 1 , . . . , X n ) . The resulting estimator in (4) possesses the following properties (Lin et al. 2014): (i) Scale invariance: the estimator is unchanged under the transformation X * → TX * for an arbitrary diagonal matrix T = diag(t 1 , . . . , t n ) with all t i > 0. (ii) Permutation invariance: the estimator is invariant under any permutation π of the p components, meaning that it is unchanged if π is applied to both the columns of X * and the components ofβ τ .
(iii) Selection invariance: the estimator is unchanged if one knew in advance which components would be estimated as zero and applied the procedure to the subcomposition formed by the remaining components.
These properties make the proposed estimator more appealing over standard Lasso estimator in terms of interpretability.

Computation Algorithm
The iterative local adaptive majorize-minimization algorithm (I-LAMM) is a first-order algorithm to efficiently solve the Lasso-type problem in high-dimensional settings . Nevertheless, the term β 1 in (4) is not separable under the zero-sum constraint, so the I-LAMM algorithm is not directly applicable to solve (4). In this section, we introduce a fast and easy-to-implement algorithm called iterative augmented Lagrangian and local adaptive majorizeminimization algorithm (I-ALLAMM), which combines the I-LAMM algorithm and the augmented Lagrangian (AL) method (Bertsekas 2014). Consider the augmented Lagrangian form for problem (4): where γ is a Lagrange multiplier and ν > 0 is a regularization parameter. Letθ (k) τ andγ (k) denote the estimates of θ * τ and γ at the kth iteration, respectively. Following Bertsekas (2014), the AL algorithm for solving (4) is presented as Algorithm 1.
Remark 2. The proposed I-LAMM algorithm to solve our specific problem includes a thresholding step for β new , which helps maintain satisfactory performance of 1 penalization and is slightly different from the algorithm in Fan et al. (2018).
Combining Algorithms 1 and 2, our I-ALLAMM algorithm for solving (4) can be summarized in Algorithm 3.

Theoretical Analysis
More notations are needed. Throughout the paper, || · || ∞ is the maximal absolute value in the components of a matrix and || · || ∞,∞ is the maximum absolute row sum of a matrix. Definē The vectorZ is the projection of Z on the orthogonal complement of the space spanned by 1 p , and¯ k is \k . Some conditions on¯ k , similar to the irrepresentability condition (Wainwright 2009;Lin et al. 2014), are needed for theoretical analysis. Let be the eigendecomposition of¯ , σ max (¯ )(σ min (¯ )) be the largest (smallest) nonzero eigenvalue of¯ , and = U −1 U . The matrix acts as the inverse of¯ . For any random variable ξ , define ||ξ || ψ 2 = sup v≥1 (E|ξ | v ) 1/v /v 1/2 . For a vector, we can define the active (inactive) set to index nonzero (zero) components. Let S 0 ,S 0 , S k 0 andS k 0 be the active sets of β * , (1, (β * ) ) , β * \k and (1, (β * \k ) ) , respectively. In addition, let S c 0 = {j|β * j = 0} and S k 0c = {1, . . . , p}/S k 0 be the inactive sets of β * and (1, (β * \k ) ) , respectively. Define s = ||β * || 0 + 1 : We will use subsets to index a vector or matrix; for example, Z i,S 0 is the subvector formed by the jth entries of Z i with j ∈ S 0 . In order to study the properties of the proposed estimator, we need the following conditions: (C1) The error is an absolutely continuous random variable independent of X. There exists a positive constant N 1 such that E| | < N 1 . In addition, there exists a positive constant L such that sup N 6 and N 7 are three positive constants satisfying N 5 ≥ 4N 1 . (C7) There exists a constant κ ∈ (0, 1) such that Condition (C1) guarantees the estimation bias of the Huber regression only occurs at the intercept. Many commonly-used distributions (e.g., Normal distribution; Student's t distribution with degrees of freedom more than 1) for the error satisfy Condition (C1). Condition (C2) is a standard assumption in the context of high-dimensional analysis (van de Geer et al. 2014; Janková and van de Geer 2016). Conditions (C3) and (C4) are regular assumptions for high-dimensional models with compositional covariates (Shi, Zhang, and Li 2016;Lu, Shi, and Li 2019). Condition (C5) indicates that p could grow at an exponential rate of n. Condition (C6) is needed for technicality, and Condition (C7) is a regular condition for support recovery, which is similar to the conditions in Wainwright (2009) and Lin et al. (2014).
It is straightforward to verify that problem (4) is equivalent to the following constrained convex optimization problem: Without loss of generality, we assume k 0 = p. We will use the primal dual witness (PDW) method (Wainwright 2009) to prove that (11) can recover the support of β * with probability tends to 1. The PDW method lays out the following steps to construct a pair (θ τ ,Z) ∈ R p+1 × R p .
The next theorem presents the support recovery of our method.

Tuning Parameter Selection
The regularization parameter λ n is selected by the generalized information criterion (Fan and Tang 2013). Specifically, define ,θ λ is the regularized estimator, p ∨ n = max{p, n} and d λ is the number of nonzero coefficients inθ λ . In view of the zero-sum constraint in (2), the effective number of free parameters is d λ − 1. The optimal λ can then be determined by minimizing GIC(λ). Meanwhile, the robustification parameter τ is chosen such that 80% of the predicted errors locates in the interval [−τ , τ ]. Another regularization parameter ν needed to enforce the zero-sum constraint does not affect the convergence of Algorithm 1 as long as ν > 0, so we let ν = 1 in numerical studies. We set φ 0 = 5 and K = 100. The tolerance parameters are set to be ε 1 = ε 2 = 10 −3 .
The initial value of α is set to beα (0) = 0, whileθ τ is the ridge estimator for model (2). In light of the conditions on β * min in Theorem 1, the thresholding value h n is set as 0.1 log(p)/n.
We also observe that our method performs slightly better than HL1 in terms of 1 and 2 -errors. Nevertheless, HL1 violates the zero-sum constraint with finite samples. Moreover, HL2 is inferior to HL1 and our proposed method across almost all scenarios. This is because the reference component is not regularized by the HL2 method, and is always retained in the selected model. As for support recovery, the proposed method is comparable to HL1 and HL2 in terms of TP, and FP of the proposed method is smaller than the three competitors. From Tables 5 and 6, we can see that some significant covariates cannot be identified by the LS in cases (c) and (d) with p = 500 and ρ = 0.5, while the proposed method performs well in terms of TP in all cases, indicating that our method is more robust than the LS in support recovery with high-dimensional data. As expected, the performance of all methods improves as n increases from 200 to 300. Finally, we compare our Lasso-based method with the nonconvex penalization methods: SCAD (Fan and Li 2001) and MCP (Zhang 2010). The results are reported in Tables 7 and 8. We observe that, under our current simulation settings, the Lasso method has a comparable performance with the SCAD and MCP in terms of TP, FP, and 1 and 2 -errors, which suggests that the Lasso method is suitable for the support recovery.

GDP Satisfaction Data
We apply the proposed method to a GDP satisfaction dataset. The dataset includes 31 samples and 6 covariates: satisfaction very bad (x 1 , very bad), satisfaction bad (x 2 , bad), satisfaction moderately bad (x 3 , moderately bad), satisfaction moderately good (x 4 , moderately good), satisfaction good (x 5 , good) and satisfaction very good (x 6 , very good). The values of covariates satisfy 6 j=1x j = 1 withx j > 0. We take log(GDP) as the response, where the GDP is measured per capita from the year 2012. The dataset is available in the R package "robCompositions" (Templ, Hron, and Filzmoser 2020). We randomly split the dataset into 21 training samples and 10 test samples. Model selection with the proposed method and the LS method are performed with the training dataset, and the prediction errors are computed with the test samples: i∈T (Y i − W iθ τ ) 2 /10, where T = {i : the ith sample in the testing dataset} is the testing set index. We repeat the random partition for 100 times. In each replication, we add 94 spurious variables such that p    Table 6. TP, FP, 1 -error, 2 -error and the corresponding standard deviations (in parentheses) with n = 300 and p = 500.   could be larger than n. Specifically, letx j = 0.2 (x j ), 7 ≤ j ≤ 100, where (x 7 , . . . ,x 100 ) are generated from a multivariate normal distribution with mean zero and covariance matrix˜ = (σ jk ) = (0.5 |j−k| ). Then, we define z j = log(x j / p j=1x j ), j = 1, . . . , 100, in which z 7 , . . . , z 100 are the spurious variables and are independent of the response log(GDP).
The average of the prediction errors over 100 replicates is 0.694 for the proposed method. We calculate the selection frequencies of 100 covariates with GIC tuning parameter selection and the stability selection approach. The results are reported in Table 9. We observe that four covariates are selected for more than 70 times out of 100 replicates and the selection results are rather stable. Based on the four covariates, we refit model (2) without penalization. The refitted coefficients are also reported in Table 9, and the fitted versus observed values of log(GDP) are displayed in Figure 1(b). It can be seen that the model with the four selected covariates fits the data reasonably well. In comparison, three covariates are selected for more than 70 times out of 100 replicates with the LS method, and its prediction error is 1.257(> 0.694). Hence, the analysis of this economic data again indicates that our method suffers less from overfitting and has smaller prediction error compared with the LS method.

HIV Microbiome Data
After HIV infection, patients may present with chronic inflammation due to an undetectable viral load, even if they administer antiretroviral medications. Chronic inflammation may cause patients' tissue damage and is related to many chronic diseases (Brenchley et al. 2006). In recent years, researchers have considerable interest in defining possible interventions involving modifications of the gut bacterial environment which may reduce inflammation in HIV patients (Klatt et al. 2013;d Ettorre et al. 2015). Thus, it is desirable to better understand the association between gut microbial composition and several inflammation markers. Here we consider a HIV microbiome dataset reported in Rivera-Pinto et al. (2018) to illustrate the proposed method. This dataset includes 151 samples and 60 gut microbial taxa. The levels of soluble CD14 (sCD14), an immune marker related to chronic inflammation, are measured for each sample. In this analysis, we focus on identifying a subset of taxa whose subcomposition is associated with the sCD14. Let log(sCD14) be the response and the microbial taxa be the covariates. Figure 1(a) displays the histogram and the density estimation of log(sCD4). It can be seen that the distribution of the response tends to be asymmetric and skewed to the right slightly. To evaluate the performance of the proposed method, we randomly split the dataset into 75 training samples and 76 test samples. Model selection with the proposed method and the LS method are performed with the training dataset, and the prediction errors are computed with the test samples. We repeat the random partition for 100 times. In each replication, we add 140 spurious variables that are independently generated from a Poisson distribution with parameter 5 and are independent of the response log(sCD14). In such a way, the number of covariates p = 200, which is larger than the sample size n = 151. Since the number of counts for microbial taxa at genus taxonomy rank varied greatly across samples, we do not directly analyze these count data, instead we transform them into compositional data after replacing zero counts by the maximum rounding error 5 × 10 −5 (Aitchison 2003).
The average of the prediction errors over 100 replicates is 0.718 for the proposed method. We also calculate the selection frequencies of covariates with GIC tuning parameter selection. Five taxa are selected for more than 70 times out of 100 replicates. To assess the stability of the selected taxa, we further evaluate their selection probabilities using the stability selection approach (Meinshausen and Bühlmann 2010). Specifically, we apply the proposed method to a randomly chosen subdata with size n/2 , where a is the interger part of a. Repeating this procedure 100 times, we then calculate the selection probabilities of the selected taxa. The results are reported in Table 10, from which we can observe that the selection probabilities of all five taxa are greater than 0.7, indicating that the selection results are rather stable. Based on the five selected taxa, we refit model (2) without penalization. The refitted coefficients are reported in Table 10, and the fitted versus observed values of log(sCD4) are displayed in Figure 2(b). It can be seen that the model with the five selected taxa fits the data reasonably well. For comparison, there are 10 taxa selected for more than 70 times out of 100 replications with the LS method, but its prediction error is 0.982, greater than the prediction of our propose method. As a result, our method suffers less from overfitting and has smaller prediction error in analyzing this dataset.

Discussion
In this article, we studied signal recovery for a high-dimensional linear log-contrast model, allowing for an asymmetric and heavy-tailed error distribution. The proposed procedure was built on the Huber loss with 1 penalization. The 1 and 2 consistency for the slope parameter vector were established. Under conditions analogous to the irrepresentability condition, we proved that the proposed method can recover the support of the slope parameter vector. The simulation results contained supportive evidence that our method is robust and works well in various settings considered in our simulation studies. Applications to a GDP satisfaction dataset and a HIV microbiome data were provided to illustrate our method. In our article, we focused on the high-dimensional linear log-contrast model for simplicity. Extensions of our method to a general subcompositional log-contrast linear model (Shi, Zhang, and Li 2016) would be an interesting problem. Moreover, the predictors is required to be sub-Gaussian in our technical derivations. It is nontrivial to directly apply our technical technique to accommodate heavy-tailed predictors in high dimensions. Some moment conditions on the covariates are needed. We leave space here for future research. Furthermore, the independence assumption between the error and the covariate vector is crucial to prove Proposition 1. Without the independence assumption, Proposition 1 may not hold and those comprehensive theoretical techniques in Fan, Li, and Wang (2017), Fan et al. (2018) and Pan, Sun. and Zhou (2021) shall be used to study the asymptotic properties accordingly.
In addition, since we only required that the first moment of | | is bounded, our scaling condition s 3 log 5 (p) = o(n α 1 ) is stronger than those imposed by Loh (2017) and Sun, Zhou, and Fan (2020). In our article, we only studied the theoretical results of the Lasso estimator. It would be interesting to consider other folded concave penalty functions, such as SCAD and MCP, in our future work.

Supplementary Materials
All proofs are provided in the supplementary materials.