Sparse Reduced Rank Huber Regression in High Dimensions

Abstract We propose a sparse reduced rank Huber regression for analyzing large and complex high-dimensional data with heavy-tailed random noise. The proposed method is based on a convex relaxation of a rank- and sparsity-constrained nonconvex optimization problem, which is then solved using a block coordinate descent and an alternating direction method of multipliers algorithm. We establish nonasymptotic estimation error bounds under both Frobenius and nuclear norms in the high-dimensional setting. This is a major contribution over existing results in reduced rank regression, which mainly focus on rank selection and prediction consistency. Our theoretical results quantify the tradeoff between heavy-tailedness of the random noise and statistical bias. For random noise with bounded (1+δ)th moment with δ∈(0,1), the rate of convergence is a function of δ, and is slower than the sub-Gaussian-type deviation bounds; for random noise with bounded second moment, we obtain a rate of convergence as if sub-Gaussian noise were assumed. We illustrate the performance of the proposed method via extensive numerical studies and a data application. Supplementary materials for this article are available online.


Introduction
Low rank matrix approximation methods have enjoyed successes in modeling and extracting information from large and complex data across various scientific disciplines.However, large-scale datasets are often accompanied by outliers due to possible measurement error, or because the population exhibits a leptokurtic distribution.As shown in She and Chen (2017), one single outlier can have a devastating effect on low rank matrix estimation.Consequently, nonrobust procedures for low rank matrix estimation could lead to inferior estimates and spurious scientific conclusions.For instance, in the context of financial data, it is evident that asset prices follow heavy-tailed distributions: if the heavy-tailedness is not accounted for in statistically modeling, then the recovery of common market behaviors and asset return forecasting may be jeopardized (Müller, Dacorogna, and Pictet 1998;Cont 2001).
In the context of reduced rank regression, She and Chen (2017) addressed this challenge by explicitly modeling the outliers with a sparse mean shift matrix of parameters.Similar ideas have been considered in the context of robust linear regression (She and Owen 2011) and robust clustering (Liu et al. 2012;Wang et al. 2016).In many statistical applications, the outliers themselves are not of interest.Rather than introducing additional parameters to model the outliers, it is more natural to develop robust statistical methods that are less sensitive to outliers.There is limited work along these lines in low rank matrix approximation problems.In fact, She and Chen (2017) pointed out that in the context of reduced rank regression, directly applying a robust loss function that down-weights the outliers, such as the Huber loss, may result in nontrivial computational and theoretical challenges due to the low rank constraint.So a natural question arises: can we develop a computationally efficient robust sparse low rank matrix approximation procedure that is less sensitive to outliers and yet has sound statistical guarantees?
In this article, we propose a novel method for fitting robust sparse reduced rank regression in the high-dimensional setting.We propose to minimize the Huber loss function subject to both sparsity and rank constraints.This leads to a nonconvex optimization problem, and is thus, computational intractable.To address this challenge, we consider a convex relaxation for both the sparsity and rank constraints, which can be solved efficiently.A similar convex relaxation has also been considered in Chen and Huang (2012) and Richard, Savalle, and Vayatis (2012) under the least squares loss.We note that Bunea, She, and Wegkamp (2012) proposed a group-lasso type penalty with a rank constraint to encourage the solution to be group-wise sparse and low rank under the least squares loss.
Most of the existing theoretical analysis of reduced rank regression focuses on rank selection consistency and prediction consistency (Bunea, She, and Wegkamp 2011;Mukherjee and Zhu 2011;Bunea, She, and Wegkamp 2012;Chen, Dong, and Chan 2013;Luo and Qi 2017;She 2017).Moreover, as shown by several authors, in order to achieve consistency, the number of covariates and the number of responses need to be much smaller than the sample size (Candes et al. 2011;She 2017).This motivates the use of sparsity penalty to accommodate possible high-dimensional covariates and responses.However, in the high-dimensional setting, nonasymptotic analysis of the estimation error is not well established, even in the context of reduced rank regression. 1 To bridge this gap in the literature, we provide nonasymptotic analysis of the estimation error under both Frobenius and nuclear norms for robust sparse reduced rank regression with the Huber loss.
The Huber loss has a robustification parameter that trades bias for robustness.In past work, the robustification parameter is usually fixed using the 95%-efficiency rule (among others, Huber 1964Huber , 1973;;Portnoy 1985;Mammen 1989;He and Shao 1996).Therefore, estimators obtained under Huber loss are typically biased.To achieve asymptotic unbiasedness and robustness simultaneously, within the context of robust linear regression, Sun, Zhou, and Fan (2018) showed that the robustification parameter has to adapt to the sample size, dimensionality, and moments of the random noise.Motivated by Sun, Zhou, and Fan (2018), we will establish theoretical results for the proposed method by allowing the robustification parameter to diverge.
The robustness of our proposed estimator is evidenced by its finite sample performance in the presence of heavy-tailed data, that is, data for which high-order moments are not finite.When the sampling distribution is heavy-tailed, there is a higher chance that some data are sampled far away from their mean.We refer to these outlying data as heavy-tailed outliers.Theoretically, we establish nonasymptotic results that quantify the tradeoff between heavy-tailedness of the random noise and statistical bias: for random noise with bounded (1 + δ)th moment, the rate of convergence, depending on δ, is slower than the sub-Gaussian-type deviation bounds; for random noise with bounded second moment, we recover results as if sub-Gaussian errors were assumed; and the transition between the two regimes is smooth.
Notation: For any vector u = (u 1 , . . ., u p ) T ∈ R p and q ≥ 1, let ||u|| q = p j=1 |u j | q 1/q denote the q norm.Let ||u|| 0 = p j=1 1(u j = 0) denote the number of nonzero entries of u, and let ||u|| ∞ = max 1≤j≤p |u j |.For any two vectors u, v ∈ R p , let u, v = u T v.Moreover, for two sequences of real numbers a n , and a n b n signifies that a n b n and b n a n .If A is an m × n matrix, we use ||A|| q to denote its order-q operator norm, defined by ||A|| q = max u∈R n ||Au|| q /||u|| q .We define the (p, q)-norm of a m × n matrix A as the usual q norm of the vector of row-wise p norms of A: where A j• is the jth row of A. We use λ k to denote the nuclear norm of A, where be the Frobenius norm of A. Finally, let vec(A) be the vectorization of the matrix A, obtained by concatenating the columns of A into a vector.
1 We note that the prediction error bound can be used to derive an estimation error bound under some further incoherence condition on the design matrix.

Formulation
Suppose we observe n independent samples of q-dimensional response variables and p-dimensional covariates.Let Y ∈ R n×q be the observed response and let X ∈ R n×p be the observed covariates.We consider the matrix regression model where A * ∈ R p×q is the underlying regression coefficient matrix and E ∈ R n×q is an error matrix.Each row of E is an independent mean-zero and potentially heavy-tailed random noise vector.
Reduced rank regression seeks to characterize the relationships between Y and X in a parsimonious way by restricting the rank of A * (Izenman 1975).An estimator of A * can be obtained by solving the optimization problem minimize (2) where r is typically much smaller than min{n, p, q}.Due to the rank constraint on A, (2) is nonconvex: nonetheless, the global solution of (2) has a closed form solution (Izenman 1975).
It is well-known that squared error loss is sensitive to outliers or heavy-tailed random error (Huber 1973).To address this issue, it is natural to substitute the squared error loss with a loss function that is robust against outliers.We propose to estimate A * under the Huber loss function, formally defined as follows.
Definition 1 (Huber Loss and Robustification Parameter).The Huber loss τ (•) is defined as where τ > 0 is referred to as the robustification parameter that trades bias for robustness.
The Huber loss function blends the squared error loss (|z| ≤ τ ) and the absolute deviation loss (|z| > τ), as determined by the robustification parameter τ .Compared to the squared error loss, large values of z are down-weighted under the Huber loss, thereby resulting in robustness.Generally, an estimator obtained from minimizing the Huber loss is biased.The robustification parameter τ quantifies the tradeoff between bias and robustness: a smaller value of τ introduces more bias but also encourages the estimator to be more robust to outliers.We will provide guidelines for selecting τ based on the sample size and the dimensions of A * in later sections.Throughout the article, for M ∈ R p×q , we write τ (M) = p i=1 q j=1 τ (M ij ) for notational convenience.
In the high-dimensional setting in which n < p or n < q, it is theoretically challenging to estimate A * accurately with only the low rank assumption.To address this challenge, Chen, Chan, and Stenseth (2012) and Chen and Huang (2012) proposed methods for simultaneous dimension reduction and variable selection.In particular, they decomposed A * into the product of its singular vectors, and imposed sparsity-inducing penalty on the left and right singular vectors.Thus, their proposed methods involve solving optimization problems with nonconvex objective.
Given that the goal is to estimate A * rather than its singular vectors, we propose to estimate A * directly.Under the Huber loss, a robust and sparse estimate of A * can be obtained by solving the optimization problem: minimize and card(A) ≤ k, (3) where card(A) is the number of nonzero elements in A. Optimization problem (3) is nonconvex due to the rank and cardinality constraints on A. We instead propose to estimate A * by solving the following convex relaxation: minimize where λ and γ are nonnegative tuning parameters, || • || * is the nuclear norm that encourages the solution to be low rank, and ||•|| 1,1 is the entry-wise 1 -norm that encourages the solution to be sparse.The nuclear norm and the 1,1 norm constraints are the tightest convex relaxations of the rank and cardinality constraints, respectively (Recht, Fazel, and Parrilo 2010;Jojic, Saria, and Koller 2011).A similar convex relaxation has also been considered in Chen and Huang (2012) and Richard, Savalle, and Vayatis (2012) under the least squares loss.We now discuss the close connection between our proposed method and that of She and Chen (2017).As shown by Lemma 2 of She and Chen (2017), ( 4) is equivalent to minimize where C ∈ R p×q is a matrix of augmented parameters that models the outliers.Optimization problem (5) is motivated by the mean shift model Y = XA * + C * + E, where A * ∈ R p×q is a matrix of regression coefficients, C * is a sparse matrix that models the outliers, and E is a matrix of sub-Gaussian random noise with some abuse of notation.
Let A be a solution obtained from solving (4) or (5).The primary advantage of studying the estimator A using the proposed framework in (4) is that it allows us to study the estimator A under a different perspective than that of the mean shift model (She and Chen 2017).We derive the theoretical results under the model Y = XA * + E, where E are the random noise with bounded (1 + δ)th moment condition.In other words, we consider the case when the outliers are modeled as heavy-tailed random noise rather than a mean shift parameter.Under the bounded moment condition, we show that the rate of convergence for A undergoes a phase transition as a function of δ in Theorem 1.Our results complement that of the theoretical results derived under the mean shift model in She and Chen (2017) and offer a different perspective to A, illustrating the power of Huber loss under the setting when the random noise is heavy-tailed.
Remark 1.Since optimization problems (4) and ( 5) are equivalent in that they yield the exact same solution A, results such as those of She and Chen (2017) can be obtained by analyzing the solution A under (5).We leave this for future work.

Algorithm
In this section, we provide an efficient algorithm to obtain the proposed estimator A. Due to the equivalence between (4) and ( 5), we start with deriving a block coordinate descent type algorithm for solving (5).One advantage of solving ( 5) is that the estimated augmented matrix C for modeling outliers is outputted as a by-product, which may be of interest to practitioners.We note that optimization problem (4) can also be solved directly using an alternating direction method of multipliers (ADMM) algorithm, which we provide in Section A of the online supplementary materials for completeness.
A block coordinate descent algorithm for solving (5) involves updating the matrices A and C iteratively while holding the other fixed until convergence.Given a fixed A, the update for C can be obtained by solving minimize which yields a closed-form solution of C = S(Y−XA, τ ), where S denote the soft-thresholding operator, applied element-wise to a matrix, that is, Given C, the update for A can then be obtained by solving minimize Optimization problem ( 6) does not admit a closed-form solution.To this end, we derive an ADMM algorithm for solving (6) (Eckstein and Bertsekas 1992;Boyd et al. 2010).Specifically, we note that ( 6) is equivalent to minimize The scaled augmented Lagrangian of (7) takes the form where A, Z, W are the primal variables, and N and M are the dual variables.Note that the ADMM algorithm is an iterative algorithm.At the kth iteration, the ADMM algorithm requires the following updates: The derivation for the closed-form updates are standard for least squares loss and are omitted.The details of the proposed algorithm are summarized in Algorithm 1.
Note that the term (X T X + 2ρI) −1 can be computed outside of the loop.Thus, the computational bottleneck in each iteration of Algorithm 1 is the singular value decomposition of a p × q matrix with computational complexity O{p 2 q + q 2 p + min(q 3 , p 3 )}.constants ρ > 0, τ > 0, λ > 0, γ > 0, and > 0. 2. Iterate until the stopping criterion , where A t is the value of A obtained at the tth iteration of the block coordinate descent algorithm.
(a) C = S(Y − XA, τ ), where S denote the soft-thresholding operator, applied element-wise to a matrix:

Statistical Theory
We study the theoretical properties of A obtained from solving (4).Let V p,q = {U ∈ R p×q : U T U = I q } be the Stiefel manifold of p × q orthonormal matrices.Throughout the theoretical analysis, we assume that A * can be decomposed as where n, and rs u s v n.Consequently, A * is sparse and low rank.Let S = supp(A * ) be the support set of A * with cardinality |S| = s, that, S contains indices for the nonzero elements in A * .Note that s ≤ rs u s v .
For simplicity, we consider the case of fixed design matrix X and assume that the covariates are standardized such that max i,j |X ij | = 1.To characterize the heavy-tailed random noise, we impose a bounded moment condition on the random noise.
Condition 1 (Bounded Moment Condition).Let δ > 0. Assume that each entry of the random error matrix E in (1) has bounded (1 + δ)th moment, and let Condition 1 is a relaxation of the commonly used sub-Gaussian assumption to accommodate heavy-tailed random noise.For instance, the t-distribution with degrees of freedom larger than one can be accommodated by the bounded moment condition.This condition has also been used in the context of high-dimensional Huber linear regression (Sun, Zhou, and Fan 2018).Note that Condition 1 allows for heterogeneous random noise as long as the random noise has at least bounded (1 + δ)th moment.
Let H τ (A) be the Hessian matrix of the Huber loss function τ (Y − XA) /n in (3).In addition to the random noise, the Hessian matrix is a function of the parameter A, and H τ (A) may equal zero for some A, because the Huber loss is linear at the tails.To avoid singularity of H τ (A), we will study the Hessian matrix in a local neighborhood of A * .To this end, we define and impose conditions on the localized restricted eigenvalues of H τ (A).
Definition 2 (Localized Restricted Eigenvalues).The minimum and maximum localized restricted eigenvalues for H τ (A) are defined as where Condition 2. There exist constants 0 < κ lower ≤ κ upper < ∞ such that the localized restricted eigenvalues of H τ are lowerand upper-bounded by A similar type of localized condition was proposed in Fan et al. (2018) for general loss functions and in Sun, Zhou, and Fan (2018) for the analysis of robust linear regression in high dimensions.In what follows, we justify Condition 2 by showing that it is implied by the restricted eigenvalue condition on the empirical Gram matrix S = X T X/n.To this end, we define the restricted eigenvalues of a matrix and then place a condition on the restricted eigenvalues of S.
Definition 3 (Restricted Eigenvalues of a Matrix).Given ξ > 1, the minimum and maximum restricted eigenvalues of S are defined as

respectively.
Condition 3.There exist constants 0 < κ lower ≤ κ upper < ∞ such that the restricted eigenvalues of S are lower-and upperbounded by Condition 3 is a variant of the restricted eigenvalue condition that is commonly used in high-dimensional nonasymptotic analysis.It can be shown that Condition 3 holds with high probability if each row of X is a sub-Gaussian random vector.
Under Condition 3, we now show that the localized restricted eigenvalues for the Hessian matrix are bounded with high probability under conditions on the robustification parameter τ and the sample size n.That is, we prove that the localized restricted eigenvalues condition in Condition 2 holds with high probability under Condition 3. The result is summarized in the following lemma.
Under Conditions 1 and 3, there exists constants κ lower and κ upper such that the localized restricted eigenvalues of H τ (A) Lemma 1 shows that Condition 2 holds with high probability, as long as Condition 3 on the empirical Gram matrix S holds.Note that the constants κ lower and κ upper also appear in Condition 3.
We now present our main results on the estimation error of A under the Frobenius norm and nuclear norm in the following theorem.For simplicity, we will present our main results conditioned on the event that Conditions 1-2 hold.
Theorem 1 establishes the nonasymptotic convergence rates of our proposed estimator under both Frobenius and nuclear norms in the high-dimensional setting.To the best of our knowledge, we are the first to establish such results on the estimation error for robust sparse reduced rank regression under heavy-tailed random noise.By contrast, most of the existing work on reduced rank regression focuses on rank selection consistency and prediction consistency (Bunea, She, andWegkamp 2011, 2012).Moreover, in order to achieve consistency, the number of covariates and the number of responses need to be much smaller than the sample size (Candes et al. 2011;She 2017).This motivates the use of the sparsity inducing penalty to reduced rank regression to accommodate high-dimensional covariates and responses.When the random noise has second or higher moments, that is, δ ≥ 1, our proposed estimator achieves a parameteric rate of convergence as if sub-Gaussian random noise were assumed.It achieves a slower rate of convergence only when the random noise is extremely heavy-tailed, that is, 0 < δ < 1.
Remark 2. As pointed out by the referees and the associate editor, several authors have considered sparse reduced rank regression using row-wise sparsity inducing penalty with low rank constraint or the nuclear norm penalty (Chen and Huang 2012;Bunea, She, and Wegkamp 2012;She 2017).We want to emphasize that our proposed method considers response selection while row-wise sparsity penalty does not.Consequently, we can relax the dependence of our result on q, by only involving the logarithmic of q.Because of this, it is difficult to compare the convergence rate between the two approaches directly.
Remark 3. The restricted eigenvalue type conditions are needed for establishing the estimation error of A. From Theorem 1, a prediction error bound for ||X( A − A * )|| 2 F can be obtained directly.On the other hand, if the risk excess error is of interest, then the restricted eigenvalue type conditions can be removed.
Intuitively, one might expect the optimal rate of convergence under the Frobenius norm to have the form A − A * F √ r(s u +s v ) {log(pq)/n} min{δ/(1+δ),1/2} , since there are a total of roughly r(s u + s v ) nonzero parameters to be estimated in A * as defined in (8).To validate the aforementioned intuition, we provide the minimax lower bound for sparse and low rank estimation under the Gaussian random noise in Theorem 2. The minimax lower bound under random noise with (1 + δ)th bounded moment remains an open problem and we leave it for future work.We consider the following family of all rank r sparse matrices: where λ k (A) be the kth largest singular value of A.
Theorem 2. Assume that each entry of the random noise E ij are independent and identically distributed from a standard normal distribution.Suppose that 2(r − 1) max(s u , s v ) ≤ min(p, q).Then, we have inf where C 1 is a constant depending only on κ upper .Moreover, under the scaling condition {max(s u , s v )} 3 pq min(p 3 , q 3 ), we have inf n log(pq) , (10) where C 2 is a constant depending only on κ upper .
If we assume that the random noise has at least a bounded second moment, that is, δ ≥ 1, then the rate of convergence in Theorem 1 reduces to the following: Comparing ( 11) and ( 10), we see that under the Frobenius norm, the rate of convergence for our proposed method is slower than that of the minimax optimal rate by a scaling factor of √ s u s v /(s u + s v ).The lost of the scaling factor is due to the convex relaxation (4) where we estimate A directly rather than estimating the sparse singular vectors u k and v k separately.Finally, we note that the above results also illustrate the power of Huber loss: the rate of convergence of the proposed method under bounded second moment condition on the random noise matches that of the optimal rate derived under Gaussian error up to a scaling factor.
Remark 4. The class F defined in ( 9) is a sub-class of rs u s vsparse and r-rank matrices.Thus, the derived lower bound in Theorem 2 is also a lower bound for element-wise sparse and low-rank matrices.Here we use F instead of a class of sparse and low rank matrices because it is needed to obtain the nuclear norm convergence rate.Moreover, as the sparse singular value decomposition structure in F naturally gives a low rank and sparse A with at most rs u s v nonzeros, we are able to obtain the convergence rate under the Frobenius norm.We emphasize that rs u s v is a tight upper bound for the number of nonzeros in A. To see this, considering the case where r = s v = 1, then rs u s v = s u and the coefficient matrix A has exactly s u nonzeros.

Numerical Studies
We perform extensive numerical studies to evaluate the performance of our proposal for robust sparse reduced rank regression.Seven approaches are compared in our numerical studies: classical reduced rank regression, classical; our proposal with Huber loss, hubersrrr; our proposal with squared error loss (with τ → ∞), srrr; signal extraction approach for sparse multivariate response (Luo and Qi 2017), SiER; robust reduced rank regression with an additional mean parameter that models element-wise outliers (She and Chen 2017), r4; penalized reduced rank regression via an adaptive nuclear norm (Chen, Dong, and Chan 2013), rrr; and the penalized reduced rank regression via a ridge penalty (Mukherjee and Zhu 2011), rrridge.The proposals classical, rrridge, rrr, and r4 do not assume sparsity on the regression coefficients.Among the seven proposals, only hubersrrr and r4 are robust against outliers.For all of our numerical studies, we generate each row of X from a multivariate normal distribution with mean zero and covariance matrix , where ij = 0.5 |i−j| for 1 ≤ i, j ≤ p.Then, all elements of X are divided by the maximum absolute value of X such that max i,j |X ij | = 1.The response matrix Y is then generated according to Y = XA * + E. We consider two different types of outliers: (a) heavy-tailed random noise E, and (b) contamination of some percentage of the elements of Y.We simulate data with sparse and nonsparse low rank matrix A * .The details for the different scenarios will be specified in Section 4.1.
Our proposal hubersrrr involves three tuning parameters.We select the tuning parameters using 5-fold crossvalidation with the absolute median loss as the criterion: we vary λ across a fine grid of values, consider four values of γ = {2.5, 3, 3.5, 4} as suggested by Theorem 1, and considered a range of the robustification parameter τ = c{n/ log(pq)} 1/2 , where c = {0.4,0.45, . . ., 1.45, 1.5}.The tuning parameters for srrr are selected in a similar fashion with τ → ∞.We note that the tuning parameters can also be selected using a calibrated structured cross-validation proposed in She and Tran (2019).For scenarios with nonsparse regression coefficients, we simply set γ = 0 for hubersrrr and srrr for fair comparison against other approaches that do not assume sparsity.For r3, we select the tuning parameter using five different information criteria implemented in the R package rrpack (Chen, Dong, and Chan 2013), and report the best result.For rrridge, we specify the correct rank for A * and consider a fine grid of tuning parameters for the ridge penalty and report the best result.For classical, we specify the correct rank for A * .We implement SiER using the default option in the R package SiER.The method r4 has two tuning parameters that control the sparsity of the mean shift parameter for modeling outliers and the rank of A * .We implement r4 by specifying the correct rank of A * , and choose the sparsity tuning parameter according to 5-fold cross-validation.Since r4 is nonconvex, the final solution may depend on the initialization of the parameter of interest.We input the true regression coefficients A * as an initial estimator for r4.In other words, we give a major advantage to rrridge, r4, and classical in that we provide the rank of A * as an input as well as A * as an initializer.
To evaluate the performance across different methods, we calculate the difference between the estimated regression coefficients A and the true coefficients A * under the Frobenius norm.In addition, for scenarios with in which A * is sparse, we calculate the true and false positive rates (TPR and FPR), defined as the proportion of correctly estimated nonzeros in the true parameter, and the proportion of zeros that are incorrectly estimated to be nonzero in the true parameter, respectively.Since some existing approaches are not applicable in the high-dimensional setting, we perform numerical studies under the low-dimensional setting in which n ≥ p in Section 4.1.We then illustrate the performance of our proposed methods, hubersrrr and srrr, compared to SiER and r4, in the high-dimensional setting in Section 4.2.In Section 4.3, we illustrate the phase transition phenomenon in Theorem 1 via numerical studies.In Section 4.4, we assess whether the proposed estimator estimates the rank accurately by plotting the top ten singular values of A.
Table 1.The mean (and standrard error) of the difference between the estimated regression coefficients and the true regression coefficients under the Frobenius norm, averaged over 100 datasets, in the setting where A * is not sparse, with n = 200, p = 50, and q = 10.

Low-Dimensional Setting with n ≥ p
In this section, we perform numerical studies with n = 200, p = 50, and q = 10.We first consider two cases in which A * has low rank but is not sparse: 1. Rank one matrix: We then generate random noise E ∈ R n×q from three different distributions: (a) the normal distribution N(0, 4), (b) the tdistribution with degrees of freedom 1.5, and (c) the log-normal distribution log N(0, 1.2 2 ).Moreover, we consider a contamination scenario in which we generate each element of E from the N(0, 4) distribution, and then randomly contaminate 5% and 10% of the elements in Y by replacing them with random values generated from a uniform distribution on the interval [10,20].The estimation error for each method under the Frobenius norm, averaged over 100 datasets, is reported in Table 1.
From Table 1, we see that rrr and rrridge outperform all other methods when A * is rank one under Gaussian noise.This is not surprising, since rrr and rrridge are tailored for reduced rank regression without outliers.We see that hubersrrr has similar performance to srrr, suggesting that there is no loss of efficiency for hubersrrr even when there are no outliers.When the random noise is generated from the t-distribution, r4 has the best performance, followed by hubersrrr.Note that r4 is nonconvex and we provide the true regression coefficients A * as an initializer.The estimation errors for methods that do not model the outliers are substantially higher.For log-normal random noise, hubersrrr outperforms r4.Under the data contamination model, r4 and hubersrrr perform similarly, and both outperform all of the other methods.These results corroborate the observation in She and Chen (2017) that the estimation of low rank matrices is extremely sensitive to outliers.As we increase the contamination percentage of the observed outcomes, we see that the performance of the nonrobust methods deteriorates.Similar results are observed for the case when A * has rank two.
Next, we consider two cases in which A * is both sparse and low rank: 1. Sparse rank one matrix: The heavy-tailed random noise and data contamination scenarios are as described earlier.The results, averaged over 100 datasets, are reported in Table 2.When A * is sparse, SiER, hubersrrr and srrr outperform all of the methods that do not assume sparsity.In particular, we see that r4 has the worst performance when the random noise is normal or log-normal, or when the data are contaminated.The method rrr has an MSE of 5.00 when the data are contaminated, due to the fact that the information criteria always select models with the regression coefficients estimated to be zero.Under the Gaussian error without any outliers, SiER has a lower MSE than our proposed method when the rank of A * is one.In summary, our proposal hubersrrr has the best performance across most scenarios and is robust against different types of outliers.

High-Dimensional Setting with p > n
In this section, we assess the performance of our proposed method in the high-dimensional setting, when the matrix A * is sparse.To this end, we perform numerical studies with q = 10, p = 200, and n = 150.The methods classical, rrr, and rrridge do not assume sparsity and do not model outliers, therefore, their results are omitted.We consider low rank and sparse matrices A * described in Section 4.1.Similarly, two types of outliers are considered: heavy-tailed random noise and data contamination.The TPR, FPR, and estimation error under Frobenius norm for both types of scenarios, averaged over 100 datasets, are summarized in Tables 3 and 4, respectively.NOTE: Three distributions of random noise are considered: normal, t, and log-normal.We also considered contaminating 5% or 10% of the elements of Y.We report the mean (and standard error) of the true and false positive rates, and the difference between A and A * under Frobenius norm, averaged over 100 datasets.We see that for Gaussian random noise, SiER has the lowest estimation error, followed by hubersrrr and srrr.We note that hubersrrr and srrr have similar results, indicating that there is little loss of efficiency when there are no outliers.However, in scenarios in which the random noise is heavytailed, hubersrrr has high TPR, low FPR, and low Frobenius norm compared to all of the other methods.In fact, we see that when the random noise is heavy-tailed, the TPR and FPR of srrr and SiER are very low.We see similar performance for the case when the data are contaminated in Table 4.These results suggest that hubersrrr should be preferred in all scenarios since it allows accurate estimation of A * when the random noise are heavy-tailed, or under data contamination.Moreover, there is little loss of efficiency compared to srrr and SiER when there are no outliers.

Phase Transition Phenomenon
Similar to Section 4.2, we generate the response matrix Y = XA * +E with a sparse rank two matrix T , and v 2 = (0 T 2 , 1.5 T 4 , 0 T q−6 ) T .To validate the phase transition behavior, we generate each element of the random noise E from a t df distribution with degrees of freedom df.The t df distribution has a finite (1 + δ)th moment provided that the degrees of freedom is larger than 1+δ.We take df = {1.5, 1.6, . . ., 3.5}.The  differences between A and A * under the Frobenius and nuclear norm across different values of δ, averaged over 200 replications, are reported in Figure 1 for the case when n = 400, p = 500, and q = 10.We also calculate the theoretical bound as a function of δ, that is, ν 1/ min(1+δ,2) δ log(pq)/n min{δ/(1+δ),1/2} , where we set ν δ = 20 to obtain the curves in Figure 1.
From Figure 1, we see that the curve obtained theoretically matches with the estimation error of A under both the Frobenius norm and nuclear norm, across a range of δ.In particular, we observe that there is a phase transition phenomenon: when δ < 1, the estimation error decreases as δ increases; when δ ≥ 1, the estimation error under both Frobenius and nuclear norms are flat even when we increase δ.

Rank Selection Consistency
We consider the simulation setting in Section 4.2 with random noise simulated from the t-distribution with degrees of freedom 1.5.The top 10 singular values for A obtained from our proposed method with n = {150, 500, 1000} and p = 200 are plotted in Figure 2. We see from the left panel of Figure 2 that the largest singular value of A is estimated to be large and the rest of the singular values are approximately zero when rank(A * ) = 1.On the other hand, when rank(A * ) = 2, our proposed method estimates both the first and second largest singular value to be significantly different from zero, while the rest of the singular values are estimated to be approximately zero.These results suggest that the proposed estimator can estimate the rank of A * quite accurately.

Data Application
We apply the proposed robust sparse reduced rank regression to the Arabidopsis thaliana dataset, which consists of gene expression measurements for n = 118 samples (Rodrígues-Concepción and Boronat 2002; Wille et al. 2004;Ma, Gong, and Bohnert 2007;Tan, Witten, and Shojaie 2015;She and Chen 2017).It is known that isoprenoids play many important roles in biochemical functions such as respiration, photosynthesis, and regulation of growth in plants.Here, we explore the connection between two isoprenoid biosynthesis pathways and some downstream pathways.
Similar to She and Chen (2017), we treat the p = 39 genes from two isoprenoid biosynthesis pathways as the predictors, and treat the q = 795 genes from 56 downstream pathways as the response.Thus, X ∈ R 118×39 and Y ∈ R 118×795 , and we are interested in fitting the model Y = XA + E. We scale each element of X such that max i,j |X ij | = 1, and standardize each column of Y to have mean zero and standard deviation one.
In Section 4.2, we illustrated with numerical studies that if the response variables are heavy-tailed, sparse reduced rank regression with squared error loss will lead to incorrect estimates.We now illustrate the difference between solving (4) with Huber loss and squared error loss.We set γ = 3, and pick λ such that there are 1000 nonzeros in the estimated coefficient matrix.For the robust method, we set the robustification parameter to equal τ = 3 for simplicity.In principle, this quantity can be chosen using cross-validation.
Let A hubersrrr and A srrr be the estimated regression coefficients for the robust and nonrobust methods, respectively.To measure the difference between the two approaches in terms of regression coefficients and prediction, we compute the quantities || A hubersrrrst − A srrr || F /|| A hubersrrr || F ≈ 37% and ||X A hubersrrr − X A srrr || F /||X A hubersrrr || F ≈ 35%.
Figure 3 displays scatterplots of the right singular vectors of X A srrr against the right singular vectors of X A hubersrrr .We see that while the first singular vectors are similar between the two methods, the second and third singular vectors are very different.These results suggest that the regression coefficients and model predictions can be quite different between robust and nonrobust methods when there are outliers, and that care needs to be taken during model fitting.
Next, we assess the prediction accuracy of our proposed method under both Huber loss and squared error loss.Specifically, we split the data into training set with n train = 100 and test set with n test = 18.Then, we fit the proposed method on the training set with tuning parameters selected using crossvalidation similar to that of described in Section 4, and evaluate the prediction accuracy on the test set.We repeat this procedure 1000 times.The prediction error under the Huber and squared error loss are 8836 and 8907, with standard errors 63 and 64, respectively.The improvement for using the Huber loss is mild in this dataset, mainly due to the fact that the outliers themselves are not very large.To further illustrate the advantage of the Huber loss, we repeat the aforementioned analysis with one entry of the response matrix perturbed with the number 50.The prediction error are now 8837 and 9115 for the Huber loss and squared error loss, with a standard error of 63 and 66, respectively.In summary, we see that the prediction error under the Huber loss does not change since it is robust to outliers, whereas the prediction error under squared error loss increases significantly.

Discussion
We propose robust sparse reduced rank regression for analyzing large, complex, and possibly contaminated data.Our proposal is based on a convex relaxation, and is thus, computationally tractable.We show that our proposal is statistically consistent under both Frobenius and nuclear norms in the highdimensional setting in which p > n.By contrast, most of the existing literature in reduced rank regression focus on prediction and rank selection consistency.
Specifically, we focus on quantifying the tradeoff between heavy-tailness of the random noise and the statistical bias.We show that the proposed robust estimator can achieve exponential-type deviation errors only under bounded loworder moments.Our work offers a different perspective to studying robustness.In particular, our framework is different from the conventional perspective on robust statistics under the Huber's -contamination model, which focuses on developing robust procedures with a high breakdown point (Huber 1964).The breakdown point of an estimator is defined roughly as the proportion of arbitrary outliers an estimator can tolerate before the estimator produces arbitrarily large estimates, or breaks down (Hampel 1971).Under the conventional perspective, the proposed method, similar to that of the classical Huber regression, has a breakdown point of 1/n, when both features and responses can be arbitrarily contaminated.We leave the theoretical investigations of robust sparse reduced rank regression with nonconvex truncated losses, that may potentially have resistant estimation, for future work.

Algorithm 1 A
Block Coordinate Descent Algorithm for Solving (5). 1. Initialize the parameters: (a) augmented variable C to the zero matrix.(b) primal variables A, Z, and W to the zero matrix.(c) dual variables B Z and B W to the zero matrix.(d)

Figure 1 .
Figure 1.Estimation error for A under the Frobenius and nuclear norm across a range of δ, averaged over 200 datasets.The black solid line is the estimation error for A and the gray dash line is the theoretical bound.

Figure 2 .
Figure 2. Singular values for A averaged over 100 datasets.The left and right panels present results for simulation settings with rank(A * ) = 1 and rank(A * ) = 2, respectively.The black, blue, and red lines correspond to the results for the case when p = 200 and n = {150, 500, 1000}, respectively.

Figure 3 .
Figure 3. Scatterplots of the leading right singular vectors of X A hubersrrr and X A srrr .
Three distributions of random noise are considered: normal, t, and log-normal.We also considered contaminating 5% or 10% of the elements of Y.

Table 2 .
Results for the case where A * is sparse, with n = 200, p = 50, and q = 10.

Table 3 .
Results for the case when A * is sparse and rank one in the high-dimensional setting with n = 150, p = 200, and q = 10.

Table 4 .
Results for the case when A * is sparse and rank two in the high-dimensional setting with n = 150, p = 200, and q = 10.