Sparse Convoluted Rank Regression in High Dimensions

Abstract Wang et al. studied the high-dimensional sparse penalized rank regression and established its nice theoretical properties. Compared with the least squares, rank regression can have a substantial gain in estimation efficiency while maintaining a minimal relative efficiency of 86.4%. However, the computation of penalized rank regression can be very challenging for high-dimensional data, due to the highly nonsmooth rank regression loss. In this work we view the rank regression loss as a nonsmooth empirical counterpart of a population level quantity, and a smooth empirical counterpart is derived by substituting a kernel density estimator for the true distribution in the expectation calculation. This view leads to the convoluted rank regression loss and consequently the sparse penalized convoluted rank regression (CRR) for high-dimensional data. We prove some interesting asymptotic properties of CRR. Under the same key assumptions for sparse rank regression, we establish the rate of convergence of the l1-penalized CRR for a tuning free penalization parameter and prove the strong oracle property of the folded concave penalized CRR. We further propose a high-dimensional Bayesian information criterion for selecting the penalization parameter in folded concave penalized CRR and prove its selection consistency. We derive an efficient algorithm for solving sparse convoluted rank regression that scales well with high dimensions. Numerical examples demonstrate the promising performance of the sparse convoluted rank regression over the sparse rank regression. Our theoretical and numerical results suggest that sparse convoluted rank regression enjoys the best of both sparse least squares regression and sparse rank regression. Supplementary materials for this article are available online.


Introduction
Over the past two decades, there has been a surge of literature on high-dimensional statistics.We refer to Bühlmann and Van De Geer (2011) and Fan et al. (2020) for a comprehensive review of the existing work on this topic.In particular, many penalization methods have been proposed for high-dimensional regression, including 1 -penalized regression (Tibshirani 1996), the Dantzig selector (Candes and Tao 2007), concave-penalized regression (Fan and Li 2001), among others.These techniques are also applicable in other statistical models.The penalized least squares method is at the center of the stage in terms of theoretical and computational developments in high-dimensional regression.The theoretical setup typically assumes that the true model is a linear regression model with homoscedastic variance.As long as the error is sub-Gaussian, the penalized least squares estimator enjoys nice theoretical guarantees even if the number of covariates grows at a nearly exponential rate with sample size.
An approach for achieving a higher efficiency is the penalized Wilcoxon rank regression (or rank regression for short).Wilcoxon rank regression is well studied in the classical robust nonparametric statistics (Hettmansperger and McKean 2010).The penalized rank regression was studied by several authors (Wang and Li 2009) for the low dimensional setting.Recently, CONTACT Hui Zou zouxx019@umn.eduSchool of Statistics, University of Minnesota, Minneapolis, MN 55455.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/JASA.penalized rank regression in high dimensional setting was fully investigated in Wang et al. (2020).The penalized rank regression solves the estimator of regression coefficient through minimizing (1.1) over β ∈ R p , where p λ (•) is some penalty function.The penalized rank regression has several advantages compared with the penalized least squares regression.First, penalized rank regression is shown to possess better efficiency than the least squares approach when error has a heavy-tailed distribution, while maintaining a good relative efficiency when error is normally distributed.Second, penalized rank regression enjoys tuning free property, which means the theoretical correct tuning parameter can be easily estimated from the dataset without any cross-validation.Although tuning free property can be also obtained through other methodologies such as the squareroot Lasso (Belloni, Chernozhukov, and Li 2011) and penalized quantile regression (Wang, Wu, and Li 2012), these methods do not necessarily have the first aforementioned efficiency property.
Although penalized rank regression has the aforementioned nice theoretical advantages, it can be difficult to use in practice due to computational challenges, especially when the number of covariates in the dataset is very large.It is known that high dimensional penalized regression with a smooth loss function can be efficiently computed by cyclical coordinate descent algorithm (Friedman, Hastie and Tibshirani 2010).However, the loss function in penalized rank regression is highly nonsmooth.In principle, coordinate descent may fail to deliver the right solution due to the nonsmoothness of the objective function.A similar problem is quantile regression in which the check loss is nonsmooth.The computation of quantile regression is done by using interior point algorithms.One way of computing the penalized rank regression is to transform it into linear programming and then apply the interior point algorithm.However, the interior point algorithm does not scale well with high dimensions.Gu et al. (2018) developed an alternating directional method of multipliers for computing the high-dimensional quantile regression.Computationally speaking, sparse penalized rank regression is more challenging than penalized quantile regression.The interior point algorithm is not a suitable choice for solving highdimensional sparse rank regression.
It is natural to ask whether the aforementioned good theoretical properties possessed by rank regression can be shared with good computational efficiency for practical applications.If one focuses on (1.1), then the only solution is to develop an efficient algorithm for solving (1.1) exactly for large p problems.Recently, Fernandes, Guerre, and Horta (2021) proposed an interesting smoothing technique for solving quantile regression with statistical guarantees.They showed that the smoothing quantile regression can even have a smaller mean squared error than the exact quantile regression for estimating the same conditional quantile function.He et al. (2021) studied the smoothed quantile regression with growing dimensions, and then Tan et al. (2022) extended the smoothed quantile regression to high dimensional setting.Their work is more interesting from a statistical perspective, because fast computation for the quantile regression has already been solved in Gu et al. (2018).Their work motivated us to develop a smooth version of sparse rank regression from the statistical perspective, as opposed to trying to solve it exactly.For easy discussion, we call the first term in (1.1) the rank regression loss, although it is not like the empirical average of a loss function in empirical risk minimization.If we could replace the rank regression loss in (1.1) with a smooth loss such that the resulting estimator still has the nice theoretical properties of sparse rank regression, then we should focus on solving the smooth problem instead of (1.1).This is what Fernandes, Guerre, and Horta (2021) did for quantile regression.To this end, we consider the expectation of the rank regression loss with respect to the true distribution.The rank regression loss is viewed as the expectation of a random variable with respect to some empirical distribution assigning uniform discrete probability to each observed realization.If we estimate the true distribution by using a smoothed kernel density estimator, then we can take the expectation of the same random variable with respect to the smoothed kernel density estimator.The resulting quantity is shown to be smooth, convex and has a Lipschitz continuous derivative.We name it convoluted rank loss because the kernel density estimator has a convolution interpretation.We then replace the rank regression loss in (1.1) with the convoluted rank loss and the resulting estimator is called sparse convoluted rank regression.By its convexity and smoothness, the sparse convoluted rank regression can be efficiently solved by using the generalized coordinate descent algorithm (Yang and Zou 2013).
We systematically study the theoretical properties of the convoluted rank regression and sparse convoluted rank regression.We establish the asymptotic relative efficiency of convoluted rank regression compared with rank regression.The analysis shows that for a wide range of smoothing parameter the relative efficiency is very close to one and in some cases the relative efficiency is above one.Hence, the smoothing parameter does not have to be very small.The theory indicates that convoluted rank regression should be viewed as a new methodology, but not approximate solution to rank regression.We then consider the sparse convoluted rank regression.The goal is to show that it maintains all the essential theoretical properties of sparse rank regression.Specifically, we first establish the rate of convergence of the 1 -penalized convoluted rank regression in ultra-high dimensions without assuming a strong moment condition on the error and the 1 -penalized convoluted regression is also shown to enjoy the asymptotic tuning free property.Second, we analyze the folded concave penalized convoluted rank regression and establish its strong oracle property without imposing strong moment conditions on the error.The folded concave penalized regression involves a tuning parameter.We thus further propose a high dimensional Bayesian information criterion (HBIC) and establish its consistency for the selection of the theoretically optimal tuning parameter.
The rest of this article is organized as follows.In Section 2, we introduce convoluted rank regression loss and prove some interesting asymptotic properties of convoluted rank regression.Then we formally define the sparse convoluted rank regression estimator.In Section 3, we present the theoretical justifications for the proposed estimators.We also present the HBIC criterion and its theoretical results.In Section 4, we derive an efficient algorithm for solving sparse convoluted rank regression for high-dimensional data.In Section 5, we use simulations and a real data example to compare sparse convoluted rank regression and sparse rank regression.The technical proofs are given in the supplementary materials.

Convoluted Rank Regression
In this section we present the main idea that leads to the convoluted rank regression loss and the sparse convoluted rank regression.

Notation and Definitions
We begin with some necessary definitions.For an arbitrary index set A ⊂ {1, . . ., p}, any vector c = (c 1 , . . ., c p ) and any n × p matrix U, let c A = (c i , i ∈ A), and let U A be the submatrix with columns of U whose indices are in A. The complement of an index set A is denoted as A c = {1, . . ., p} \ A. For any finite set B, let |B| be the number of elements in B. For a vector 1 q be its q norm, let c ∞ (or c max ) = max j |c j | be its ∞ norm, let c 0 = |{j : c j = 0}| be its 0 norm, and let c min = min j |c j | be its minimum absolute value.For a matrix M, let λ min (M) and λ max (M) be its eigenvalue with smallest absolute value and largest absolute value, respectively.This is the common notation for eigenvalues of a matrix, and λ min , λ max should not be confused with the penalization parameter used in a penalty function.For any matrix G, let G = √ λ max (G T G) be its spectral norm.In particular, for a vector c, c = c 2 .For a, b ∈ R, let a ∧ b = min{a, b} and a ∨ b = max{a, b}.For a sequence {a n } and another nonnegative sequence {b n }, we write

Canonical Convoluted Rank Regression
Suppose we have the observed data {(y i , x i )} n i=1 where y i ∈ R is the response value and x i ∈ R p is the p-dimensional covariate vector for the ith subject.Let X = (X 1 , . . ., X p ) ∈ R n×p be the design matrix, with X j = (x 1j , . . ., x nj ) T containing observations for the jth variable, j = 1, . . ., p.The ith row of X can be written as x T  i , where x i = (x i1 , . . ., x ip ) T .Let y = (y 1 , . . ., y n ) T be the n-dimensional response vector.For the sake of brevity, we adopt the fixed design setting in the sequel, although our methodology can also be justified under the random design setting.Assume that the data {(y i , x i )} n i=1 are generated from the following linear regression model, where { i } n i=1 are iid random errors, β * ∈ R p is the unknown vector to be estimated.Note that we do not assume the errors in (2.1) have mean zero.Consequently, the intercept can be absorbed into the error term.
The canonical rank regression (Jaeckel 1972;Hettmansperger and McKean 2010) in the fixed dimension setting proposes to estimate β * through min Compared with the standard least squares method, the rank regression estimator of β * can have arbitrarily high relative efficiency when error distribution is heavy-tailed, while having at least 86.4% asymptotic relative efficiency under arbitrary symmetric error distribution with finite Fisher information (Hettmansperger and McKean 2010).
For each (i, j) pair, define For the remaining discussion in this section, we treat (y i , x i ) n i=1 as independent and identically distributed.Although ζ ij s are not independent, they still follow an identical distribution.For any β, let F(t, β) denote its cumulative distribution function.After taking the expectation of the objective function in (2.2) with respect to the true distribution of ζ ij , the population level objective function is Specifically, we consider using the kernel density estimator with some kernel function K : R → [0, ∞) and h > 0.
Throughout this article, we consider K(•) satisfying the following properties: where ) and " * " stands for convolution.
Remark 1.There can be a lot of choices for the kernel function K(•) satisfying our conditions mentioned in Section 2.2.For example, the Gaussian kernel with e − u 2 2 and the Epanechnikov kernel with K(u) = 3 4 (1 − u 2 )I(−1 ≤ u ≤ 1) both satisfy the conditions, where I(•) is the indicator function.
Thus, in the fixed dimension setting, we propose the canonical convoluted rank regression min It turns out that the rank regression (2.2) and the convoluted rank regression (2.3) share interesting connections from theoretical perspective.We next present three theorems to elaborate the connections.The proof of these theorems are placed in Appendix, supplementary materials.Some mild regularity conditions imposed are also placed there, since they are not the main focus of this article.The main message to be delivered by these theorems is that convoluted rank regression is a good methodology so it is worth extending this new method into high dimensional setting.Let (y, x) and (y , x ) be iid random vectors having the same distribution as (y i , x i ), which has continuous distribution in R p+1 satisfying y = x T β * + and y = x T β * + , with being independent from x, and being independent from x .For rank regression, it is well known that the minimizer of the population version of its loss function, that is, arg is exactly the same as β * , the true regression coefficients.This simple fact justifies that rank regression is valid in the population sense, which is necessary in order for its sample version to aim at estimating the true regression coefficients.One may naturally ask if the population version of (2.3) also has such property.Let We have the following theorem, stating that smoothing via convolution does not incur any bias at all in the population sense.
Theorem 1.For any h > 0, we have Remark 2. Note that the smoothing quantile regression (Fernandes, Guerre, and Horta 2021) does not have the good property of zero smoothing bias as shown in Theorem 1.In fact, the proof of Theorem 1 crucially relies on the fact that the distributions of − and x − x are symmetric about zero, which can only be taken advantage of given the special form of rank regression.
Since the canonical convoluted rank regression (2.3) is newly proposed, we further study its asymptotic efficiency in the fixed dimension setting.Although fixed dimension setting is not the main focus of this article, such a discussion helps justify the good merit of (2.3) and the worthiness of extending it to high dimensional setting in the upcoming sections.Let β be the rank regression estimator, that is, the minimizer in problem (2.2).It is well known that under fixed dimension the rank regression estimator enjoys the asymptotic normality as follows (Jaeckel 1972;Hettmansperger and McKean 2010), where f (•) is the density of and cov(x) is the covariance matrix of x which is assumed to be positive-definite.Denote βh = arg min which is the canonical convoluted rank regression estimator.The next theorem shows that under fixed dimension βh also shares the asymptotic normality.
Theorem 2. For each h > 0 we have To compare βh with β, it is natural to borrow the idea of relative efficiency arising from literature on robust statistics.For two estimators ê1 and ê2 of some target e * which satisfy When the relative efficiency of ê1 with respect to ê2 is greater (or smaller) than 1, ê1 is said to be more (or less) efficient than ê2 .For more details, the readers are referred to Hettmansperger and McKean (2010).We hereby study the relative efficiency of βh with respect to β.From Theorem 2 and (2.4) we see that the asymptotic variance-covariance matrices of βh and β only differ by a numeric factor.We define ARE( βh as the asymptotic relative efficiency of βh with respect to β, which is the ratio of the corresponding numeric factors. Intuitively, one expects to see ARE( βh , β) is very close to 1 when h is very small.The following theorem provides a rigorous quantification of this intuition.
Theorem 3 has several implications.First, it implies lim h→0+ ARE( βh , β) = 1, meaning that in terms of estimation efficiency, convoluted rank regression performs arbitrarily close to rank regression if one chooses h to be sufficiently close to zero.Nevertheless, zero might not be the optimal choice for h.Theorem 3 confirms this point by characterizing the local behavior of ARE( βh , β) near zero.The constant c(f e)de can be positive.In such cases, 1 is the infimum value for ARE( βh , β) in the neighborhood of zero.For example, when using the Epanechnikov kernel and the error distribution is normal, we can confirm that the constant c(f ) is positive, which implies that ARE( βh , β) > 1 for sufficiently small positive h and hence the convoluted rank regression is asymptotically more efficient than the rank regression.
When h is away from zero, we do not have a unified analysis of ARE( βh , β).On the other hand, we can directly compute the value of ARE( βh , β) for different h with the Epanechnikov kernel for a given error distribution by using numerical integration tools in Matlab.The results are shown in Table 1 where we consider three error distributions.There are two important observations.First, we see that ARE( βh , β) is very close to one in all cases in Table 1, implying that the performance of convoluted rank regression is insensitive to the choice of h.In smoothed quantile regression (Fernandes, Guerre, and Horta 2021;He et al. 2021;Tan et al. 2022) the smoothing parameter h has to be small.This is clearly not the case in convoluted rank regression.In the next section we show that h can be away from zero for highdimensional sparse CRR too.Second, in the case of normal error distribution, the convoluted rank regression has the potential of being more efficient than rank regression.Therefore, it is worth emphasizing that convoluted rank regression should be viewed as a new methodology and not an approximate solution of rank regression.

Sparse Convoluted Rank Regression
When p is large, we consider designing the estimator under a sparsity assumption that β * in the data generating model has many zero components.Let A = {j : β * j = 0} be the support set of β * , that is, the set of indices of the important covariates.Let s = |A|.Throughout this article, we allow p = p n and s = s n to diverge with n, and we assume s n ≥ 1 and p n goes to infinity as n goes to infinity.For convenience, we still use p and s to represent these quantities since no confusion is caused.In ultra-high dimensions, the dimension p is allowed to increase exponentially with the sample size n, and we assume that s is relatively of smaller order compared to n.Otherwise, no consistent estimator is possible.
To estimate β * , we propose the sparse Convoluted Rank Regression (CRR) by solving min Here p λ (•) is some sparsity-inducing penalty function with a tuning parameter λ > 0,

Theoretical Justifications for Sparse CRR
In this section we study the theoretical properties of the 1penalized Convoluted Rank Regression (CRR) and the folded concave penalized CRR under the same key regularity conditions for the rank regression in Wang et al. (2020).

1 -penalized CRR
For a tuning parameter λ 0 > 0, we define the 1 -penalized CRR estimator as We now state the assumptions needed throughout this article.We make the following assumptions for the kernel function K(•).
For the error distribution, we impose the following assumption.
Assumption 1.The errors { i } n i=1 are independent and identically distributed with density function f (•) with respect to the Lebesgue measure on R. Besides, let Let g(•) be the probability density function of ς ij , we assume |g(x)−g(y)| ≤ L 1 |x−y| α 1 , ∀x, y ∈ R, for some constants L 1 > 0 and α 1 ∈ (0, 1].This implies that μ 0 := sup t∈R g(t) < ∞.Meanwhile, we assume there exist positive constants δ is the support set of β * .We also impose the following conditions on the design matrix.
Assumption 3.There exists a constant ρ > 0 such that In particular, this implies λ min ( Assumption 4.There exists a constant ψ > 0 such that Assumption 2 is common in the fixed design case.It can be relaxed with M increasing with n at a suitable rate, without much difficulty.We keep it here for the sake of brevity.We can center the design matrix when estimating the β * vector because centering only affects the intercept part which is a nuisance parameter in our method as well as in rank regression.Assumption 3, which is known as the restricted eigenvalue condition (RE), is needed to establish 2 -type error bound for 1 -penalized estimator.It is a commonly used assumption in the literature (Bühlmann and Van De Geer 2011;Fan et al. 2020).Assumption 4 has similar spirit as the restricted nonlinearity assumption in Belloni and Chernozhukov (2011).The restricted nonlinearity assumption corresponds to α i = 1 here and is thus stronger than our assumption.Same type of condition was also considered in Gu and Zou (2020).We refer to the quantity ψ as the generalized restricted nonlinear impact coefficient and it is needed to obtain tight bounds when handling the second order derivative of our loss function.
Theorem 4. Assume Assumptions 1-4 hold, and n ≥ 2, p ≥ 2. Let 0 < λ 0 ≤ c 0 log p n where c 0 is some positive constant.Then with probability at least 1 − 2p exp{− Notice that the probabilistic lower bound in Theorem 4 does not depend on unknown quantities, since with the design matrix at hand, M, n, and p are all available.This means that in principle, the λ 0 in 1 -penalized CRR estimator is tuning-free.Under the

Folded Concave Penalized CRR
It has been well established in the literature that folded concave penalized estimators can enjoy a strong oracle property.We prove the same is true for convoluted rank regression.Define βora := arg min as the CRR oracle estimator.It can be directly verified that βora exists due to the convexity of L h (•), Assumptions 2 and 3. We establish the following property for the oracle estimator.
Theorem 5. Assume Assumptions 1-4 hold, s = o(n) and In particular, for any ξ > 0, there exists a constant r ξ > 0 such that P 1 Remark 3. In the case where βora is not unique, one may take any solution to (3.1), and our theory about CRR oracle estimator still holds.
We now propose the concave penalized convoluted rank regression.It solves the following problem: For the choice of p λ (•), we adopt the folded concave penalty (Fan, Xue, and Zou 2014b) with some pre-specified constant a > a 2 .Here a 1 and a 2 are two fixed positive constants.Special cases of folded concave penalty are SCAD (Fan and Li 2001) and MCP (Zhang 2010).The SCAD penalty has the form which corresponds to a 1 = a 2 = 1.The MCP penalty function is defined as which corresponds to a 1 = 1 − 1 a , a 2 = 1.We adopt the local linear approximation (LLA) (Zou and Li 2008) algorithm to solve (3.2).The LLA algorithm iteratively solves where β(0) is some initial estimator.We use βλ to denote the folded concave penalized CRR estimator computed by the LLA algorithm, with tuning parameter λ.Below we establish theory for the folded concave penalized CRR estimator.
Theorem 6.Let the conditions of Theorems 4 and 5 hold.Let a 0 = min{1, a 2 } where a 2 is the constant associated with the folded-concave penalty function.Let ξ > 0 be any constant.
Assume min j∈A |β * j | > (a + 1)λ and let h > √ ρnλ , where r ξ is defined in Theorem 5. Choose the tuning parameter so that 288M 2 +2p −2 for any x > 0. Then (i) the LLA algorithm in (3.3) initialized by β(0) = βλ 0 , with λ 0 being described as in Theorem 4, converges to βora in two iterations with probability at least 1 − 2p exp{− (ii) with the SCAD or MCP as the penalty function and let the tuning parameter be chosen such that λ ≤ c 0 log p n , the LLA algorithm in (3.3) initialized by β(0) = 0 converges to βora in three iterations with probability at least 1 Theorem 6 can be easily translated into an asymptotic statement that the folded concave penalized CRR estimator equals to the CRR oracle estimator with overwhelming probability, which is typically referred to as strong oracle property.We omit the details of such translation for brevity.The theorem suggests that our estimator can perform as well as if the true set of important covariates was given.
Remark 4. In Theorems 4 and 5, we only require h = O(1), and in Theorem 6, 1) is sufficient.These are weaker conditions on the smoothing bandwidth h than that required in smoothed quantile regression (Fernandes, Guerre, and Horta 2021;He et al. 2021;Tan et al. 2022).In particular, Fernandes, Guerre, and Horta (2021) which studied smoothing quantile regression under low-dimensional setting requires 1 4 to achieve strong oracle property.Our relaxed conditions on h are consequences of the delicate form of rank regression which makes important first order terms vanish, as well as our careful theoretical analysis, which can be seen from our proofs.Finally, although the formulation of convoluted rank regression loss is based on the kernel density interpretation, the choice of h reveals that SCRR is different from kernel density estimation in which h should be small such as h = O(n −1/5 ) for achieving good performance.In SCRR, the value of h can be much relaxed.Again, when h is not close to zero, we should not view sparse convoluted rank regression as an approximate solution to sparse rank regression.

Consistent Tuning Parameter Selection
For the folded concave penalization, Theorem 6 guarantees that there exists a good tuning parameter in principle.Since the tuning parameter depends on unknown quantities, a data-driven approach is needed to specify the tuning parameter in practice.Motivated by Wang et al. (2013), we propose a modified high dimensional Bayesian information criteria, defined as where M λ := {j : βλ j = 0}, and the choice of C n is discussed in Theorem 7. The corresponding tuning parameter the folded concave penalty is chosen by minimizing the proposed HBIC.
Theorem 7. Let λ = arg min λ∈ HBIC(λ), where = {λ > 0 : |M λ | ≤ K n }, and K n > s is allowed to diverge to infinity.Under the conditions of Theorem 6, assume that Remark 5.The condition min |S|≤2K n λ min ( X T S X S n ) > 0 in Theorem 7 is known as the sparse Riesz condition and is widely used in literature on high dimensional statistics (Zhang and Huang 2008).In our numerical studies, the sequence C n is chosen such that C n log log n.This is the same choice as in the HBIC for the penalized rank regression (Wang et al. 2020).
Theorem 7 shows that with proposed HBIC, our method can exactly identify the important variables with probability approaching to 1. Unlike cross-validation, the HBIC criterion does not require sample splitting or repeated evaluation of the test error on each sub-dataset.As a result, our method requires no extra computation for tuning.

Computation
We have shown that we need to solve the folded concave penalized CRR by running the LLA iteration 2-3 times.In each LLA iteration, we need to solve a weighted 1 -penalized CRR problem.In this section, we develop an efficient algorithm for computing the solution path of a weighted 1 -penalized CRR.
Consider the following "weighted" 1 -penalized CRR problem: arg min where each w k ≥ 0. In contrast to the sparse rank regression, the density convolution gives a smooth loss function L h .To see this, recall We thus establish some basic properties of L h (•).
Lemma 1.For any Therefore, the objective function in problem (4.1) is the summation of a convex and smooth loss function and a convex and separable penalty term.It turns out that a coordinate descenttype algorithm usually works well in this situation (Tseng 2001).
In a coordinate-wise manner, suppose we have updated the coordinates β 1 , β 2 , . . ., β k−1 and we now need to update β k .Denote by β the current solution and let v ij = y i −y j −(x i −x j ) T β.The standard coordinate descent algorithm cyclically updates β k by minimizing We observe that minimizing the above function does not have a close-form solution, so we consider a generalized coordinate descent algorithm (Yang and Zou 2013).The idea is to perform a majorization-minimization update rather than directly minimize F(β k | β).Specifically, we need to find a quadratic function From the last inequality of Lemma 1, we can obtain a quadratic majorization condition for CRR: for t 1 = t 2 .For each pair of i = j, by letting t where Therefore, we solve problem (4.1) by cyclically performing the above update for each k = 1, 2, . . ., p.
In our implementation, we directly compute the solution path problem (4.1) at a sequence of tuning parameters, λ [1] , λ [2] , . . ., λ [L] instead of calling the algorithm L times for each individual parameter.We let which is the smallest penalization parameter to make all βk = 0. We then choose other λ-values such that they are uniformly distributed on a logarithm scale.In addition, we also employ the warm start and active set strategies to further accelerate the GCD algorithm; see details of these two strategies in Friedman, Hastie and Tibshirani (2010).
To accelerate the computation of solving folded concave penalized CRR from problem (3.2), we further integrate the LLA algorithm introduced in (3.3) into the GCD algorithm.The coefficients are solved in pathwise fashions and some repeated computation can thus be avoided.A detailed step-by-step algorithm is presented in Algorithm 1.It is worth noting that similar algorithm was considered in Gu and Zou (2016).Our algorithm converges at least linearly, which can be proved in the same way as the iteration complexity analysis in Gu and Zou (2016).For the sake of space, we do not repeat such discussion in this article.

Simulation Study
In this section, we demonstrate the performance of the sparse convoluted rank regression in terms of estimation accuracy and variable selection using simulations.Because the most attractive property of rank regression is its efficiency argument, we focus on estimators with strong oracle properties such as the SCADpenalized convoluted rank regression (denoted by CRR-SCAD) and SCAD-penalized rank regression (denoted by RR-SCAD).We use the zero vector as the initial value in the LLA algorithm for computing CRR-SCAD, so that we do not have to compute the 1 -penalized CRR in order to compute CRR-SCAD.We used the code from Wang et al. (2020) to compute RR-SCAD.In our numerical studies, we used Epanechnikov kernel as the density convolution kernel, , where I(•) is the indicator function, and the loss function is Algorithm 1: Local linear approximation algorithm for solving problem (3.2) 11: Compute 12: 13: Let d k = βk − βk and set βk ← βk . 14: end for 15: end for 18: end for 19: Output: To fix ideas, we set h = 1 in the simulations.We do not use a much smaller h in order to highlight that convoluted rank loss is not an approximation to rank regression loss.One can use CV to select h by treating h as another tuning parameter.We opt not to do this to avoid the possibility that the improvement of sparse rank regression over rank regression may come from this additional tuning.We also tried several different values of h and the overall results are very similar.These additional simulation results are presented in the supplementary materials.Both the RR-SCAD and CRR-SCAD are tuned based on HBIC.For comparison, we also include the SCAD-penalized least squares (denoted by LS-SCAD) and tune it by its corresponding HBIC (Wang et al. 2013).
We consider a model y = x T β + , where β x is independently generated from N(0, ), and is independently generated from some certain distributions.We fix the sample size n = 100 and use the dimensions p = 400 and 3000.We consider four situations for the correlation structure of x: CS (0.2), CS (0.5), AR (0.5), and AR (−0.5),where each CS (ρ) represents the compound symmetry correlation, that is, i,j = ρ if i = j or 1 otherwise, and AR (ρ) indicates the autoregressive correlation, that is, = (ρ |i−j| ) p×p .
We compare these methods based on five criteria: 1 error (E β − β 1 ), 2 error (E β − β 2 ), model error (E( β − β ) T ( β − β )), the number of false positive variables, and the number of false negative variables.All the quantities are averaged over 200 independent runs and the standard errors are provided.To compare the computational efficiency, we also include the run time.Table 2 exhibits the simulation results when is from N(0, 1).In each situation, we use boldface to indicate the best performance that is evaluated based on each of the five criteria.When p = 400, we observe that the estimation accuracy of LS-SCAD and CRR-SCAD is similar and better than that of RR-SCAD; when p = 3000, the estimation accuracy of CRR-SCAD is the best.In addition, both LS-SCAD and CRR-SCAD have perfect performance in variable selection and RR-SCAD is the only method that makes mistakes.By comparing the performance of CRR-SCAD when p = 400 and 3000, we see the performance of CRR-SCAD is less prone to the increase in p.
Table 3 summarizes the simulation results when is from a mixture normal distribution: ∼ 0.95N(0, 1) + 0.05N(0, 100).From Table 3, we find that LS-SCAD fails to work well in this situation.For both p = 400 and p = 3000, RR-SCAD and CRR-SCAD perform similarly.Table 4 shows the results when / √ 2 follows the t-distribution with four degrees of freedom.In all situations, CRR-SCAD performs better than the other two methods, in terms of both estimation accuracy and variable selection.When p is increased from 400 to 3000, CRR-SCAD suffers minimal impact, while RR-SCAD shows a significant loss in estimation accuracy.

A Real Data Application
We illustrate our proposed method on a microarray gene expression data reported in (Scheetz et al. 2006).The dataset contains RNA expression levels of more than 31,000 gene probes from 120 twelve-week-old laboratory rats.Following Scheetz et al. (2006), we include 18,976 genes that have sufficient variation and are considered expressed in mammalian eyes.Among these genes, TRIM32 has genetic influences on a rare genetic disorder, the Bardet-Biedl syndrome (Chiang et al. 2006).Thus, TRIM32 is chosen as the target variable and our goal is to identify the genes that are associated with TRIM32.
In our experiments, we randomly split the original data into a training set and a test set in the ratio 1:1.On the training set, we apply the fused Kolmogorov filter (Mai and Zou 2015) to obtain a reduced set of 300 probes and retained the same 300 probes on the test set.We then fit SCAD-penalized least squares (SCAD), rank regression (RR-SCAD) and our convoluted rank regression (CRR-SCAD) on the training set and compute the prediction error on the test set.To illustrate the performance in higher dimensions, we repeat the same above procedure except that the reduced set from the fused Kolmogorov filter has 5000 probes.
Based on 200 random partitions, we report the prediction error and run time in Table 7.We observe CRR-SCAD has   the lowest prediction error whereas LS-SCAD has the highest error.When p grows from 300 to 5000, both RR-SCAD and CRR-SCAD become more accurate; this may be because some important variables are discarded in the screening step.In terms of speed, we see the smoothed rank loss offers some obvious benefits in the computational efficiency: CRR-SCAD is as fast  as LS-SCAD and it is about two orders of magnitude faster than RR-SCAD.LS-SCAD is implemented in a standard way by using the LLA algorithm with the glmnet package.When we implemented CRR-SCAD, we made some efforts to integrate the GCD and LLA algorithms by avoiding some repeated computation, thus our CRR-SCAD is even faster than LS-SCAD when p = 5000.Without such implementation efforts, our CRR-SCAD would be slower than LS-SCAD.

Discussion
In this article we have developed a new method called Convoluted Rank Regression (CRR) based on which we have further developed the Sparse Convoluted Rank Regression (SCRR) for high-dimensional data.Although the main idea is motived by kernel density convoluted smoothing, our theory shows that the smoothing parameter can be away from zero and the performance of proposed method is insensitive to the choice of smoothing parameter.Unlike smoothed quantile regression which is a smoothed approximation to quantile regression, CRR and SCRR should not be viewed as smoothed approximations to rank regression or sparse rank regression, respectively.More interestingly, the theoretical and empirical results in this work suggest that sparse convoluted rank regression enjoys the best of both sparse rank regression and sparse least squares.
There are some further technical questions for investigation.For example, the recent paper by Bellec, Lecué, and Tasybakov (2018) proved that using a lasso penalty of value log(ep/s)/n, the lasso least squares can achieve the minimax rate of s log(p/s)/n for L 2 recovery under sub-Gaussian errors.Note that this rate is slightly better than our rate, s log(p)/n for the lasso CRR.Although we do not assume sub-Gaussian errors, it would be interesting to see that if the lasso CRR can achieve the same rate with or without the sub-Gaussian error.We leave this question for future study.We have used the HBIC for tuning in theory and in practice and it works well for our method.Another possible tuning criterion could be the predictive information criterion in She and Tran (2019) where the authors studied sparse reduced rank regression.It is also good to have more tools for tuning, and we leave this as another topic for future study.
condition s log p n = o(1), as long as one chooses h = O(1) and λ 0 = c 0 log p n with c 0 > 8 √ 2M, the probabilistic lower bound in Theorem 4 converges to 1, and as a result we have βλ 0 −β * 2 = O p ( s log p n ), which means the 1 -penalized CRR estimator achieves the near-optimal rate.
we have the quadratic majorization function for F(β k | β): be the expectation with respect to the probability measure P. For a sequence of random variables {Z n } n≥1 , we write Z n = O p (1) if lim M→∞ lim sup n→∞ P(|Z n | > M) = 0, and we writeZ n = o p (1) if lim n→∞ P(|Z n | > ) = 0, ∀ > 0.For two sequences of random variables Z n and Z n , we write

Table 1 .
ARE( βh , β)for different h under three different distributions.
The comparison criteria are 1 error, 2 error, model error (ME), number of false positive variables (FP), number of false negative variables (FN), and run time (in seconds).In each example, the best method evaluated based on each criterion is in boldface.All the quantities are averaged over 200 independent runs and standard errors are given in parentheses.In all the examples shown in this table, the error term in the data generating model is drawn from the standard normal distribution.

Table 3 .
Comparison of least-square regression with SCAD (LS-SCAD), rank regression with SCAD (RR-SCAD), and convoluted rank regression with SCAD (CRR-SCAD).The comparison criteria are 1 error, 2 error, model error (ME), number of false positive variables (FP), number of false negative variables (FN), and run time (in seconds).In each example, the best method evaluated based on each criterion is in boldface.All the quantities are averaged over 200 independent runs and standard errors are given in parentheses.In all the examples shown in this table, the error term in the data generating model follows a mixture normal distribution: ∼ 0.95N(0, 1)+ 0.05N(0, 100).
The comparison criteria are 1 error, 2 error, prediction error (PE), number of false positive variables (FP), number of false negative variables (FN), and run time (in seconds).In each example, the best method evaluated based on each criterion is in boldface.All the quantities are averaged over 200 independent runs and standard errors are given in parentheses.All the examples in this table have 15 nonzero coefficients.

Table 6 .
Comparison of least-square regression with SCAD (LS-SCAD), rank regression with SCAD (RR-SCAD), and sparse convoluted rank regression with SCAD (SCRR-SCAD).The comparison criteria are 1 error, 2 error, prediction error (PE), number of false positive variables (FP), number of false negative variables (FN), and run time (in seconds).In each example, the best method evaluated based on each criterion is in boldface.All the quantities are averaged over 200 independent runs and standard errors are given in parentheses.All the examples in this table have discrete predictors.

Table 7 .
Real data analysis.Comparison of prediction error and run time using least-square regression with SCAD (LS-SCAD), rank regression with SCAD (RR-SCAD), and convoluted rank regression with SCAD (CRR-SCAD).The data is split into a training and a test set in the ratio of 1:1 and the fused Kolmogorov filter is applied to reduced the dimension to 300 and 5000.All the quantities are averaged over 200 random partitions.The lowest prediction errors are in boldface, and standard errors are given in parentheses.