Regularized Linear Programming Discriminant Rule with Folded Concave Penalty for Ultrahigh-Dimensional Data

Abstract We propose the regularized linear programming discriminant (LPD) rule with folded concave penalty in the ultrahigh-dimensional regime. We use the local linear approximation (LLA) algorithm to redirect the model with folded concave penalty to a weighted ℓ 1 model. The strong oracle property of the solution constructed by the one-step local linear approximation (LLA) algorithm is verified. In addition, we propose efficient and parallelizable algorithms based on feature space split to address the computational challenges due to ultrahigh dimensionality. The proposed feature-split algorithm is compared to existing methods by both numerical simulations and applications to real data examples. The numerical comparisons suggest that the proposed method works well for ultrahigh dimensions, while the linear programming solver and alternating direction method of multiplier (ADMM) algorithm may fail for such high dimensions. Supplementary materials for this article are available online.


Introduction
High-dimensional data analysis is fundamental in many research areas such as genome-wide association studies, finance, tumor classification and biomedical imaging. Analyzing high dimensional data is challenging in many statistical applications that high dimensionality not only increases computational time and space requirements for processing the data, but also renders traditional statistical methods inappropriate. For example, the presence of many irrelevant or redundant features causes the problem of over-fitting and lack of generalization for many traditional statistical methods. Specifically, in this article, we consider the problem of classification, which is important and has drawn considerable attention in recent years (Bickel and Levina 2004;Fan and Fan 2008;Shao et al. 2011;Mai and Zou 2013).
When p n, the sample covariance matrixˆ is singular and its inverse does not exist. One natural way is to use some generalized invertable matrix to approximate the sample covariance matrix. However, most of such estimates are highly biased and unstable, and will lead to a classifier with poor performance when either p is large or correlations among variables are not trivial. A naive method in this case is to replace with the diagonal of the sample covariance matrix, which ignores the dependence among the variables. This independence rule is equivalent to the naive Bayes rule (Bickel and Levina 2004). To overcome the curse of dimensionality, it is natural to impose some structural assumptions on the parameters of the discriminant rule. Fan and Fan (2008) assumed that δ is sparse and proposed the features annealed independence rule by applying the independence rule to a set of selected important features of δ. This rule fails when the correlations between the variables are significant. To ensure the consistent estimation on and δ, one often needs to assume that both of them are sparse. Shao et al. (2011) proposed using a thresholding procedure to estimate and δ, respectively, and then to carry out rule (1). An important observation is that (1) depends on and δ through δ. Inspired by this fact,  proposed an alternative solution, Linear Programming Discriminant (LPD) Rule, by assuming sparsity of the product of and δ. Specifically, LPD estimates the product δ directly through constrained 1 minimization, where β = δ, λ is a tuning parameter, · 1 and · ∞ are the 1 and ∞ norms of vectors, respectively. Problem (2) is equivalent to whereˆ j is the jth column ofˆ andμ ij is the jth element of μ i , i = 1, 2.
In this formulation, the sparsity assumption is imposed on δ instead of on δ and individually.
It has been showed that β can be estimated directly through a constrained 1 minimization in (2) even when δ and/or cannot be well estimated individually . However, it is well known that 1 penalty tends to over-penalize large coefficients and thus do not possess the desirable oracle property (Fan and Li 2001). In recent years, the folded concave regularization method has received considerable attention since it does not require the irrepresentable condition to achieve the variable selection consistency and can correct the intrinsic estimation bias of the 1 regularization method (Fan and Li 2001;Zhang 2010;Fan, Xue, and Zou 2014). Motivated by this, we study the LPD rule regularized by folded concave penalty in this article and verify the strong oracle property of its estimator by the one-step local linear approximation (LLA) algorithm (Zou and Li 2008) in Section 2. We further investigate the convergence result of misclassification rate of our new method to the Bayes misclassification rate.
Conventional methods of solving (2) are through linear programming (LP), for instance, interior point methods or simplexbased methods. However, as the volume and dimension of data used for statistical learning increase, the efficiency of LP is severely affected. With the advancement of distributed computing resources, distributed statistical estimation and optimization become increasingly popular in recent years (Boyd et al. 2011;Wainwright 2013, 2015;Lee et al. 2017). Within this vein, Tian and Gu (2017) proposed an efficient distributed estimation method for sparse linear discriminant analysis for high dimensional data, which distributes the data of size n into m machines and then estimates a local sparse LDA estimator on each machine. This method significantly improves the computing efficiency when the sample size n is large, however, the algorithm is still time-consuming when p n. To address the numerical challenge due to large p, we propose an alternative approach for solving LPD rule with ultrahighdimensional data through alternating direction method of multiplier (ADMM). Specifically, we propose a novel feature-split variant of ADMM to handle the computational challenge of ultrahigh dimensionality through feature partitioning and executing parallel computing on the subset of features. Theoretical convergence results are established for our new feature-split ADMM algorithm.
Finally, to accommodate both folded concave penalty and ultrahigh-dimensionality, we introduce a multi-block ADMM algorithm based on feature space splitting with one step LLA being designed to solve LPD rule for ultrahigh dimensional data, which achieves computational efficiency when p n compared to the standard linear programming solver and ADMM algorithm. A comprehensive numerical study is conducted to evaluate the convergence of LPD rule with folded concave penalty solved by the proposed multi-block feature-split ADMM algorithm. We have also conducted both numerical simulations and real data applications to compare the classification performance of the proposed method and the original LPD rule (2) solved by either linear programming (implemented by package lpSolve) or the original ADMM algorithm using the full set of features. Results suggest that the proposed approach performs better than lpSolve and ADMM on moderate dimensional datasets. Moreover, lpSolve and ADMM may fail for datasets that have tens of thousands of features due to intensive memory usage, while the proposed method maintains good performance for such dimensions.
The rest of the article is organized as follows. In Section 2, we introduce LPD rule with folded concave penalty and verify the strong oracle property of the estimator obtained from one-step LLA. In Section 3, we introduce the computational framework that nests a 3-block ADMM into one-step LLA to find the estimate of LPD rule with folded concave penalty. In Section 4, we compare the proposed approach with existing alternative methods including the original LPD rule solved by linear programming, which is implemented by the R package lpSolve and the original ADMM algorithm using the full set of features (i.e., without feature split). The usefulness of our new method is further demonstrated by applications to multiple real datasets. Finally, we conclude the article with a brief discussion in Section 5.

Linear Programming Discriminant Rule with Folded Concave Penalty
In this section, we introduce the LPD (2) with a general class of folded concave penalties and verify its oracle properties. we assume throughout this section that folded concave penalty functions satisfy the following conditions (Fan, Xue, and Zou 2014;Liu et al. 2017;Fan et al. 2020): (a) Zero baseline penalty: P λ (0) = 0; (b) Monotonicity: P λ (x) ≤ P λ (y) for all 0 ≤ x < y < ∞; (c) Concavity: P λ (x) concave for x ∈ [0, ∞); (d) Subadditivity: P λ (x + y) ≤ P λ (x)+P λ (y) for all 0 ≤ x ≤ y < ∞; (e) Differentiability: P λ (x) is differentiable in (0, ∞); (f) P λ (x) ≥ a 1 λ for x ∈ (0, a 2 λ]; P λ (x) = 0 for x ∈ [aλ, ∞) with a > a 2 , where a 1 and a 2 are two constants associated with each penalty. One such example is the SCAD penalty (Fan and Li 2001) with derivative defined as for some a > 2 and P λ (0+) = λ. For ease of presentation, let The SCAD penalty is a folded concave penalty with both sparseness and unbiasedness properties as well as a bounded derivative allowing gradient descent and other gradient-based algorithms to search for proper solutions in large subspaces. However, like other folded concave penalty functions, the nonconvexity of the penalty term results in multiple local optimums and brings great difficulties to computing algorithms. To address this limitation, Zou and Li (2008) constructed an efficient onestep sparse estimation procedure in nonconcave penalized likelihood models through local linear approximation (LLA). By approximating the nonconvex penalized problem with a series of adaptive Lasso problems, LLA successfully keeps the convexity property of the 1 regularization method. Specifically, the LLA to the penalty is given by: j is a given initial value. Suppose we have a folded concave penalized estimation problem min β L(β; y, X) is a convex loss function. The LLA shares nice properties of the 1 penalty in terms of computational efficiency, and Zou and Li (2008) proposed to use the properly initialized one-step LLA estimator to alleviate the computation burden in iterative algorithm and overcome potential local maximum problem in maximizing the folded concave penalized likelihood.
Back to our LPD problem (2), in many high-dimensional settings, the 1 penalty tends to over-penalize large coefficients and thus does not possess the desirable oracle property. Therefore, we generalize problem (2) to the case where β 1 is replaced by folded concave penalty P λ (|β|), and obtain the LPD with folded concave penalty as follows, The application of LLA redirects problem (3) to an adaptive version of problem (2): To use LLA, we first verify its strong oracle property. Denote the between-class scatter matrix as S B =δδ T , then we can express the oracle Fisher's rule aŝ where A := Supp(β * ) is the index set of the true support of β * with carnality |A| = q and β * := δ is the population discriminant function. In our theoretical analysis, it is always where A is the submatrix containing the columns of indexed by the set A. In other words, the cardinality of A is much less than the sample size. The sign consistency of the standard LPD ruleβ LPD (2) was established under the Uniform Irrepresentability condition along with other regulatory conditions as required and imposed in Zhang, Wang, and Bian (2020): have no zero entries, where A max and A min are the element-wise maximum and minimum absolute value of vector/matrix A, respectively. Condition (A1) is related to the irrepresentability condition proposed by Zhao and Yu (2006). Condition (A2) sets the upper bound on the max norm of the precision matrix. Condition (A3) demonstrates that the probability ofˆ having no zero entries goes to 1 and matrix −1 A,A and vector β * A satisfying that −1 A,A sign(β * A ) has zero entries forms a zero measure set, where A,A is the submatrix containing the rows and columns of indexed by the set A. Usingβ LPD to initialize the LLA algorithm, we can obtain the following theorem: Theorem 1. Under conditions (A1), (A2), (A3) and further assume that β * A min > (a + 1)λ, then the LLA algorithm initialized byβ LPD converges toβ oracle after one iteration with probability 1 − P 1 − P 2 − P 3 , where P i are the probabilities of events E i , i = 1, 2, 3 and each E i is defined as follows: where a 1 and a 2 are two constants associated with each fold concave penalty function. Furthermore, if λ (log(pq)q 2 /n) 1/2 , then P i → 0, i = 1, 2, 3 as n → ∞.
Theorem 1 confirms that the estimator obtained by LLA algorithm enjoys oracle properties and its proof is given in the supplementary materials. We next study its performance for classification. The best possible misclassification rate of Fisher's rule is given by R : is the cumulative distribution function of standard normal distribution. The conditional misclassification rate of LPD rule with given samples is whereβ is estimated from (4). The performance of the LPD rule can be measured by the difference between R and R n . To study the consistency property of R n , we shall impose the following conditions: Cai and Liu (2011) imposed similar conditions. Under these conditions, we have the following theorem whose proof is provided in the supplementary materials: Theorem 2. Suppose conditions (C1) and (C2) hold, λ (log(pq)q 2 /n) 1/2 , λ β * min and d < M for some M > 0, then we have as n → ∞ and p → ∞,

3-Block ADMM for Ultrahigh-Dimensional Data
In this section, we propose a feature-split algorithm for solving problem (4) with ultrahigh dimensional data. We partition β into K > 1 blocks as and partition matrixˆ correspondingly as (ˆ 1 , . . .ˆ K ) withˆ i ∈ R p×p i . We reframe our objective (4) into a more general optimization problem with weighted 1 penalty (8). When α = 1 p , the weighted 1 regularized problem reduces to the general 1 regularized problem. The K blocks of β can be treated as one-block variable in the objective function. To enforce a multi-block structure on the primal variables, we introduce auxiliary variables z ∈ R p and ω = (ω 2 , . . .
The augmented Lagrangian function is Inspired by Sun, Toh, and Yang (2015), we propose the following procedure to update variables to guarantee both theoretical convergence and numerical efficiency: where θ is the step size. The algorithm (9) has three separable blocks in the objective function with the third part being linear and takes a particular block coordinate descent cycle for updating variable blocks. Therefore, we need to update the third variable block twice to obtain convergence. By plugging in the augmented Lagrangian function, we can obtain Compared to the traditional ADMM (Boyd et al. 2011), by introducing slack variables ω 2 , . . . , ω K , the three-block ADMM decouples the update of primary variables β 2 , . . . , β K , ω 2 , . . . , ω K and γ 2 , . . . , γ K . It can be seen from (10) and (11) that K blocks of β are separable in the augmented Lagrangian function, which allows a natural parallelization for β updates on multiple worker machines. This is particularly desirable since it is often the case that β-subproblems play a dominant role in the computational cost.
while the stopping criterion is not satisfied, do Compute β t+1 by Compute ω t+1 end while where S(x, t) = sign(x)(|x|−t)I(|x| > t) is the soft thresholding function. It is not difficult to derive that z and ω subproblems have analytic solutions. We summarize our feature-split ADMM (FS-ADMM) algorithm in Algorithm 1. The algorithm for nonconvex LPD can be derived from Algorithm 1 and is summarized in Algorithm 2. Specifically, we adopt a two-step procedure for solving the LPD with the folded concave penalty. In the first step, we solve 1norm LPD to obtain an initial estimatorβ l . In the second step, we compute weighted 1 LPD with α j = 1 λ P λ (|β j l |) for j = 1, . . . , p to obtain the final estimator for the LPD with a folded concave penalty such as SCAD.
. . , p. while the stopping criterion is not satisfied, do Computeβ t+1 by (14) and (15).  (20). end while rate of convergence of Algorithm 1 for solving LPD is presented in Theorem 3, whose proof is available in the supplementary materials.
Theorem 3. Take θ ∈ (0, (1 + √ 5)/2), then sequence (β t , z t , ω t , γ t ) generated by Algorithm 1 converges in probability to a unique limit (β,z,ω,γ ) such that (β,z,ω) is the primal optimal of (8) andγ is the dual optimal. Moreover, there exists a constant μ ∈ (0, 1) such that where the distance function Dist t at iteration t is defined as As indicated in the proof, the algorithmic convergence rate μ depends on parameters in Algorithm 1 and especially on our choice of K. On the one hand, increasing K reduces the dimension of sub-problems and the value of η's, and thus it accelerates the computation of each sub-problem. On the other hand, increasing K leads to an increased number of sub-problems and may raise the value of μ. To guarantee the convergence of Algorithm 1, θ ∈ (0, (1 + √ 5)/2) and a larger θ makes the convergence faster. Hence, a reasonable choice for θ is slightly less than (1 + √ 5)/2 in practice (e.g., θ = 1.618). The way of splitting features has no influence on the convergence property of the algorithm and we equally distribute the features into K groups without adjusting the order in our numerical studies.

Numerical Experiments
In this section, comprehensive numerical studies have been carried out to illustrate the superior performance of the proposed method over existing ones. We first evaluated the convergence performance and computational efficiency of FS-ADMM. Then, We compare classification performance of the proposed method to existing ones on solving the LPD rule on simulated datasets. Finally, we demonstrated the potential usefulness of the proposed by applications to real-world datasets. We illustrate FS-ADMM using SCAD as a representative of folded concave penalty function throughout these numerical studies, and FS-ADMM algorithm can be applied to any general folded concave penalty function. We follow the selection procedure introduced by  to select the tuning parameter λ.

Simulation Studies for FS-ADMM
Throughout this simulation, we generate data based on the following configurations: 1. Sample size n 1 = n 2 = 200; 2. Assume that μ 1 = 0 and μ 2 = (1, . . . , 1, 0, . . . , 0) T with first 10 elements are nonzero; 3. The covariance matrix To examine the convergence of the proposed algorithm, we set p = 500 and 5000, and numerically evaluated the prime i ) 2 against iterations using 100 replicates. The average residual values against iterations are plotted in Figure 1, which does indicate convergence of FS-ADMM.
In the computing time simulation, we compared FS-ADMM algorithms with the original ADMM and the linear programming implemented by the open source R package lpSolve. A common 1 penalty as in (2) was considered in all three algorithms and we picked the dimensionality p = 2000 × s, where s = 1, . . . , 10. The computing time of each method is reported in Table 1 and it is evident that our new FS-ADMM algorithm is computationally much more efficient than both linear programming and ADMM under low and moderate dimensions. The FS-ADMM algorithms are scalable under ultrahigh dimensions while both linear programming and ADMM may fail.  We next examine the effect of K (i.e., the number of feature groups) on the computing efficiency of the proposed algorithm. We consider three choices for K = 4, 8, 16 and three choices of p = 2000, 5000, 10,000. The number of observations n 1 = n 2 = 200. The results are summarized in Table 2. When the dimension of the data p = 2000, the computing times of different K's are similar. When p = 5000, the computing time of K = 4 becomes larger than those of K = 8, 16. When p = 10,000, the performance of K = 16 seems better than the other two choices. It can be concluded that a larger K may be more appropriate for larger dimensions. On the other hand, the choices of K also depends on the computing power of the equipment. In practice, we recommend to pick a K from 4 to 8 when the dimension is in tens of thousands and set K = 8 in the numerical studies carried out in this article.
We generated n 1 samples from N(μ 1 , ) and n 2 samples from N(μ 2 , ) as training samples, and the test samples were generated with the same sizes. The following two criteria were used to evaluate the performance of each method: 1. β −β 2 2 , which is used to evaluate the parameter estimates; 2. Misclassification rate (MCR) on test samples, which is used to evaluate the model generalization capability.
We report the average values of β −β 2 2 over 100 replicates in Table 3 along with standard errors of the mean in parentheses. The MCR values on the test samples over 100 replications are reported in Table 4. It is noteworthy that the best MCR under the current simulation parameter settings is R = 1 − ( √ δ T δ/2) = 0.1656 for both p = 2000 and 10,000. When p = 10,000, lpSolve could not complete the job due to memory limit (4GB). From the results reported in Tables 3 and 4, we conclude that: 1. lpSolve may fail due to excessive memory usage when the dimension of the data becomes ultrahigh, while the proposed method has more stable performance against ultrahigh dimensions. 2. SCAD-regularized LPD outperforms both the standard 1norm LPD and ADMM in terms of both estimation accuracy in both moderate dimensions and high dimensions. 3. In terms of the model generalization ability, the proposed FS-ADMM-SCAD method has an comparable performance with lpSolve and ADMM in moderate dimensions, but have a clear edge over lpSolve and ADMM in high dimensions.

Application to Gene Expression Data
We test the performance of different methods when applied to a gene expression dataset collected from patients with inflammatory bowl diseases (Morgan et al. 2015). There were a total of 19,908 genes whose expressions were measured by microarray, and it has been previously shown that gene expression patterns may be indicative of inflammatory bowl disease subtypes (Zhan et al. 2017). In particular, a total of 255 patients were classified as FAP (familial adenomatous polyposis) and non-FAP. Our goal in this real data analysis is to classify patients with FAP or non-FAP. We randomly split the data into a training dataset with 210 patients and a testing dataset with the remaining 45 patients. Then testing error was calculated as the ratio between the number of misclassifications and the testing sample size. We selected either p = 2000 or 5000 genes with the largest variances as potential predictors. The experiment was repeated for 100 times and we report the average testing errors of different methods in Table 5. As seen in Table 5, the proposed SCAD-based feature split ADMM methods have the best classification performance compared to lpSolve and ADMM. When p = 2000, the misclassification rate of FS-ADMM-SCAD methods is about 65% of the 1based methods. When we increase the number of genes to 5000, the performance of each method deteriorates a bit, indicating that the added genes contain more uninformative ones with respect to disease subtype classification.

Application to Human Activity Recognition Dataset
In this part, we apply our methods to the scenario of relatively low or moderate dimensions. In particular, we test the performance of proposed method on a real dataset with the number of features around 500. The dataset comes from the UCI machine learning repository, Smartphone Dataset for Human Activity Recognition (HAR) in Ambient Assisted Living (AAL). 1 Researchers collected this dataset from an Ambient Assisted Living experiment, where 30 participants from 22 to 79 years old were asked to wear smartphones with built-in accelerometer and gyroscope around waists. Each participant performed six activities (standing, sitting, laying, walking, walking upstairs, walking  downstairs) for 1 min. In the end, three-axial raw signals from accelerometer and gyroscope were collected, and 561-feature vectors were extracted and derived from the signals as potential predictors for activity recognition. We use the following two criteria to evaluate the performance of the methods: (a) Training Error: the ratio between the number of misclassifications and the training sample size; (b) Testing Error: the ratio between the number of misclassifications and the testing sample size. For ease of presentation, we take sitting as an example to report the results of comparing sitting to three other activities in Table 6. Sitting versus Walking Downstairs: we first select records corresponding to two very different activities, sitting and walking downstairs. The training dataset contains 1525 samples and the testing data contains 528 samples. Table 6 shows that SCADbased FS-ADMM outperforms the other two methods in terms of both training and testing error.
Sitting versus Walking Upstairs: we next select records corresponding to standing and walking upstairs. The training dataset includes 1463 samples and the testing data includes 518 samples. The results are summarized in the middle part of Table 6. We observe similar patterns as in the previous Sitting versus Walking Downstairs comparison.
Sitting versus Laying: we select records from two activities that are relatively similar, sitting and laying. The training dataset includes 1388 samples and the testing data includes 527 samples. The results are summarized in the bottom part of Table 6. Results show that SCAD-based FS-ADMM methods outperform the lpSolve in terms of training error and all methods have the same testing error.

Discussion
The advent of advanced technologies has culminated massive datasets, which pose grand challenges for traditional statistical classification methods. While linear discriminant analysis (LDA) approaches are in almost every data analyst's toolbox, handling such a problem in the regime of ultrahigh dimensionality is still a challenging task. While the linear programming discriminant (LPD) rule proposed in  sheds light on LDA under high dimensional settings with a relatively weak sparsity assumption on the product of precision matrix and difference in mean vectors. As demonstrated in this article, the current LPD rule is still limited both in estimation accuracy and computation scalability.
This article has presented novel feature-split ADMM algorithms for solving the LPD rule with folded concave penalty for ultra-high dimensional data. By extending the LPD rule to accommodate ultrahigh-dimensional data from both theoretical and computational points of view, we aim at providing timely statistical classification analysis tools for researchers facing problems associated with modern big data analysis. The main contribution of our new method are 3-fold. First, we prove that the LPD rule with folded concave penalty solved by LLA algorithm enjoys oracle property. Second, the proposed parallel feature-split ADMM algorithm has both a higher comparable accuracy of model parameter estimates and a better model generalization ability than two existing benchmark algorithms, including linear programming and ADMM. Third, the new FS-ADMM algorithm for LPD rule with folded concave penalty is computationally scalable, which is essential for highdimensional data analysis. We have clearly demonstrated the huge advantages of our methods over existing alternative ones under the regime of ultrahigh-dimensions. When the dimension of data is only high or moderate (e.g., a few hundred), numerical studies presented in Section 4.4 confirm that our methods are still at least comparable or slightly more powerful than existing ones. The FS-ADMM method has been implemented in R and we have provided codes for reader who are interested in solving similar problems.
The extension of the method to K-class classification is not straightforward but similar. Similar feature-split algorithm can be applied to the precision matrix estimation model proposed by Cai, Liu, and Luo (2011). After obtaining the estimation of the precision matrix, we can calculate the statistics T j = (x − μ j ) Tˆ −1 (x −μ j ), j = 1, . . . , K. We assign the new observation of x for the class j with the largest value of T j , j = 1, . . . , K.

Supplemental Materials
Supplementary material available at Journal of Computational and Graphical Statistics online include proofs of Theorems 1-3, additional numerical results and R codes to implement FS-ADMM.