Gaining Outlier Resistance with Progressive Quantiles: Fast Algorithms and Theoretical Studies

Outliers widely occur in big-data applications and may severely affect statistical estimation and inference. In this paper, a framework of outlier-resistant estimation is introduced to robustify an arbitrarily given loss function. It has a close connection to the method of trimming and includes explicit outlyingness parameters for all samples, which in turn facilitates computation, theory, and parameter tuning. To tackle the issues of nonconvexity and nonsmoothness, we develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, a new technique is proposed to alleviate the requirement on the starting point such that on regular datasets, the number of data resamplings can be substantially reduced. Based on combined statistical and computational treatments, we are able to perform nonasymptotic analysis beyond M-estimation. The obtained resistant estimators, though not necessarily globally or even locally optimal, enjoy minimax rate optimality in both low dimensions and high dimensions. Experiments in regression, classification, and neural networks show excellent performance of the proposed methodology at the occurrence of gross outliers.


Introduction
Outliers are bound to occur in real-world big data and may severely affect statistical analysis. Many commonly used methods, including the lasso (Tibshirani, 1996), break down in the presence of a single bad observation. We study the problem in a general setup. Given an arbitrary loss function l, a response vector y ∈ R n and a design matrix X ∈ R n×p , the coefficient vector β ∈ R p can be estimated by solving the following optimization problem min β∈R p l(η; y) + p j=1 P (β j ) s.t. η = Xβ, where l measures the discrepancy between the observed response and the prediction, and P represents a regularizer that can also be a constraint. Neither l nor P is necessarily convex. The loss can be, say, a deviance function corresponding to a generalized linear model (GLM), a robust loss like Huber's loss, or the well-known hinge loss or exponential loss for the purpose of classification. The loss function need not be associated with any probability distribution. The goal of this work is to robustify (1) for any given l at the occurrence of a small number of gross outliers-we stress that the desired outlier resistance should not only accommodate samples with mild deviations from the model assumption, but also handle extreme anomalies with severe outlyingness and leverage. Some frequently used robust losses-Huber's loss and the 1 -norm loss, in particular-are not resistant to gross outliers, and hence one may want a reinforced robustification in the setup of (1). The nature of the task eventually leads to an 0 -type regularization proposal.
The research of robust statistics undergoes a long history and leads to fruitful outcomes. A standard means to gain robustness is to modify the loss, or equivalently, reweight the samples. We illustrate the idea with robust regression. Instead of minimizing the quadratic loss y − Xβ 2 2 , M-estimation (Huber and Ronchetti, 2009;Hampel et al., 2011;Maronna et al., 2006) solves min β i ρ(y i − x T i β) or X T ψ(Xβ − y) = 0, where ρ is a robust loss and ψ = ρ . Let w(t) = ψ(t)/t and r = Xβ − y. The score equation can be written as X T {w(r) • (Xβ − y)} = 0, where the multiplicative weights w(r i ) are applied componentwise. The design of a robust loss thus amounts to that of a weight function. See Loh (2017) and Avella-Medina (2017), for example, for some theoretical properties. We call the above scheme the multiplicative robustification, to contrast with the additive fashion to be introduced in Section 2. An iterative algorithm called iterative reweighted least squares (IRLS) is popularly used in computation. The most critical and expensive step however lies in the preliminary resistant fit with high-breakdown (Rousseeuw and Yohai, 1984;Rousseeuw, 1985), also the focus of our paper. Some challenges of resistant estimation in methodology, theory, and computation are summarized as follows.
First, when l(η; y) has a more complicated expression than that in terms of r i = y i − x T i β only, how to reshape it to guard against gross outliers does not seem straightforward. According to Bianco and Yohai (1996), simply applying ρ on the deviance residuals of a GLM (Pregibon, 1982) does not ensure consistency and an extra bias correction term must be added into the criterion. Croux and Haesbroeck (2003) limited ρ to a certain class to guarantee the existence of a finite solution, and included an additional weighting step to get a bounded influence function. We refer to Avella-Medina and Ronchetti (2018) for a robust quasi-likelihood form by use of both a ψ function and a weight function. To us, another important line of work on the trimmed likelihood estimator (TLE) (Hadi and Luceño, 1997;Vandev and Neykov, 1998;Kurnaz et al., 2018), as an extension of least trimmed squares (LTS) (Rousseeuw, 1985), is most motivating. LTS and TLE give rise to our new regularized form of resistant estimation that is more amenable to both optimization and analysis.
The theoretical challenges faced by today's robust statisticians are perhaps even more pressing. For example, even under n > p (let alone in high-dimensional settings), (a) how trimming inflates the risk, (b) how small the universal minimax lower bound could be, and (c) what makes a theoretically sound model selection criterion, are all surprisingly unknown in the literature. We emphasize the finite-sample nature of desired statistical theory (e.g., Theorems 4-6), because in real-life data applications, it is often hard to know how large the sample size should be relative to the number of unknowns to apply asymptotics. Breakdown point provides a useful index of the robustness of a given method, but as pointed out by Huber and Ronchetti (2009), such worst-case studies are not probabilistic and may be too conservative on a particular dataset. In addition to robust analysis, parameter tuning based on asymptotics or breakdown point suffers the same theoretical difficulties.
The last big challenge lies in computation. Resistant fits are much more costly than IRLS or Θ-IPOD (She and Owen, 2011) which takes them as initial points. Due to the inherent nonconvexity, subset sampling is heavily used in LTS-like algorithms to generate multiple starting values (Rousseeuw and Driessen, 1999). Unfortunately, the number of required samplings grows exponentially in p and so these procedures already become inaccurate and/or prohibitive for moderate p. Developing more efficient outlier-resistant algorithms and mitigating the burden of subset sampling are at the core of modern-day robust statistics. Moreover, the fact that the computationally obtained solutions are not necessarily globally optimal poses nontrivial challenges in theoretical analysis.
This paper attempts to address some aforementioned obstacles to gain outlier resistance, with the hope to advance the practice of robust statistics to more sophisticated learning tasks. Our main contributions are threefold. a) A general resistant learning framework is introduced to robustify an arbitrarily given loss, with an " 0 + 2 " form of regularization to deal with gross outliers in possibly high dimensions. It has a close connection to the method of trimming but includes explicit outlyingness parameters, which in turn facilitates computation, theory, and parameter tuning. b) Although the associated problem is highly nonconvex and nonsmooth, extremely scalable algorithms with implementation ease can be developed with guaranteed convergence. In particular, the statistical error of the sequence of iterates can converge geometrically fast. A progressive optimization technique is proposed to relax the regularity condition and alleviate the requirement on the starting point so that on regular datasets the number of data resamplings can be substantially reduced. c) Combined statistical and computational treatments make it possible to perform nonasymptotic analysis beyond the M-estimation setting, and the obtained "Aestimators", though not necessarily globally or even locally optimal, enjoy minimax rate optimality in both low dimensions and high dimensions.
The rest of the paper is organized as follows. Section 2 introduces our regularized resistant learning framework and compares it to trimming. Section 3 deals with the computational issues by progressive iterative quantile thresholding. Section 4 provides nonasymptotic theoretical support for the obtained estimators under proper regularity conditions. Section 5 reveals some universal minimax lower bounds and investigates the issue of parameter tuning. Section 6 performs extensive simulation studies and real world data analysis. The proof details and more simulation results are given in the appendix.
Notation. We use X = [x 1 , x 2 , · · · , x n ] T to denote the design matrix and x i the ith sample. Throughout the paper, we use l(η; y) to denote a differentiable loss defined on the systematic component η, and assume that l is bounded from below and inf η l(η; y) ≥ 0 without loss of generality. Sometimes, such as when building the connection to trimming, we assume that the overall loss takes a sample-additive form, l(η; y) = n i=1 l 0 (η i ; y i ). l(η; y) is said to be L-strongly smooth or have an L-Lipschitz continuous gradient if ∇l(η 1 ; y) − ∇l(η 2 ; y) 2 ≤ L η 1 − η 2 2 , ∀η 1 , η 2 (2) for some L > 0. For ease of presentation, we frequently use the concatenated notation X = [X, I],β = [β T , γ T ] T , and so Xβ + γ =Xβ. For any matrix A, A 2 is its spectral norm. Given any β ∈ R p , β 0 = p j=1 1 β j =0 . The hat matrix H of X is X(X T X) + X T with + denoting the Moore-Penrose inverse. Given any two vectors α, β of the same size, the inner product α, β is given by α j β j . Given X ∈ R n×p and s ≤ p, we introduce 0 ≤ m X (s), M X (s) associated with the restricted isometry property (Candes and Tao, 2005) by m X (s) β 2 2 ≤ Xβ 2 2 ≤ M X (s) β 2 2 , ∀β : β 0 ≤ s. In particular, MX(s, o) (s ≤ p, o ≤ n) obeying Xβ 2 2 ≤ MX(s, o) β 2 2 , ∀β : β 0 ≤ s, γ 0 ≤ o will play an active role in our algorithm design and analysis. Obviously, M X (s) ≤ X 2 2 and MX(s, o) ≤ X 2 2 . But since we are primarily interested in s, o n, Xβ orXβ for an s-restricted β or an (s, o)-restrictedβ involves a much thinner design, and so M X (s), MX(s, o) can be way smaller.
We use C, c and L to denote universal constants which are not necessarily the same at each occurrence and denotes an inequality that holds up to a multiplicative numerical constant. Finally, 0 log 0 = 0, a ∨ b = max{a, b} and a ∧ b = min{a, b} are adopted.

L 0 -regularization vs. Trimming
We begin with the robust estimation problem (1) without regularization. The customizable loss determines the performance metric, so we prefer not to alter its form in the process of robustification. This is possible by use of an additive regularized estimation.
Concretely, re-define η as Xβ + γ, where γ i characterizes the outlyingness of the ith sample and we allow it to be large. Taking advantage of the sparsity in γ, as outliers are never the norm, we propose an 0 -constrained, 2 -penalized outlier-resistant estimation where q satisfies q ≤ n/2 and is assumed to be an integer throughout the paper. Here, the number of unknowns, n + p, is larger than the number of observations, and the " 0 + 2 " regularization plays an effective role in dealing with the high-dimensional challenge.
Unlike the popular 1 -norm penalty, the 0 'norm' used in the constraint is indifferent to the magnitude of γ i and does not result in any undesired bias. An alternative idea is to utilize a non-convex sparse penalty-in the regression setting, this amounts to Mestimation, the diverse choices of the penalty lead to different robust losses (She and Owen, 2011), and the conclusion extends to regularized β-estimation in possibly high dimensions (She and Chen, 2017, Lemma 2). But (3) has some distinct advantages: the constraint directly controls the maximum number of outliers and · 0 is arguably the ideal function to enforce sparsity regardless of the severity of aberrant samples. In practice, specifying the value of q, rather than a penalty parameter λ, is easier and more convenient.
In addition to the sparsity-promoting regularization, we include the 2 -shrinkage to enhance numerical stability and compensate for collinearity in the presence of clustered outliers. We will see in Section 3.1 that with ν introduced, the samples with minor deviations can still contribute to the model fitting. Empirically, ν is not a sensitive parameter, and we often set ν to a mild value (say 1e-4). As long as ν > 0, the γ-estimate is finite. But if we force ν = 0, (3) has an interesting and important connection to the method of trimming (Rousseeuw, 1985;Hadi and Luceño, 1997;Vandev and Neykov, 1998).
i) Given any minimizer (β,γ) of the 0 -constrained joint problem min β,γ l(Xβ + γ; y) s.t. γ 0 ≤ q,β is also a globally optimal solution to the trimmed problem on β: ii) The same connection holds between the 0 -penalized joint problem min β,γ l(Xβ + γ; y) + τ γ 0 and the winsorized problem min β τ ∧ l 0 (x T i β; y i ) for any cutoff τ ≥ 0. iii) Finally, any minimizerβ of the winsorized optimization problem is a solution to the trimmed problem with q = |{i : l 0 (x T iβ ; y i ) > τ }|, and any (β,γ) that minimizes the 0 -penalized joint criterion is also an optimal solution to the 0 -constrained joint problem with q = γ 0 .
So in a sense, imposing the 0 regularization on γ amounts to using a trimmed version or setting a cutoff of the original loss on β, both of which can effectively bound the influence of outliers and necessarily result in nonconvexity. The joint (β, γ) formulation under 0penalization corresponds to winsorizing the loss (which unsurprisingly gives a redescending ψ in the setup of robust regression), and the 0 -constrained form has a similar power but utilizes a data-adaptive cutoff. It is noteworthy that our additive regularized framework makes it possible to utilize any loss (which can be convex) for resistant estimation.
Remark 2.1. Because of the connections, some breakdown-point and influence-function results of the estimators defined by (3) can be directly obtained from say Alfons et al. (2013Alfons et al. ( ) andÖllerer et al. (2015. For example, Proposition 1 of Alfons et al. (2013) can be slightly modified to argue that the breakdown point is greater than q/n when η > 0 and l is convex under some regularity conditions. Moreover, interested readers may refer to Appendix A.11 for a risk-based breakdown point defined via the Orlicz norm of the effective noise, which takes the randomness into account, cf. (A.76), (A.77). One can then combine the empirical process theory and generalized Bregman calculus to perform breakdown point studies for an extended real-valued loss l(η; y) that may not be a function of y − η as in M-estimation. See Remark A.5 and Theorem A.5 for details, where the benefit of using a strongly convex l on the systematic component is shown.
On the other hand, the equivalence between the sets of globally optimal solutions does not preclude practically obtained estimates from being different. For the analysis of locally optimal or alternatively optimal estimators to be presented in later sections, our additive regularized form appears to be advantageous over the trimming form.
Finally, an extension of (3) to high dimensions gives rise to resistant variable selection or resistant shrinkage estimation P (β; λ) can be, say, a sparsity-inducing penalty/constraint, or an 2 -type penalty. (4) allows for p > n. Similar to Theorem 1, in the special case of l(η; y) = y − η 2 2 /2, P (β; λ) = λ β 1 and ν = 0, (4) can be shown to be equivalent to the sparse LTS (Alfons et al., 2013). In comparison, our explicit introduction of the outlyingness vector γ provides new insights. First, outliers can be directly revealed from the γ-estimate. Second, the formulation eases the design of optimization-based algorithms (Section 3). Third, (3) and (4) enable us to borrow high dimensional statistics tools to investigate the non-asymptotic behavior of resistant estimators (Section 4) and develop a new information criterion for regularization parameter tuning (Section 5).

Progressive Iterative Quantile-thresholding
This section studies how to solve the computational problems defined in Section 2, where the main obstacle lies in nonconvexity and nonsmoothness. Thanks to the sparsity of the problem, obtaining a globally optimal solution may not be necessary in many real world applications. This section develops two classes of iterative algorithms, the BCD type and the MM type, both of which can accommodate a differentiable loss function that is not necessarily convex. More importantly, we propose a progressive optimization technique to relax the condition required to enjoy the best statistical accuracy.

Outlier-resistant regression
To illustrate the main ideas and techniques, this part studies outlier-resistant regression with n > p: Compared with LTS, the inclusion of γ in (5) renders the quadratic loss intact, and so block-wise coordinate descent (BCD) can be used for computation. The sub-problem of β is trivial, and given β, the best γ can be obtained with a quantile thresholding operator Θ # (She et al., 2013). Given any s ∈ R n , define Θ # (s; q, ν) to be a vector t ∈ R n satisfying t (j) = s (j) /(1 + ν) if 1 ≤ j ≤ q, and 0 otherwise, where s (1) , . . . , s (n) are the order statistics of s 1 , . . . , s n satisfying |s (1) | ≥ · · · ≥ |s (n) |, and t (1) , . . . , t (n) are defined similarly. To ensure Θ # is a function and avoid ambiguity, we assume throughout the paper that in performing Θ # (s; q, ν) , either |s (q) | > |s (q+1) | or |s (q) | = |s (q+1) | = 0 occurs, referred to as the Θ # -uniqueness assumption. The resultant BCD algorithm proceeds as follows or simply starting with β (0) ∈ R p or γ (1) ∈ R n . (6) and (7) are referred to as iterative quantilethresholding (IQ) algorithms in the paper. Intuitively, (7) repeatedly thresholds a weighted average of y and the current γ-estimate. But the quantile thresholding is no ordinary hardthresholding due to its iteration-varying threshold and concurrent 2 shrinkage. (7) shares some similarity with the celebrated "C-step" designed on the basis of trimming (Rousseeuw and Driessen, 1999). On the other hand, at each iteration C-step only uses a subset of observations to get the coefficient vector, while IQ uses all the samples with adjustment.
In a sense, C-step's subset selection amounts to checking the zero-nonzero pattern of γ (t+1) i instead of its exact value. The algorithm also suggests some virtues of the complementary 2 -shrinkage. When ν = 0, given the optimalβ,γ i = 0 or y i − x T iβ . Letting d i = 1γ i =0 and D = diag{d i }, from (6),β satisfies Because of the binary nature of d i , the ith observation is either kept unaltered or removed completely, which might hurt the estimation accuracy in the presence of a number of mildly outlying observations. When ν > 0, the above equation still holds for d i = 1γ i =0 /(1 + ν), and the fact that I −D is nonsingular even when q takes a conservatively large value secures numerical stability and variance reduction. More insights can be gained by transforming y = Xβ * + γ * + to y = V T γ * + where V is the orthogonal complement to the range of X, y = V T y and = V T . Then, during the occurrence of clustered outliers with a large overall leverage, the 2 regularization has a beneficial effect to cope with the coherent Gram matrix V T V = I − H. Establishing IQ's rate of convergence is nontrivial since it is a nonconvex optimization algorithm in high dimensions (rigorously speaking, n observations and p + n unknowns). We give two results regarding the convergence of its computational error and statistical error.
Theorem 2. Consider the algorithm defined by (6) or (7). a) For any T ≥ 1, is a sub-Gaussian random vector with mean zero and scale bounded by σ (cf. Definition 1 in Appendix A.2). Let q = ϑo * with ϑ > 1. Suppose the design matrix satisfies a restricted isometry property ( RIP) for some ε: 0 < ε < 1 Then with probability at least 1 − C(n − p) −c , for any t ≥ 1 where and C, c > 0 are constants. Furthermore, under a slightly stronger noise assumption that i are independent sub-Gaussian(0, σ 2 ), the following convergence result holds for all t ≥ 1 with probability at least 1 − C(n − p) −c − C exp(−cp) The first result in (a) states a universal convergence rate of the order O(1/t), and the conclusion is free of any conditions. On the other hand, this rate does not take the model parsimony into full account.
The statistical error convergence shown in (b) is surprising and delightful. Although Θ # is not a nonexpansive mapping, let alone a contraction, IQ approaches the statistical target geometrically fast with high probability, as long as θ or ν is properly large: Such a linear convergence result in the discrete nonconvex setup is novel to the best of our knowledge. A positive ν can improve the linear rate parameter κ. The error bound remains the same order as long as ν γ * 2 2 is dominated by a constant multiple of σ 2 ϑo * log en o * . (The best value of ν should be data-adaptive, but fixing it at a mild value already gives satisfactory performance.) The statistical error does not vanish as t → ∞ due to the existence of noise, but Section 5 will show that pσ 2 + o * σ 2 log(en/o * ) is the desired error rate in robust regression. Hence over-optimization seems to be unnecessary.
Of course, to enjoy the fast convergence, a regularity condition on H must be assumed. Restricted isometry like (8) is widely used and known to hold in compressed sensing and many high-dimensional sparse problems (Candes and Tao, 2005;Hastie et al., 2015). Intuitively, ∆ T (I − H)∆ ≥ ε ∆ 2 2 , ∀∆ : ∆ 0 ≤ (1 + ϑ)o * means all principal submatrices of size (1 + ϑ)o * of I − H should have eigenvalues greater than or equal to a positive . The cone to define the restricted eigenvalue shrinks when o * is small. It is well known that RIP is much less demanding than the mutual coherence restrictions on h i,j (i = j) (van de Geer and Bühlmann, 2009;Hastie et al., 2015) which are realized by low-leveraged data since |h i,j | ≤ (h ii h jj ) 1/2 .
We will show that for strongly convex losses, pursuing globally optimal estimators in our additive regularized framework can ultimately remove all regularity conditions, namely, outlier resistance regardless of leverage and outlyingness. But on regular datasets this may be unnecessary. The global minimization comes at a heavy computational cost: multiple random starts have to be used due to the nonconvexity, and experience shows that even when p = 50, the subset sampling popularly used in robust statistics already becomes ineffective and expensive (Rousseeuw and Hubert, 1997).
So a legitimate question on large data is how to relax the condition required for statistical accuracy while maintaining computational efficiency. Theorem 2 sheds new light on the issue. For example, with ν = 0, ε > 1/ √ ϑ ensures the geometric statistical convergence, and to make it hold more easily, an obvious means is to increase ϑ. The larger the value of ϑ takes, the bigger ε is, and so the smaller the influence of the initial point according to (9) or (11). The adverse effect is the larger error term ϑo * log en o * . But we can gradually tighten the cardinality constraint with a sequence Q(t) which deceases from n to q as t increases. The resultant algorithm is termed as Progressive Iterative Quantile-thresholding (PIQ). PIQ can substantially improve the mediocre performance of IQ. It is easy to extend Theorem 2 from fixed quantiles to varying quantiles, but deriving a theoretically optimal cooling scheme is challenging. Fortunately, this is not a big issue in practice; various schemes seem to do a decent job, such as Q(t) = n−at 2 (quadratic), Q(t) = 2n/(1+exp(at)) (sigmoidal), Q(t) = n − a log t (logarithmic) and so on.
Our progressive quantile proposal is innovative in that it is not just another intricate sampling scheme and does not increase the number of initial points. Compared with state-of-the-art algorithms, the tables and figures in Section 6.1 clearly show that it can significantly reduce the need of data resampling (thereby the overall computational time) without sacrificing the statistical accuracy. In addition, we must point out that PIQ works in the opposite direction to the class of forward pathwise algorithms and boosting methods (Bühlmann and Yu, 2003;Needell and Tropp, 2009;Zhang, 2013) which all grow a model from the null. Both of our analysis and computer experiments support the backward manner in resistant learning.
An alternative (and perhaps simpler) algorithm can be developed following the principle of majorization-minimization (MM) (Hunter and Lange, 2004). The technique is general and will play a vital role in extending BCD-type PIQ to conquer a general loss. Recall the concatenated notationX = [X I],β = [β T γ T ] T . Instead of directly minimizing the original objective function f (β) = y −Xβ 2 2 /2 + ν γ 2 2 /2, we construct a surrogate function g(β;β − ) by (joint) linearization where ρ is the inverse stepsize parameter to be chosen later. Then minimizing g with respect toβ yields a sequence of iteratesβ (t+1) = arg minβ g ρ (β;β (t) ) s.t. γ 0 ≤ q. After some algebraic manipulations, g can be rephrased as where c(β − ) does not depend on β. Noticing the separability in β and γ, the resultant update is given by As long as the stepsize is properly small, e.g., ρ > X 2 2 , (15) is convergent (cf. Theorem 3). A similar conclusion to Theorem 2 can be proved, which supports a progressive scheme with Q(t) decreasing to q to relax the requirement of the initial point.
Different from (6), (15) updates (β, γ) simultaneously and does not involve any matrix inversion or expensive operations. A downside of MM however lies in its smaller step-size, which not only slows down convergence but may (surprisingly) affect statistical accuracy in resistant learning; see Section 4.

Harnessing a general loss
This part uses the optimization techniques developed for resistant regression to deal with a general objective with p possibly larger than n: where P is often a sparsity-promoting penalty, but can also take the form of a ridge penalty or an 0 -constraint. Here, we add the algorithmic parameter > 0 to match the scale of the design and aid stepsize selection. With linearization and MM, we can extend (15) to (17), (18) below, applicable to a broad family of l and P .
In robust regression, (17) degenerates to the linear β-update in (15) with ρ = 2 , but notice the distinctive scaled form of (17) associated with a nonlinear thresholding operator Θ. The Lipschitz condition is satisfied by the squared error loss, logistic deviance, and Huber's loss, among many others, but is only used to give a universal stepsize. In fact, using a backtracking line search (Boyd and Vandenberghe, 2004), L need not be known in implementation; see Remark A.1. The equations satisfied by (β,γ) actually define a broader class of estimators that will be investigated in Section 4.
In the remaining, we use BCD to tackle (16). The alternating update involves solving two sub-problems: The first problem can be addressed using a similar MM trick as in Theorem 3. But there is a more efficient way to directly locate the optimal support of γ when l(η; y) = n i=1 l 0 (η i ; y i ). Specifically, withl(t; a, b) = l 0 (a+t, b)+νt 2 /2, the criterion becomes i: Then the index set I associated with the q largest differences δ i := l 0,i −l min,i (1 ≤ i ≤ n) gives the support. When ν = 0 andl min,i = 0, simply sorting the losses l 0,i achieves the goal. Next, one merely needs to solve q univariate problems min sl (s; x T i β (t) , y i ) to get γ (t+1) . For (21), an iterative MM algorithm similar to (17) can be developed, and we will show that it suffices to performing the update just once, β (t+1) = Θ( β (t) − X T ∇l(Xβ (t) + γ (t+1) ; y)/ ; λ)/ . Again, we advocate pursuing β-sparsity via the hybrid " 0 + 2 " regularization in place of (21), By use of linearization, the resultant progressive quantile-thresholding is given by where ρ ≥ LM X (2q β ) (recall the definition of M X in Section 1) and the sequence q (t) β drops to q β eventually. (One might want to merge the two 0 -constraints into β 0 ≤ q β + q γ to simplify the computation, but this would result in a larger statistical error due to the enlarged search space; see the analysis in Section 4.) At the end of the section, we make a comparison between the MM-type and the BCDtype PIQ algorithms. The first kind has a single-loop structure and maintains low periteration complexity. We do observe that MM is faster than BCD in some large-p classification problems (say when p > 500), but in the conventional n > p setup and many other scenarios, BCD is more accurate and efficient. We will mainly use the BCD-type PIQ in applications and analysis unless otherwise specified.

Statistical Accuracy of A-estimators
A major theoretical challenge brought by the class of estimators from our computational algorithms is their lack of global optimality and stationarity, which significantly differentiates them from conventional M-estimators and Z-estimators (van der Vaart, 1998). We call such estimators A-estimators due to their alternative optimality nature. In fact, even the alternative optimality may be merely in a local sense. As far as we know, there is a lack of statistical accuracy studies for such algorithm-driven estimators. In this section, we aim for developing new techniques to reveal tight error rates of A-estimators as well as the comprehensive roles of algorithmic parameters in resistant learning, in both low and high dimensions.
Introducing the notion of noise in the non-likelihood setting is an essential component. We define the effective noise associated with a statistical truthβ which is not affected by the regularizer. In a GLM with cumulant function b and canonical link function g = (b ) −1 , the deviance based loss is given by l(η) = − y, η + 1, b(η) , and so = y − g −1 (Xβ * + γ * ) = y − E(y).
In general, however, the effective noise jointly determined by the loss and y may not be the raw noise on y. For example, a robust loss associated with a bounded ψ function always gives rise to a bounded , thereby sub-Gaussian regardless of the distribution of y, making our analysis nonparametric in nature; the same is true for many classification losses.
Unless otherwise specified, we assume that is a sub-Gaussian random vector with mean zero and scale bounded by σ (cf. Definition 1 in Appendix A.2). This gives a broad family including Gaussian and bounded random variables; in particular, any Lipschitz l results in a sub-Gaussian effective noise. However, our analysis is not restricted by sub-Gaussianity; Theorem A.4 in Appendix A.11 gives a universal risk bound when has a general Orlicz norm. Also, we allow γ * i to take arbitrarily large values to capture extreme anomalies; another possible way is to assume γ i are i.i.d. following a certain distribution, but treating γ * as an n-dimensional unknown parameter vector (with sparsity) is convenient in detecting outliers and in handling models with non-additive noise.
We work in a general setting with p possibly larger than n unless otherwise mentioned. For simplicity, all 2 -shrinkage parameters are set to zero, and we assume ∇l is L-Lipschitz continuous, and take L = 1 without loss of generality.
Let's first consider the advocated doubly constrained form introduced at the end of Section 3 to exemplify the error bounds. As aforementioned, to make the statistical error study more realistic, rather than restricting to the set of globally optimal solutions, one needs to pay particular attention to the A-estimators defined by But theβ delivered from the fast PIQ algorithm may not possess the global alternative optimality in (24a). The good news is that according to Theorem A.1 in Appendix A.4, all these global or local A-estimators satisfy for any ρ > M X (2q β ). Our statistical accuracy analysis is applicable to the whole class (and MM-type non-global estimators as well, with slight modification). For clarity, define Assume that there exists some δ > 0 such that for sparseβ,β satisfying β 0 ≤ ϑs * , β 0 ≤ s * , γ 0 ≤ ϑo * , γ 0 ≤ o * . Then the following error bound holds where C, c are positive constants. More generally, if the starting point ( for some constant C > 0. The regularity condition is an extension of the classical 2 -form restricted isometry imposed on the augmented design. Indeed, for l(η; y) = η − y 2 2 /2, the condition can written as Xβ for some δ > 0 and large enough ϑ. According to the proof, under q β = p, the term ρ β 2 2 on its right-hand side can be dropped and so the condition degenerates to the ordinary RIP used in Theorem 2, due to the orthogonal decomposition Xβ + γ 2 2 = X{β + (X T X) + X T γ} 2 2 + (I − H)γ 2 2 . We have argued in Section 3 that in low dimensions, low leverage implies low mutual coherence which in turn implies RIP. But the plain definition of leverage cannot incorporate outlier scarcity, and is noninformative as p > n, since H = I when X has full row rank. The Bregman-based restricted isometry on the augmented design gives an extension of the notion of leverage to high dimensions and non-quadratic losses.
Theorem 4 gives useful implications regarding the algorithmic parameters as well. For example, from (26), to enjoy the statistical guarantee, the inverse stepsize ρ must be properly small. The finding is insightful and novel. Throughout the machine learning literature, slow learning rate is recommended when training a nonconvex learner. But our statistical error analysis strongly cautions against using an extremely small step size when nonconvexity occurs. The BCD-type algorithm design appears advantageous in this aspect. Moreover, a high-quality starting point, usually expensive in computation, certainly brings some statistical benefit seen from the generalized regularity condition (28) (consider an extreme case R = 1). But simply enlarging ϑ gives another promising way to relax (26), and supports the progressive tightening proposal in Section 3.
When ϑ, δ are treated as constants and s * ∨ o * > 0, (27) shows that A-estimators can achieve an error bound of the order σ 2 s * log(ep/s * ) + σ 2 o * log(en/o * ), which reduces to pσ 2 + σ 2 o * log(en/o * ) when there is no β-sparsity. This error rate involves the number of outliers apart from the number of relevant predictors. But the influence of outliers remains controlled in the sense that the bound does not grow with the magnitude of γ * . In the situation of relatively few outliers: o * log(en/o * ) s * log(ep/s * ), the proposed method can attain the celebrated rate σ 2 s * log(ep/s * ) for (clean) large-p variable selection. On the other hand, one should be aware that on 'big dirty' data with large n and o * , σ 2 o * log(en/o * ) could be the dominant error term.
Remark 4.1. One may wonder whether the overall error rate in Theorem 4 can be further improved by pursuing a global minimizer. Theorem A.2 shows that this is not the case, but the regularity condition gets relaxed. Concretely, the symmetrized GBF∆ l in (26) will be replaced by ∆ l , and the right-hand side of (26) will be just zero. Hence if the loss l defined on the systematic component is µ-strongly convex (e.g., regression), the error bound holds with δ = µ, free of any RIP or leverage restriction. This shows the inherent soundness of the proposed resistant learning framework.
Remark 4.2. The 2 -recovery result proved in Theorem 4 is fundamental, from which one can also obtain estimation error bounds. For example, Theorem A.3 shows that under a slightly stronger regularity condition, holds with high probability. In the classical n > p setup with no penalty imposed on β, it becomes β − β * 2 2 {pσ 2 + o * σ 2 log(en/o * )}/n. See Appendix A.5 for details.
Our theoretical techniques can cope with the penalized form of (16) as well. For instance, Theorem 5 studies the outlier-resistant lasso (cf. Section 2), where (The theorem proved in Appendix A.6 is actually regarding a general loss and its proof applies to all subadditive penalties.) Theorem 5. Let λ = Aσ log(ep) with A a sufficiently large constant. Then if (β,γ) is an A-estimator of resistant lasso for some and ϑ (cf. Theorem A.1), the risk bound holds under the assumption that there exists a large enough K such that K where ε is a positive constant, J = {j : β * j = 0} and J = |J |. When K and ϑ are treated as constants, the error rate of the resistant lasso is slightly worse (by a logarithmic term) than (27) for the doubly constrained form.

Minimax Lower Bounds and Model Comparison
We first derive some universal lower bounds that are satisfied by all estimators. These nonasymptotic results are stated for GLMs. Let η * =Xβ * and y i |η * i be independent and follow a distribution in the regular exponential family with dispersion σ 2 and l 0 be the negative log-likelihood function (cf. Appendix A.7 for details), and consider a general signal class with p possibly larger than n Theorem 6. In the regular exponential dispersion family with n ≥ 2, Let I(·) be any nondecreasing function with Then there exist positive constants c, c, depending on I(·) only, such that where (β,γ) denotes any estimator of (β * , γ * ).
Finer analysis and lower bounds are presented in Appendix A.7. We illustrate the conclusion for classification. Because the logistic deviance has a 1/4-Lipschitz continuous gradient, the regularity condition in (i) is satisfied for any κ ≥ MX(s * , o * )/4 (cf. Section 1 and Appendix A.3). So when κ ≤ cn, {s * log(ep/s * )+o * log(en/o * )}/n, or (p+o * log(en/o * ))/n when s * = p, is the desired minimax lower error rate for all classifiers, both with constant positive probability and in expectation (corresponding to I(t) = 1 t≥c and I(t) = t). A similar conclusion for prediction can be drawn from (ii). To the best of our knowledge, such a kind of results with the usage of GBFs for resistant GLMs is novel in robust statistics.
Together with the upper error bounds (e.g., Theorem 4 and Theorem A.3), PIQ enjoys minimax rate optimality provided that q γ is not set too large relative to o * . It is common in practice to directly specify the cardinality bound based on domain knowledge, say, γ 0 ≤ 0.25n, for the sake of β-estimation. On the other hand, if the goal is to identify all outliers to have the best predictive model, one ought to choose the regularization parameters in a more data-adaptive manner. This is regarded as a vital and challenging task in robust statistics (Ronchetti et al., 1997;Müller and Welsh, 2005;Salibian-Barrera and Van Aelst, 2008). Indeed, outliers may render resampling based cross-validation inappropriate, and although there is an abundance of information criteria, which one is sound in theory and reality lacks a clear conclusion. The joint variable selection in high dimensions and the flexible choice of the loss exacerbate the issue. (The 2 -shrinkage parameter is however much less sensitive and can be fixed at a mild value.) In the general setup where both β and γ are possibly sparse, the upper and lower error bounds suggest a universal function to penalize model complexity We refer to the associated criterion as the predictive information criterion (PIC). In the case that no variable selection is needed ( β 0 = p), one can rewrite (34) as for pure outlier identification.
(34) has some involved logarithmic terms and is not a simple 0 -type penalty, but it leads to an ideal model comparison criterion with a solid finite-sample theoretical support.
Theorem 7. Assume defined in (22) is a sub-Gaussian random vector with mean zero and scale bounded by a constant andβ Then for a sufficiently large constant A, any (β,γ) that minimizes Theorem 7 achieves the optimal error rate as Theorem 4 and Theorem 5, but a major difference here is that no parameters (like q, λ) are involved, but just some absolute constants. Moreover, Theorem 7 does not require any RIP or large signal-to-noise ratio conditions.
When the noise distribution has a dispersion parameter σ 2 , Theorem 7 still applies, but the penalty in (36) becomes Aσ 2 P (β, γ) with an unknown factor. A preliminary robust scale estimate can be possibly used. But an appealing result for robust regression is that the estimation of σ can be totally bypassed. For clarity, Theorem 8 assumes n > p and no sparsity in β, in which situation a scale-free form of PIC, motivated by She and Tran (2019), is given by where α 1 , α 2 are absolute constants. Again, the soundness of (38) when the outliers are not awfully many places no limit on their leverage or outlyingness. (More generally, using (34) in place of (35) gives a more general form of scale-free PIC for joint variable selection and outlier identification; Theorem 8' proved in Appendix A.9 implies Theorem 8, and applies to p > n.) Theorem 8. Let y = Xβ * + γ * + , where X has full column rank, i are independent sub-Gaussian(0, σ 2 ) and E 2 i = σ 2 i = c i σ 2 with c i some positive constants and σ 2 unknown. Define l(Xβ + γ; y) = Xβ + γ − y 2 2 and P (γ) = γ 0 log(en/ γ 0 ).
Combining Theorem 6 and Theorem 8 supports the modified BIC proposed by She and Owen (2011) based on empirical studies. However, (38) has some subtle but important differences from BIC, and the resemblance is just a coincidence. First, unlike most information criteria, its derivation and justification do not need an infinite sample size. Second, the factor log(en/ γ 0 )-not exactly log n as in BIC that assumes clean data-arises due to the seek for outliers. Third, the general PIC complexity term (34) (cf. Theorem 7 or Theorem 8') has the overall inflation effect added to the degrees of freedom, as opposed to the standard multiplicative relation in AIC, BIC, and many others.
In common with most nonasymptotic studies, our theorems do not show tight constants. However, these numerical constants do not depend on a specific problem and in some easier cases can be determined by Monte Carlo experiments-for example, when n > p, we recommend 5.5 γ 0 + γ 0 log(en/ γ 0 ) in (38) for robust regression. Theoretically deriving the optimal constants requires much finer analysis and we will not pursue in the current paper.

Simulations
In this part, we perform simulation studies to compare PIQ with some popular robust and/or sparse estimation methods in p < n and p > n setups. Unless otherwise mentioned, the raw predictor matrix X = [x 1 , x 2 , . . . , x n ] T ∈ R n×p is generated in the default way: where Σ is a Toeplitz design covariance matrix with Σ ij = ρ |i−j| . Here, we consider regression and classification models contaminated with highly-leveraged outliers in the following four setups. More experiment results in other settings are in reported in Appendix A.12 due to limited space. Recall o * = γ * 0 , s * = β * 0 .
Example 2 (n > p, classification): n = 1000, p = 10, ρ = 0.5, β * = [3, 3, 1.5, 1.5, 3, 3, −3, −3, 3, 3] T . To introduce outliers, we change the first o * rows of X to [3, . . . , 3] so that the first o * elements in Xβ * are 45 and set γ * i = −90, 1 ≤ i ≤ o * and 0 otherwise. The response vector is generated according to the Bernoulli distribution with mean 1/(1 + exp(−x T i β * − γ * i )) for the ith observation. The following 11 methods are included for comparison: S-estimator (Rousseeuw and Yohai, 1984), LTS (Rousseeuw, 1985), Bianco and Yohai's robust logistic regression (B-Y) (Bianco and Yohai, 1996), TLE (Hadi and Luceño, 1997), robust quasi-likelihood estimator (QLE) (Cantoni and Ronchetti, 2001), quantile lasso (QL) (Belloni and Chernozhukov, 2011), robust LARS (RLARS) (Khan et al., 2007), sparse LTS (S-LTS) (Alfons et al., 2013), sparse maximum tangent-likelihood estimator (S-MTE) (Qin et al., 2017), elasticnet LTS for classification (enetLTS) (Kurnaz et al., 2018) and PENSE (Freue et al., 2019). To reduce the interference of different regularization parameter tuning schemes, and to make a fair comparison, sparsity parameters and cut-off values for the sake of outlier identification or variable selection are all chosen to yield 1.5o * or 1.5s * nonzeros using the true model. The other parameters are set to their default values. In calling PIQ, we use the BCD-type, with ν fixed at 10 −4 . A quadratic cooling schedule Q(t) = −at 2 + U with a = (U − L)/T 2 (0 ≤ t ≤ T ) is applied in regression and a logarithmic cooling where L is the desired cardinality, U = n or p for outlier identification or variable selection, and T = 200. Owing to the progressive backward optimization, a simple single starting point β (0) = 0 already seems satisfactory in PIQ. In each setup, we repeat the experiment 50 times and evaluate the performance of an algorithm using the following metrics. For outlier identification, we report the masking or missing (M) rate, as well as the rate of successful joint detection (JD); see She and Owen (2011). Concretely, the masking probability is the fraction of undetected true outliers (misses), and the JD is the fraction of simulations with zero miss. For variable selection, we report the false alarm (FA) rate (the fraction of spuriously identified variables) in addition to M and JD. In regression experiments, the mean square error on β, denoted by Err, is used to evaluate estimation accuracy, while in classification, Err refers to the misclassification error on a separate clean testset containing 10,000 observations. The computational time in seconds, denoted by T, is also included in the tables or figures to describe the complexity of each algorithm. Table 1 shows a performance comparison for robust regression and robust classification in Example 1 and Example 2, respectively. To investigate algorithm scalability, we also give a computational time comparison in Figure 1 with respect to the dimension p. According to the table, with less than 5% outliers, all methods show satisfactory results for regression or classification. But as the percentage of outliers goes up to 10% (say) or more, the performance of many conventional methods degrades significantly. Also, some algorithms have poor numerical stability as p is just over 40. In comparison, PIQ is much more accurate and robust, and its scalability is evident.
The advantages of PIQ are even more impressive in Table 2 which summarizes the high-dimensional results. In particular, its boost in accuracy and outlier-resistance is not accompanied by a sacrifice in computational time relative to other robust methods. PIQ offered substantial time savings: its computational cost was at most about one fourth of that of the other algorithms.
We also conducted experiments with different correlation strengths, covariance struc-  tures and sparsity levels, and similar conclusions can be drawn from the performance comparison; see Appendix A.12 for details.

Robust classification of spam email
The email spam dataset (available at ftp.ics.uci.edu) contains 57 predictors including relative frequencies of 54 most commonly occurring words and characters and 3 statistics of capital letters. The total number of emails is 4,601, of which 1,813 are spam (y = 1) and 2,788 are non-spam (y = 0). We would like to know whether accommodating mislabeled emails (if any) could lead to improved spam/non-spam classification. We applied PIQ with the logistic deviance and the Huberized hinge loss (Chapelle, 2007), denoted by PIQ-L and PIQ-H, respectively, with PIC for tuning. Table 3 compares PIQ with logistic regression, B-Y, TLE, and QLE (cf. Section 6.1) in terms of misclassification error and F1-score (Chinchor, 1992), both averaged over 100 data splits (70% for training and 30% for test). From the table, TLE and QLE failed due to the large sample size. PIQ outperforms the other methods and the choice of the loss does not affect its performance on this dataset.
We then studied the subset of outliers identified by PIQ. Take PIQ-L for instance: there are 73 nonzerosγ i 's, the smallest magnitude being 10.97. To verify if there is indeed a distinction between the outliers (o = 1) and clean observations (o = 0), we examined 6 indicative features, free, !, $, hp, 650 and cap avg. The first three words and signs are often used in spam to attract people's attention; hp and 650 are related to the data creator's working company and area code; cap avg measures the average length of uninterrupted sequences of capital letters. Given each feature, we made a 2 × 2 table with its rows corresponding to spam and non-spam (y = 1, 0) and columns corresponding to outlying and clean (o = 1, 0). The cross-product ratio or odds ratio visualizes the strong association between y and o, as shown in Figure 2. Indeed, even the odds ratios closer to 1 are either as small as exp(−2.8) = 0.06 or as large as exp(4.89) = 133, demonstrating a strong discrepancy between the outlier subset and the clean subset. We also built an ANOVA model on each feature to test the significance of o; all the obtained p-values are below 1e-10, a clear evidence of the outlyingness. The data analysis suggests that even on some conventional datasets, mislabeling frequently occurs and guarding against outliers is beneficial. It is fair to say that modern statistical analysis involves more and more large-scale datasets, but with bigger data also come more junk and errors. This could be a severe issue for binary classification, since any inaccuracy in labeling simply means the response goes to the other extreme. Applying a classifier designed under no consideration of gross outliers could be risky in reality, and our resistant learning framework offers an effective fix in this regard.

Robust neural network for Parkinson's disease detection
Neural networks have attracted a great deal of attention in machine learning. In this experiment, we adapt the PIQ technique to a neural net model and test its performance for Parkinson's disease (PD) detection. There has been a surge of interest in speech analysis of PD recently, as voice recordings from these patients tend to show typical patterns like aperiodic vibrations. We use the dataset from Little et al. (2007) which collects a range of biomedical voice measurements from normal people and patients with PD, where the 22 predictors for 195 recordings include various measures of vocal fundamental frequency and variation in amplitude, as well as some nonlinear metrics constructed from raw voice recordings. We randomly split the dataset into a training subset (60%) and a test subset (40%).
Previous studies suggest the presence of nonlinear effects and interactions of the features in this challenging classification task (Little et al. (2007), Sakar et al. (2013)). Hence we designed a feedforward neural network that consists of an input layer, two dense layers with 22 and 12 nodes respectively, and one output layer. The dense layers deploy the popular rectified linear units (ReLU) (Nair and Hinton, 2010), and the output node uses a softmax activation. We trained the network for 200 epochs and attained a misclassification error rate of 9.1% on the test data-in contrast, an SVM with a nonlinear RBF kernel gave 17.9%. (Meanwhile, classical robust classification methods like B-Y, QLE, and TLE all behaved poorly, with the misclassification error rates > 25%.) On the other hand, the noisy signals and potential outliers occurring suggest the need of robustification. But the overall network criterion barely resembles that of regression or logistic regression due to its substantial nonlinearity and hierarchy, and how to reweight the samples in this sophisticated model to limit the influence of outliers sounds tricky. Following the additive robustification scheme, we simply added sample-indicator features into the input layer (cf. Figure 3), accompanied by a group 0 + 2 regularization (q = 0.05n, ν = 1e-4) to correct for sample outlyingness in constructing predictive factors. Our optimization-based PIQ algorithm can be seamlessly incorporated into the process of back-propagation. Though a simple modification, the robustified neural network gave an impressive misclassification error rate 6.4%, an improvement of about 30% over that of the vanilla neural network.

Summary
The work studied how to gain outlier resistance for a given loss that is not necessarily quadratic and may go beyond the likelihood setup. Motivated by the method of trimming, we proposed an 0 + 2 regularized learning framework. We showed that pursuing its (global) minimizers can cope with extreme outliers regardless of leverage and outlyingness, but on large-scale data, there is always a balance between statistical accuracy and computational efficiency. Hence a more meaningful and deeper question is how to design a scalable algorithm to alleviate the starting point requirement and the regularity conditions to obtain the desired order of statistical accuracy.
In this work, the interplay between high dimensional statistics and nonconvex optimization led to fruitful results for resistant learning on the learning rate, cardinality control, the choice of regularization parameters, and so on. In particular, a novel progressive backward optimization scheme is brought forward which, unlike subset sampling, is cost effective, and can significantly improve the statistical performance of iterative quantile-thresholding. We hope that the work could raise concern about outliers among (big) data analysts, as well as providing a sensible and universal means to boosting resistance in regression, classification and more sophisticated learning tasks.

A Proofs
A.1 Proof of Theorem 1 First, we introduce a lemma to be used for proving the three statements.
Proof of Lemma 1 Recall that the loss is assumed bounded from below. Without loss of generality, let inf η l 0 (η; y) = 0 for all y ∈ Y (otherwise, we can redefine the loss function by l 0 (η; y) − inf η l 0 (η; y)).

A.2 Proof of Theorem 2
First, we have the following result to argue the (alternative) optimality of γ (t+1) given β (t) , cf. Lemma C.1 in She et al. (2013).
The below definition of a sub-Gaussian random vector is standard in the literature, see, e.g., Vershynin (2012).
can be bounded using the following lemma.
Lemma 4. Let be a sub-Gaussian random vector with mean zero and scale bounded by σ. Given X ∈ R n×p and 0 ≤ J ≤ p, there exist universal constants A, C, c > 0 such that for any a > 0, the following event occurs with probability at most C exp(−ct)p −cA 2 for any t ≥ 0.
In fact, according to (A.11), we may choose any a ≥ 1 and b > 0, and so the linear convergence rate holds if ε > 1/(2 √ ϑ) − (2 − 1/( √ ϑ)ν. To show the conclusion on the sequence of β-iterates, notice that for any t ≥ 1, , where r(X) denotes the rank of X. The last inequality can be shown from the Hanson-Wright inequality (Rudelson and Vershynin, 2013) under the independent sub-Gaussian assumption since H 2 = 1 and H 2 F = r(X). Then applying the Cauchy-Schwartz inequality gives the result (details omitted).

A.3 Proof of Theorem 3
First, a thresholding function (She and Owen, 2011) is defined as a real-valued function or λ is replaced by a vector. Throughout the paper, assume that λ is the threshold parameter.
These equations are also satisfied by the fixed points of our BCD algorithms seen from the iterate updates given in Section 3.

A.5 Proof of Theorem 4
Recall that given any β, J (β) denotes its support, i.e., J (β) = {j : β j = 0}, and J(β) = |J (β)| and s * = J(β * ) and o * = J(γ * ). Given a matrix A, let P A be the orthogonal projection matrix onto the column space of A and P ⊥ A be its orthogonal complement. The range of A is denoted by R(A). For convenience, we also use P A to stand for R(A) or R(P A ). In the proofs, [n] denotes the set of {1, . . . , n} and we use C and c and L to denote universal constants which are not necessarily the same at each occurrence.
Lemma 5. Any (β,γ) satisfying (25) can be re-characterized by The proof of the lemma can be shown from Lemma 2 and the details are omitted. A pleasant fact is that (A.27) gives a joint optimization problem, rather than an alternative one. Also, notice that g may not majorize l.
To achieve the purpose, we decomposeX(β −β * ) =:X∆ as follows It is easy to verify that I 2 2 + II 2 2 = X ∆ 2 2 . We use Lemma 6 and Lemma 7 to handle , I and , II , respectively. The proof of Lemma 6 is given at the end; the proof of Lemma 7 is similar and omitted.
where L, C, c are universal constant.
, |J 1 | ≤ s, |J 2 | ≤ o}. Given any a, b, a > 0, we have Lemma 6 indicates that R s,o is bounded above by a constant times σ 2 in expectation when choosing b/a to be a constant greater than 1/4. Note that when s = o = 0, R 0,0 = 0. When s ≥ 1, o ≥ 1, for any t ≥ 0, where we used log p J ≤ CJ log(ep/J) and C, c > 0 are constants. Similarly, when s = 0, we know P(R s,o > tσ) ≤ C exp(−ct 2 )n −c for any 0 ≤ o ≤ n, and when o = 0, P(R s,o > tσ) ≤ C exp(−ct 2 )p −c for any 0 ≤ s ≤ p. From these tail bounds, we get Repeating the treatment using Lemma 7 gives Combining the two bounds and using the monotone property of P , we get for any 4b > a > 0, a > 0, where C, c are positive constants. It follows from (A.30), (A.35) and the regularity condition (26) that where L is a constant. Taking a = a = 8/δ and b = 4/δ gives the conclusion.
To bound the metric entropy log N (ε, Γ J 1 ,J 2 , d), where N (ε, Γ J 1 ,J 2 , d) is the smallest cardinality of an ε-net that covers Γ J 1 ,J 2 under d, we notice that α is always in a (r 1 ∧ J 1 + r 2 ∧ J 2 )-dimensional ball and the number of such ball is at most p 1 J 1 p 2 J 2 . By a standard volume argument where C is a universal constant. From Dudley's integral bound, The conclusion follows from the Cauchy-Schwarz inequality.

A.6 Proof of Theorem 5
We prove a more general theorem for a strongly smooth loss l with ∇l 1-Lipschitz.
Then, similar to the proof of Lemma 5, we can prove Then by Lemma 2 in She (2016) and Lemma 3, we have 1 2 Next we try to bound the stochastic term in a proper way to merge with the β-penalties.
Then for any ≥ X 2 , there exist universal constants A 0 , A 1 , C, c, c 1 , c 2 > 0 such that for any a ≥ 2b > 0, the following event occurs with probability at most C exp(−ct), where λ = Aσ log(ep), A = √ abA 1 and t ≥ 0.
The remaining part follows the lines of the proof of Theorem 4 and is omitted.

A.7 Proof of Theorem 6
Assume the density of y ∈ Y n given η * =Xβ * is given by with respect to a base measure µ defined on Y n . The corresponding loss is l(η) = 1, l 0 (η) Assume the natural parameter space Ω = {η ∈ R n : b(η) < +∞} is open. Then the loss corresponds to a distribution in the regular exponential dispersion family with dispersion σ 2 and natural parameter η i . In the Gaussian case, l 0 (η i ) is (η i − y i ) 2 /(2σ 2 ) up to an additive term independent of η i . Consider a signal class The following theorem implies Theorem 6 by setting M β = M β = +∞ and assuming κ β /κ β and κ γ are positive constants.
To prove the desired rate for estimation, we make a discussion in two cases.

A.8 Proof of Theorem 7
Let J(γ) = |J (γ)| = γ 0 and J (γ) is the support of γ, i.e., J (γ) = {i : γ i = 0}. The optimality of (β,γ) implies that The stochastic term ,Xβ −Xβ can be decomposed and bounded in a similar way as in the proof of Theorem 4. The difference is to use the union bound to show the E sup s≤p,o≤n R 2 s,o ≤ C. Take the first term , I from the decomposition as an example: When s = o = 0, R 0,0 = 0. When s ≥ 1 and o ≥ 1, for any t ≥ 0, if 4b/a is a constant greater than 1, where the last inequality is due to the sum of geometric series. Similarly, when s = 0, P(sup 0≤o≤n R 0,o > tσ) ≤ C exp(−ct 2 )n −c , and when o = 0, P(sup 0≤s≤p R s,0 > tσ) ≤ C exp(−ct 2 )p −c for any 0 ≤ s ≤ p. Hence E sup s,o R 2 s,o ≤ C. To summarize, we obtain that for any constants a, b, a , b > 0 satisfying 4b > a, (A.64) Using the regularity condition and choosing constants a, a , b, b and A sufficiently large such that 1/a + 1/a = δ/4, 4b > a and A = A 0 + 2bL, we obtain the error bound.
A.9 Proof of Theorem 8 We prove a general theorem in possibly high dimensions where β is also desired to be sparse.

It follows that
The proof of Theorem 7 gives a high-probability bound for the stochastic term: for any constant a, b, a > 0 satisfying 4b > a, with probability at least 1 − Cn −c(1∧o * ) p −c(1∧s * ) for some c, C > 0. Assume c 0 σ 2 ≤ E 2 i ≤ C 0 σ 2 with c 0 , C 0 positive constants and let ε and ε be two With A 0 large enough, we can choose a, a , b, A such that 1/a + 1/a < 1/2, 4b > a and 8bL/c 0 (1 − ε) ≤ A. From the Hanson-Wright inequality (Rudelson and Vershynin, 2013) A occurs with probability at most c 2 exp(−c 2 n), where c 2 , c 2 are dependent on constants ε, ε . The conclusion results.
Remark A.3. For robust regression in low dimensions, applying Theorem 8' to the reduced model U T ⊥ y = U T ⊥ γ * + gives Theorem 8, where U ⊥ ∈ R n×(n−r(X)) is the orthogonal complement of U that is from the SVD: X = U DV T , and = U T ⊥ . The proof follows the same lines; in particular, when applying the Hanson-Wright inequality, note that E[

A.10 Error bounds of optimal solutions
This part points out that the error rate remains the same if (β,γ) is globally optimal, but the regularity condition gets relaxed. For simplicity, we drop the 2 penalty terms.
Under a slightly stronger regularity condition than that used in Theorem 4, we get an estimation error bound.
Theorem A.3. Let (β,γ) be an A-estimator satisfying γ 0 = q γ , β 0 = q β with q γ = ϑo * , q β = ϑs * and ϑ ≥ 1. Assume β * = 0. Then, with (26) replaced by Treating the stochastic term in the same way as in the proof of Theorem 4, we obtain for any a, a , b > 0 satisfying 4b > a. The regularity condition implies Combining the above three inequalities and choosing a, a , b such that a = a = 8/δ and b = 4/δ gives the desired result.
To prove Theorem A.3, recall that from the proof of Theorem 4, we obtain with probability at least 1 − Cn −c(1∧o * ) p −c . Using the regularity condition, and setting a = a = 4/δ and b = 2/δ, we get the estimation error bound (A.69) with high probability (details omitted).

(A.75)
This applicability to approximately sparse signals is quite useful in reality.
Remark A.5. Taking a specificβ =β * in (A.73) and assuming the associated regularity condition holds, the error bound depends on γ * through its support size only, thereby finite regardless of its magnitude or outlyingness. This gives a risk-based breakdown point result that accounts for the randomness of the estimators. Concretely, to define a general stochastic breakdown point, we fix the true systematic component Xβ, but can freely alter the response y in with o ∈ N ∪ {0}, where = −∇l(Xβ + γ; y). Given any estimator (β,γ) (that implicitly depends on the data X, y), its finite-sample breakdown point * can be defined by Then, in the setup of Theorem A.4, the estimator given q β , q γ has the stochastic breakdown point * ≥ (q γ + 1)/n. (A.78) For example, forβ = arg minβ : β 0 ≤q β , γ 0 ≤qγ l(Xβ; y) with a quadratic l(η; y) = y − η 2 2 /2, the contamination model associated with Y(o) is y = Xβ+γ+ subject to γ 0 ≤ o, where the nonzero components of γ can be arbitrarily large, but even in high dimensions the breakdown point ofβ is no lower than (q γ + 1)/n. The direct link between the 0 -constraint and breakdown point facilitates parameter tuning.
Statistically speaking, obtaining a precise error rate ψ −1 exp{c(q γ log en qγ +q β log ep q β )} 2 in Theorem A.4 is much more informative than simply knowing that the risk is finite for the purpose of breakdown studies. However, it is an interesting question to determine what relaxed conditions should satisfy to guarantee (A.78) for a general extended realvalued function l : R n → R ∪ {+∞}. Toward this, we change the D 2 in (A.77) to the generalized Bregman ∆ l : * = min{o : sup y∈Y(o) E[∆ l (Xβ,Xβ)] = +∞}/n, and make a no-model-ambiguity assumption: l is differentiable atXβ * ∈ D = {η : l(η) < +∞} with the gradient ∇l(Xβ * ) = − ,Xβ * is a finite optimal solution to the Fenchel conjugate when ζ = − : l * (ζ) = sup η∈R n ζ, η − l(η), (A.79) and the extended real-valued convex function l * is differentiable at − . The assumption simply means that (Xβ * , − ) makes a conjugate pair. Note that l need not be overall strictly convex, especially when D is compact, according to Danskin's min-max theorem (Bertsekas, 1999). Then we have the following conclusion.
In the special case that l is strongly convex on R n with respect to a certain norm, as long as the effective noise has the second moment, the theorem recovers (A.78) since l * can be shown to be strongly smooth with respect to the Euclidean norm (Rockafellar, 1970) (albeit not delivering a concrete error rate as before). Two other notable features apart from the randomness: l is a general extended real-valued function (not restricted to be a function of y − η as in M-estimation), and its conjugate plays an important role in bounding the risk; also, we do not need to assume a positive norm penalty which is a key element to the proof of Proposition 1 in Alfons et al. (2013).

A.12 More simulations
In this part, we present more experiment results by varying the correlation strength, covariance structure and sparsity level. Concretely, in Example 1 with n = 1000, p = 10, o * = 100, we changed the correlation strength ρ from 0.2 to 0.8. In Example 2 with n = 1000, p = 10, o * = 120, three possible covariance structures for Σ were included, Toeplitz structure [ρ |i−j| ], equally correlated structure [ρ1 i =j ], and blocked structure diag{Σ 0 , Σ 1 }, with two blocks of equal size and the predictors within each block equally correlated with ρ = 0.5. In Example 3 and Example 4, where n = 200, p = 1000, o * = 30, we considered different sparsity levels. The later steps of introducing high-leveraged outliers remain the same. The results are summarized in Table A.1, Table A.2, and Table A.3. The overall comparison evidently supports PIQ as an extremely resistant and efficient method.