The robust desparsified lasso and the focused information criterion for high-dimensional generalized linear models

The classical lasso estimation for sparse, high-dimensional regression models is typically biased and lacks the oracle properties. The desparsified versions of the lasso have been proposed in the literature that attempt to overcome these drawbacks. In this paper, we propose the outliers-robust version of the desparsified lasso for high dimensional generalized linear models. The robustness, consistency and high dimensional asymptotics are investigated rigorously in a general framework of M-estimation under potential model misspecification. The desparsification mechanism is subsequently utilized to construct the focused information criterion (FIC) thereby facilitating robust, focused model selection in high dimensions. The applications are demonstrated with the Poisson regression under robust quasilikelihood loss function. The empirical performance of the proposed methods is examined via simulations and a real data example.


Introduction
The celebrated least absolute shrinkage and selection operator (lasso) of Tibshirani [1] performs the simultaneous regularized estimation and the variable selection in high dimensional regression models when the number of covariates d exceeds the sample size n and grows with n. This seminal article opened the doors for plenitude of investigations in statistical modelling of high dimensional data. The extensive discussions of such methods may be found in Knight and Fu [2], van de Geer [3], Fan and Lv [4], and Bühlmann and van de Geer [5] among others. The asymptotic properties of non-concave penalized likelihood estimators for high dimensional sparse linear models are investigated thoroughly in Fan and Lv [6]. The extensions of these results using robust quasilikelihood functions for the high dimensional data observed with outliers are facilitated by Avella-Medina and Ronchetti [7]. The robust penalized M-estimation for sparse, high dimensional data using locally convex estimating functions is discussed by Loh and Wainwright [8] and Loh [9]. Recently, Zhou, Claeskens and Bradic [10] discussed the robustness aspects in high dimensions in terms of model averaging.
The oracle properties of penalized estimation ensure that an estimator of a low dimensional non-sparse component is asymptotically normal, while the remaining sparse component is estimated almost surely by a zero vector. The unavailability of the non-degenerate limiting laws for sparse, penalized estimators pose a major hindrance in devising the formal statistical inference in high dimensions. Moreover, a well-known and widely used lasso penalty does not belong to the class of penalty functions leading to oracle estimators. In the fixed dimension case, the oracle properties of penalized estimators as given by Theorem 2 of Fan and Li [11] are based on the assumption that the regularization parameter λ satisfies λ = o(1) and n 1/2 λ diverge to ∞ as n → ∞, where n denotes the sample size. These two conditions cannot be simultaneously satisfied by lasso penalty since n 1/2 -consistency of lasso estimators requires λ = O(n −1/2 ) which contradicts the condition that n 1/2 λ → ∞. The lack of oracle properties is also exhibited by robust lasso estimators when the model dimension grows with n. Let k denote the number of active, non-zero components of β. By virtue of Theorem 3 of Avella-Medina and Ronchetti [7], the oracle properties of robust lasso estimators require that λ = O((nk) −1/2 ) and λ (k/n) 1/2 which are mutually contradictory conditions.
In order to carry out the statistical inference for the non-zero component of β, one must first correctly identify the zero coefficients. Hence, the inference in high dimensions implicitly involves model selection as the first step. The validity of such post-model selection inference in high dimensions often requires the uniform signal strength condition that the active component of β must be asymptotically larger than O(n −1/2 ). Leeb and Pötscher [12,13] demonstrate that even minutely weak signals produce highly non-normal estimators. Limitations of post-model selection inference are also discussed in extensive details by Hjort and Claeskens [14], Berk et al. [15] and Kuchibhotla et al. [16].
The desparsified lasso proposed by van de Geer et al. [17], Zhang and Zhang [18], Javanmard and Montanari [19] attempts to overcome these limitations. Desparsification facilitates tracking the asymptotic normality of a penalized estimator of entire parameter vector without having to identify the sparse components at the outset. The asymptotics of desparsified lasso does not require the uniform signal conditions. Desparsification is especially useful for variable selection in high dimensions since it facilitates constructing tests of the hypothesis about regression coefficients. Inference for high dimensional data using desparsified lasso is also discussed extensively in van de Geer [20]. Recently, Honda [21] proposed the desparsified version of the group lasso for inference for high dimensional varying coefficients regression models.
The existing literature on desparsification of regularized estimation is confined to the least square or likelihood loss function. It is well-known that the estimation based on these loss functions is susceptible to the potential model misspecification caused by outliers in the data. Hampel [22] formalized the robustness evaluation of estimators by means of their influence functions. An estimator whose influence function is bounded is said to be infinitesimally robust against an -contamination where 0 < < 1. Influence functions of likelihood and least square estimators are unbounded. Hence, as such these estimators are non-robust. The class of M-estimators proposed by Huber [23] nests various practically useful robust estimators with bounded influence functions. Here, M stands for maximum likelihood-type.
Ever since its inception, applications of Huber's M-estimation have been extensively discussed for a variety of problems. The formal, extensive analysis of robustness of various estimation schemes classical linear models can be found in Hampel et al. [24] and Huber and Ronchetti [25]. A number of advances have been recently made in the field of robust M-estimation. Robust covariance and scatter matrix estimation using M-estimators is discussed by Chen et al. [26]. The robust versions of Kalman filter using Huber's loss function with applications to attitude estimation of small satellites are discussed by Cao and Chen [27]. Avella-Medina and Ronchetti [7] proposed the robust estimation in high dimensional generalized linear models. Robust gradient estimation has been developed by Prasad et al. [28]. Adaptive Huber regression for high dimensional data has been discussed by Sun, Zhou and Fan [29]. The robust estimation of the mean and the covariance matrix for time series has been developed Zhang [30]. Quite recently, Zhu, Jiao and Steinhardt [31] have proposed the robust estimation via generalized quasi-gradients.
The existing literature on the robust estimation in high dimensions does not incorporate testing of hypothesis and model/variable selection aspects. In order to facilitate these techniques for contaminated high dimensional generalized linear models (GLM) under model misspecification and in the presence of outliers, it is essential to construct the robust version of the desparsified lasso. With this view, we propose in this paper the desparsified, lasso-regularized robust M-estimator. We formulate the robust nodewise regression procedure to compute the 'relaxed' inverse of the underlying Hessian matrix which is eventually instrumental in constructing the desparsified estimators. We derive the influence function of the proposed estimators to formally investigate their robustness and asymptotics in the general framework of M-estimation. Our results facilitate robust tests of hypothesis and variable selection for high dimensional GLM. Our results also generalize the existing desparsification techniques based on the likelihood or least square loss functions to a general class of loss functions.
Another aim of this paper is to bridge the gap between inference and model selection in high dimensions. We construct the focused information criterion (FIC) for high dimensional GLM using robust desparsified lasso. The desparsification in fact facilitates the variable selection by means of testing the hypothesis about the regression coefficients. However, such a variable selection is aimed at selecting a single suitable model for the data based on its overall goodness of fit with little regards to the purpose of model selection. In order to facilitate a focused and contextual model selection, Claeskens and Hjort [32] in their seminal paper proposed a novel model selection criterion termed as the Focused Information Criterion (FIC). The FIC typically evaluates the candidate models on the basis of the mean squared error of an estimator of a user-defined focus function. Hence, the FIC based model selection enables the decision makers to select contextually interpretable models that align with the purpose of statistical modelling.
This idea of focused inference and model selection has received noteworthy attention and appreciations. Hansen [33] described FIC to be 'an intriguing challenger to the existing model selection methods and deserves attention and scrutiny'. FIC has been developed to address a variety of model selection problems arising in linear and generalized linear regression models (Claeskens and Hjort,[32,34]), Gaussian autoregressive and moving average time series models (Claeskens et al., [35], Rohan and Ramanathan,[36]), semiparametric generalized linear models (Claeskens and Caroll, [56]), Generalized additive partially linear models (Zhang and Liang, [37]) to name a few. An extensive review of the focused statistical modelling and inference can be found in Claeskens and Hjort [34,Chap 6] and Claeskens [38].
The FIC in its original form as proposed by Claeskens and Hjort [32] is constructed using the maximum likelihood estimators (MLE) of model parameters. The asymptotic normality of the estimators of the model parameters under the sequence of locally misspecified, mutually contiguous parametric models is crucial for the FIC construction. However, such local asymptotics are based on the assumption that as the sample size n → ∞, the sequence of locally misspecified models approaches to the correctly specified core parametric model. However, such a core parametric model is itself misspecified when the data are contaminated with outliers. Hence, in such cases, the estimators such as the MLE or the least square estimators which exhibit unbounded bias as a function of data contaminations are not suitable to develop the FIC. Therefore, developing the focused model selection methods by means of non-likelihood estimators that are robust to data contaminations is worth investigating. Du et al. [39] discussed the robust and focused model averaging aspects in a framework of M-estimation. We have proposed recently the robust FIC in Pandhare and Ramanathan [40,41] and Ramanathan and Pandhare [42].
The aforementioned previous investigations into the robust focused modelling are confined to the low dimensional linear models. Gueuning and Claeskens [43] proposed the FIC for the high dimensional regression models and can be regarded as a pioneering work in the field of focused modelling of high dimensional data. However, their investigations are again confined to the least square and likelihood based loss functions and hence has the same drawbacks as those exhibited by the traditional low dimensional FIC in presence of data contaminations.
The FIC that we propose in this paper attempts to overcome these limitations when modelling the sparse high dimensional data observed with outliers. Our FIC is applicable for the robust model selection in a wide class of GLM when the conditional distribution of the response given the covariates may not be necessarily Gaussian but can be any model in the exponential family. Moreover, it is constructed using a wide class of M-estimators. We establish the asymptotic normality of the robust desparsified lasso penalized M-estimators under local model misspecification which is eventually used to derive the FIC. We also propose various mechanisms to estimate the mean squared error of the focus function under model misspecification leading to the required FIC formulas. The general results are exemplified with the high dimensional Poisson regression under robust quasilikelihood estimation scheme.
The rest of the paper is organized as follows. The robust estimation in high dimensional GLM is briefly discussed in Section 2. The desparsified version of these estimators is discussed in Section 3 along with its robustness aspects. In Section 4, the asymptotic normality of these estimators under local model misspecification is investigated. Using the results in Section 4, the general FIC expressions are derived subsequently in Section 5. The results in Sections 2-5 are fairly general and applicable to a wide range of loss functions and GLMs. Section 6 is devoted to empirically examine the performance of the robust desparsified lasso and the robust FIC by for the case of high dimensional Poisson regression by means of simulation studies. The loss function for the sake of empirical investigations is considered to be robust quasilikelihood. Analysis of real data on riboflavin production is provided in Section 7. Conclusions and future directions are sketched in Section 8. The computational details, algorithms and proofs of all the theoretical results are presented in the supplementary material. In particular, the coordinate descent algorithm is discussed that is instrumental in performing the nodewise regression and computing the desparsified estimator for the case of high dimensional Poisson regression.

Model setting
. . , n denote n independent and identically distributed (iid) copies of (Y, x). The GLM is typically specified in terms of the conditional distribution F β (.|x) of Y given x. We assume throughout that F β (.|x) is picked from the class F of the distribution functions equipped with the uniform topology. Furthermore, we assume the conditional distribution F β (.|x) is a member of the exponential family with the canonical link function g(.) and the linear predictor η x (β) = x T β with the probability density of the form for some functions a(.), b(.), c(.) and the known dispersion parameter κ. The conditional mean is given by μ ) and the conditional variance is given by V(μ x (β)) = κb (η x (β)), where b and b denote the first and the second derivatives of the function with respect to the linear predictor. The extensive details of GLM are referred to Nelder and Wedderburn [44] and McCullagh and Nelder [45]. Let T : F → denote a functional such that T(F β ) = β * = argmin β∈ E β ρ(Y, μ x (β)) denotes the 'least false' parameter value of β that minimizes the expectation of a certain parametric loss function ρ(Y, μ x (β)) over , while T(F n ) =β denotes the estimated β, where F n denotes the empirical distribution function. Such a functional T is known as a Fisher consistent M-functional and a resultant estimatorβ is called an M-estimator, where M stands for minimization.

High dimensional influence function
In the robust statistics, the conditional distribution of Y is permitted to deviate from the assumed core parametric model in an -contamination neighbourhood for given > 0. Let Y denote a mass function that assigns a mass 1 at the point Y and 0 elsewhere. Given x, the model for Y under -contamination is given as F (.|x) = (1 − )F β (.|x) + Y . The purpose of robust regression modelling is to devise the estimators whose bias remains stable under an -contamination. The sensitivity of an M-estimator to the misspecification in the parametric model caused by the outliers is measured in terms of its influence function (Hampel,[22]). Given the covariates x, the conditional influence function of T at parametric model F β * (.|x) at the least false parametric value β * is defined as The boundedness of the influence function (1) implies that the underlying M-estimator is bias-robust, meaning that it exhibits the bias of the maximum size O( ) undercontamination. A sufficient condition for the influence function (1) to be well-defined is the Hadamard differentiability of the functional T with respect to the core parametric model. The details of Hadamard derivatives and other related functional calculus are referred to Fernholz [46]. In the low dimensional cases, the Hadamard differentiability and hence the existence of the influence function of T is ensured by the non-singularity of the Hessian matrix of the loss ρ(Y, μ x (β)) (Hampel, [22]). However, this criterion is not directly applicable to obtain the influence function in the high dimensional case since the Hessian matrix is rank deficient. The high dimensional data often exhibit the sparsity implying that many of the components of β are essentially zero. Let k denote the number of active, non-zero components of β where k d, k < n. The penalized version of the functional T identifies the zero coefficients and reduces the dimension. Following Avella-Medina and Ronchetti [7], the lasso penalized risk function takes the form where the function ρ is a bounded loss function. Let T(F β ) = β * denote the minimizer of the lasso risk function (2). Based on the data We now discuss the influence function of T when d > n. In what follows, the term . In order to derive the influence function of lasso functional T, we assume the following regularity conditions.

Assumption 2.2:
There exists a unique β * ∈ such that for every open set U containing β * .

Assumption 2.3:
There exists a unique minimizer β of the loss in a neighbourhood con- surely continuous at β for almost every x ∈ X .
Assumptions 2.1 and 2.3 are the traditional smoothness conditions on the loss function. The boundedness conditions imply that the influence function is bounded implying the robustness of T. Assumption 2.2 implies that the loss function has a well-separated minimum. The usual convex loss functions satisfy these assumptions, cf. Loh and Wainwright [8] and Avella-Medina [47].
The following theorem provides the influence function of T.
The boundedness conditions in Assumption 2.1 imply that the above influence function is bounded implying the bias robustness of resultant lasso penalized estimators obtained by minimizing the risk function (3).

Robust nodewise regression
We now propose the robust version of the nodewise lasso regression to compute the 'relaxed' inverse of the Hessian matrix of the loss ρ(Y, μ x (β)) which is eventually useful to desparsify the lasso and obtain its asymptotic distribution applicable to high dimensional GLM. The nodewise lasso involves successive regression of each of j-th column (j = 1, 2, . . . , d) of the underlying design matrix on the remaining d−1 columns. The regression coefficients estimated in each case are eventually used to construct the relaxed inverse of the Hessian matrix of the underlying loss function. Let The Hessian M X (β) defined in this manner is a linear transformation of the columns in the weighted design matrix H. Hence, the relaxed inverse of M X (β) can be obtained by performing successive regression each of the j-th column of H on its remaining j−1 ) denote the mean functions (conditional on h i,−j ) with respect to the link function g(.). For every j ∈ {1, 2, . . . , d}, let F ν j denote the common distribution function of i.i.d. copies of h i,j i = 1, 2, . . . , n with the conditional density in the exponential family of the form as given in Section 2.1 with the response y i replaced by h i,j and the covariates x i replaced by h i,−j .
Let T j : F → R d−1 be a functional given as The functional defined by (4) represents the lasso penalized, nodewise M-functional. The 2 The following theorem is in the spirit of Theorem 2.1 and provides the influence function of the nodewise lasso functional (4). In what follows, k j denotes the length of active ) is a vector of length d with first k j elements as λ j sgn(ν * j,s ) and each of the remaining d − k j elements as 0 for every s = 1, 2, . . . , Under the assumptions of this theorem, the above influence function is seen to be bounded implying the robustness of the nodewise lasso estimators given by (5).

Remark 3.1:
The influence functions as given by Theorems 2.1 and 3.1 are presented for the situations when d > n. However, these results are non-asymptotic in nature and hence are based on the assumption that model dimension is fixed. In the forthcoming discussions, we present the asymptotic results for desparsified estimation and the FIC wherein we assume that d grows with n exhibiting the non-polynomial (NP) dimensionality such that log(d) = o(n 1/2 ) as in Avella-Medina and Ronchetti [7]. We also assume the sparsity condition that the dimension of active part of β satisfies the condition k = o(n 1/2 /log(d)). This condition is comparable with the sparsity condition (C5) of Gueuning and Claeskens [43]. Allowing the model dimension d to grow with n and incorporating the sparsity is instrumental in establishing the classical root-n consistency of the desparsified lasso. It is implicit throughout the discussions that d = d n , β = β n , k = k n and λ = λ n implying that the parameters and model dimensions are functions of n. However, for the sake of notational simplicity, we suppress the indexing by n in the notations.

Relaxed inverse and the formula for desparsified estimator
The notion of the relaxed inverse in the context of high dimensional regression models has been proposed by van de Geer et al. [17]. The nodesν j can be used to estimate the relaxed inverse of the Hessian M X (β) which is eventually useful to define the desparsified version ofβ. Letĥ i,j denote the responses estimated using the nodesν j . LetÂ denote d × d matrix with (j, j )-th element (j = j ) as −ν j,j and (j, j)-element as 1. Definê ). Following van de Geer et al. [17], we define the nodewise, relaxed inverseˆ ofM asˆ The desparsified estimator, say,β ofβ is given as Equations (6) and (7) serve as the formulas to compute the desparsified estimators. The key step to arrive at (7) is to invert the well-known Karush-Kuhn-Tucker (KKT) conditions corresponding to the minimization scheme (4). The derivation follows on the lines of Section 3.1 of van de Geer et al. [17] and hence is omitted.

Additional assumptions and related discussions
The following are the additional assumptions that are instrumental in establishing the asymptotics of desparsified lasso estimator given by relation (7). In the subsequent discussions, we set Additionally,ψ j,j (Y, μ x (β)) is locally Lipschitz on O, i.e., for any β a = β b ∈ O, there exists a δ-neighbourhood, such that Moreover, the smallest eigenvalue l min of Q * satisfies 1/l min = O(1).

Assumption 3.2:
There exists an open set O containing β * such that for every β ∈ O and for almost every x ∈ X , the smallest eigenvalue min of M X (β) is strictly positive and 1/ min = O(1), the largest eigenvalue max is O(1) and ||M * || ∞ = O(1), where M * = M X (β * ). Assumption 3.6: There exist n × n weight matrix W j = W H (ν j ) and n × 1 column matrix z j = z H (ν j ) such that

Assumption 3.3: It holds that
We also make the following assumption.
Assumptions 3.1-3.7 are instrumental in establishing the asymptotic normality of desparsified lasso. Assumption 3.1 is stronger than Assumption 2.1 since it also requires the local Lipschitz continuity of the Hessian of the underlying loss. Assumptions 3.4 and 3.7 imply that the design matrices are strongly bounded in the sense of Assumption (B1) of van de Geer et al. [17]. Assumptions 3.2, 3.4, 3.6 and 3.7 are sufficient conditions for the so-called compatibility condition to hold true with respect toM with probability approaching 1 as n → ∞ (Bühlmann and van de Geer, [5, Section 6.12]). The extensive details on compatibility condition and other stronger conditions which imply the compatibility condition can be found in van de Geer and Bühlmann [48]. In particular, the compatibility condition and hence the assumptions here which imply it are weaker than the restricted eigenvalue conditions such as those in Raskutti, Wainwright and Yu [49] that are used to establish the oracle inequalities for lasso. Moreover, the above assumptions are also weaker than the irrepresentability conditions of Zhao and Yu [50] which are necessary and sufficient for the variable selection consistency of lasso. The compatibility condition is in fact a direct implication of the restricted eigenvalue and the irrepresentablity conditions (van de Geer and Bühlmann, [48, Section 9]).
The compatibility condition is crucial to the high dimensional asymptotics presented here since it provides the bounds for the L 2 error of the lasso estimates. In particular, it implies that ||β − β * || 2 = O P (k 1/2 λ). The compatibility condition also implies that the estimators of the nodes satisfy ||ν j − ν * j || 2 = O P (k j λ j ). These key results eventually lead to the asymptotics of desparsified lasso. Assumption 3.3 implies that the weight matrices in nodewise regression are bounded and Lipschitz continuous. This condition along with the uniform boundedness condition in Assumption 3.1 implies the robustness of the desparsified lasso. Assumption 3.5 is the usual sparsity condition in high dimensional asymptotics which, coupled with the compatibility condition establishes the rate of convergence of lasso estimators.
The existence and the formulas of the matrices appearing in the above assumptions depend on the nature the underlying loss function. The robust quasilikelihood loss of Cantoni and Ronchetti [51] is a well-known and widely used loss function that satisfies these assumptions. Additional examples of such loss functions can be found in the supplementary material of Avella-Medina and Ronchetti [7].

Asymptotics of relaxed inverse and desparsified lasso
as defined in Theorem 2.1. Let * j andˆ j denote the j-th column of * andˆ , respectively. The following theorem ensures the consistency ofˆ .
We now discuss the robustness and the asymptotic normality of the desparsified estimators. The following theorem establishes that the desparsified lasso is indeed robust and asymptotically normal. (i) The estimatorβ j is asymptotically linear with the empirical influence function given by Moreover, (ii) The estimatorβ j is asymptotically normal. More precisely,

Remark 3.2:
Statement (i) of Theorem 3.3 implies that the empirical influence function is a consistent estimator of its population version. Moreover, the influence function is bounded, implying the robustness of desparsified estimators. In fact, by virtue of this theorem, it is clear that the robust version of the nodewise regression yields the robust desparsified estimators. Statement (ii) avails the asymptotic normality of all the components of the penalized estimators without having to identify correctly the zero coefficients via regularization. By means of the Cramer-Wold device, the joint asymptotic normality also holds true, that is,β

Local model misspecification setting
Without loss of generality, partition β as β = (α, γ ), α ∈ R p , γ ∈ R q , p + q = d. Accordingly, partition the desparsified estimatorβ asβ = (α,γ ). We assume that the subvector α is common to all the models under consideration while the subvector γ is subject to selection. Consider a subset S of ⊆ {1, 2, . . . , q}. Let |S| denote the cardinality of S and π S denote the |S| × q projection matrix such that γ S = π S γ denotes the portion of γ that is unknown under a subset model S. The case when S is an empty set corresponds to the smallest model under which α alone is an unknown parameter while γ is known completely. Let this known value be denoted by γ 0 . We assume throughout that γ 0 lies in the interior of the parameter space . In the context of covariate selection in regression models, we set γ 0 = 0 q in which case, we arrive at the smallest model with p protected covariates. Hence, α denotes the coefficients corresponding to the covariates that are protected and included in all the models under consideration while the components of γ that do not appear in a subset model leads to the omission of the corresponding covariates from the concerned model. The case when S = {1, 2, . . . , q} corresponds to the full model under which the vector β is completely unknown and all p + q = d covariates are unknown.
In order to derive the FIC for high dimensional submodel S, we need to tract the limiting distribution of the desparsified estimators under a submodel S. The essence of the focused model selection lies in the asymptotic normality of the parameters estimated under a potential model misspecification. Parallel to the local model misspecification setting of Gueuning and Claeskens [43], we define γ S = π S γ under a subset model S as Typically, the parameter vector α is common to all the models. It is regarded as a predetermined, low dimensional subvector of β. Since its dimension is fixed, this component is kept free of local misspecification. The γ component, on the other hand, is regarded as a high dimensional subvector of β with the dimension that potentially grows with n. A locally perturbed model F α * ,γ 0,S,δ is contiguous to the fixed model F α * ,γ * S and the wellknown contiguity lemmas of LeCam [52] may be utilized to derive the local asymptotics of the estimators under a model misspecification (8) as discussed in the following section.

Asymptotic normality under locally misspecified high dimensional subset models
We now discuss the asymptotic normality of the desparsified estimators of the coefficients in a subset model S under the local model misspecification (8). Let denote the conditional score vector given x.
Define G X (β) = E β ( (Y, μ x (β))S x (β) T ). Partition the matrixM according to the α and γ components asM Let the matricesQ andĜ be partitioned similarly. LetM S denote the submatrix ofM under a subset model S which can be partitioned asM Partition the nodewise, relaxed inverseˆ S ofM S according to (9) aŝ The following theorem provides the asymptotic distribution of a subvectorβ S = (α, π Sγ ) ofβ pertinent to a subset model S under the local model misspecification (8).

Asymptotic mean squared error of the estimated focus under model misspecification
Let ζ(β) denote a real-valued parametric focus function. Let ζ(α * , γ 0,S,δ ) denote the least false value of the focus under the local misspecification. Let ζ(α, π Sγ ) denote the focus estimated using the desparsified estimators of coefficients under a submodel S. Let τ S = (τ S,α , τ S,γ ) T and C S denote respectively the mean vector and covariance matrix of asymptotic normal distribution of n 1/2 (α − α * , π Sγ − π S γ 0 ) T . By means of Theorem 4.1, the components of τ S are given as where L 1 S and L 2 S are matrices of dimensions p × q and |S| × q, respectively. These matrices are estimated byL LetĈ S be partitioned with the componentŝ The following corollary to Theorem 4.1 provides the asymptotic normality of the desparsified estimator of the focus function under model misspecification which eventually leads to the formula for its approximate mean squared error (AMSE). Let ζ(β) = ζ(α, γ ) be a real-valued differentiable focus function with the non-vanishing derivative atβ. Suppose the assumptions of Theorem 4.1 hold true. Then, under the locally misspecified subset model F α * ,γ 0,δ,S , we have,

Corollary 5.1:
Setζ S = ζ(α, π Sγ ). By virtue of Corollary 5.1, the AMSE of the focus estimated under a subset model S is given as Remark 5.1: Common choices of the focus functions in the context of GLM are linear predictor η 0 (β) = x T 0 β and the conditional mean function μ 0 (β) = g −1 (η 0 (β)) for a fixed covariate level x 0 and a monotone link function g (.). Partition x 0 according to α and γ components as is a differentiable link function with the non-vanishing derivative g (.), it is seen that x α,0 , and ∂μ 0 (β) ∂γ x γ ,0 .
Using these relations, we get, The above relation implies the FIC based on the linear predictor leads to the same model choice as that of the FIC based on the conditional mean function.

Estimation of misspecification parameter and FIC formulas
The FIC for a subset model S is an estimator of AMSE(ζ S ). However, to compute the FIC, we need to estimate the matrix δδ T in order to estimate the squared bias component. The following corollary to Theorem 4.1 is instrumental in this regard. In the subsequent discussions, we set = δδ T .
is non-singular with the its inverse partitioned as α , i,γ ) where i,α and i,γ denote the gradients of the quasi-likelihood loss ρ(Y i , μ x i (β)), i = 1, 2, . . . , n with respect to α and γ , respectively. Set Then, given x, the following holds true with respect to the conditional distribution function F β * (.|x) as n → ∞.
In view of Corollary 5.2, it is seen that the following are possible choices of estimating and the corresponding FIC formulas. The AUL and AUDL estimators enjoy the asymptotic unbiasedness but may estimate by a non-positive definite matrix. As suggested by Gueuning and Claeskens [43,Section 5], the corresponding squared biased component in the AMSE of the focus is truncated to 0 to avoid the negative estimates of the FIC. The PIL and PIDL estimators, on the other hand, are biased but ensure the non-negativity of the FIC. The comparative performance of these four versions of FIC for case of Poisson regression is investigated empirically in the next section.

Model setting
We simulate observations on the response variable  We assume the proportion of contamination to be 10% and the coefficient vector β to be of length d = 1000. The coefficient vector β is partitioned as β = (α, γ ). We consider α = (−1, 1, 1, −1) and γ = (1, 1, 0, 0, 0 · · · , 0). Hence the true model is sparse having the dimension k = 6 with the covariates x 1 , . . . x 6 , while the smallest model is of dimension 4 with the covariates x 1 , . . . , x 4 . The robust and non-robust versions of the lasso estimator of β are computed via the the coordinate descent is carried using R functions made available publicly by Avella-Medina and Ronchetti [7]. These codes can be accessed at https://doi.org/10.1093/biomet/asx070. The relevant technical deductions underlying these computations are provided in the supplementary material to this paper.

Robust versus non-robust tests
We now examine the performance of the robust and the non-robust versions desparsified lasso in terms of power of the tests the hypothesis under local model misspecification .  Partition β = (α, γ ). We assume that the narrow model is of the dimension 4 with the coefficients α = (β 1 , β 2 , β 3 , β 4 ) while the γ consists of the components γ = (β 5 , β 6 , . . . , β d ).
The null hypothesis H 0 : γ = 0 is to be tested against the local alternatives H 1n : γ = n −1/2 δ at 5% level of significance.
By virtue of Remark 2, it is seen that the statistic nγ Tˆ γ ,γγ is approximately distributed as χ 2 q , whereˆ γ ,γ is q × q submatrix ofˆ =MQ X (β) −1M T corresponding to γ component. There is a little evidence in favour of H 0 if the computed test statistic is smaller than χ 2 0.95,q where χ 2 0.95,q denotes the quantile of χ 2 q at 0.95. While computing this statistic, the term Q X (β) needs to be estimated for which an estimate of β needs to be computed.
There are four potential choices to estimate β, namely, the robust quasilikelihood (β R ), robust quasilikelihood desparsified estimator (β R ), non-robust quasilikelihood estimator (β NR ) and non-robust quasilikelihood desparsified estimator (β NR ). The explicit formulations for robust and non-robust versions of quasilikelihood function are provided in the supplementary material to this paper. In order to compute the power, we also need to estimate the term δ. Due to Corollary 5.2, the robust and non-robust estimators of the parameter δ can be estimated using K X (β) by replacing β by each of the above mentioned estimators.
The empirical powers of the test based on robust and non-robust versions of Q X (β), K X (β) and δ for various values of sample size n are provided in Table 2.
The above empirical investigations reveal that the tests based on robust estimation schemes are more powerful than those based on non-robust estimation. Moreover, as sample size increases, the power of the robust tests approaches closer to the unity implying the consistency of the robust tests. The power of non-robust tests is affected visibly when the regressors are mutually correlated, whereas the power of the robust tests remains largely unaffected. The robust desparsification of the lasso is seen to safeguard the power of tests against model misspecification caused by contamination and correlated regressors. For instance, it can be seen for the Toeplitz covariance structure that the power of the test based on non-robust desparsified estimators reaches only up to 83% while its robust counterpart exhibits the power of 90%. Desparsification of the lasso is seen to improve the performance of the tests even for the case of non-robust estimation. For instance, when the pairwise correlations among regressors are equal, the maximum power of the test based on non-robust, non-desparsified estimation is only 69% while its desparsified counterpart achieves the maximum power of 76%. The better performance of desparsified estimators is justified since the usual lasso lacks oracle properties while desparsification yields consistent estimators.

Robust versus non-robust FIC
It is now of interest to examine the performance of the FIC computed using the robust as well as the non-robust desparsified lasso. We simulate 100 independent samples each of size n = 300 using the model setting in Section 6.1. The models are to be chosen for two purposes. The first purpose is the minimization of risk in estimating the average response. For this purpose, the first focus function ζ 1 is chosen to be the linear predictor x T 0 β at the fixed covariate level x 0 . As justified in Remark 3, the FIC based on this focus identifies models same as those identified by the FIC with focus on the conditional mean function. The second purpose is minimizing the risk in estimation based on the robust quasilikelihood loss under model misspecification. For this purpose, the second focus function ζ 2 is chosen to be the robust quasilikelihood loss function ρ(Y 0 , μ x 0 (β)).
The narrow model is chosen to be of dimension 4 with covariates (x 1 , x 2 , x 3 , x 4 ). The covariates are simulated from the multivariate normal distribution with three choices for the covariance structure as mentioned in Section 6.1. The model dimensions are subsequently increased by adding covariates one by one until all d = 1000 covariates are included. Theoretically, there are d m choices for a subset model of dimension m ≤ d. However, for the demonstration purposes, we consider a single model of each dimension. In the discussions hereinafter, by a subset model of dimension m, we mean a model consisting of the covariates x 1 , . . . , x m . The FIC for each subset model is computed using the formulas given in Table 1.
We now identify the most preferred model in all the 100 simulations by each of the four versions of FIC as well as those by testing of hypothesis. In Table 3, these model choices by four versions of the FIC are reported along with the mean squared error (MSE) of the focus function averaged over all 100 simulation runs. We also report the models which observe the smallest MSE of the focus. Letζ NR r,i r = 1, 2 denote the non-robust estimator of the focus functions computed using the desparsified version of the usual non-robust quasilikelihood lasso estimator in the simulation run i ∈ {1, 2, . . . , 100}, whileζ R r,i r = 1, 2 denotes the estimator of focus function computed using the desparsified lasso estimator based on robust quasilikelihood loss. The non-robust MSE of these focus functions averaged over all the 100 simulation runs is hence given as , r = 1, 2.   Similarly, the averaged robust MSE of these focus functions is given as , r = 1, 2.
From Table 3, it is seen that the values of the MSE based on robust desparsified lasso (RMSE) are smaller than their non-robust counterparts (NRMSE). For both the choices of focus functions and all the three choices of covariance structures for regressors, the robust FIC based on the bias-corrected desparsified lasso (RFIC AUDL) identifies the models that estimate the focus functions with the lowest realized MSE. These choices are marked in the boldface in Table 3. The presence of systematic multicollinearity among the regressors is seen to adversely affect the performance of the non-robust FIC. Amongst all the versions of the FIC under consideration, the non-robust FIC based on PIL and PIDL schemes is affected most significantly due to correlated regressors. When the regressors are uncorrelated, the non-robust FIC based on PIL selects models of dimensions 12 and 13 for the focus functions ζ 1 and ζ 2 , respectively. These model dimensions increase to 15 and 16 for the cases of Toeplitz and equal correlation structures. The introduction of correlated regressors is also seen to bring about significant shifts in the magnitude of the NRMSE estimated under the models chosen by non-robust versions of the FIC. The RMSE values under the models chosen by robust versions of the FIC are comparatively stable. For instance, when the focus is ζ 1 and the covariance structure is Toeplitz, the percentage It is also worth noticing that in majority of the scenarios, FIC selects two distinct models for two distinct focus functions. The MSE values of the two focus functions are seen to differ considerably from one another. These findings reiterate the philosophy underlying the focused model selection that the models which may be suitable for one purpose may not be suitable for another.

Real data example
We present the analysis of the data in Buhlmann et al. [53] on riboflavin production by bacillus subtilis. The dataset consists of n = 71 observations on d = 4088 gene expression levels of bacillus subtilis which potentially determine the abundance of riboflavin production. The dataset is publicly available in the R library hdi:high dimensional inference. Let X = {x ij , i = 1, 2, . . . , n, j = 1, 2, . . . , d} denote the n × d matrix of gene expressions while x i denote its i-th row. Let Q m,i , m = 1, 2, 3 denote the three quartiles of x i , i = 1, 2, . . . , n while IQR i = Q 3,i − Q 1,i denote the corresponding interquartile range.
Corresponding to x i , the response variable Y i is defined as the count of the gene expressions that lie in the interval [Q 1,i − 1.5IQR i , Q 3,i + 1.5IQR i ]. The gene expressions which may fall out of this interval represent the potential outliers in the riboflavin production. The box-and-whiskers plot of these data is given in Figure 1.
A number of outliers are exhibited by the above boxplot, the fact which calls for the robust data modelling.
We now discuss the model selection for the data under consideration.We train the models of dimensions m = 4, 5, . . . , d (with d = 4088 in the present case) on the sub-data of size 35 and test them on the remaining sub-data of size 36. The conditional parametric distribution of the response given the covariates is assumed to be Poisson with the loglink. We first perform the non-robust lasso regularization to identify the narrow model. The narrow model hence chosen consists of 32 genes as the significant predictors. Starting with this model, we successively add one variable from the data to the existing set of variables and compute the all the four versions of the FIC as in Table 1 with respect to the non-robust lasso estimators. The successive inclusion of the variables is terminated when the absolute difference in the FIC values at the latest iteration and the previous iteration attains or exceeds a threshold of 0.1. The same algorithm is followed to select the models by robust FIC by starting with the narrow model chosen by robust lasso regularization. The narrow model in this case identifies 36 genes as significant predictors. The two focus functions ζ 1 and ζ 2 are chosen as the linear predictor and the quasilikelihood loss function respectively.
We compute the crossvalidation MSE of the focus function for each candidate model. In what follows, the termsζ NR,train r andζ NR,test r denote the estimators of the focus functions for the training and the test data respectively computed using the desparsified version of likelihood based lasso, whileζ R,train r andζ R,test r denote their corresponding robust versions based on robust quasilikelihood loss, r ∈ {1, 2}. The non-robust MSE of the focus function predicted for the test data using the models fitted on the training data is defined as NRMSE r = ζ NR,train r −ζ NR,test r 2 , r = 1, 2.
In Table 4, the models chosen by the robust as well as non-robust versions of the FIC are reported. We also report the MSE values of the focus functions for each of these models. The FIC based on robust desparsification selects the models with smallest predicted MSE of the focus functions while the FIC based on non-robust plug in estimation selects the models with the largest predicted MSE. Overall, the robust versions of the FIC are seen to select the models with smaller predicted MSE. Moreover, FIC is seen to select two different models for two different focus functions implying that the model that may be suitable with respect to one focus function may not be suitable for another in terms minimizing the MSE of the focus estimated under model misspecification.

Conclusions and future directions
In the present paper, we have proposed the robust desparsified lasso using robust Mestimators and established its robustness and asymptotic normality. Using desparsified estimators, we have derived the FIC for contextual model selection in high dimensions. The simulation studies reveal that tests based on robust desparsified lasso are more powerful than their non-robust counterpart in presence of data contaminations. Hence, the robust version of the desparsified lasso performs better than the non-robust desparsified lasso of van de Geer et al. [17]. Moreover, the FIC based on such robust desparsification selects models with smaller MSE of the focus functions as opposed to the models chosen by the FIC obtained using likelihood based desparsification. The empirical investigations reveal that the FIC based on robust desparsification and biascorrection leads to the models that minimize the risk in the estimation of focus function under potential model misspecification and systematic multicollinearity among the regressors. The discussions in this paper are confined to the penalized estimation using lasso penalty. It is also of potential interest to construct the desparsified versions and the corresponding FIC using penalized estimators with general penalty functions which may include important special cases like smoothly clipped absolute deviation penalty of Fan and Li [11] and elastic net penalty of Zou and Hastie [54].
The methodology proposed in this paper is fairly general and applies to a wide range of robust loss functions, robust quasilikelihood loss being an important special case. Beyond the quasilikelihood, the density power divergence family proposed by Basu et al. [55] and the locally concave loss functions discussed in Loh and Wainwright [8] are also often used for robust inference. Demonstrating the applications of our general results for the robust focused modelling using these loss functions is also worth the investigations.

Supplementary material
Rigorous proofs of all the results and technical discussions underlying the empirical investigations in Section 6 are placed in the supplementary material.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The first author would also like to thank the Council of Scientific and Industrial Research and University Grants Commission, Government of India for the award of Senior Research Fellowship. The second author would like to acknowledge the grant MTR/2020/000126 received from Science and Engineering Research Board, Government of India.