Asymmetric influence measure for high dimensional regression

Abstract Identification of influential observations is crucial in data analysis, particularly with high dimensional datasets, where the number of predictors is higher than the sample size. These rich datasets with extensive detail are increasingly exploited and analyzed in multiple fields of science, e.g., genomics, neuroscience, finance, etc. Unfortunately, classical diagnostic statistical tools are not tailored for identifying influential observations in high dimensional setup. In this paper, we use the concept of expectiles to develop an influence measure in high dimensional regression. The influence measure is based on the asymmetric marginal correlation, and its derived asymptotic distribution is used to define a threshold based on statistical principles. Our comprehensive simulation results display the favorable qualities of this influence measure under various scenarios. The usefulness of the proposed measure is illustrated through the analysis of a neuroimaging dataset. An R package implementing the procedure is publicly available on GitHub (https://github.com/AmBarry/hidetify).


Introduction
Statistical inference and prediction rely on the calculated values of various statistics or estimators. Frequently, these calculated values are biased by the contribution of one or more influential observations in the dataset, that often stand out from the bulk of the observations in some way. A sample with such observations is often called a contaminated dataset if there is reason to believe that these observations do not accurately represent the signals of interest. Under these conditions, statistical analysis and interpretation based on these estimators will be most likely erroneous. Thus, it is crucial to be able to detect such influential observations and to limit their adverse effect.
For classical linear regression, there is an extensive literature on identifying influential observations and reducing their impact. These efforts led to the development of several influence measures, most of which are available in standard statistical software. These measures are generally based on the estimated coefficients and residuals of the regression model; for example, Cook's distance measure (Cook and Weisberg 1982;Cook 1977) uses the estimated coefficients and the leave-one-out approach to identify influential observations. The Cook's distance of an observation is defined as the difference between the full sample coefficient estimate and the coefficient estimate on the sample excluding that observation, and observations with large values of Cook's distance are identified as influential. Cook suggested identifying these influential observations by determining a threshold based on the Fisher distribution. For more details, Chatterjee and Hadi (1986) presented an excellent literature review, including a broad class of influence measures and a description of the relationship between them and Cook's measure.
Bias and inference problems associated with influential observations also occur in high-dimensional datasets. Indeed, high dimensionality increases the likelihood of an observation or multiple observations to be influential and amplifies their potential impact on downstream analysis. Unfortunately, traditional diagnostic measures, such as Cook's distance, are not directly applicable in high dimension due to singularities. The gram matrix, derived from the design matrix (whose number of columns is often larger than the number of rows), is not invertible and the parameters must be estimated by regularization methods which are computationally demanding and often produce very unstable estimators. In addition, it becomes more difficult or impossible to use visualization techniques to identify influential observations when the data is very large.
Recently, the emergence of high-dimensional data in a number of fields such as genetics, medical imaging, astronomy and finance has spawned a new scientific literature on high-dimensional influence measures. In this regard, there is She and Owen's (2010) paper, which suggest a novel penalty-based measure applicable when the number of predictors is larger than the number of observations. She and Owen (2010) introduced, into the classical regression model, an indicator variable for each observation, and a new parameter-the one mean shift parameter-in addition to the parameter of interest. This new parameter is subject to a penalty that simultaneously identifies the influential observations while finding a sparse solution.
Subsequently, using the leave-one-out approach, Zhao et al. (2013) proposed a new influence measure (high dimensional influence measure) that captures each individual's contribution to the marginal correlation between the response variable and each predictor in the model. This new influence measure is based on the concept of "sure independence screening (SIS)" which is a variable selection method based on the marginal correlation between predictors and the response variable (Fan and Lv 2008). The high dimensional influence measure (HIM) defines the contribution of an individual observation as the difference between the full sample marginal correlation and the marginal correlation derived from the sample excluding that observation. Another important contribution of Zhao et al. (2013) paper is the derivation of the asymptotic distribution of the HIM measure, defining the statistical threshold beyond which the observations are Our asymHIM influence measure, which is a generalization of Zhao et al. (2013) influence measure, is based on asymmetric correlations. In this regard, we introduce the asymmetric covariance that measures the variability of the data around the expectile level s: When s ¼ 0:5 the asymmetric covariance becomes the classical covariance and the asymHIM measure corresponds to the HIM measure. Our asymHIM measure captures the heterogeneity of the impact of the observations on the variability of the data. Through this property, it increases the probability of identifying influential observations and therefore provides greater power than Zhao et al. (2013) influence measure (HIM). To illustrate, consider a sample with 100 observations, where the response variable of the first 10 observations is contaminated (according to the design of Model I of Section 3). Figure 1 shows the asymHIM influence measure values of the first 20 observations for three values of s 2 f0:25, 0:5, 0:75g: The influentialness of the 10 contaminated observations is better captured by the asymmetric influence measure (asymHIM) defined at level s ¼ 0:25 or s ¼ 0:75 than with the influence measure (HIM) defined at level s ¼ 0:5: In this paper, we developed a computationally efficient influence measure, and derived its asymptotic distribution to determine a statistical cutoff threshold. We evaluated the corresponding p-value by applying a multiple testing procedure. We adopted the Bonferroni test to control the family-wise error at the nominal level a ¼ 5%: We provided a publicly available statistical package to simplify the implementation of the influence measure. With its computational efficiency, the asymHIM measure is very easy to implement and will be a very useful instrument in the toolkit of data scientists using high dimensional regression. The asymmetric points (expectiles) provide our influence measure a good ability to identify influential observations with greater power. The response variable of the first 10 observations is contaminated according to the scheme of Model I described in Section 3. The influence measures for the first 10 (contaminated) observations tend to be much larger than for observations 11-20.
In the next section (Section 2), we introduce the expectile statistic, the asymmetric covariance, and the asymmetric correlation. We present the new asymmetric influence measure as well as its asymptotic distribution which will then allow choosing a cutoff threshold for influential observations based on sound statistical principles. In Section 3, we undertake an extensive simulation to assess the performance of our influence measure with respect to the HIM measure. In Section 4, we apply our asymmetric high dimensional influence measure to the Autism Brain Imaging Data Exchange (ABIDE) neuroimaging dataset, Di Martino et al. (2014), and assess its impact on the downstream analysis to predict brain maturity level. Finally, in Section 5, we present the conclusions as well as future avenues of research. We have built a R package for the influential measure which is publicly available on our GitHub (https://github.com/ AmBarry/hidetify).

Notation
Vectors are written in lower bold letters, x 2 R p and matrices are represented in capital bold letters, X 2 R nÂp: Estimated quantities are represented with a hatX, the inverse matrix is noted X À1 and X T is the transposed matrix. I pÂp or I p is the identity matrix and is noted as I when the dimension is implicitly known. The symbol 1ðt < 0Þ is the indicator function equal to 1 if t < 0 and 0 otherwise and the vectors 1 and 0 are constant vectors filled respectively with 1 and 0. The norm k Á k corresponds to the l 2 norm, and jSj is the number of elements in the set S. We denote by S inf the set of influential observations, and n inf ¼ S inf its cardinality. The setŜ inf is its estimator, and S c inf is its complement. We define SuppðbÞ ¼ fjjb j 6 ¼ 0, for j ¼ 1, :::, pg as the set of nonzero coefficients (active set) and Supp c ðbÞ ¼ f1, :::, pgnSuppðbÞ its complement. Let y ¼ ðy 1 , :::, y n Þ T be the n Â 1 response vector and X ¼ ½x 1 , :::, x p the n Â p associated design matrix with x j ¼ ðx 1j , :::, x nj Þ T , j 2 f1, :::, pg: The observation formed by the pair ðy i , x i T Þ of each subject 1 i n is assumed to be generated from the following regression model: where the random error e $ N ð0, 1Þ and b ¼ ðb 0 , b 1 T Þ T is the parameter vector. Notice that observations generated by model (1) are error free and not contaminated.

High dimensional influence measure (HIM)
Under the classical setup ðn > pÞ, Cook defined the influence measure (Cook's distance) using the ordinary least square estimator (OLS),b OLS ¼ ðX T XÞ À1 X T y: In the high dimension regression setting, the classical Cook's distance or any other OLS based influence measure is no longer applicable, because the gram matrix is not invertible and the OLS estimator is unstable. Additionally, most of the high dimension regression estimators rely on a regularization parameter and are computationally expensive. Instead, Zhao et al. (2013) used marginal correlation between the response and the predictors to define an influence measure in high dimension regression (HIM). Fan and Lv (2008) showed that marginal correlation is a componentwise regression estimator, which is a specific case of ridge regression with a large regularization parameter. The marginal correlation is easy to compute, and such computational advantage is critical for high dimensional data analysis. Specifically, define q j ¼ E½ðx ij À l x ij Þðy i À l y i Þ=ðr x ij r y i Þ as the marginal correlation between the jth predictor and the response, for j ¼ 1, :::, p, and l x ij ¼ E½x ij , l y i ¼ E½y i , r 2 x ij ¼ Var½x ij , and r 2 y i ¼ Var½y i : Denote byq j ¼ n À1 P n i¼1 ðx ij Àl x ij Þðy i À l y i Þ=ðr x ijr y i Þ the sample estimate of q j , andl x ij ,l y i ,r x ij , andr y i the sample estimates of l x ij , l y i , r x ij , and r y i , respectively. Employing the leave-one-out approach, Zhao et al. (2013) quantified the influence of the kth observation in high dimension regression by the following measure: where q ðkÞ j is the sample marginal correlation with the kth observation removed. The kth observation is identified as an influential observation if its measure D k is large, where large is defined by a statistical threshold. When there are no influential observations in the sample, Zhao et al. (2013) showed that n 2 D k $ v 2 ð1Þ, where v 2 ð1Þ is a chisquare distribution with one degree of freedom.
In the following section we introduce a new measure we call asymmetric high dimensional influence measure (asymHIM) using the theory of expectiles. AsymHIM measure generalizes the HIM measure and increases its sensitivity.

Asymmetric high dimensional influence measure (asymHIM)
We start by introducing the expectile function and presenting some of its properties.
The expectile is an asymmetric weighted average which characterizes the cumulative distribution function of a random variable in the same way a quantile does. The expectile of a random variable Y is defined as the solution l s ðYÞ which minimizes the following risk function over h 2 R for a fixed value of s 2 ð0, 1Þ: The function R s ðÁÞ, of the form is an asymmetric square loss function that assigns weights s and 1 À s to positive and negative deviations, respectively. By equating the first derivative of (2) to zero, the expectile can also be defined the solution of where w s ðtÞ ¼ s À 1ðt 0Þ j jis the piecewise linear check function. Notice that, when s ¼ 0:5 then w 0:5 ðY À l 0:5 Þ ¼ 0:5 and l 0:5 ¼ l ¼ E½Y is the expectation of the random variable Y. Given a random sample, fðy i Þg n i¼1 , the sth empirical expectilê is the solution which minimizes the empirical loss function 1 n X n i¼1 R s ðy i À hÞ: The sample expectile is computed iteratively using the iterative reweighted least square (IRLS) algorithm. Since the asymmetric square loss function is convex and continuously differentiable the IRLS algorithm converges very fast. A package (expectreg package) to estimate expectiles and other expectile models is implemented in R Core Team (2018) by Sobotka et al. (2014), Sobotka and Kneib (2012), Schnabel and Eilers (2009).
We therefore propose to use the expectile function to define an asymmetric covariance and an asymmetric correlation function. An asymmetric covariance measures the covariance between variables after centering them at the expectiles of their distribution. Using this approach, the asymmetric covariance and the asymmetric correlation between the response variable and the predictor j are, for a fixed s 2 ð0, 1Þ, defined as: where r 2 s ðy i Þ ¼ E½ðy i À l s ðy i ÞÞðy i À l s ðy i Þ is the asymmetric variance and r 2 s ðx ij Þ is defined similarly. Notice that, when s ¼ 0:5 then l 0:5 ¼ l, Cov 0:5j ¼ E½ðx ij À lðx ij ÞÞðy i À lðy i ÞÞ, and q 0:5j ¼ Cov 0:5j r 0:5 ðx ij Þr 0:5 ðy i Þ are the mean, the covariance and the classical correlation, respectively.
The corresponding sample estimate of the asymmetric covariance and correlation function, for a fixed s 2 ð0, 1Þ, can be defined as: wherer 2 s ðx ij Þ ¼ n À1 P i¼1 ðx ij Àl s ðx ij ÞÞ 2 is the empirical asymmetric variance and r 2 s ðy i Þ is defined similarly. Therefore, using the leave-one-out principle, we introduce an asymmetric high dimensional influence measure for a subject k 2 f1, :::, ng and for a single s as: whereq ðkÞ sj is the marginal asymmetric correlation with the kth observation removed and is defined as: , j ¼ 1, :::, p, k ¼ 1, :::, n: The estimatesl ðkÞ s ðx ij Þ,l ðkÞ s ðy i Þ,r ðkÞ s ðx ij Þ, andr ðkÞ s ðy i Þ are the sample estimates with the kth observation removed.
The asymHIM measure retains the same computational advantages as the HIM measure of Zhao et al. (2013). As noted by Zhao et al. (2013), the impact of the influence measure is not limited to the marginal correlation alone. On the contrary, any substantial change in the marginal correlation will have implications on downstream analyses such as variable selection and parameter estimation. In Section 3, we show the effect of identifying influential observations with asymHIM measure on parameter estimation and variable selection.
Several different strategies can be proposed for using the asymmetric influence measure, depending on the value of s: For example, we could choose a sequence of s covering the left, right and center of the distribution and select the best measure based on criteria such as true positive rate TPR inf and false positive FPR inf (which are defined in Section 3). We could also use the max and the min of that sequence as an influence measure. We tested all these scenarios and the simulation results, which are not shown here, were the same for any single s: However, when summarizing across values of s, we found that the min influence measure had small true positive rate ðTPR inf Þ, whereas the max influence measure demonstrated inflated false positive rate ðFPR inf Þ: Among all the scenarios, the sum based asymmetric influence measure led to a better tradeoff in terms of TPR inf rate and FPR inf Þ rate. Therefore, in the following, we propose the sum based asymmetric influence measure as an instrument for identifying influential observations in high dimensional regression. Therefore, we define our asymmetric high dimensional influence measure (asymHIM) for a subject k and for a sequence of asymmetric points ðs 1 , :::, s q Þ as: whereq s l j andq ðkÞ s l j are the sample asymmetric correlation estimates for the full sample and with the kth observation removed and for a fixed s l , l ¼ 1, :::, q: This new defined measure will help capture possible asymmetry in the data across several different values of s across the distributional range ð0, 1Þ: The asymHIM influence measure did not perform as well when the sequence of ss included the extremes of the distribution (e.g., expectiles such as 0:95, 0:9, 0:1, 0:05); these expectiles have a higher probability of being influential observations themselves or being affected by them.
In our explorations, we noted that the asymHIM influence measure based on 5 values, i.e., s 2 ð0:3, 0:4, 0:5, 0:6, 0:7Þ or 3 values of s 2 ð0:25, 0:5, 0:75Þ gave better true positive and false positive rates, when compared to larger series of values for s: Therefore, in Section 3, we present the results of the asymHIM measure based on 5 (denoted asymHIM5) and 3 (asymHIM3) values of s: The term asymHIM refers to either of the two variants. We also present in the Supplementary material file the impact of using a larger number of expectiles, especially when some of the expectiles lie at the extreme end of the distribution.

Asymptotic properties of asymHIM
An important step in developing an influence measure is the derivation of its asymptotic distribution. Toward that goal, we establish in this section the asymptotic distribution of the asymHIM measure under the following conditions. H1. For any fixed s 2 ð0, 1Þ and 1 j p, q sj is constant and does not change as p increases.
H2. The asymmetric covariance matrix R s ¼ Cov s ðxÞ ¼ E½ðx À l s ðxÞÞðx À l s ðxÞÞ T , with the eigen decomposition R s ¼ P p j¼1 k sj u j u j T , is assumed to verify l ps ¼ P p j¼1 k 2 sj ¼ Oðp r Þ for some 0 r < 2 and for any fixed s 2 ð0, 1Þ: H3. The predictor x i T follows a multivariate normal distribution and the random noise e i follows a normal distribution.
The stated conditions are similar to those stated by Zhao et al. (2013) Condition H1 assumes a fixed correlation q sj between the response and the predictor j for any fixed s and independently of p. Condition H2 permits high values of the eigenvalues of the covariance matrix R s , but at a rate controlled by the dimensionality p. A sufficient condition would be that max 1 j p k sj be bounded. The normality assumption, Condition H3, is used, among others, to ensure independence across columns of the matrix X: For reading convenience we introduce the following notation: l js ¼ l s ðx ij Þ, l s ¼ l s ðy i Þ, r js ¼ r s ðx ij Þ and r s ¼ r s ðy i Þ: Theorem 1. Assume conditions H1-H3 and that the following parameters, l js , l s , r js and r s are known. When there are no influential points and minðn, pÞ ! 1, then where v 2 ðqÞ is the chi-square distribution with q degrees of freedom, and q is the number of asymmetric points. Notice that the number of asymmetric points is adjusted by the number of degrees of freedom of the chi-square distribution.
Theorem 1 is stated under the assumption that the parameters l js , l s , r js and r s are known. However, in reality they are unknown and are replaced by their estimator. We now replace them by their ffiffiffi n p consistent estimators and show that the asymptotic result continues to hold. We choose their corresponding sample moment estimates: l js ,l s ,r js andr s : Notice that we used robust estimators to compute the asymmetric influence measure in practice. We estimated l js and l s by the corresponding empirical quantile of level s and r js and r s by the median absolute deviation (MAD) estimator. To derive the next result we introduce the following quantities: ðQ js , R js Þ ¼ ððl js À l js Þ=r js , r js =r js Þ and ðQ s , R s Þ ¼ ððl s À l s Þ=r s , r s =r s Þ: In addition, let: We make the following additional assumption.
H4. For all 1 j p, ðQ js , R js Þ are the same symmetric function of f _ x tjs ¼ ðx tj À l js Þ=r js , for 1 t ng and ðQ s , R s Þ are also the same symmetric function of f_ y ts ¼ ðy t À l s Þ=r s , for 1 t ng: We assume that S Qxs , S Rxs , S Qys , and S Qys are finite.
Proposition 2. For a fixed s 2 ð0, 1Þ and j ¼ 1, :::, p, assume thatl js ,l s ,r js andr s are ffiffiffi n p consistent and satisfy assumption H4. Substituting l js , l s , r js and r s with their corresponding estimates in D sk , Theorem 1 continues to hold under the same conditions.
The proofs of Theorem 1 and Proposition 2 are in the Supplementary material file. In their paper, Zhao et al. (2013) showed that the influence measure (when s ¼ 0:5) is a function of two main terms that play similar roles to those of the residuals and the diagonal elements of the hat matrix in the expression of the classical Cook's distance. They also showed that the influence measure identifies the influential observation with probability one as n and p ! 1: All these statistical properties remain for the asymmetric influence measure.

Design
In this section, we evaluated the performance of the asymHIM measure through extensive simulations. The simulated samples are generated from model equation (1) and the contaminated data are generated by the following model whereñ ¼ 0:1n is the number of subjects with contaminated data. Following Zhao et al. (2013) contamination scheme, we contaminated the data according to two different schemes: Model I and Model II. In Model I, only the response variable of theñ subjects is altered and the predictors are left unchanged, i.e.,x i ¼ x i : The alteration is generated from the following equation: The parameter c allows the contamination of the response variable by creating a false relationship between the response variable and the variables whitout any prediction power. The parameter j 2 ð0, 0:4, 0:8, 1:2, 1:6Þ controls the degree of contamination, with larger values resulting in points that display more deviation. When j ¼ 0, there is no contamination and the dataset is clean. That is, there are no apriori-known influential points in the dataset. The values of b and c are set below.
In Model II, only the predictorsx i of theñ subjects are altered and the response variable is unaltered, i.e.,ỹ i ¼ y i : The predictors contamination is set up under four different scenarios according to the degree of contamination, and the contaminated region ðSÞ: We contaminated either the firstp 1 ,p 2 predictors, S 1 ¼ f1, :::,p 1 g and S 2 ¼ f1, :::,p 2 g, or the lastp 1 ,p 2 predictors, S 3 ¼ fp Àp 1 þ 1, :::, pg and S 4 ¼ fp Àp 2 þ 1, :::, pg, using two degree of contaminationp 1 ¼ 0:1p andp 2 ¼ 0:3p: The contamination is formalized by the following equation: We set the number of subjects n 2 ð100, 500Þ and the number of predictors p 2 ð1000, 5000Þ: The simulation is carried out with 200 replications for each samples. We estimated the asymmetric correlation at 5 s 2 ð0:3, 0:4, 0:5, 0:6, 0:7Þ, and 3 different points s 2 ð0:25, 0:5, 0:75Þ: We evaluated the performance of the asymHIM measure according to two criteria. In the first criterion, we measured its power to retrieve all influential observations with the true positive rate ðTPR inf Þ: For the second criterion, we computed the false positive rate FPR inf to assess its weakness to falsely report as influential observations that are not. Formally, we have: where S inf is the set of influential observations, n inf its size,Ŝ inf its estimator generated by a influence measure. We generated the influential setŜ inf estimator by controlling the family-wise error at the nominal level a ¼ 5%: Among the multiple testing procedures we chose the Bonferroni test.
In the second criterion, we evaluated the performance of the asymHIM influence measure in relation to its impact on coefficient estimation and variable selection. To do this, we fit LASSO models to the raw data and to the clean data obtained after applying the influence measure. Then we evaluated the accuracy of the coefficient estimates and their support (the set of non-zero coefficients estimator). The accuracy of the coefficient estimates is evaluated by the following error function: where b, andb are respectively the parameter, and its estimator and, k Á k 2 is the l 2 norm.
We evaluated the influence measure in terms of variable selection by comparing the support of the parameter (the set of non-zero parameters) with that of the parameter estimator. For this we have defined two criteria: the true positive rate (TPR) and the false positive rate (FPR). The TPR is the percentage of the true non-zero coefficients selected by the model and the FPR is the percentage of the false non-zero coefficients selected by the model. Let SuppðbÞ ¼ fjjb j 6 ¼ 0, for j ¼ 1, :::, pg and Supp c ðbÞ ¼ f1, :::, pg n SuppðbÞ then the TPR and the FPR are defined as: Supp c ðbÞ j j : Simulations were conducted using high performance computing clusters provided by Calcul Quebec and Compute Canada. All computations are performed with the R (v3.5.0) statistical programming language R Core Team (2018). The implemented R package hidetify that comes with this manuscript is publicly available on Github at https://github.com/AmBarry/hidetify. Recall that asymHIM5 is the asymHIM measure based on 5 values of s 2 ð0:3, 0:4, 0:5, 0:6, 0:7Þ and asymHIM3 is the asymHIM measure based on 3 values of s 2 ð0:25, 0:5, 0:75Þ: In the following, asymHIM refers to asymHIM5 and asymHIM3 at the same time. Note that the simulation results for the parameter values b 2 and b 3 are in the Supplementary material file. Figure 2 reports the results on the TPR inf and FPR inf of the three influence measures: asymHIM5, asymHIM3 and HIM, in Model I. The two columns show results for different sample sizes (100 and 500) and the two rows are different numbers of predictors (1000 and 5000). We have on the x-axis, the degree of contamination of the datasets and on the y-axis the values of the TPR inf and FPR inf of the influence measures. The TPR inf and FPR inf results of Model II are reported in Figure 3, where the first column (A) shows results for n ¼ 100 and the second (B) shows results for n ¼ 500: An effective influence measure will have large TPR inf $ 1 and small FPR inf $ 0:

TPR inf and FPR inf :
The power ðTPR inf Þ of the influence measures increases with the degree of contamination, while the FPR inf remains relatively stable when the data is contaminated, particularly for asymHIM3 and HIM measures. Our influence measures (asymHIM5 and asymHIM3) have higher power than the HIM measure ðs ¼ 0:5Þ and the difference is greater in Model II where the power of the HIM measurement can be below 30%, (Figure 3). Our influence measures have larger FPR inf than the HIM measure. However, the FPR inf of the asymHIM3 influence measure is below the nominal level 5% and is below 2% when the sample size increases ðn ¼ 500Þ, Figure 3. Finally, in Model II, the results are similar according to the contaminated regions ðS 1 À S 4 Þ: Overall, our influence measures perform better than the HIM measurement ðs ¼ 0:5Þ particularly in Model II. The HIM influence measure has smaller FPR inf , but in terms of TPR inf and FPR inf ratio, the asymHIM3 influence measure, whose power is comparable to that of asymHIM5 and its FPR inf to that of HIM, is more efficient.
Results for parameter settings b ¼ b 2 and b ¼ b 3 lead to the same conclusions and are in the Supplementary material file. Table 2 and Table 3 report results of criteria ERR, TPR and FPR which serve to evaluate the impact of the correlation-based influence measures on the downstream statistical analyzes such as parameter estimation and variable selection.

Variable selection and parameter estimation
In Model I, we observe that the TPR and FPR of all the methods are, relatively, at their optimal value when the dataset is clean ðj ¼ 0Þ: The results show that the bias (ERR) of the LASSO increases significantly with respect to the degree of contamination ðjÞ, while the bias of the LASSO þ asymHIM and LASSO þ HIM decreases. Overall, the bias of the LASSO þ asymHIM, especially that of the LASSO þ asymHIM3, is smaller than that of the LASSO þ HIM. The LASSO þ asymHIM has a better performance in    terms of TPR, while the FPR of the LASSO is relatively smaller than that of the LASSO þ asymHIM and LASSO þ HIM which have same FPR values.
In general, the LASSO þ asymHIM estimate has lower bias (ERR) and its support (TPR) includes most often the parameter support. The results for b 2 and b 3 in Model I are similar and are in the Supplementary material file.
The results of Model II, Table 3, depend on the overlap between the support of the parameter and the contaminated region, S 1 À S 4 : When the support of the parameter b overlap with the contaminated region then the LASSO þ asymHIM method outperforms the LASSO þ HIM and the LASSO methods in terms of ERR, TPR and FPR. Otherwise, the three methods perform similarly in terms of ERR and TPR and LASSO þ asymHIM fall behind the LASSO þ HIM and the LASSO methods in terms of FPR. For example, when the support of the parameter b is that of b 1 and the contaminated regions are S 1 and S 2 then LASSO þ asymHIM outperforms the other methods and performs as well as the other methods when the contaminated regions are S 3 and S 4 : When the support of the parameter b is that of b 2 and the contaminated regions are S 3 and S 4 then the LASSO þ asymHIM outperforms the other methods and performs similarly otherwise. Finally, when the support of the parameter b is that of b 3 then the support overlap with all the contaminated regions S 1 À S 4 : In this case, the LASSO þ asymHIM performs better than the other methods in terms of ERR and TPR and its FPR is slightly higher than the other methods. We also observe in Model II the same phenomenon observed in Model I, when n ¼ 500 and p ¼ 1000: The results of b 2 and b 3 are in the Supplementary material file.
We observe that the LASSO method slightly outperforms other methods when there is no crossover between the predictive variables with non-zero coefficients and the contaminated region S 2 fS 1 , :::, S 4 g: We also notice a poor performance of all methods (high ERR, small TPR), when p ¼ 1000, and n ¼ 500 (Table 2 and Table 3).

Computation time
We used the microbenchmark package (Mersmann 2018), with 100 replications, to evaluate the computation time of the influence measures from a sample of size n ¼ 500 with p ¼ 5000 predictors. It takes on average 5 seconds to the HIM measure to identify influential observations from a sample of size n ¼ 500 with p ¼ 5000 predictors, 7 seconds for the asymHIM3 measure, and only 11 seconds for the asymHIM5 measure ( Figure 4).

Masking and swamping effects
We also simulated a third model, where both the response and the predictors of the first 10 observations are contaminated. The response is contaminated according to the scheme of Model I and the predictors are contaminated according to the scheme of Model II. We applied our asymHIM measure and Zhao et al. (2013) influence measure (HIM) to detect the influential observations, and neither method performed well. Results, which are not shown here, displayed high false positive rates ðFPR inf > 50%Þ for both methods. That is, both methods falsely reported as influential many good Table 3. Model II. Impact is measured by bias (ERR), true positive rate (TPR), and false positive rate (FPR) according to the sample size n 2 ð100, 500Þ, the number of predictors p 2 ð1000, 5000Þ, the parameter value b 1 , the contaminated region ðS 1 À S 4 Þ and the degree of contamination, j 2 ð0, 0:4, 0:8, 1:2, 1:6Þ:   observations. Notice that Model III was also simulated by Zhao et al. (2013), but they did not report the FPR inf rate of their HIM measure. Under this third model, the poor performance of both methods can be explained by the strong contamination of the data (response and predictors) and by the violation of the underlying asymptotic assumption. The contamination in Model III occurs simultaneously in the response and the predictors, and their effects are multiplicative since the influence measure is based on the marginal correlation of the response and the predictors. Additionally, the influence measure of each subject k 2 f1, :::, ng is affected by the contribution of the influential observations, even though the theory assumes that there are no influential observations in the dataset.
Hence, when there are influential observations in the dataset, and their contributions are very large-which is the case when the effect is multiplicative -then influence measures derived from the good observations are swamped. To illustrate the problem, consider a fixed s, and a single predictor ðp ¼ 1Þ, then the influence measure of the k-th observation is: Then, the k-th observation is identified as a good observation when the contribution of the influential observations are bounded, that is: where v 2 1Àa ð1Þ is the 100ð1 À aÞ% quantile of a v 2 ð1Þ distribution. However, the influential observations, by definition, have high correlations and this upper bound is difficult to verify.
In summary, with the above equation, we can see how good observations can be swamped by the presence of at least one single influential observation, and this swamping effect is exacerbated when the contamination effect is multiplicative, i.e., response and predictor are contaminated simultaneously. A similar problem arises in classical statistics (Bendre 1989;Nurunnabi, Hadi, and Imon 2014;Roberts, Martin, and Zheng 2015), and many solutions exist in this framework. Some of those solutions have been extended to a high dimension regression setting (Zhao et al. 2019;Wang et al. 2018;Wang et al. 2019), and their adaptation to the asymHim measure may help mitigate the swamping effect in general.

Materials and methods
Structural and functional neuroimaging datasets are high dimensional, and vertex-wise measurements are strongly correlated, and are often contaminated due to acquisition, and preprocessing artifacts (Fritsch et al. 2012). Indeed, the brain is a spatially embedded system, so there is a huge amount of spatial autocorrelation. This means that an influential observation will likely be extreme on many features. We applied our asymmetric high dimensional influence measure (asymHIM3) to the Autism Brain Imaging Data Exchange (ABIDE) neuroimaging dataset, and conducted a downstream analysis to predict brain maturity based on cortical thickness. Brain maturity prediction is a well studied subject in neuroscience to establish a baseline for normal brain development against which neurodevelopmental disorders can be assessed (Khundrakpam, Tohka, and Evans 2015).

Participants
The ABIDE dataset comprises 573 control and 539 autism spectrum disorder (ASD) individuals from 16 international sites (Di Martino et al. 2014). The neuroimaging data of these individuals were obtained from DataLad repository (http://fcon_1000.projects. nitrc.org/indi/abide/). Only a subset of individuals is used in this work due to failures resulting from the image processing pipeline. The demographic description of these individuals is provided in Table 4. In this paper, we used 542 controls from the ABIDE study to illustrate the asymHIM influence measure.

MR image processing and cortical thickness measurements
The MR images (T1-weighted scans) were processed using the FreeSurfer 6.0 pipeline (Fischl 2012) deployed on CBrain -a high-performance computing facility (Sherif et al. 2014). FreeSurfer delineates the cortical surface from a given MR scan and quantifies thickness measurements on this surface for each brain hemisphere. The cortical thickness measurements for each MR image are computed using FreeSurfer 6.0 pipeline (Fischl 2012;Dale, Fischl, and Sereno 1999). The pipeline consists of 1) affine registration to the MNI305 space (Collins et al. 1994); 2) bias field correction; 3) removal of skull, cerebellum, and brainstem regions from the MR image; 3) estimation of white matter surface based on MR image intensity gradients between the white and gray matter; and 4) estimation of pial surface based on intensity gradients between the gray matter and cerebrospinal fluid (CSF). The distance between the white and pial surfaces provides the thickness estimate at a given location of cortex. For detailed descriptions refer to (Fischl 2012;Dale, Fischl, and Sereno 1999). The individual cortical surfaces are then projected onto a common space (i.e., fsaverage) characterized by 163,842 vertices per hemisphere to establish inter-individual correspondence.

Data cleaning and age prediction
We applied Zhao et al. (2013) influence measure (HIM), and our influence measure (asymHIM) to identify, and remove influential observations from the ABIDE dataset, using a 5% Bonferroni test correction. Our influential measure identified 48 subjects as influential, while the HIM measure identified 21 subjects, a subset of our 48 subjects (see Figure 5). Then, we fit an elasticnet model, with the R package glmnet (Friedman, Hastie, and Tibshirani 2010), to the three resulting datasets: Full, HIM-cleaned, and asymHIM-cleaned, to predict age based on cortical thickness measurements. After cleaning, the Full dataset contained 542 subjects, the HIM-cleaned dataset had 521 subjects, and the asymHIM-cleaned dataset contained 494 subjects. In each of the 3 datasets, the age response is predicted with 299,569 cortical vertex features, 149,778 cortical vertices from the right hemisphere, and 149,791 cortical vertices from the left hemisphere. Following Khundrakpam, Tohka, and Evans (2015), we estimated the elasticnet model using the hyper-parameter h ¼ 0:5 balancing the L 1 and L 2 penalization. To simultaneously select the best regularization parameter and evaluate the prediction accuracy, we applied a 10-fold nested cross-validation (CV) loops (Ambroise and McLachlan 2002). In the inner 10-fold CV loop, we selected the best regularization parameter, and then age prediction accuracy was evaluated in the outer 10-fold CV loop. To measure goodness of fit, we used the mean absolute error (MAE) between the chronological and estimated age in the inner loop, and the correlation coefficient between the chronological and estimated age in the outer loop. Finally, since every 10-fold CV is random, we repeated the analysis 100 times and report the distribution of results as violin plots.

Results
We summarize the application results in term of sparsity, goodness of fit and prediction accuracy ( Figure 6). Despite the small differences, the cleaned dataset based models achieve better sparsity with 355 and 344 cortical vertices selected on average when using the asymHIM and the HIM measures, respectively. Without cleaning the dataset, the  Figure 6). In term of estimation of the chronological age, the full dataset based model has higher correlation than the cleaned dataset based models, with 0.78, 0.75 and 0.75 average correlation, respectively, for the full dataset based model and for cleaned dataset based models. However, the full dataset based model has also higher residual variance (12.65 on average) than the cleaned dataset based models (10.65 and 11.62, respectively for the asymHIM and HIM measure). The lower estimated correlation displayed by the cleaned dataset based models might be related to the removed influential observations. Indeed, the two influence measures (asymHIM and HIM) are correlation based, and observations with high marginal correlation have higher probability to be flagged as influential and removed from the dataset. Similarly, the high variance displayed by the full dataset based model can be related to the presence of the influential observations.

Conclusion
In this paper, we introduced the asymmetric influence measure (asymHIM) to identify influential observations in high dimensional regression. We derived its asymptotic distribution, which yields a statistical decision rule for identifying influential observations. The availability of an asymptotic distribution avoids needing to use subjective methods such as visualization, or computationally demanding methods such as the bootstrap, to Figure 5. Logarithm of the p-values assessing whether each observation is highly influential. Each control subject is shown along the horizontal axis. The observations flagged as influential by the asymHIM measure are denoted by triangles, and those identified by both methods, asymHIM and HIM, are denoted by squares. Inset box shows number of influential observations (NIO) identified by each method. define appropriate thresholds for influential observations. The existence of a statistical threshold facilitates systematic application of the influence measure during quality control and preprocessing step, which will contribute to the improvement of reproducibility. The simulations showed that the asymHIM measure outperforms the HIM measure of Zhao et al. (2013) in term of TPR inf to FPR inf ratio. We explored several sequences of asymmetric points with different ranges. We noticed that the asymHIM measure did not perform as well in the presence of extreme expectiles which have a higher probability of being influential. The exploration of several sequences of expectiles demonstrated that the asymHIM measure with three points inside the interquartile range had the best performance, and use of asymmetric points (expectiles) outside the interquartile range gave results that were overly prone to influential observations. We used both asymHIM and HIM measures to detect influential observations present among the controls in the ABIDE dataset and then to predict brain maturity. The prediction model derived from the clean datasets displayed more sparsity, better mean absolute error between chronological and predicted age, and smaller residual variance, than the prediction model derived from the full dataset. We noticed that the correlation decreased for the cleaned datasets relative to the full dataset based model; such patterns are often observed even in simple models where elimination of an influential point may lead to an attenuated slope estimate. This in itself is not a bad thing -as we can argue this prevents us from "overestimating" the association between cortical thickness and age.
In simulations, we noted poor performance of both the asymHIM and HIM measures when the response and the predictors are simultaneously contaminated. This poor performance can be explained by the presence of influential observations which violate the asymptotic assumption, and the fact that the simultaneous contamination amplifies the swamping effect multiplicatively. We are currently exploring different alternatives such as random group deletions to mitigate the dual phenomenon of the swamping and masking effects. We have promising preliminary results which will be presented in the near future.
Finally, it will be interesting to extend the proposed work in this article to other high dimensional complex models, such as generalized linear models or models containing complex dependence structures to evaluate the benefits associated with identification of influential points in those situations.