Adaptive Handling of Dependence in High-Dimensional Regression Modeling

Abstract Dependence within a high-dimensional profile of explanatory variables affects estimation and prediction performance of regression models. However, the strong belief that dependence should not be ignored, based on our well-proven knowledge of low-dimensional regression modeling, is not necessarily true in high dimension. To investigate this point, we introduce a new class of prediction scores defined as linear combinations of a same random vector, including the naive prediction score obtained when ignoring dependence and the Ordinary Least Squares (OLS) prediction score that, on the contrary, fully accounts for dependence by a preliminary whitening of the explanatory variables. Interestingly, the former class also contains Ridge and Partial Least Squares prediction scores, that both offer intermediate ways of dealing with dependence. Through a theoretical comparative study, it is first shown how the best handling of dependence should depend on the interplay between the structure of conditional dependence across explanatory variables and the pattern of the association signal. We also derive the closed form expression of the prediction score with best prediction performance within the proposed class, leading to an adaptive handling of dependence. Finally, it is demonstrated through simulation studies and using benchmark datasets that this prediction score outperforms existing methods in various settings. Supplementary materials for this article are available online.


Introduction
Regression modeling in a prediction perspective is well-studied and proven in standard designs where the number n of observations exceeds by far the number p of explanatory variables. Thus, it has created strong beliefs about good practice of regression rules that are yet not necessarily true in high-dimensional designs. The question of ignoring dependence or not when fitting prediction models perfectly illustrates this point. Indeed, let us consider the usual multivariate normal setting for regression model where the response Y is real-valued and let us assume that the explanatory variables X = (X 1 , . . . , X p ) are uncorrelated with positive standard deviations σ = (σ 1 , . . . , σ p ) , conditionally on the response. Under the former conditional independence assumption, the unconditional variance-covariance matrix of X is x = D 2 σ + σ xy σ xy /σ 2 y , where D σ is the p × p diagonal matrix whose vector of diagonal entries is σ , σ xy = cov(X, Y) and σ 2 y = var(Y). The linear prediction score reaching the largest squared correlation with the response or equivalently the lowest mean squared error of prediction (MSEP) is L opt (X) = μ y + (X − μ x ) x −1 σ xy , with μ x = E(X) and μ y = E(Y), or equivalently, using an explicit inversion formula for x , L opt (X) = μ y + ϕ xy (X − μ x ) D −2 σ σ xy , where ϕ xy = σ 2 y /(σ 2 y + σ xy D −2 σ σ xy ). Therefore, ifX = D −1 σ (X − μ x ) stands for the vector of scaled explanatory variables and σx y = D −1 σ σ xy for the p-vector of covariances between Y andX, then L opt (X) = L opt (X) = μ y + ϕx yX σx y , where ϕx y = σ 2 y /(σ 2 y + σ xy σx y ). In situations where the explanatory variables are no longer independent given Y, with positive definite p × p conditional covariance matrix and correlation matrix C = D −1 σ D −1 σ , then there exists matrices W such that C −1 = W W or equivalently WCW = I p . Such matrices are either called decorrelation or whitening matrices, since the variance-covariance matrix of the whitened explanatory variables Z = WX is the identity matrix I p . The optimal linear prediction score is obtained by applying the regression rule designed to be optimal under independence on Z: where σ zy = Wσx y is the p-vector of covariances between Y and Z and ϕ zy = σ 2 y /(σ 2 y + σ zy σ zy ). The same idea also holds for the standard two-group classification issue, where Y only takes value 0 or 1 with a 50-50 chance, under the assumption of a mixture of normal distributions for X with equal variance-covariance for the two components of the mixture. Indeed, if conditional independence across explanatory variables is further assumed, the linear Bayes classifier, minimizing the probability of misclassification, is also an affine transformation of the linear prediction score L opt (X) =X δx y , whereX = D −1 σ (X − μ x ) with μ x = {E(X | Y = 1) + E(X | Y = 0)}/2 and δx y = E(X | Y = 1) − E(X | Y = 0) stands for the vector of association parameters between the scaled explanatory variables and the response. Analogously with the situation of a real-valued response, if it is now assumed that the conditional variance-covariance ofX is C = I p , then the optimal linear score is, up to an affine transformation: with Z = WX and δ zy = Wδx y . One general conclusion is that, when the explanatory variables are known to be mutually dependent conditionally on the response, optimal prediction requires to take explicitly into account this dependence and the suitable way to do that is by applying prediction rules designed to be optimal under independence on whitened explanatory variables. OLS regression and Linear Discriminant Analysis (LDA), which can both be viewed as empirical counterparts of (1) and (2), respectively, are gold standard methods for regression and classification. However, in high-dimensional settings, where p is nearly as large or larger than n, plugging-in the sample estimateĈ of C in expressions (1) and (2) is either no longer possible or leads to poor prediction performance. Starting from the observation that OLS and LDA prediction scores result from the optimization of a least-squares goodness of fit criterion, a very popular and numerically efficient approach to address these shortcomings is to modify the objective function of the fitting algorithm by adding a regularization term J(β) proportional to the magnitude of the regression coefficients: wherex andȲ are, respectively, the sample means of X and Y and κ > 0 is a penalty parameter. Lasso (Tibshirani 1996), with J(β) = p j=1 |β j |, Ridge (Hoerl and Kennard 1970), with J(β) = p j=1 β 2 j , and elastic net (Zou and Hastie 2005), with J(β) = α p j=1 |β j | + (1 − α) p j=1 β 2 j , 0 ≤ α ≤ 1, procedures are the most widely used regularized estimation methods. It is especially obvious in the explicit expression of the ridge estimate of β that the tuning parameter κ defines a compromise between two opposite dependence handling strategies: completely whitening the explanatory variables with κ = 0 and, for a large κ, ignoring dependence. This point will be detailed later in Section 3.
In the general framework introduced by Witten and Tibshirani (2009), both for regression and classification issues, a two-stage regularized estimation procedure is proposed, where a shrunken estimate of the matrix of partial correlations across explanatory variables conditionally on the response is first obtained by maximization of a penalized likelihood and, lastly, the resulting estimate is plugged-in a second likelihoodbased criterion for a regularized estimation of the regression coefficients. Again, the first stage of the former so-called scout procedure offers an alternative way of handling dependence with respect to a complete decorrelation. Ridge, Lasso, and Elastic net estimation are special cases of the scout general estimation procedure (Witten and Tibshirani 2009). Many other variants of the former regularized estimation methods exist. In a regression context with a real-valued response, we just mention here an alternative shrinkage-based regression method introduced by Opgen-Rhein and Strimmer (2007) (see also Zuber and Strimmer 2011), where a James-Stein type estimate (Schäfer and Strimmer 2005) is used both for the variancecovariance matrix of the explanatory variables and for the covariances between the explanatory variables and the response. A similar approach is introduced in the LDA framework by Ahdesmäki and Strimmer (2010).
Still within the framework of linear prediction rules, a rankreduced estimation of the linear prediction score can also lead to excellent prediction performance. Partial Least Squares (PLS) regression (Wold et al. 1984) and Discriminant Analysis (Boulesteix 2004), consisting in a preliminary extraction of latent variables that summarize the explanatory variables in a low-dimensional linear kernel, are among the most widely used methods for high-dimensional regression and classification. Here also, the choice of the number of latent variables can be viewed as the search for a compromise between ignoring dependence, by extracting only one latent component, and fully decorrelating the explanatory variables, with as many latent components as possible.
Consistently with the normal framework introduced at the beginning of the present section, the list of high-dimensional regression and classification methods above focuses on leading procedures estimating a linear prediction score. However, the variety of approaches used to accommodate high-dimension, including tree-based prediction and neural networks, has generated a much larger list of possible methods, with unclear recommendations about which should be preferred in which circumstances (see chapter 3 of Hastie, Tibshirani, and Friedman (2009) for a review of high-dimensional regression methods). Many comparative studies highlight the impact of the pattern of association between the response and the explanatory variables, namely its fraction of zeros, and the amount of correlation across explanatory variables on the prediction performance (see, e.g., Tibshirani 1996;Chong and Jun 2005). Despite some theoretical advances on those points, it is finally often recommended in practice to compare methods by cross-validated evaluation of their prediction performance (Krstajic et al. 2014). However, optimization of the hyperparameters involved in these methods, either shrinkage parameters or numbers of latent variables, can be challenging and deter the prediction performance.
Essentially for classification issues in high dimension, naive options consisting in ignoring dependence have been tried in some comparative studies (Dudoit, Fridlyand, and Speed 2002), with surprisingly good prediction performance. The so-called naive Bayes classifier is also compared with LDA on a theoretical basis in Bickel and Levina (2004) and it turns out to show very good prediction performance in large or high dimensional settings. Starting from this observation, we introduce a similar naive linear prediction score for a real-valued response and demonstrate that both the OLS and the naive prediction score are just different linear combinations of a same random p-vector ξ(Z). The class L = h ξ(Z), h h = 1 of linear combinations of ξ(Z) is introduced for a flexible handling of dependence through the adaptive choice of the pvector h of linear coefficients. Interestingly, L turns out to define a convenient general framework for comparing a large scope of high-dimensional regression methods, including regularized and rank-reduced estimation procedures, and for deriving optimal prediction scores. This article is organized as follows. Section 2 focuses on a theoretical comparison of the prediction performance of the naive and the OLS linear prediction scores in the standard linear regression framework. Extending the results obtained by Bickel and Levina (2004) in a two-group classification context, sharp bounds for the relative efficiency of the naive prediction score are given under assumption of an arbitrary dependence structure, with explicit expressions for the association parameters corresponding to the lowest and largest efficiency. In Section 3, based on the former comparative study, the class L of prediction scores briefly introduced above is defined as a general framework extending the handling of dependence to more flexible approaches than just ignorance or a complete whitening of the explanatory variables. We also show that Ridge and PLS prediction scores form subclasses of L. An explicit expression of the optimal predictor within L is derived in Section 4 and compared to existing methods through a simulation study and using two benchmark datasets in Section 5. Finally, a discussion ends this article.

Optimal Linear Prediction
The n p paradigm has raised fundamental questions about the extension of well-proven estimation methods in regular n > p designs, such as least-squares or more generally maximumlikelihood estimation. One crucial issue in those discussions is whether dependence across explanatory variables should be ignored or not. As in Bickel and Levina (2004) for two-group classification using LDA, let us first discuss this issue in the standard linear regression framework.
Hereafter, it is assumed that (X , Y) is normally distributed with mean (μ x , μ y ) and a positive variance-covariance matrix, with var(X) = x , var(Y) = σ 2 y and cov(X, Y) = σ xy . In the present multivariate normal set-up, the unconditional dependence of the explanatory variables is captured by the variance-covariance matrix x . A part of the former dependence is due to the one-to-one association between each of the explanatory variables and the response, whereas a complementary part can be viewed as intrinsic dependence conditionally on the response. Indeed, if = var(X|Y) stands for the conditional variance-covariance matrix of the explanatory variables, then x = + σ xy σ xy /σ 2 y . In order to disentangle properly the association parameters and the intrinsic dependence across explanatory variables, the parameters ( , σ xy , σ 2 y ) will be preferred hereafter to the natural variance parameters ( x , σ xy , σ 2 y ) of the joint distribution.
where D σ is the p × p diagonal matrix whose diagonal entries are the conditional standard deviations σ = (σ 1 , . . . , σ p ) , the vectorX = D −1 σ (X − μ x ) of scaled explanatory variables and the p-vector σx y = D −1 σ σ xy of covariances betweenX and Y: with ϕx y = σ 2 y /(σ 2 y + σ xy C −1 σx y ). Still within the set of linear predictors, it is straightforwardly proved that L opt (X) also has the largest squared correlation R 2 opt with Y, where:

Naive Linear Prediction
Similarly as in the LDA framework (Dudoit, Fridlyand, and Speed 2002;Bickel and Levina 2004), a naive linear prediction score is obtained by replacing C in expression (3) by I p , which amounts to ignoring the conditional dependence between explanatory variables: For a straightforward comparison of the naive and optimal linear predictors, the squared correlation R 2 N between L N (X) and the response is expressed as a function of R 2 opt : where g(σx y , C) = σ xy Cσx y σ xy C −1 σx y /(σ xy σx y ) 2 and f : . Analogously with Bickel and Levina (2004) for the comparison of the classification performance of the linear Bayes and naive Bayes classifiers, Kantorovitch inequality provides an upper bound for g(σx y , C) over all possible σx y : where τ (C) = λ max (C)/λ min (C) is the condition number of C. This quantity τ (C) can be interpreted as a measure of the amount of conditional dependence between explanatory variables: when it is close to 1, C is itself close to the identity matrix whereas a large τ (C) means that there exists a linear combination of the scaled explanatory variables which concentrates a large part of the variability of those explanatory variables. Finally, a lower bound for the squared correlation between the naive prediction score and the response is deduced: Figure 1 displays the lower bound for R 2 N given in expression (4) as a function of R 2 opt for a range of values of τ (C) going from 1 to 100. It clearly shows that ignoring dependence can strongly deter the prediction performance and that the potential loss is as large as the explanatory variables are more dependent conditionally on the response. Even if the former conclusion is generally true over all possible association patterns between X and Y, it is important to keep in mind that, whatever the dependence pattern in C, σx y may be such that R 2 N is actually close to R 2 opt . Moreover, the lower bound given in (4) may not be reached for any vector σx y , as will be demonstrated in Theorem 1.

Prediction Performance of Naive Llinear Prediction
The eigendecomposition of C = UD λ U is now introduced, where D λ is the p × p diagonal matrix whose diagonal entries are the eigenvalues λ = (λ 1 , . . . , λ p ) of C and U is the p × p matrix of corresponding eigenvectors, with UU = I p . The notation γ = U σx y will be used hereafter for the covariance between the response and the vector of whitened explanatory variables Z = U X , with var(Z) = U CU = D λ . With these notations, and Sharp upper and lower bounds for R 2 N over all possible vectors γ of association parameters are given in the following Theorem, with explicit values of γ for which the limits are reached. where If the coordinates of γ are the square-roots of the coordinates of v(λ), then R 2 N reaches its lower limit. On the contrary, for any vector γ with only one nonzero coordinate, R 2 N reaches its upper limit.
Proof. See Section 1.1 in the supplementary materials.
It is deduced from Theorem 1 that the potential loss of prediction performance induced by ignoring dependence depends on the interplay between the patterns of association and dependence.
Up to now, the optimal and naive linear prediction scores have been introduced in a purely probabilistic framework, where all parameters of the joint distribution of (X , Y) are supposed to be known. In the following, we introduce the sample counterparts of L N (Z) and L opt (Z) and show that, in large or high-dimensional contexts, naive linear prediction can actually markedly outperform optimal linear prediction.

OLS and Naive Linear Prediction in High Dimension
It is now supposed that the training sample contains n ≥ 2 independent observations (X i , Y i ) of (X , Y) , where the sample size n can be smaller than p. Hereafter, (X,Ȳ) will denote the sample estimates of (μ x , μ y ) and (S x , s xy , s 2 y ) the sample estimates of ( x , σ xy , σ 2 y ). Consistently, S = S x − (s xy s xy )/s 2 y will stand for the estimate of andĈ = D −1 s SD −1 s for the estimate of C, where D s is the p × p diagonal matrix which diagonal entries are the sample conditional standard deviations s = (s 1 , . . . , s p ) of the explanatory variables. The eigendecomposition ofĈ is also introduced here:Ĉ =ÛDλÛ , where Dλ is the p × p diagonal matrix whose diagonal entries are the eigenvaluesλ = (λ 1 , . . . ,λ p ) ofĈ andÛ is the p × p matrix of corresponding eigenvectors, withÛÛ = I p . It is also assumed that the rank q ofĈ can be smaller than p.
The sample counterpart of L opt (Z) is the OLS linear prediction score L OLS (Ẑ) obtained by plugging-in the estimates of the parameters of the joint distribution of (X , Y) in expression (5) of L opt (Z): whereγ =Û D −1 s s xy ,Ẑ =Û D −1 s (X −X). In expression (7), the notation D − λ denotes the Moore-Penrose generalized inverse of Dλ, namely the p × p diagonal matrix which first q diagonal entries are the inverse of the positive eigenvaluesλ j and the remaining p − q are 0.
Analogously, the sample counterpart of L N (Z) has the following expression: The theoretical naive linear prediction score is not used in the remaining of this article. Thus, from now, L N refers to the empirical naive prediction score.
Since (X,Ȳ,λ,γ , s, s 2 y ) are consistent estimators of (μ x , μ y , λ, γ , σ , σ 2 y ), then the bounds given in Theorem 1 hold asymptotically for the comparison of L N (Ẑ) and L OLS (Ẑ). As illustrated below, this is far from true in small-sample and highdimensional situations.

Illustrative Comparative Study in High Dimension
The benchmark dataset described in Lu et al. (2004), freely available in the R package care (Zuber and Strimmer 2011) or from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1572, is used hereafter to illustrate the comparison of L N (Ẑ) and L OLS (Ẑ). It contains gene expression profiles X of p = 403 genes for 30 human brain samples, used to predict the age Y of each patient. The list of selected 403 genes results from prescreening and preprocessing as described in Zuber and Strimmer (2011). A data-driven simulation study is conducted hereafter, in which the conditional variance matrix between the explanatory variables given the response is the same as estimated by S, using the former illustrative gene expression dataset.
We now consider a first pattern of association between Y and X in which the asymptotic prediction performance of L N (Ẑ) is the lowest possible with respect to L OLS (Ẑ), according to Theorem 1. For that, a vector γ is obtained by taking the squareroot of the coordinates of the only eigenvector v(λ) ofλλ −1 + λ −1λ associated to a positive eigenvalue with v (λ)v(λ) = 1. SinceĈ is not full rank,λ andλ −1 are restricted to the q = n − 2 = 28 nonzero eigenvalues ofĈ. The resulting vector γ is multiplied by a scalar, adequately chosen so that R 2 opt = 0.8 (arbitrarily). The covariance vectorσ xy = D sÛ γ is deduced, where the q columns ofÛ are the eigenvectors ofĈ corresponding to nonzero eigenvalues. Finally, the estimation of x is updated to be consistent with the fixed patterns of conditional variance-covariance of X and association between X and Y: x = S + (σ xyσ xy )/s 2 y . Using the former set ( x ,σ xy , s 2 y ) of variance parameters for the joint distribution of (X , Y) , it is straightforwardly checked that the theoretical squared correlation between the naive predictor and Y is 0.2. Note that the lower bound for R 2 N given by Kantorovitch inequality yields 0.09, smaller than the smallest reachable limit.
A second pattern of association between Y and X is obtained by starting from a vector γ with only one nonzero coordinate. The same sequence of operations as described above for the first pattern of association leads here to a set ( x ,σ xy , s 2 y ) of variance parameters for the joint distribution of (X , Y) , for which both L N (Ẑ) and L OLS (Ẑ) have an asymptotic squared correlation with Y equal to 0.8.
The above two sets of variance parameters are now used to simulate 1000 training datasets of dimension n × (p + 1), which rows are independent realizations of (X , Y), with expectation (μ x , μ y ) = 0, in large sample (n = 1000) and small sample (n = 30, as in the original dataset) conditions. For each training dataset, a test dataset of 1000 individuals is also simulated with the same joint distribution of (X , Y). Using each training dataset, the response values in the corresponding test dataset are predicted using L N (Ẑ) and L OLS (Ẑ). Moreover, Ridge regression (Hoerl and Kennard 1970) and PLS regression (Wold, Martens, and Wold 1983;Wold et al. 1984) are implemented using the R packages glmnet (Friedman, Hastie, and Tibshirani 2010) and pls (Mevik, Wehrens, and Liland 2020), respectively. For each method, the hyperparameter (the penalty coefficient for Ridge regression and the number of components for PLS regression) is chosen by minimizing the MSEP estimated using a 10fold cross-validation procedure. Table 1 reproduces the mean squared correlation between the predicted and observed values of the response in the test dataset over 1000 simulations for each of the four scenarios (two patterns of association parameters, two values for n).
As expected, in both scenarios of association between the response and the explanatory variables and in large sample conditions (n = 1000), the prediction performance of L N (Ẑ) and L OLS (Ẑ) are close to the values obtained with the fixed simulation parameters, namely (R 2 N = 0.2, R 2 opt = 0.8) in the first scenario and (R 2 N = 0.8, R 2 opt = 0.8) in the second scenario. Still in these large sample conditions, the prediction performance of Ridge and PLS regression are close to the best performance, which can be explained by the fact that the present asymptotic conditions favor an accurate choice of the optimal hyperparameter, either a small regularization parameter in Ridge regression or a large number of PLS components, for which the corresponding predictors are close to L OLS (Ẑ).
In small sample conditions (n = 30), the prediction performance of the four methods in the first scenario of association parameters is dramatically deterred, especially for L N (Ẑ), L OLS (Ẑ) and PLS regression. Interestingly, still in small sample conditions but now in the second scenario of association parameters, the prediction performance of L N (Ẑ), Ridge and PLS regression are as good as in the large sample conditions, whereas the prediction performance of L OLS (Ẑ) is markedly lower.
Clearly, for the same conditional dependence pattern across explanatory variables but in two different scenarios of association between X and Y, comparing L N (Ẑ) to L OLS (Ẑ) leads to very different conclusions, the second scenario being far more favorable to the ignorance of dependence. To complete the above study, the four prediction methods compared in Table 1 are now implemented to predict the age of a patient from the original illustrative dataset (Lu et al. 2004). Figure 2 displays the distributions of the squared correlations between predicted and observed values over 50 random splittings of the dataset in a 10-fold cross-validation set-up. In the present situation, L N (Ẑ) turns out to show better prediction performance than L OLS (Ẑ), Ridge and PLS regression.
In the following, we introduce a new class of linear prediction scores, including the four regression methods compared in Table 1, and deduce an optimal choice within this class for a flexible handling of dependence.

A New Class of Prediction Scores
In the following, we will focus on prediction performance in terms of squared correlation between the prediction score and the response. Therefore, from now on, two predictors L 1 (X) and L 2 (X) will be said to be equivalent, which will be denoted L 1 ≡ L 2 , if there exists scalars a and b, with b = 0, such that, for all X, L 2 (X) = a + bL 1 (X). For example, it is deduced from (7) and (8) = (1, . . . , 1) .
Therefore, ignoring dependence in linear prediction or, on the contrary, fully whitening the explanatory variables can both be obtained by an ad-hoc choice of linear coefficients for ξ(Ẑ). In order to enlarge the scope of dependence handling solutions, the class L of prediction scores defined as linear combinations of the elements of ξ(Ẑ) is now introduced: The arbitrary restriction h h = 1 just aims at reducing equivalence subclasses of scores to a unique element in L. For any p-vector h of linear coefficients with h h = 1, and for a p-vector X 0 of explanatory variables, the prediction of the corresponding response value using andφ h is the OLS estimation of the slope of the regression line of the response variable on h ξ(Ẑ).
Note that L h (Ẑ 0 ) is also a linear combination of the coordinates of X 0 . Therefore, L is included into the general class of linear prediction scores. However, L does not contain the linear prediction scores deduced from Principal Component Regression (PCR, see Jolliffe 2011). Indeed, the PCR prediction score for the explanatory profile X 0 , when applied with explanatory variables scaled using the conditional standard deviations s, is a linear combination ofV j D −1 s (X 0 −X), j = 1, . . . , min(n−1, p), whereV j is the eigenvector of D −1 s S x D −1 s associated to the jth nonzero eigenvalue. However, the former eigenvectors cannot be deduced from linear combinations of the eigenvectorsÛ j of D −1 s SD −1 s . In the following, we show how Ridge (Hoerl and Kennard 1970) and Partial Least Squares (Wold et al. 1984) prediction scores form subsets of L also containing L OLS (Ẑ) and L N (Ẑ). Both methods can be viewed as resulting from a least-squares optimization under restriction on the vector of regression coefficients.

L contains Ridge Prediction Sscores
Provided the explanatory variables have been scaled using their conditional standard deviations s, then Ridge regression consists in the minimization over β of the penalized least-squares crite- where κ is a nonnegative penalty parameter, which leads to the following estimate of the unscaled regression coefficients β: A closed form expression of the corresponding Ridge prediction score is deduced: Hereafter, L Ridge denotes the class of Ridge prediction scores L Ridge (X; κ).
Theorem 2. L Ridge is a subclass of L. Moreover, if h κ denotes the vector of linear coefficients, with h κ h κ = 1, such that the Ridge prediction score obtained with penalty parameter κ is As a conse- quence, L N (Ẑ) (resp. L OLS (Ẑ)) can be approximated by a Ridge prediction score provided κ is chosen sufficiently large (resp. small).
Proof. See Section 1.2 in the supplementary material.
Optimization of the penalty parameter κ in Ridge regression usually aims at minimizing the MSEP, which generally leads to a compromise between the OLS prediction score, for values of κ close or equal to zero, and the constant predictor with all regression coefficients set to zero for large values of κ. In the present context where the goal is to maximize the squared correlation between the prediction score and the response, due to the normalization restriction on h, the limiting Ridge prediction score for large κ is the naive predictor.

L contains Partial Least Squares (PLS) Prediction Scores
Let us now introduce PLS regression (Wold et al. 1984) using the sequence of Krylov spaces representation used in Helland 1988; Blazère, Gamboa, and Loubes 2014. As above for Ridge regression, it is supposed that the explanatory variables are scaled using their conditional standard deviations s. Then, as recalled by Lingjaerde and Christophersen (2000), the PLS estimatê β PLS,m of the regression coefficient β, with m ≥ 1 PLS components, is obtained by minimizing the least-squares criterion The corresponding PLS prediction score is deduced: Let L PLS denote the class of PLS prediction scores, indexed by the number m of PLS components. Proof. See Section 1.3 in the supplementary materials.

Illustration
In the two scenarios of association signals introduced in Section 2.5, where the conditional dependence structure across explanatory variables is deduced from the conditional variancecovariance matrix estimated using Lu et al.'s (2004) gene expression data, with n = 30 and p = 403, the naive linear prediction score shows very different prediction performance. Indeed, in the first scenario, the prediction performance of L N (Ẑ) is poor, even asymptotically, whereas in the second scenario, it remains as good for small sample size as expected for large sample size. The Ridge and PLS prediction scores markedly outperform both L OLS (Ẑ) and L N (Ẑ) in the first scenario. Figure   λ whereas the linear coefficients of L PLS (Ẑ) also depend on the association parameterγ , the two methods lead to similar handling of dependence.

Closed-Form Expression of the Best Predictor
We now propose to search for the optimal predictor within L, that is, the vector h opt of linear coefficients maximizing the squared correlation between L h (Ẑ) and Y. Assuming the moments of ξ(Ẑ) are known and provided var(ξ(Ẑ)) is nonsingular, the scaled optimal vector of linear coefficients has the following expression: where h 2 2 = p j=1 h 2 j . The following Theorem gives the nonasymptotic moments of ξ(Ẑ).
Theorem 4. Let S be a random p × p matrix distributed as W p ( ; n − 2)/n. Let D s denote the p × p diagonal matrix whose diagonal entries are the square roots of the diagonal entries of S. Let U denote the p × p matrix of eigenvectors of D −1 s SD −1 s . Under the normality assumption introduced in Section 2 for the joint distribution of the explanatory variables and the response, the expectation and variance of ξ(Ẑ) and its covariance with the response Y are as follows: Proof. See Section 1.4 in the supplementary material.
The prediction performance of L h opt (Ẑ), measured by its squared correlation R 2 h opt with the response, can straightforwardly be deduced from Theorem 4. It turns out to depend both on the vector σ xy of one-to-one association parameters between the explanatory variables and the response and on the conditional dependence across explanatory variables in a nontrivial manner through expectations of quadratic forms in the coordinates of U D −1 s σ xy . The expectation of the former vector U D −1 s σ xy approximates the vector γ of association parameters between the whitened explanatory variables and the response.
To illustrate how optimal prediction within L adapts to the specific combination of γ and C, let us consider the two situations introduced in Section 2, where the conditional variancecovariance matrix of explanatory variables is the same as estimated using data from Lu et al. (2004) and two association signals are considered: in scenario 1, L N (Ẑ) is clearly outperformed by L OLS (Ẑ) whereas in scenario 2, L N (Ẑ) outperforms L OLS (Ẑ) in small sample conditions (see Table 1).
For both scenarios, Theorem 4 is used to derive the squared correlation between the response and the optimal predictor L h opt (Ẑ), the moments of ξ(Ẑ) being approximated using 1000 Monte-Carlo samples. A sequence of values for the sample size n going from n = 50 (n p) to n = 1000 (n p) is considered. Figure 4 shows how the prediction performance varies along n in both scenarios for the optimal, OLS and naive predictors. The prediction performance of the optimal predictor is equivalent to the best one, either the OLS predictor in scenario 1 or the naive predictor in scenario 2, which demonstrates its ability to adapt to combinations of a conditional dependence structure and a pattern of association between the response and the predictors.

Estimation of the Optimal Prediction Score within L
The expressions of var(ξ(Ẑ)) and cov(ξ(Ẑ), Y) given in Theorem 4 can now be used for the moment estimation of h opt , by plugging-in the sample estimates of x , σ xy and σ 2 y . Due to its complexity, the expression of var(ξ(Ẑ)) does not give much insight into the conditions under which it is nonsingular, which is required for the calculation of h opt (see Equation (13)). Hereafter, in cases where the moment estimate of var(ξ(Ẑ)) turns out to be singular, its inverse is replaced by its Moore-Penrose generalized inverse in expression (13) of h opt .
Since the computation of h opt involves potentially timeconsuming Monte Carlo experiments, we propose a large sample approximation based on the convergence in distribution of the unit-length eigenvectors of a sample correlation matrix of normal profiles to the unit-length eigenvectors of the corresponding population correlation matrix (Kollo and Neudecker 1993): Therefore: Consistently, the following alternative estimator of the vector of optimal linear coefficients is proposed: where Dλγ 2 is the p × p diagonal matrix which diagonal entries are the coordinates of the vectorλ γ 2 . Note that the above expression ofĥ opt is to be divided by its 2 -norm so that the resulting standardized version fulfills the unit-norm constraint.
However, in practice, the use of a Moore-Penrose generalized inverse involves the choice of a threshold under which the eigenvalues are set to zero. In some cases, disentangling zero and nonzero eigenvalues being not obvious, we introduce the number of positive eigenvalues as an hyperparameter of our method. In the comparative studies conducted in the next section, this hyperparameter is tuned by cross-validation. In the following, the prediction score obtained using this procedure with linear coefficientsĥ opt is denoted Lĥ opt .
In a toy simulation setup introduced by Witten and Tibshirani (2009) to demonstrate the estimation accuracy of linear model parameters by the Scout method, the prediction performance of five prediction scores within L (OLS, Naive, PLS, Ridge and the proposed adaptive method) and two out of L (Scout(1,1) and PCR) are compared in Section 2 of the supplementary material. Two scenarios with same dependence across explanatory variables but different patterns of association between the response and the explanatory variables are considered. The former study shows that, even using the asymptotic approximation introduced above for the estimation of h opt in the present nonasymptotic context (n = 20), the proposed adaptive method shows the best prediction performance in both scenarios, as Ridge and Scout. On the contrary, PLS and PCR have markedly lower prediction performance in the two scenarios, especially in scenario 2. Surprisingly, the Naive prediction score has the best prediction performance in scenario 1 (and almost the worst in scenario 2). The OLS prediction score performs badly in both scenarios, as also reported by Witten and Tibshirani (2009).

Simulation Study
A high-dimensional data-driven simulation study is now conducted to compare the same prediction methods as introduced in the toy simulation study reported in Section 2 of the supplementary material. The 1 -penalized estimation of the linear regression model (Lasso regression, see Tibshirani 1996) has been added to the list of prediction methods. Finally, the SLM method proposed by Opgen-Rhein and Strimmer (2007), consisting in a doubly shrunken estimation of the linear regression parameters, is also considered. In the former method, James-Stein estimates of the correlation matrix of the explanatory variables and of the vector of correlations between the response and the explanatory variables are plugged in the expression of the OLS estimator of the linear regression parameters.
When needed, a 10-fold cross-validation procedure is used to optimize hyperparameters (Lasso, Ridge, PLS, PCR, Scout and proposed method). The PLS and PCR methods are implemented using the R package pls (Mevik, Wehrens, and Liland 2020), the SLM method using the R package care (Zuber and Strimmer 2017), the Ridge and Lasso methods using the R package glmnet (Friedman, Hastie, and Tibshirani 2010) and the Scout method using the R package scout (Witten and Tibshirani 2015), with an 1 -regularization of the vector of regression coefficients and an 2 -regularization of the partial correlation matrix (default option in the R package scout). The proposed Adaptive method is implemented using the R package AdaptiveRegression (see https://github.com/fhebert).
First, N = 100,000 normal p-profiles of explanatory variables, with p = 256, are randomly generated with same mean and variance-covariance matrix x as estimated using 123 nearinfrared spectra in the Wine dataset, available in the R package cggd (Zhang and Melnik 2012). Figure 5, displaying an image plot of the correlation matrix derived from x , shows a very strong dependence with an increasing lag-1 auto-correlation along the wavenumbers and a clear two-block structure, with large and positive within-block correlations and large negative between-block correlations.
In order to extend the comparison study to a nonnormal framework, new p-profiles whose entries are mutually dependent, with identical nonnormal marginal distribution, are obtained as follows: if x ij stands for the (i, j) entry of the N × p matrix of normal data, then for all j = 1, . . . , p and i = 1, . . . , N, • x ij is first replaced byr ij = r ij /N, where r ij is the rank of x ij in the jth column, so that the transformed valuesr ij in each column are uniformly distributed in [0; 1]; • theresultingvaluesr ij are then transformed using F −1 , where F is the cumulative distribution function of a non-normal marginal distribution.
Finally, each column is scaled to have the same mean and standard deviation as the corresponding column in the N × p matrix of normal data. Four marginal distributions are considered in the present simulation study: the normal distribution (untransformed), the Student distribution T 5 with five degrees of freedom, the Chi-square distribution χ 2 5 with five degrees of freedom and the Fisher distribution F 10,10 with 10 degrees of freedom for both the numerator and the denominator.
For each p-vector of explanatory variables, the corresponding value of the response variable Y is generated by the standard linear regression model: where, for each simulation scenario, ε is distributed according to the same distribution as the marginal distribution of the explanatory variables, scaled so that ε has mean zero and standard deviation σ . The signal-to-noise ratio β x β/σ 2 is fixed to 100 and the values of β are discussed below.
For each of the four possible choices of a marginal distribution for explanatory variables and conditional distribution of the response, four scenarios are also considered for the association between the explanatory variables and the response, with a variety of sparsity rates. In the following, β * denotes the vector such that the jth coordinate of β is obtained by dividing the jth coordinate of β * by the standard deviation of the jth explanatory variable. Only the nonzero coordinates of β * are given below in each scenario: Training and test samples are obtained by sampling 100 and 10,000 individuals, respectively, in the simulated population of 100,000 profiles.
Tables 2-5 in Section 3 of the supplementary material give the average MSEP of the nine prediction methods over 1000 generations of training and test datasets in each of the sixteen scenarios (four distributions and four vectors of regression parameters).
Depending on the simulation scenario, the proposed adaptive prediction method shows either the best prediction performance over the nine methods or is close to the best performance. In scenario 1, with a moderately sparse vector of regression coefficients, the adaptive regression method outperforms all the other methods, whatever the marginal distribution of the explanatory variables and the conditional distribution of the response. In this scenario, the rank-reduced methods (PLS and PCR) have markedly better prediction performance than penalized estimation procedures (Ridge and Lasso), and even more clearly better than doubly-penalized estimation procedures (SLM and Scout). The prediction performance are even more contrasted in scenarios with heavy-tailed marginal distributions (Student and Fisher).
In scenarios 2 and 4, with a sparse and a dense vector of regression coefficients, respectively, the PLS, PCR and adaptive regression methods show the best prediction performance, with very close values of MSEP. In scenario 2, the doubly penalized estimation methods clearly outperforms the Lasso regression method whereas the benefit of a double penalization is less obvious in scenario 4.
In scenario 3 of a very sparse vector of regression coefficients, the Lasso regression method outperforms all other methods. The adaptive regression and the Scout methods show very similar prediction performance, much better than the rank-reduced regression methods.
In all scenarios, the OLS and naive prediction methods are very markedly outperformed, which demonstrates that, although they can be viewed as opposite extremes in the scope of dependence handling strategies within L, alternative prediction scores such as Ridge, PLS and the proposed adaptive method can perform much better.
Over the variety of patterns for the vector of regression coefficients and distribution for the explanatory variables and the response, the adaptive regression method turns out to show both the most stable and among the best prediction performance, whereas all other methods are, at least in one scenario, clearly outperformed.
Moreover, departures from the normality assumption within the scope of marginal distributions introduced above does not seem to affect much the prediction performance of the adaptive regression method, which is far from true for the other methods especially in scenarios 1 and 2 where heavy-tailed distribution obviously deters the performance of both the rank-reduced and penalized regression methods. Figure 6. Boxplots of 10-fold cross-validated total sum of squared prediction errors for the orange juice dataset (Li, Goovaerts, and Meurens 1996) over 50 random splittings of the data.
A complementary simulation study is proposed in Section 4 of the supplementary materials, comparing the same prediction methods, and covering a large scope of patterns for the vector of regression coefficients and of synthetic dependence structures (factor models, dependent or independent blocks of equi-or auto-correlated explanatory variables). As observed in the datadriven simulation study, the relative performance of prediction methods is highly variable depending on the interplay between the patterns of regression coefficients and dependence across the explanatory variables. Even under assumption of a sparse regression model, penalized methods can show poor prediction performance. Also, for strong dependence patterns across explanatory variables, even with block or factor structure, rankreduced methods can also be outperformed. Over the scenarios considered in this simulation study, the adaptive regression method also turns out to show stable and among the best prediction performance.

Performance Comparisons on Datasets
The same nine prediction methods as in the above data-driven simulation study are now compared using public datasets by their total sum of squared prediction errors estimated in a 10fold cross-validation setup. The cross-validation procedure is repeated fifty times on random splittings to give an insight of the variability of the prediction performance.
The dataset available in the R package cggd (Zhang and Melnik 2012), initially described and analyzed in Li, Goovaerts, and Meurens (1996), contains 218 near infrared spectra of samples of orange juice measured between 1100 and 2500 nm at 2 nm intervals. Three spectra are outliers and are removed from the dataset. Finally, the dataset is composed of 215 individuals and 700 explanatory variables. The response variable is the level of saccharose in a sample of orange juice.
The results are displayed in Figure 6. For each method, a boxplot of the 50 total sum of squared prediction errors estimated in a 10-fold cross-validation setup is given. Quite surprisingly since n p, OLS shows a prediction performance close to Lasso and Ridge. The naive prediction score is by far outperformed by all other methods. The best prediction performance is reached by the proposed adaptive method, PLS and PCR. It turns out that Scout and SLM have markedly lower prediction performance.
The second dataset used for the present comparative study is available in the R package prospectr (Stevens and Ramirez-Lopez 2020) and was used for the "Chimiometrie 2006" challenge (Minasny and McBratney 2008;Fernández-Pierna and Dardenne 2008). This dataset contains 645 absorbance spectra of samples of soil measured between 1100 and 2500 nm at 2 nm intervals. Six spectra are outliers and are removed from the dataset. The response variable is the level of total nitrogen in g/Kg of dry soil.
The results are displayed in Figure 7 similarly as for the first illustrative dataset in Figure 6. Contrarily to the previous example, OLS has now by far the poorest prediction performance. The naive prediction score is clearly better than OLS but is outperformed by PLS, PCR and the proposed method. This demonstrates that even in a context where ignoring dependence might yield good prediction performance, methods that adaptively handle dependence can also show improvements. The penalized estimation methods (Lasso, Ridge, SLM, and Scout) have here a lower prediction performance than the proposed method and the rank-reduced estimation methods PLS and PCR.

Computation Time
The computation time of the proposed method is now compared to the previously introduced methods. The comparisons are realized using the previous soil dataset (Fernández-Pierna and Dardenne 2008), with various sample sizes n and numbers of explanatory variables p (by artificially restricting the dataset to its first rows and columns). For each method, the computation time involves all steps of the prediction procedures, from the cross-validation hyperparameter optimization (if needed) to the calculation of the predicted values on the training dataset. The computations are performed on a laptop equipped with an i5-1035G1 CPU and 8 GB RAM, using only one core. The average computation times over 10 replications for each method are reported in Table 2. For methods involving an hyperparameter optimization (Lasso, Ridge, PLS, PCR, and Scout), the computation times highly depends on n and p, sometimes in unexpected ways. For instance, for p = 100, Lasso is much faster for n = 500 than for n = 100, probably because the convergence of the iterative algorithm is faster for larger sample sizes. For any value  of p and n, Scout is much slower than other approaches, most certainly because it requires the optimization of two hyperparameters. The computational cost of the proposed method is moderate for any situation. It is always faster than the Scout and Lasso methods; depending on the situation, it can be more efficient than PLS and PCR.

Discussion
How to handle dependence in large and high dimensional prediction issues does not only depend on the pattern of dependence across explanatory variables. Under arbitrarily complex dependence structure, it does neither only depend on the sparsity or more generally the shape of the association signal between the response and the explanatory variables. Indeed, for any given dependence structure, there is no uniformly best dependence handling strategy over all patterns of association signals. The first aim of the present paper is to demonstrate that the choice of a strategy to handle dependence has to specifically account for the interplay between the conditional dependence of the explanatory variables and the association signal. In Section 2, this point is illustrated, for an arbitrary conditional dependence pattern, by a closed-form expression of an association signal for which ignoring dependence is optimal. Some well-known prediction methods, such as Ridge and PLS, designed to address estimation of linear regression models in high-dimension, can be viewed as intermediate strategies in which dependence is partly reduced. Although both of these methods show good prediction performance in many situations, the conditions under which one has to be preferred with respect to the other are unclear. The class L of prediction scores introduced in Section 3 defines a general framework for the comparison of the naive, OLS, Ridge, and PLS prediction methods and more generally offers a wide scope of dependence handling strategies. Whereas OLS and Ridge correspond to linear coefficients defined as simple functions of the eigenvaluesλ j of the conditional correlation matrix of the explanatory variables solely, the explicit expression of the optimal linear coefficients turns out to depend in a more complex way both onλ and on the association signal between the response and the whitened explanatory variables. The corresponding optimal prediction score within L adapts to various combinations of a dependence structure and a pattern of association signal, in the comparative studies conducted in Section 5.
Although these results are very promising, much remains to be done to improve the estimation of the optimal linear coefficients, our proposition of a moment estimator raising some numerical issues that may generate a strong variability in some high-dimensional situations. One idea could be to propose a nonparametric approximation for the relationship between the optimal linear coefficients andλ, that would extend the parametric approaches by Ridge and PLS.
Furthermore, the class L defines a new framework for a more specific investigation of the prediction performance of the Ridge and PLS prediction methods. Indeed, although this point goes beyond the scope of the present article, the explicit expression of the squared correlation between the response and the predictors offers new leads to define optimization procedures for the regularization parameter or the number of latent components, without requiring cross-validation.
Finally, it shall be noticed that, although the present article focuses on prediction of a real-valued response, similar results for the case of a two-class response variable can be deduced using an analogous approach, extending the results of Dudoit, Fridlyand, and Speed (2002), and Bickel and Levina (2004).

Supplementary Materials
Proofs of Theorems, the simulation study introduced in Section 4.2 in Witten and Tibshirani (2009)'s setup and detailed results of the simulation studies introduced in Section 5.1 are given in the supplementary material file available with this article.