Scaled Predictor Envelopes and Partial Least-Squares Regression

Partial least squares (PLS) is a widely used method for prediction in applied statistics, especially in chemometrics applications. However, PLS is not invariant or equivariant under scale transformations of the predictors, which tends to limit its scope to regressions in which the predictors are measured in the same or similar units. Cook, Helland, and Su (2013) built a connection between nascent envelope methodology and PLS, allowing PLS to be addressed in a traditional likelihood-based framework. In this article, we use the connection between PLS and envelopes to develop a new method—scaled predictor envelopes (SPE)—that incorporates predictor scaling into PLS-type applications. By estimating the appropriate scales, the SPE estimators can offer efficiency gains beyond those given by PLS, and further reduce prediction errors. Simulations and an example are given to support the theoretical claims.


INTRODUCTION
Throughout the article, we consider multivariate linear regression where Y ∈ R r is the response vector, X ∈ R p is the stochastic predictor vector having mean μ X and covariance matrix X > 0, the errors ε ∈ R r are distributed independently of X with mean 0 and covariance matrix Y|X > 0, μ Y ∈ R r and β ∈ R p×r . Let S X , S XY , and S Y denote the sample variance of X, the sample covariance between X and Y, and the sample variance of Y.
In this context we review partial least squares (PLS), envelopes and the connections between them, and describe how the scales of the predictors can impact the performance of PLS.

Partial Least Squares
PLS originated as a method for prediction in chemometrics, and has been historically defined in terms of the iterative algorithms NIPALS (Wold 1966) and SIMPLS (de Jong 1993). It is an integral part of the chemometrics culture where much of its development is taken place. Today PLS is used in many disciplines, particularly as a method that improves prediction performance over ordinary least square (OLS) regression.
PLS operates by reducing the predictors to a few linear combinations, X → T X, that have the largest covariances with the responses subject to certain constraints. Here ∈ R p×u , u ≤ p, is a semi-orthogonal matrix that we temporarily assume to know, and u is called number of components. Estimation and prediction are then based on the OLS fit of the reduced model Y = μ Y + η T { T (X − μ X )} + ε, where the co-efficients η ∈ R u×r . The PLS estimator of β is β PLS = η = ( T S X ) −1 T S XY = P (S X ) β ols , where β ols is the OLS estimator of β, P A(S) denotes the projection in the S inner product onto A or span(A) if A is a subspace or a matrix, and Q A(S) = I − P A(S) . The population version of β PLS is β PLS = η = ( T X ) −1 T XY , which depends only on span( ) and not on a particular basis. Compared to OLS, PLS has a dimension reduction step, which reduces p predictors X to u components T X. When u = p, = I p and PLS degenerates to OLS. When u < p, PLS often shows better prediction performance over OLS, particularly when there is collinearity among the predictors.
The SIMPLS version of PLS uses the following algorithm to construct an estimator of . Setγ 1 equal to the eigenvector of S XY S T XY corresponding to its largest eigenvalue, and let k = (γ 1 , . . . ,γ k ), k = 1, . . . , p. Given k and k < p, γ k+1 = arg max g g T S XY S T XY g, subject to g T S X k = 0 and g T g = 1. ( Then PLS = u is the SIMPLS estimator of . This estimator does not require that S X > 0, so it does not run into computational difficulties when n < p, depending on the size of u. However, Chun and Keleş (2010) showed that β PLS is inconsistent if p/n → k > 0. Therefore, in this article we limit our asymptotic results to regressions in which p is fixed and n → ∞.
The NIPALS definition of PLS is the same as SIMPLS, except it uses a different inner product in the constraints. Since SIMPLS seems more popular and is implemented in software like R, SAS, and MATLAB as the standard PLS algorithm, we will focus our discussion on SIMPLS. Through the similarity between the two algorithms, results on SIMPLS can be extended straightforwardly to NIPALS. Like principal component regression (PCR), ridge regression (RR), penalized regression and many other methods, β PLS is not invariant or equivalent under scale transformations. Let D ∈ R p×p be a diagonal matrix with positive diagonal elements, transform X to X D = DX, and let β D,PLS denote the PLS estimator of β D for the transformed data. Then we do not have β D,PLS = β PLS or β D,PLS = D −1 β PLS . In fact, the number of components u may even change with a scale transformation of the predictors. This suggests that the advantages of PLS may not be realized if some of the predictors are measured in different units.
We illustrate this lack of invariance in Figure 1, which depicts a stylized population regression with a univariate response, p = 2 two centered predictors represented along the axes, and three different scalings for the predictors. Figure 1(a) shows a contour of the distribution of the original unscaled predictors X = (X 1 , X 2 ) T along with the coefficient vector β and the eigenvectors v 1,1 and v 2,1 of X .
When the response is univariate, β can always be represented uniquely as a linear combination of the eigenvectors of X , and the number of components u is equal to the number of eigenvectors needed for this representation (Helland and Almøy 1994;Naik and Tsai 2000). In Figure 1(a), β aligns with neither the first eigenvector v 1,1 nor the second eigenvector v 2,1 . This means both eigenvectors are needed to represent β and thus, that u = 2, = I, and PLS reduces to OLS, so there is no predictive gain.
It is common practice to scale every predictor to have standard deviation equal to 1 and then apply PLS to the scaled predictors (see Chapter 10.2.1 in Eriksson et al. (2006) for more details). We followed this practice in Figure 1(b), which represents scaled predictors D −1 X, where D = diag(σ 1 , σ 2 ) with σ 1 and σ 2 being the population standard deviations for X 1 and X 2 .
The coefficient vector in this scale is Dβ, and the eigenvectors of var(D −1 X) = D −1 X D −1 are denoted as v 1,2 and v 2,2 . The new coefficient vector Dβ still does not align with v 1,2 or v 2,2 , which implies that u = 2 and PLS again degenerates to OLS. This illustrates that standardizing the predictors does not necessarily improve the prediction performance. (See Section A in the online supplement for a more general treatment of this phenomenon.) In this article we propose a new scaling method that is designed to estimate a diagonal scaling matrix = diag(λ 1 , λ 2 ) so that the coefficient vector β for the rescaled predictors −1 X aligns with an eigenvector of var( −1 X). Consequently, we can expect improved performance of PLS in the new scale. Applying the population version of the proposed method to Figure 1(a) results in Figure 1(c) where the coefficient vector β in the transformed scale aligns with the first eigenvectors v 1,3 of var( −1 X), so u = 1 and we have predictive gains. Our development of the new scaling scheme exploits the connection between PLS and envelopes established by Cook, Helland, and Su (2013), so the rest of this introduction is devoted to a review of envelopes.

Envelopes
The overarching goal of envelope models and methods is to increase efficiency in multivariate parameter estimation and prediction. Speaking informally, this is achieved by enveloping the information in the data that is material to the estimation of the parameters of interest while excluding the information that is immaterial to estimation. The reduction in estimative variation can be quite substantial when the variation in the immaterial information is relatively large.
Envelopes were first used by Cook, Li, and Chiaromonte (2010) to account for immaterial variation in the response vector, resulting in an estimator of β that has the potential to be much less variable than the standard OLS estimator. Cook, Helland, and Su (2013) used envelopes to account for immaterial variation in the predictor vector, resulting in a different envelope estimator that outperforms the OLS and PLS estimators. We continue the review of envelopes in this setting, which is the context that leads to our proposed scaling method.
Let S be a subspace of R p and suppose that Q S X satisfies the following two conditions: (I) Q S X is uncorrelated with P S X and (II) Y is uncorrelated with Q S X given P S X. Cook, Helland, and Su (2013) showed that condition (I) is equivalent to requiring that (A) S be a reducing subspace of X and that condition (II) is equivalent to (B) B ⊆ S, where B = span(β). The X -envelope of span(β), denoted by E X (B), is defined as the intersection of all the subspaces that satisfy (A) and (B). The envelope E X (B) then has the property that cov(P E X, Q E X) = cov(Y, Q E X) = 0. Consequently, Q E X has no linear effect on either Y or P E X. We refer informally to P E X and Q S X as material and immaterial information in X.
We use E for E X (B) when it appears in subscripts, and let u = dim(E X (B)).
The coordinate form of the envelope model is where the columns of ∈ R p×u form an orthogonal basis of E X (B), ( , 0 ) ∈ R p×p is an orthogonal matrix, T = var(P E X) is the variation of the material information, 0 0 T 0 = var(Q E X) is the variation of the immaterial information, and ∈ R u×u and 0 ∈ R (p−u)×(p−u) are positive definite matrices. Assuming that (X, Y) is multivariate normal, Cook, Helland, and Su (2013) developed the likelihood estimator env of a basis for E X (B). They showed that the resulting envelope estimator β env = env ( T env S X env ) −1 T S XY = P env (S X ) β ols of β is more efficient than or at least as efficient as the OLS estimator asymptotically, and that the efficiency gain can be substantial when > 0 , where · is the spectral norm. Additionally, they proved that β env is a √ n-consistent estimator of β under model (1) without normality.
Key findings for the purpose of this article are that PLS is a √ n-consistent estimator of a basis for E X (B) and that the number of PLS components corresponds to the dimension u of E X (B) (Cook, Helland, and Su 2013). Thus, there is a very close connection between the SIMPLS implementation of PLS and envelopes: The envelope and SIMPLS estimators, β env and β PLS , have the same form and are based on the same population construct E X (B), but differ in their methods of estimating a basis for E X (B). Further, β env typically dominates β PLS in both estimation and prediction and is less sensitive to the number of components selected (Cook, Helland, and Su 2013).
Like PLS, envelope methods are not invariant or equivalent under scale transformation. Methods to achieve scale invariance for envelopes applied to response reduction in multivariate linear regression were discussed by . The basic idea underlying our proposed method is the same: introduce scaling parameters to estimate the best rescaling of the variables under consideration. However, here our focus is on predictor reduction, which is a related but distinctly different problem. The theoretical and methodological developments and the operating characteristics of methods for predictor envelopes are quite different than those for response envelopes. For instance,  conditioned on the predictors, treating them as ancillary, while here the predictors are random and not ancillary. The connection with PLS arises in the context of predictor reduction, not response reduction. Cook and Su allowed one rescaling parameter for each response variable, while here we allow groups of predictors to be scaled in the same way. And there are natural constraints on the dimension of scaled predictor envelopes that do not occur for scaled response envelopes, as described in Proposition 2.2.
In the following section, we develop scale invariant versions of β env and β PLS that can identify the required scale transformation in Figure 1(a), transform to Figure 1(c) where estimation is carried out and then transform the estimator back to the original scales in Figure 1(a).

Formulation
To develop scale invariant methodology, we add scaling parameters to model (3). Let ∈ R p×p be a diagonal matrix with repeated diagonal elements in blocks 1, . . ., 1, λ 1 , . . ., λ 1 , . . ., λ q−1 , . . ., λ q−1 , where 1, λ 1 , . . ., λ q−1 are q positive numbers. Suppose that the ith of these q scalings has r i replications, q i=1 r i = p. We propose this construction of because in application there may be groups of variables that we want to scale in the same way. We seek a transformation X → −1 X so that , which denotes the envelope in the transformed scale, and let ∈ R p×u be an orthogonal basis. Then we have the following extension of model (3), where β = −1 η, η ∈ R u×r carries the coordinates of β with respect to , 0 ∈ R p×(p−u) is the completion of such that ( , 0 ) is an orthogonal matrix, and ∈ R u×u and 0 ∈ R (p−u)×(p−u) are positive definite matrices. We fixed the first diagonal element of to be 1 for identifiability; otherwise, we can always multiply by an arbitrary constant c and multiply η by 1/c. When u = p, there is no reduction and (4) degenerates to the standard multivariate linear regression model. This is consistent with PLS: when the number of components is p, the SIMPLS algorithm returns the OLS estimator. If were known, then model (4) would reduce to model (3) for the regression of Y on −1 X. We call model (4) a scaled predictor envelope (SPE) model. It is scale invariant, as scaling is considered directly in the model building process.

Estimation
In this section, we develop estimators for β and X assuming that (X, Y) follows a multivariate normal distribution. When this multivariate normality holds, we refer to (4) as the normal SPE model. Normality is not required in the SPE model (4), but this assumption produces estimators that perform well when normality does not hold; see Section 2.4 for a statement of consistency and Section 4 for a numerical experiment.
Suppose that the data (X i , Y i ), i = 1, . . . , n, are independently and identically distributed. LetX andȲ denote the sample means of X and Y. Let X be an n × p matrix whose ith row is X T i and let Y be an n × r matrix whose ith row is Y T i . The centered data matrices are denoted by X c = X − 1 nX T and Y c = Y − 1 nȲ T , where 1 n is an n × 1 vector of 1. With fixed u, the parameters to be estimated by maximum likelihood are and 0 . Estimates of these constituent parameters are then used to estimate β = −1 η.
The maximum likelihood estimators of and are obtained by minimizing over the set of p × u semi-orthogonal matrices for and the positive real numbers for the diagonal elements of . (See Section B in the online supplement for details.) Optimization of (5) can be performed by an alternating algorithm. Given an initial value 1 of , we minimize (5) on the p × u Grassmannian to get 1 = arg min L( , 1 ). Then ( 1 ). Having 1 , we can update by minimizing (5) using any standard program in R or MAT-LAB, 2 = arg min ( 1 , ). We iterate between and until the difference between the objective functions in two adjacent iterations is smaller than a prespecified value. Once we have and , the maximum likelihood estimators for the rest of the parameters are as follows: OLS estimator of β based on the rescaled predictors −1 X and Y. So SPE estimates the scaling parameter , rescales the data, performs ordinary envelope estimation on the rescaled data to get P ( −1 S X −1 ) β * ols , and then transforms the estimator back to the original scale. The SPE model also provides an alternative estimator of X besides the standard estimator S X .
The global minimizer of the objective function (5) is not unique: if minimizes (5) then so does O for any orthogonal matrix O ∈ R u×u . However, as optimization is essentially over a Grassmannian, span( ) is typically unique. Occasionally the objective function may be flat along some directions, and then the minimizers will not be unique or will be ill determined. But these nonuniquenesses are not an issue, as and are constituents of the parameters of interest β and X , which are both identifiable. Additionally, the SPE estimator β spe is unique even when the global minimizer of (5) is not unique. These properties are discussed further in Section C of the online supplement. The uniqueness of β spe and X,spe provides the foundation for our discussion of their asymptotic distribution and consistency in Sections 2.3 and 2.4. Section D of the online supplement contains proofs of propositions to follow.

Asymptotic Variance
In this section, we give the asymptotic variances of the SPE estimators β spe and X,spe assuming normality.
If a quantity stems from the ordinary envelope model (3), it is designated with a subscript o. For instance Y and −1 X follow an ordinary envelope model and thus, we write β o = β, and o = −1 X −1 . We use vec(·) to denote the operator that maps a matrix to a vector columnwise and vech(·) for the operator that maps the lower diagonal of a symmetric matrix to a vector columnwise. The gradient matrix under model (3) is then Let bdiag(·) denote a block diagonal matrix with diagonal blocks as arguments. The column vector λ = (λ 1 , . . . , λ q−1 ) T contains the q − 1 unique elements of , so that λ T = vec T ( )L, where L = (e r 1 +1 ⊗ e r 1 +1 , . . . , e p−r q +1 ⊗ e p−r q +1 ) ∈ R p 2 ×(q−1) extracts the q − 1 scaling parameters from vec(λ), ⊗ denotes Kronecker product, e i ∈ R p×1 contains a 1 in the ith position and 0 elsewhere.
converges in distribution to a normal random vector with mean zero and covariance matrix The estimators β spe and X,spe are more efficient than or at least as efficient as the OLS estimators asymptotically; that is, In Proposition 2.1, the asymptotic covariance matrix V is decomposed into two parts: V 2 is the asymptotic variance when the scaling parameter is known. It is a rescaled version of the asymptotic variance given by Cook, Helland, and Su (2013) for the regression of Y on −1 X. As a consequence, we think of V 1 as the asymptotic cost of estimating . Now we focus on the asymptotic variance of vec( β spe ), which is the upper left pr × pr block of V. We denote the upper left pr × pr block of V 1 as T 1 and the upper left pr × pr block of V 2 as T 2 . Then we measure the relative cost of estimating as C = tr(T −1/2 2 . Section E in the online supplement contains a plot on the relative cost of estimating under different signal and noise levels. It is possible to have C = 0 in some cases, as stated in the following corollary. Proposition 2.1 also states that the SPE estimator is asymptotically at least as efficient as the OLS estimators. In the following discussion, we explore some cases where the asymptotic variance of the SPE estimator is the same as that of the OLS estimator. This will give us clues on when SPE model is likely to give more efficient estimators than OLS. Proposition 2.2. Under the normal SPE model (4), when u ≥ p − (q − 1)/r, the estimators β spe and X,spe have the same asymptotic covariances as the OLS estimators.
Let · denote the ceiling of a number and define u 0 = p − (q − 1)/r . Proposition 2.2 indicates that when u ≥ u 0 the SPE estimators and the OLS estimators have the same asymptotic variances. Two special cases are summarized in the following corollaries. (4), if r ≥ q, then u 0 = p and thus, the SPE and OLS estimators have the same asymptotic variance when u = p.

Corollary 2.2. Under the normal SPE model
As a consequence, when the number of responses r strictly exceeds the number of estimated scaling parameters q − 1, the SPE and OLS estimators have the same asymptotic variance when u = p; that is, when the SPE model (4) reduces to the standard multivariate linear regression model. (4), if r = 1 and q = p, then u 0 = 1 and thus, the SPE estimator always has the same asymptotic variance as the OLS estimator.

Corollary 2.3. Under the normal SPE model
It follows from this corollary that there is no point to rescaling all of the predictors in univariate linear regression since then the asymptotic variance of the SPE estimator reduces to that of the OLS estimator. However, progress is still possible in univariate regressions when rescaling the predictors in groups. For instance, suppose that r = 1, p = 20, and q = 5, so there are five groups of predictors to be scaled in the same way. Then according to Proposition 2.2, the SPE and OLS estimators have the same asymptotic variance only when u ≥ 16. Since in practice the number of components u is often small relative to p, we might reasonably expect gains in this setting.
As a consequence of Proposition 2.2, the SPE estimator is effectively constrained by the condition u < u 0 , and we normally do not bother computing the SPE estimator when u ≥ u 0 . In those cases we can still consider the OLS, SIMPLS, and ordinary envelope estimators, whose relative performance was characterized by Cook, Helland, and Su (2013).
Cook, Helland, and Su (2013, Corollary 2.1) showed that the asymptotic variance of the envelope estimator β env is the same as that of the OLS estimator β ols when the predictor are uncorrelated with equal variances and β has rank r. A similar result holds for the SPE estimator β spe when X is diagonal since then scaling with = X gives o = I p where the result of Cook et al. applies. Accordingly, like the envelope and PLS estimators, the SPE estimator offers the greatest gains when there is notable collinearity among the predictors.
The efficiency gain also depends on the relative magnitude of and 0 . Substantial efficiency gain is expected when

Consistency
Although the SPE estimators of β and X are derived using the normal likelihood, they are √ n consistent without the normality assumption.

Proposition 2.3. Assume that model (4) holds and that (Y, X) has finite fourth moments. Then
is asymptotically normally distributed, and β spe and X,spe are √ n consistent estimators of β and X .
Although we do not have a useful expression for the asymptotic variance in this case, we have found in simulations that the bootstrap gives a good estimation of the actual variance.

Selection of u
To select the dimension of E −1 X −1 ( B), likelihood-based methods such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), likelihood ratio testing (LRT) or other information criteria can be used. Cross-validation can also be used. We tend to prefer BIC for parameter estimation and cross-validation for prediction. To use BIC, for 0 ≤ u ≤ p, let be the maximized log-likelihood under model (4), and let N(u) be the number of parameters as discussed in Section 2.1. The BIC estimator of u is arg min u −2l(u) + log(n)N (u). Properties of BIC were studied by Cook and Su (2013, Proposition 4) in the context of response scaling. Similar results hold for the SPE model: Let the candidate set be the set of SPE models having dimensions varying from 0 to p. If the true model is in the candidate set then, as n → ∞, BIC will select the true model with probability tending to 1, AIC will select a model that at least contains the true model and LRT will select the true model with probability 1 − α, where α is the significance level.

SCALED SIMPLS ALGORITHM
Recall that the algorithm described in Section 2.2 for maximizing the likelihood requires a starting value 0 for . Our experience indicates that the algorithm converges reliably using the default choice 0 = I p , but also that it might take a long time to converge depending on characteristics of the regression. Better starting values can mitigate the time to convergence.
In this section, we introduce a relatively fast scaled SIM-PLS algorithm, which we denote scaled partial least squares (SPLS). While main role of SPLS is to produce starting values for the primary algorithm described in Section 2.2, our experience indicates that it can serve as an effective diagnostic on the need for scaling since in that case it typically outperforms SIMPLS on cross-validation prediction error. It might be used as a stand-alone prediction method in some regressions. Compared to SIMPLS, it incorporates a scaling parameter , and returns a scale invariant SIMPLS estimator.
On the first iteration we set 0 = I p or any reasonable guess, and then get 1 by applying the SIMPLS algorithm to the regression of Y on −1 0 X. Then given 1 we update by minimizing the objective function (5), which gives 1 . Subsequent iterations then proceed as follows. Let i and i be the estimates  of and from the ith iteration. We construct i+1 by first getting from the SIMPLS algorithm applied to the regression of Y on −1 i X. Then for a real number a ∈ (0, 1), we construct a as an orthogonal basis of span{a i + (1 − a) } and find the optimal value for a as a * = arg min The next update i+1 is constructed as an orthogonal basis of the span of a * i + (1 − a * ) , and i+1 is constructed using this value in (5). In this way, the optimization processes for and share the same objective function, which is monotonically decreasing as we iterate.
The SPLS algorithm uses SIMPLS rather than Grassmann optimization to update , so SPLS and SPE produce different estimators for β. But the SPLS algorithm is faster and typically provides a very good starting value for SPE. In timing experiments with p = 100, r = 8 and u = 5, the running time for the SPLS algorithm was about 25% of that for the SPE algorithm, both starting at 0 = I p . Using the SPLS algorithm to get starting values for SPE cut the running time in half relative to the SPE algorithm with 0 = I p . Additional support for using SPLS to get starting values are given in Section 4. As mentioned previously, we consider the SPE estimator only when u < u 0 . To be consistent, we also consider the SPLS estimator only when u < u 0 .

SIMULATIONS
In this section, we report results from simulation studies to investigate the estimative and predictive behaviors of methods discussed in previous sections.

Estimative Performance
To compare SPE, SIMPLS, and OLS on estimative performance, we generated data from model (4) with p = 10, r = 8, u = 5. We took = σ 2 I u and 0 = σ 2 0 I p−u with σ = 5 and σ 0 = √ 5. The scaling parameter had diagonal elements 2 0 , 2 0.5 , 2 1 , . . . , 2 4.5 . The elements in μ Y and −1 μ X were independent standard normal variates, the elements in η were from the uniform (0, 2) distribution, and ( , 0 ) was obtained by normalizing a p × p matrix whose elements were generated as independent uniform (0, 1) variates. We simulated the error vector ε from the multivariate normal distribution with mean 0 and covariance matrix Y|X = ADA T , where A was an orthogonal matrix obtained by normalizing an r × r matrix of uniform (0, 1) random variates, and D was a diagonal matrix with diagonal elements 1, 2, . . . , r. We generated 200 replications for sample sizes 100, 200, 300, 500, 800, and 1200. With each replication, we estimated β by using SPE, SIMPLS, and OLS. For the SPE estimator, we used both the true value of { , span( )} and the SPLS estimator { spls , span( spls )} as starting values. We computed the mean squared error (MSE) for elements in β (·) for each sample size, and the results for two elements are shown in Figure 2. We always used the true u = 5 for the SPE estimator.
Prediction is often the goal in applications of the ordinary SIMPLS algorithm, and the number of components u is typically chosen by cross-validation or a hold-out sample. Because of variance-bias tradeoffs, the best u for prediction might not be the best for estimation. To give SIMP LS an edge in this simulation, for each sample size, we selected the number of components u to give the smallest MSE of the selected element of β. As it turned out in this example, usually larger value of u minimizes the MSE, as a small value of u typically leads to large bias. The two panels in Figure 2 give results for two elements in β, which represents two common patterns that appear across all the elements in β. For both patterns, we notice that the SIMPLS estimator has a MSE larger than the OLS estimator, that the SPE estimator has the smallest MSE and that the SPLS estimator offers a good starting value for the SPE estimator. Plots of the standard deviation and absolute value of the bias are included in the online supplement. Table 1 provides the means and standard deviations of 200 SPE estimated scales. The estimates seem quite good.
To gain insights into the effects of nonnormality, we generated the errors from the t distribution with 6 degrees of freedom, the uniform (0, 1) distribution and the chi-square distribution with 4 degrees of freedom to represent distributions with heavy tail, short tail and skewness. The results for the SPE estimator are summarized in Figure 3. Since the SPE estimator is asymptotically unbiased as indicated by Proposition 2.3, and estimation variance is the main contributor to MSE for the SPE estimator as demonstrated in Figure 2 and plots in Section F in the online supplement, we provided the plots of standard deviations for clarity in comparison. From Proposition 2.3, and this and other simulations, we concluded that the SPE estimator is robust to moderate departure from normality.
We also checked the performance of the estimators when the scales are all 1 to obtain some idea of the potential loss when the scaling is unnecessary. We repeated the simulation with the same settings as for Figure 2, but all scales λ i were set to 1. The results are displayed in Figure 4. For the SIMPLS estimator, again we chose the number of components to minimize the MSE and the SPE estimator again has smallest MSE. The plots for standard deviation and absolute value of bias are in Section F of the online supplement. When = I p no scaling is necessary and model (4) reduces to the envelope model of Cook, Helland, and Su (2013). Figure 4 then confirms what is known about the relative behavior of the estimators: SIMPLS performs better than OLS and the envelope estimator performs better than both.
We also performed a simulation to demonstrate the effect of and 0 on the efficiency gains of the SPE model. We used the same setting as in Figure 2, but reversed the values of σ 2 and σ 2 0 . From Figure 5, we notice that the efficiency gain from SPE is small compared to that in Figure 2 and that SIMPLS fails in this case because it always looks in the direction with the larger variation. This will not be an issue when the directions of larger variation are material, as in many chemometrics applications. But it will be a serious problem for SIMPLS and by extension SPLS when the direction of larger variation is immaterial. Following the discussion at the end of Section 2.3, the SPE model works as expected in both cases. Plots of the standard de-viation and absolute value of the bias are included in Section F of the online supplement.

Predictive Performance
To study predictive performance, we took p = 10, r = 8, u = 1, n = 60, and generated the data under the SPE model (4). The covariance matrix of X had the structure X = where σ = 3, σ 0 = 1, and elements in M 1 ∈ R u×u and M 2 ∈ R (p−u)×(p−u) were independent uniform (0, 1) random variates. The eigenvalues of X ranged from 0.82 to 1.12e + 6. The orthogonal matrix ( , 0 ) was obtained by normalizing a p × p matrix of independent uniform (0, 1) random variates. The error vector ε was generated from a multivariate normal distribution with mean 0 and covariance matrix Y|X , where Y|X had eigenvalues 1, 2, . . ., r. The diagonal elements of were 1, 2 1 , 2 2 , . . ., 2 9 , so q = 10. The vectors μ Y and −1 μ X consisted of independent standard normal variates, and η was a u × r matrix of independent uniform (0, 5) variates. We used cross-validation to estimate the prediction error, and the identity inner product was used to bind the elements in (Y − Y). With different number of components, we computed the average prediction errors for SPE with SPLS starting values, SIMPLS, SPLS, and OLS estimators based on 50 5-fold cross-validations with random partitions. The results are summarized in Figure 6.
With u = 1, the SPE estimator reduced the prediction errors by 10.6% compared to the OLS estimator. If we overestimate u, the prediction error of the SPE estimator will increase, but it was never greater than that of the OLS estimator. From Proposition 2.2, u 0 = p − (q − 1)/r = 9 and, as expected, the SPE and OLS estimators had essentially the same prediction error when u ≥ 9. The best SIMPLS estimator in this case had u = 8, its prediction error being 8.74% larger than the SPE estimator with u = 1. Figure 6 shows that the SPLS algorithm does quite well at the true value of u. It reduces the prediction error by 28.9% compared to the SIMPLS estimator at u = 1, and by 2.5% even compared to the best SIMPLS estimator. The SIMPLS estimator seems quite sensitive to the number of components, which is consistent with the findings in Cook, Helland, and Su (2013).

DATA ANALYSIS
In this section, we demonstrate the performance of the SPE estimator using the chemometrics data published by Skagerberg, MacGregor, and Kiparissides (1992). The n = 56 observations were collected to study the polymerization reaction along a reactor. The r = 6 response variables are polymer properties: number-average molecular weight, weight-average molecular weight, frequency of long chain branching, frequency of short chain branching, the content of vinyl groups and vinylidene groups in the polymer chain. The predictors are 20 temperatures measured at equal distances along the reactor plus the wall temperature of the reactor and the solvent feed rate.
If the multivariate linear model (1) holds for these data, then by extension the envelope (3) and the SPE (4) models must hold as well. We performed a few diagnostic checks to see if the data provide clear evidence to contradict model (1), concluding that it fits quite well. With their R 2 's ranging between 0.946 and 0.997, the regressions of the six individual responses on the 22 predictors all showed strong linear trends. There was no evidence of curvature in plots of the responses versus their fitted values, but there was a little evidence of mild curvature based on adding quadratic terms. Taking multiple testing into account, we concluded that there is not sufficient evidence to justify remedial action. The eigenvalues of nS X , which range between 84.9 × 10 −6 and 6.9 × 10 +3 , clearly indicate strong multi-collinearity among the predictors and thus, that PLS and envelopes methods may provide better predictions than OLS.
Skagerberg et al. applied PLS after standardizing all variables in X and Y to have sample mean 0 and sample variance 1. We computed predictions based on SIMPLS, SIMPLS with standardized variables (standardized SIMPLS), SPE and SPLS with q = 22, and OLS, obtaining the results displayed in Figure 7.
The prediction performance was measured by the average of the prediction errors from 50 5-fold cross-validations with random splits. For better visibility, we truncated the vertical axis at 6. At u = 1, SIMPLS and standardized SIMPLS have average prediction errors as large as 9.335 and 9.077, and SPLS has an average prediction error 6.958. SIMPLS has its smallest average prediction error 1.621 at u = 5 and standardized SIM-PLS has its smallest average prediction error 1.618 at u = 6. That is about a 45.2% reduction of prediction errors compared to the OLS, which has average prediction error 2.952. The SPE estimator has average prediction error 1.555 at u = 2 and its prediction error decreases thereafter as u increases until at u = 11 it hits the minimum average predictor error 1.075. Compared to SIMPLS or standardized SIMPLS, that is a 33.6% reduction of the prediction errors. We also notice that when u = 1, the SPE estimator has slightly better performance than OLS, while SIM- PLS and standardized SIMPLS both have very large prediction errors, and they did not perform better than OLS until u = 3 and u = 4, respectively. SPE estimators seems more stable for small u. The SPLS estimator has its smallest average prediction error 1.615 at u = 4, which is about the same as the smallest average prediction error from SIMPLS and standardized SIMPLS. But SPLS achieves this prediction error with a smaller u. Not shown here, we also fitted the envelope model in the predictor space (Cook, Helland, and Su 2013), obtaining minimum average prediction error 2.360, which again indicates that properly scaling the predictors can bring substantial efficiency gains.
To gain more insights about the efficiency gains obtained by SPE, we fitted the SPE model that scales only the last two predictors, wall temperature of the reactor and solvent feed rate. Recall that in the formulation of the SPE model (4), we allow the scaling parameter to have replicates to accommodate regressions in which we want to scale groups of variables in the same way. In this example, the first 20 predictors are all temperatures around the reactor and it may be natural to apply the same scale to them. The diagonal elements of are then 1, . . ., 1, λ 2 and λ 3 . Under this construction, the SPE estimator has minimum average prediction error 1.140 at u = 11, and the prediction performance across all u is quite similar to that of the SPE estimator scaling all the predictors, as indicated in Figure 7. This suggests that the efficiency gain obtained by the SPE estimator is largely due to rescaling the last two predictors which measure different characteristics from the first 20 predictors.

DISCUSSION
Prediction in the context of the multivariate linear model (1) has been addressed by many traditional methods, including reduced rank regression (RRR), PCR and RR, all of which used information in X . These methods together with PLS have been studied and compared in the literature. For example, Frank and Friedman (1993) examined the mechanism behind PCR, PLS, and RR and compared their performance numerically. Stone and Brooks (1990) incorporated PLS and PCR into a general framework called continuum regression, and Yuan et al. (2007) compared RRR, PLS, PCR, and RR in simulations. However, none of aforementioned methods are invariant or equivariant to a scale transformation of the predictors, while the SPE model is a scale-invariant method.
The other prediction methods operate from vantage points that are distinctly different than that for envelopes. For instance, RRR offers no gain in univariate regressions, since then the only possible ranks for β ∈ R p×1 are 0 and 1, while envelopes and scaled envelopes can still produce gains. Similarly, RRR offers no gain when β ∈ R p×r has full rank, while again envelopes and scaled envelopes can still give substantial gains. Traditional PCR neglects the response vector in its reduction step, and can result in very inefficient regressions. RR is a regularization method that, depending on how the ridge parameter(s) are determined, can also neglect the response. In contrast, envelopes, scaled envelopes and PLS methods capitalize on the collinearity, rather than attempt to mitigate its effects through regularization.
The discussion in this article is confined to regressions in which n > p. Developing a scaled invariant prediction method such as SPE model for n < p is an important problem as many contemporary applications feature small sample size.

SUPPLEMENTARY MATERIALS
Supplementary materials for this article are available on the publisher's website.