Improving estimated sufficient summary plots in dimension reduction using minimization criteria based on initial estimates

In this paper we show that estimated sufficient summary plots can be greatly improved when the dimension reduction estimates are adjusted according to minimization of an objective function. The dimension reduction methods primarily considered are ordinary least squares, sliced inverse regression, sliced average variance Estimates and principal Hessian directions. Some consideration to minimum average variance estimation is also given. Simulations support the usefulness of the approach and three data sets are considered with an emphasis on two- and three-dimensional estimated sufficient summary plots.


Introduction
Consider a univariate response Y ∈ R and p-dimensional regressor X ∈ R p .Determining the relationship between Y and X can be difficult when p > 2 due our own inadequacies when it comes to visualizing data in more than three dimensions.However, an increasingly popular approach is to seek lower-dimensional projections of X which sufficiently capture the nature of this relationship.Li (1991) proposed the model where β 1 , . . ., β K are unknown p-dimensional vectors of coefficients, ε is the error term assumed independent of X and f is called the unknown link function.Let S = span(β 1 , . . ., β K ).Then (1) can be re-specified in terms of γ 1 X, . . ., γ K X, ε and a new link function provided the γ k 's form a complete basis for S. As such and when K < p, this model allows for dimension reduction by replacing X with B X for B ∈ R p×K when the column space of B is equal to S. A more general model is (see, for e.g., Cook 1998b) which includes the class of models given in (1).
Regardless of which model is employed, it is useful to ensure that S is well defined in the sense that K is the minimum for which (1) and (2) hold, and to further assume that it is unique.Under these conditions we will then follow the lead of Cook (1996) and refer to S as the Central Dimension Reduction Subspace (CDRS).We also note that S has been termed (though not explicitly under these aforementioned conditions) the effective dimension reduction (e.d.r.) subspace (Li 1991) where any element of S is called an e.d.r.direction.For convenience we will follow this convention and refer to elements of the CDRS S as e.d.r.directions.
Dimension reduction in the sense described here is largely a visualization pursuit.Given a basis for the CDRS which we will again denote B, a plot of Y versus B X can be used to visually determine regression structure.For example, under the model in (1) and K = 2, a 3-dimensional plot of Y versus B X can be used to seek f .These plots are referred to as Sufficient Summary Plot's (SSP's, Cook 1998b) since these encapsulate all sufficient information regarding Y |X.Let {y i , x i } n i=1 denote n sample realizations of (Y, X).Then, given an estimated basis for S denoted B, a plot of the y i 's versus the B x i 's is called an Estimated SSP (ESSP).
Popular and easy to implement methods that may be used to estimate a basis (or at least part of a basis) for the CDRS include Ordinary Least Squares (OLS), Sliced Inverse Regression (SIR; Li 1991), Sliced Average Variance Estimation (SAVE; Cook and Weisberg 1991) and principal Hessian directions (pHd; Li 1992).Robustness concerns with respect to these methods have been raised in the literature (see, for e.g., Gather et al. 2001Gather et al. , 2002;;Lue 2001;Prendergast 2007;Prendergast and Smith 2010).Chang and Olive (2007) showed that OLS ESSP's could be improved by removing outliers and Prendergast (2008) showed that improvements can also be found via the trimming of influential observations even when no gross outliers are evident.Garnham and Prendergast (2013) also showed that improvements to OLS can be made by replacing the response values with their ranks when large response values are present.An advantage of the aforementioned methods is that they are simple to implement in almost any statistics package and can be executed quickly.Consequently, we favor these approaches when seeking reasonable initial estimates for our proposed methods.However, in what follows other methods may also be used to obtain the initial estimates and some examples using the methods Minimum Average Variance Estimation (MAVE Xia et al. 2002) and Principal Fitted Components (Cook 2007;Cook and Forzani 2008) will also be given.
The purpose of this paper is to show how improved ESSP's can be obtained based on minimization criterion and initial estimates.The algorithm is introduced in Sect. 2 and some possible methods for the initial estimates are discussed in Sect.3. Simulations are considered in Sect. 4 that highlight the usefulness of our approach followed by application to three data sets in Sect. 5. We finish with a discussion including possible extensions in Sect.6 and provide computing details in the "Appendix".

Employing minimization criteria following dimension reduction
For convenience let F n denote the cumulative empirical distribution function associated with a sample denoted {y i , x i } n i=1 .Furthermore let g denote a link function associated with the population model for B X such that E(Y |X) = g(B X).We also let α denote a vector of ancillary parameters which fully define g.We define our estimates for α and B via α, B = arg min α,B S g (F n ; α, B) (3) where S g is the objective function to be minimized.Since we are searching for useful ESSP's defined to be plots of the y i 's versus the B x i 's, the estimate of α may not be of interest.Before introducing more general objective functions we will briefly discuss how this relates to OLS under an assumed multiple linear regression (MLR) model.For example, for K = 1 with β = B ∈ R p and under the MLR model we have E(Y |X) = g(β X) = α 0 + α 1 (β X).Now, for the ith observation, the MLR model residual associated with α and β is denoted r i (α, β) = y i − α − β x i so that subsequently setting S g in (3) equal to then leads to the least squares estimates.Letting β denote this OLS estimate for β, an ESSP can be obtained then by plotting the y i 's versus the β x i 's.

An objective function based on polynomial fitting
Suppose that where P L ,K denotes a polynomial function of degree L for K arguments and where α is a vector of coefficients for each term in the resulting model including the intercept.For example, for K = 2 and L = 2, the assumed model is We avoid strict equality between E(Y |X) and P L ,K (β 1 X, . . ., β K X; α) here since many models can be approximated well by a polynomial for appropriately chosen (L , K ) as a consequence of Taylor's theorem.This then allows us to seek good ESSP's rather than strictly committing to a fitted model.
We then denote the ith residual associated with the polynomial model in (4) as and a simple, yet conceptually effective, objective function would again be the sum of a function of the residuals.That is, we can choose estimates according to for r i (α, B)'s defined as in ( 5) and convex function ρ.To minimize the sum of squared residuals, ρ is chosen such that ρ(r ) = r 2 .For this choice of ρ, if B were fixed and minimization of (6) were carried out over just α then this would be polynomial least squares estimation for transformed regressor vectors given by B x i for each i = 1, . . ., n.
Alternatively, ρ can be chosen such that large residuals are down-weighted.For example, M-estimation (Huber 1964(Huber , 1973) could be implemented using the Huber weight function where although other choices for ρ could similarly be chosen.Due to the large number of model coefficients that can be encountered when either or both of p and L are large, we consider an iterative approach to estimation that utilizes non-linear least squares functionality (non-robust and robust) based on initial estimates of the β k 's.Let y = [y 1 , . . ., y n ] be the response vector and X n be the n × p matrix of predictor vectors (i.e. whose ith row is x i ).For r i (α, B) defined as in (5), the iterative approach to (6) is as follows: Step 0.1: Estimate B using an appropriate method and denote as B (0) .
Step 0.2: For fixed B (0) , estimate the coefficients of the polynomial least squares regression (or robust equivalent) with degree L and with response vector y i and n × K predictor matrix X n B (0) and let α (0) denote the vector of estimated coefficients.Let S (0) ρ denote the corresponding sum of ρ[r i ( α (0) , B (0) )]'s.
Step i: While tol.met is FALSE do Step i.1:With a starting choice of B equal to B (i−1) and with fixed α (i−1) , estimate Step i.2:For fixed B (i) , estimate the coefficients of the polynomial least squares regression (or robust equivalent) with response vector y and n × K predictor matrix X n B (i) and let α (i) denote the vector of estimated coefficients.Let S (i)  ρ denote the corresponding sum of ρ[r j ( α (i) , | < tol then set tol.met to TRUE.
Step i + 1: Return B (i) as the estimate to B (and α (i) as the estimate to α).Details on computations using R (R Core Team 2014, see R: A Language and Environment for Statistical Computing by) can be found in the "Appendix".

Minimum average variance estimation
When ρ(r ) = r 2 , our approach seeks the B that minimizes where the unknown E(Y |B X) is replaced by the approximating polynomial of degree L in (4).Minimum Average Variance Estimation (MAVE; Xia et al. 2002) seeks such a B by considering residuals r i0 = y i − [a + b B (x i − x 0 )] for a ∈ R and b ∈ R K and where a + b B (x i − x 0 ) is a local linear expansion of E(y i |B x i ) at a given x 0 .The MAVE estimate is the B that minimizes the weighted sum of squared r i j s (i = 1, . . ., n; j = 1, . . ., n) where the weights are based on the distances between the x i s and x j s.We will give further consideration to MAVE later.

An objective function based on spline fitting
The above least squares approach assumed an (at least approximate) underlying polynomial model and subsequently defined the residuals to be those from a polynomial regression fit.Another option is to define the residuals as those coming from a spline fitting.That is where spline K indicates a predetermined type of spline function curve fitted over all observations operating on the K dimension reduced predictors.Here we focus on the estimation of B only and hence the objective function in (3) becomes, for r i (B) defined above in (8),

Limitations and recommendations
Below we discuss some potential problems with the approach and provide some recommendations as to how they can likely be avoided.

Failing to find a global minimum
When the number of parameters to be estimated increases, it becomes more likely that a local minimum (rather than a global minimum) of the objective function is detected.That is, whilst new estimates resulting from the minimization procedure result in a decrease in the value of the objective function, there is no guarantee that the smallest value of the objective function has been achieved.Hence, whilst an improved ESSP may be detected, care should be taken when a large number of parameters are required to be estimated.It is recommended that if p is large relative to n, then L needs to be small.Another possibility is to consider a smaller subset of the p predictors and an example of this will be given in Sect.5.1.

Overfitting
Overfitting may occur resulting in an ESSP that implies meaningful dimension reduction direction have been found.This problem can be exacerbated due to the ESSP being a plot based on a low dimension projection of the data in which case it may be easy to forget that many parameters have been estimated to produce the view.The same advice that was provided for the first problem above is also valid here.Hence, a cross-validation approach which will not be influenced by over-fitting for the choice of K and L is recommended.Details are provided after the simulations in Sect. 4.

Basis constraints
When K > 1, an identifiable basis for S may be preferred.However, since the model in (1) is completely unknown, methods such as SIR, SAVE and pHd (to be discussed in the next section) simply return a basis for S rather than a specific identifiable basis.Nevertheless, the algorithms for these approaches usually result in some constraints of the estimated basis.For example, for some of the methods that we discuss in the next section, the dimension reduced predictors are uncorrelated with one another.It can be argued that these constraints are not necessary when we are simply targeting the best view of the dimension reduced data.However, if uncorrelated dimension reduced predictors are required then there is an obvious possibility which we discuss below.
In the simulations provided later, we show that the minimization approach can be used to find an improved basis estimate to S. However, without constraints the dimension reduced predictors will not be uncorrelated (unless the starting estimate already achieved the minimization).Let Σ x be the sample covariance of the predictors and B the estimated basis for S. Let B z = Σ 1/2 x B where we use the subscript z since this is an estimated dimension reduction basis on the standardized predictor scale.Now let B * z be the the Gram-Schmidt orthonormalized B z such that ( B * z ) B * z = I K . Then is an estimated basis for S such that the columns of X n B * (the columns of which are the dimension reduced predictors) and mutually uncorrelated with unit variance.If it is preferred that the columns of B * have unit length but where the dimension reduced predictors are still uncorrelated then each column of B * simply needs to be scaled with respect to its length.

Some simple dimension reduction methods for the initial estimates
We focus our attention mainly on the ensuing methods because they are simple to implement in almost any package and previous studies have shown that they are usually capable of efficiently finding good estimates for the types of models we are considering.However, any dimension reduction methods that utilizes linear combinations of the predictor variables can be used as the initial estimates.For example, in Sect.2.2 we mentioned MAVE which could also be used to determine the initial estimates.In fact, the SIR method described below is also used as the initial MAVE estimates with the functionality that we have used (see Appendix).Throughout this section suppose that, for Y and X denoting a univariate response and p-dimensional regressor respectively, μ y = E(Y ), μ x = E(X) and Σ x = Var(X).

Ordinary least squares and pHd
Let β ols denote the OLS gradient vector for the regression of Y on X such that β ols = Σ −1 x Cov(X, Y ) and consider the following condition: When K = 1 in the model in (1) and Condition 1 holds, we have that (see, for e.g., Theorem 2.1 of Li and Duan 1989) β ols = cβ 1 for some c ∈ R therefore providing a complete basis for S when c = 0.
We discuss pHd here because of its strong link with OLS.When X ∼ N p (μ x , Σ x ) (satisfying Condition 1), Li (1992) noted that eigenvectors of results in e.d.r.directions and uncorrelated dimension reduced regressors.Li also noted that, under the aforementioned conditions for (Y, X), it is also true that H z = E RZZ where R = Y − μ y − β ols (X − μ x ) is the population OLS residual for the regression of Y on X.Thus there are two commonly employed pHd methods which we will refer to as pHd y (for the Y based approach) and pHd r (for the R based approach) with pHd r being typically preferred since it appears to be more suited to finding elusive linear components (see, for e.g., Cook 1998a) and for many models there is evidence that it has lower estimator variability (Prendergast and Smith 2010).
Given a sample denoted {y i , x i } n i=1 , the use of OLS and pHd is straightforward in practice since moment based estimates are used.For the pHd r , the y i 's are replaced with the usual OLS residuals arising from a multiple linear regression model fit.For more on estimation for pHd see Li (1992).

SIR and SAVE
Let S 1 , . . ., S H denote H non-overlapping partitions over the range of Y and denote H ) which are referred to as slice means.Li (1991) introduced SIR which utilizes the slices means to obtain e.d.r directions under mild conditions.Specifically, eigenvectors of corresponding to non-zero eigenvalues are elements of Σ 1/2 x S when Condition 1 holds.Again, a re-standardization leads to e.d.r.directions.
In introducing SIR, Li (1991) noted that SIR may sometimes fail to find a complete basis for S due to V being of rank less than K which can occur when, for example, the link function is symmetric around the mean of some of β k X's.This was also discussed by Cook and Weisberg (1991) who then introduced SAVE to counter this problem.Consider the condition: Under Conditions 1 and 2, Cook and Weisberg (1991) note that Σ h = Var(X|Y ∈ S h ) (h = 1, . . ., H ) also contains information regarding Σ x S. They therefore suggest estimating whose eigenvectors corresponding to nonzero eigenvalues are elements of Σ In practice, implementation of SIR and SAVE is straightforward where, given a chosen H , the x i 's are allocated to slices according to the magnitude of y i 's and where n h (the number of observations allocated to the hth slice) is usually chosen to be approximately equal.Let x h and Σ h denote estimates of the hth slice mean and covariance respectively.Then estimates for V and M are

Simulations
In this section we report the results from various simulation studies.We start by evaluating the performance of the estimators for varying choices of K and L. We finish with discussion of using cross-validation techniques to choose a suitable pair for the choice of K and L.

Estimating the directions
For each of Models 1, 2 and 3 defined in this section and the Contamination Scheme for Model 3, let X ∼ N p (0, I 10 ) and ε ∼ N (0, 1).We will begin by looking at Model 1 which incorporates a second order polynomial link function.
Model 1: Summary results associated with 1000 trials for data simulated from Model 1 are shown in Table 1 for four choices of n (50, 100, 250 and 500).Reported are the mean (standard deviation in parentheses) cor(X n β, X n β) 2 where β is the estimated β.The methods considered are OLS, SIR, SAVE, residuals-based pHd (pHd r ) and MAVE.We also consider each of these approaches followed by polynomial residual minimization with L = 2, 3, 4 according to the iterative approach described in Sect.2.1 with ρ(R) = R 2 (identified in the table with a postscript -P to the initial method).Initial OLS, SIR, SAVE and MAVE results are poor for n = 50 indicting that typically these methods fail to return a good estimate to the direction of β.OLS and SIR both perform poorly when the link function is symmetric around β μ.Whilst this is not strictly the case here, the quadratic term in Model 1 is not useful to either OLS and SIR and may swamp the detection of β in the linear component to the link function.SIR still returns poor results for n = 250 whereas SAVE performs very well.pHd r performs comparatively well even for small n = 50.While MAVE did not perform well initially, SIR was used for the initial estimates.Given that the SIR initial estimates were typically poor, it is possible that the MAVE approach consequently suffered.This was confirmed when we replaced the SIR initial estimate with the pHd r estimate and the MAVE results (identified by MAVE*) improved remarkably.
Increases in average cor(X n β, X n β) 2 are witnessed in the table when polynomial minimization follows the initial estimation although only a very small gain was achieved for MAVE* when L = 2.However, timed on a single run it took MAVE* 47.44 s to obtain the initial estimates compared to just 0.03 s for the initial pHd r estimate plus 0.15 s for the improvements based on the algorithm.Hence the total time to obtain the comparable results based on an initial pHd r estimate was just 0.18 s compared to over 47 s for MAVE.For the other methods, improvements were particularly pronounced for smaller n and where L = 2. Since Model 1 is a quadratic polynomial, L = 2 is the suitable choice.For L = 3 and L = 4 and in particular for smaller n, the results provide improvements over the initial estimation although are on average the same or worse than when L = 2. Now consider a model with K = 2 directions and increase the dimension to p = 15.Model 2: Y and To assess the estimation methods for Model 2, we consider the first two canonical correlations (r 1 and r 2 ) comparing X n B and X n B where B = [β 1 , β 2 ] and B is the estimated basis.The results are summarized for 500 trials and n = 250 in Table 2. SIR, SAVE, pHd r and MAVE perform at least moderately well in estimating the first direction with SAVE and MAVE being the best performers.However, on average the methods perform poorly in estimating the second direction.SIR is expected to have trouble detecting the β 2 component due to symmetry of (β 2 X) 2 about the mean of β 2 X (see, for e.g., Cook and Weisberg 1991).However, big improvements for both correlations are witnessed for the polynomial minimization approaches with mean first canonical correlations reported all close to 1.00.Similarly, the mean second canonical correlations are mostly at least 0.9.Of particular note, however, is that the median first and second canonical correlations for each method followed by polynomial min-   -P denotes initial estimates followed by further polynomial minimization (where L = 5 was used) described in Sect.3.1 with ρ(R) = R 2 and -SP denotes the spline fitting approach in Sect.2.3.Reported are the mean (standard deviation in parentheses) and median cor(X n β, X n β) 2 where β is the estimated β imization are all approximately equal to one, indicating typically excellent estimation of a basis for S.This example also highlights that MAVE initial estimates can also be improved.
For the third model we consider a trigonometric link function and restrict our attention to the use of OLS, SIR, SAVE and pHd r to obtain initial estimates.
Summary results for 1000 simulated trials for Model 3 are provided in Table 3.The results reported for the polynomial minimization are for L = 5.For all choices of n, OLS performs at least reasonably well with a minimum average cor(X n β, X n β) 2 of 0.82 reported when n = 50.As expected, results improve for larger sample sizes and decreases in the standard deviation of the cor(X n β, X n β) 2 's are observed.Whilst the initial OLS results are good, the polynomial minimization provides an obvious improvement.For example, for n = 50 the mean cor(X n β, X n β) 2 has increased to 0.96.For n = 50, initial SIR, SAVE and pHd are typically poor.However, for n = 100 SIR performs well and is typically an excellent estimator for n = 250 (means reported as 0.83 and 0.96 respectively).Again, polynomial minimization provides improved results.For SAVE, results are reasonable only for n = 250.However, despite the poor initial estimates, polynomial minimization provides very good to excellent results for n ≥ 100.For n = 100 the means are reported as 0.75 for SAVE-P although the high median of approximately 1 suggests that a typical result is excellent although the large standard deviation of the cor(X n β, X n β) 2 's also suggests that, in some cases, poor final results are still possible.For pHd r , increasing n does little to improve the initial estimates (for n = 500 the reported mean and median are still only 0.45 and 0.46 respectively).This is expected since it is possible to show that H z = 0 so that, on average, pHd r is not expected to find information regarding S.However, on average the polynomial minimization provides very good results even for n = 50 (mean cor(X n β, X n β) 2 of 0.85) and a typical result is excellent for all n with median cor(X n β, X n β) 2 approximately equal to one for all considered n.
In Table 3 we also provide results for the spline approach (the -SP suffix after OLS etc. indicates this approach) described in Sect.2.3.However, results are not provided for n = 50 due to regularly encountered computational errors.In many cases it can be seen for the choices n ≥ 100, the spline approach provides comparable results to the polynomial fitting results.However, in others the results are, on average, worse.For example, for SAVE and n = 250, the polynomial fitting provides superior results.
We now consider the results associated with robust polynomial minimization for Model 3 when contamination that typically results in poor initial estimates is evident.The contamination scheme is described below.
Contamination Scheme for Model 3: For n observations randomly generated according to Model 3, we contaminate the data by further introducing 0.2 × n contamination observations generated according to Results associated with OLS and SIR are provided in Table 4 where, for simplicity, we chose to include OLS and SIR since they were the better performed estimators for the uncontaminated simulation results reported in Table 3.We report mean, standard deviations and medians for the cor(X n β, X n β) 2 's associated again with 1000 trials and four choices of n.Considered were OLS and SIR as well as further polynomial residual minimization as described in Sect.2.1 with ρ(R) = R 2 (-P identifying this approach) and also ρ as in (7) with c = 1.345 which is the default choice in R (-P(ROB) identifying this approach).For n = 60 (50 observations simulated from Model 3 with an additional 10 contamination points added), OLS and SIR are clearly adversely affected by the introduction of contamination with reported means of 0.65 and 0.57 respectively.Whilst polynomial minimization with ρ(R) = R 2 provides somewhat improved results, it is the robust equivalents with Huber weighting that result in big improvements with reported mean cor(X n β, X n β) 2 of 0.86 and 0.80 for OLS and SIR respectively.While OLS and SIR improve with increasing n, we OLS and SIR refer to standard least squares and sliced inverse regression estimation.OLS-P and SIR-P are for the initial estimates followed by further polynomial minimization described in Sect.2.1 and the OLS-P(ROB) and SIR-P(ROB) to the robust versions with Huber weighting.Reported are the mean (standard deviation in parentheses) and median cor(X n β, X n β) 2 where β is the estimated β

A B C
Fig. 1 Typical OLS ESSP's for one trial of the simulation results summarized in Table 1 with n = 100.OLS (a) refers to standard OLS estimation.OLS-P and OLS-P(ROB) (b, c) refer to non-robust and robust additional polynomial minimization as described in Sect.2.1.The cor(X n β, X n β) 2 for each plot is 0.81, 0.93 and 0.995 respectively which are approximately equal to the medians reported in Table 1 still find improvements resulting from minimization-in particular with respect to the robust approaches.For example, for n = 120 and SIR, initial results were good (mean 0.82), improvements were observed with non-robust minimization (mean 0.91) and consistently excellent results were found for the robust approach (mean 1.00 and standard deviation of 0.05).
It is also interesting to note that the introduced contamination typically does not result in ESSP's with obvious problematic points when, say, n = 120.For example, in Fig. 1 we provide three ESSP's associated with the OLS based estimators from Table 4.These ESSP's are for the same simulated data and this particular simulated data was chosen since the resulting cor(X n β, X n β) 2 's were close to the associated medians reported in Table 3.The first is the ESSP resulting from standard OLS estimation.For this ESSP it is difficult to appreciate anything other than a mild association between the response and dimension reduced predictors.Also, aside from one observation clearly not close to the remainder of the data (the point near the top left corner of the plot) other points are not so distinguishable as to clearly identify them as non-typical.The ESSP resulting from the non-robust estimation (Plot B) now exhibits some signs of the sine structure defined in Model 3.However, the ESSP resulting from robust minimization clearly shows the sine structure well and, additionally, outlying observations are clearly identifiable as those points that are not tightly associated with this structure.

Choosing K and L
So far we have not discussed how to choose K and L. One possibility is to visually inspect the views provided by a range of choices for K and L and choosing the combination that provides the best view describing a relationship between the response and dimension reduced predictors.However, the danger of this approach may be that one finds a trait of the specific data set rather than what is a good representation of the population level targeted relationship.Another option is to use a K -fold crossvalidation approach.While K -fold is the common terminology, we are avoiding a notational conflict with K also commonly used to denote the number of β k 's in the model.We describe the process below.
To carry out the K -fold validation process, the n observations are randomly partitioned in the K subgroups of approximately equal size.In turn, each of the groups is nominated as the test data while the remaining K − 1 groups form the training set.When the kth group is the test set, for a selected K and L we start by estimating E(Y |X) (k) which is the estimated model without the data from the test set (the kth group) using an estimation describe in Sect. 2. Let (y ki , x ki ) index an observation from the test data and also let n k denote the number of observations in this test data.The estimated model is then used to obtain predicted responses, denoted y ik , which leads to an estimate for the model error of This is repeated for all K subgroups leading to an estimate of the model error for the choice of K and L as For E[Y − E(Y |X)] 2 K ,L 's calculated over a range of K and L, we can choose a combination that returns a small estimated error.Our simulated explorations have revealed that when one of the initial estimated directions is not close to the corresponding true direction, then there is a chance that one of the estimations in the cross-validation may fail leading to a large overestimation of the error variance.While another possibility is to use the median of the K estimated variances, we have decided to instead use a measure similar to the coefficient of determination in multiple linear regression.Consider two equal length vectors of numeric values, denoted u 1 and u 2 , ] denote the concordance correlation coefficient where s 12 is the sample covariance between the values in u 1 and u 2 , s 2 i the sample variance of the values in u i (i = 1, 2) and x i the sample mean of the values in u i (i = 1, 2) (Lin 1989).This is a correlation measure for the agreement of the values in u 1 and u 2 (i.e. a value close to one tells us the the values are very close to one another, not just in a strong linear relationship).Let y k be the test data from the kth fold and y k be the associated fits based on the training data.Then we report which is not as influenced for some models when there is one bad estimate of the basis as is the mean of the estimated error variances.Since it does not have a polynomial link function, meaning that the choice of K and L is important in order to achieve a good approximation for the fit, we begin by considering Model 3 (a trigonometric link function with K = 1).
In Table 5 we focus our attention on polynomial minimization using SIR initial estimates for a single simulated data set from Model 3.For both n = 100 and n = 250, L = 5 appears to be a good choice with high values of (15) reported.However, good results are also achieved for L = 3 meaning that it would worthwhile considering these estimates if one is also after a simple fitted model.Overall, the polynomial minimization does seem to provide a good fit to the data despite the model not being polynomial.We simulated the cross-validation 100 times to assess the process for deciding on suitable combination of K and L. For both K = 1 and K = 2, L = 2 was too small with average correlations of 0.559 and 0.556 respectively.The best choices were for K = 1 with L = 5 and L = 3 (0.982 and 0.952) while another possible good model is for K = 2 and L = 3 (0.91).The good cross validation results for L = 5 and K = 1 mirror the good estimates for this model as can be seen in Table 3.
We now also consider the cross-validation approach for a K = 2 model returning to Model 2. In Table 6 we simulate 100 data sets for 100 trials focusing on L = 2, 3, 4.
Table 6 Average values of r c (standard deviations in parentheses) for 100 iterations of fivefold crossvalidation data simulated (n = 250) from Model 2 and considering three initial estimation methods followed by further polynomial minimization; SIR-P, SAVE-P and pHd r -P K SIR-P SAVE-P pHd r -P Mean cross-validated concordance correlations for K = 5 are provided with standard deviations in parenthesis.On average, the best results for each method are associated with L = 3 and K = 2 (the correct choices based on the true model) with SAVE-P (SAVE initial estimates followed by polynomial minimization) the best method overall.SAVE-P also performs well with larger choices of L than is necessary.Another option is to test for suitable K using, for example, large sample tests for SIR as described by Li (1991) or bootstrap approaches such as those provided by Liquet and Saracco (2012).This would be a preferred option when the sample sizes are small meaning that the cross-validation procedure may not perform well.

Examples
In this section we look at three examples.For the first two we explore estimated directions where OLS and/or SIR seemed to be the most effective estimators.We then consider an example based on initial estimates from another method.

Example: Arsenic levels in soil
The data for this example was provided by Hayley Castlehouse and was analyzed as part of her PhD dissertation (Castlehouse 2008) and was considered with respect to dimension reduction by Shaker and Prendergast (2011).The data consists of soil composition measurements for 35 samples (after samples with missing values were removed) taken from a site in North Lincolnshire, England which is known to have elevated levels of arsenic.For the purpose of this example we will let the response be iron-oxide bound arsenic (As) in mg.kg −1 .There are 14 predictors being the level of the chemical compounds PO 3− 4 , Mg 2+ , Cl − , NO − 3 , NH + 4 , SO 2− 4 , Ca 2+ , K + , F − , Na + , TIC, TOC, Fe, and Mn.Before considering ESSP's based on two estimated directions, we begin by looking at the two-dimensional ESSP based on the OLS estimated direction (shown in Plot A of Fig. 2).An obvious linear relationship is evident between the response (arsenic level) and the dimension reduced predictor vectors.The fitted polynomial of degree two is included although over the domain of the dimension reduced predictors the An improved K = 1 ESSP using the Huber objective function from ( 7) and the iterative algorithm from Sect.2.1 is shown in b.For the K = 2 case and using just four of the 14 predictors, an improved ESSP is shown in e and the responses versus fits are depicted in f.For these plots minimization was carried out over the sum of squared residuals fit is approximately linear.In order to choose an appropriate L for improving the ESSP based on minimization of residuals for a polynomial fit, we used the crossvalidation approach from Sect.4.2 based on (15).Since there are 35 values, we used K = 7 with an equal number, 5, of observations randomly assigned to each fold.
We considered two methods.The first was based on the initial OLS fit and where the improved estimate was sought by minimizing the sum of squared residuals after fitting polynomials of degree L = 2, 3 and 4. For the second we used the M-estimator objective function from (7) on the squared residuals and also for L = 2, 3 and 4. The average concordance correlations for the non-robust approach were 0.63, 0.44 and 0.30 and for the robust approach were 0.66, 0.62 and 0.52.We therefore use the robust approach with L = 2.With the exception of one outlier, the improved ESSP in Plot B highlights a clear quadratic relationship between the response and final dimension reduced predictor.The final polynomial provides an excellent fit to the data with the exception of the one outlier.This outlier can also effect the cross-validation values.
Looking back at the robust fit correlations over all seven folds, for three of the folds the correlations were 0.9 or higher and the average without the smallest (0.31) was 0.73.
Given that there are 14 initial predictors, to avoid over-fitting in the K = 2 setting we have applied SIR to a smaller number of predictors (four predictors chosen via pair-wise scatterplots between each predictor and the response).For these chosen predictors we ran the cross-validation over both K = 1, 2 and L = 2, 3, 4 based on SIR initial estimates and minimizing over the sum of squared residuals.The clear standout was for K = 2 and L = 2 where the average concordance correlation was 0.75 and all others were below 0.52.Looking at the results from the seven folds for K = 2, L = 2, one result was 0.52 and the average without this fold was 0.79.These results suggest an excellent choice of model especially given the small sample size of 35.In Plots C and D of Fig. 2, we provide the K = 2 estimated SIR ESSP and the response versus the fits resulting from the polynomial degree two fit respectively.The linear relationship between the fits and the response indicates that the choice of degree for the polynomial is appropriate although the estimated model does not fit all of the points well.However, after application of the algorithm in Sect.2.1 and minimizing over the sum of squared residuals, an improved fit is found (see Plots E and F) with no large and obvious outliers.Whilst the K = 1 ESSP in Plot B provides a good fit to all but one observation, the K = 2 ESSP from Plot F has some merit since it based on just four predictors (ten fewer than required for Plot B) and fits all observations well.

Example: Bridge construction
For this example we consider a bridge construction data set from Tryfos (1998) which has also been recently considered by Sheather (2009).The response is Time (days taken to design the bridge) and there are five predictors: deck area of bridge, construction cost, number of structural drawings, length and the number of spans.In total there are 45 observations.In the previous example, the cross-validation was reasonably consistent across folds and re-running the process provided very similar results.For this example this is not the case and several possibilities are suggested when considering SIR.The largest average correlation was for L = 2 and K = 1 with approximately 0.65 where the resulting ESSP (not shown) provided a reasonable, mostly linear, view which was very similar to the initial views for K = 1.On the other hand, average correlations for K = 2 and L = 2, 3 and 4 were lower although an inspection of the folds suggested potential for some excellent fitted models with K = 2.For example, the largest correlation for a fold achieved was 0.97 with L = 3 and K = 2 and two other folds had values over 0.8.Conversely, two of the folds had close to zero correlation.For one of the folds an inspection of the ESSPs revealed a large outlier in the test group and for the other a very poor view of the training data was the result.On the entire data set it was this combination of K and L that provided the best ESSP and fitted model.
In Fig. 3 Plot A we provide the K = 2 ESSP following SIR estimation which includes the polynomial fitted equation of degree 3.In Plot B we depict the observed response values versus the fitted responses obtained from the polynomial fit shown in Plot A. The fits are linearly associated with the responses and SIR has done a reasonable job of finding a relationship between design time and the predictors.However, in Plots C and D we show that big improvements can be found when re-estimating the directions using the algorithm described in Sect.2.1 with a tolerance level of 10 −5 .Here we have chosen to use the M-estimation objective function from (7), the result of which is one noticeable outlier that is evident in both the ESSP (Plot C) and the response versus fits (Plot D).The remaining observations fit the final degree three polynomial very well.There were no constraints placed on the final estimated basis although it would have been possible to carry out some post-estimation orthonormalization as discussed in Sect.2.4.3 and re-fit the polynomial based on the new basis.For example, GramSchmidt orthonormalization on Σ 1/2 x B followed by a re-standardization with respect to Σ −1/2 x would result in a new estimated basis where the dimension reduced predictors are uncorrelated with one another and have unit variance.This was also carried out resulting in an almost identical ESSP and response versus fits and the plots are therefore not shown here.

Example: BigMac data
As mentioned earlier, one is not limited to using the dimension reduction methods considered in Sect.3. In fact, on inspecting ESSPs obtained from any linear dimension reduction, if the relationship between the response and dimension reduced predictor looks as though it may be approximated by a polynomial function, then our proposed approach may result in an improved ESSP.For example, consider the BigMac data from Enz (1991).Adragni and Raim (2014) used a method called Principal Fitted Components (PFC; Cook 2007;Cook and Forzani 2008) to find a useful ESSP between the response (the minimum labor need to purchase a BigMac burger with fries in 45 cities) and 9 socio-economic predictor variables.The underlying relationship looks as though it could be adequately approximated by a degree three or four polynomial and we therefore use their initial estimated direction to seek an improved ESSP.
We firstly ran fivefold cross-validation over K = 1, 2 and L = 2, 3 and 4. The clear choice was for K = 1 and L = 3 with an average concordance correlation of over 0.68.We will therefore use K = 1 and L = 3 as the parmeter choices when seeking an improved ESSP.
In Fig. 4 Plot A we provide the ESSP resulting from the first estimated PFC direction.This view was also provided by Adragni and Raim (2014).One characteristic of the ESSP is that there seems to be one outlying observation (that with the largest response) that does not conform well to the underlying relationship between the response and An improved ESSP using the Huber objective function from (7) and the iterative algorithm from Sect.2.1 is shown in c and the corresponding responses versus fits are depicted in d dimension reduced predictors.There are two possibilities that may arise when seeking an improved ESSP.Firstly, this outlier may be adversely affecting the estimation so that use of the M-estimator objective function from (7) may find an estimated basis that provides improved fits for the remaining observations.The ESSP resulting from the robust estimation approach is shown in Plot B and we can see that not only has the ESSP improved in general for the observations that were not highlight as the outlier, but that the outlier itself also appears to conform at least reasonably well with the rest of the data.The second possibility that may arise is that non-robust estimation when seeking an improved ESSP may result in an estimated basis for which the ESSP adequately accounts for the outlier.This ESSP is shown in Plot C and the outlying observation conforms to the relationship illustrated in the ESSP very well.Consequently, the ESSPs in Plots B and C provide improved views of the data with the Plot C ESSP being the preferred since the underlying relationship between the response and dimension reduced predictors is clear for all observations.Additionally, while a degree three polynomial was used to seek the improved estimated basis, one could seek to explain the resulting relationship using something like an exponential model in a final modeling step.

Extensions and discussion
In this paper we have shown how adjustments to initial SIR, SAVE and pHd estimates can result in vastly superior ESSP's.This was achieved through minimization of functions of residuals assumed under a polynomial fit.The polynomial fit was used here for convenience and since this assumed model can approximate many other model types well for a suitably chosen degree.However, it would be also possible to choose other assumed models on the basis of what is observed in the initial ESSP.It needs to also be noted that by using the polynomial we are not committing to this as being the fitted model to be used in a subsequent analysis, but rather to find a good ESSP that provides a strong visual association between the response and dimension reduced regressors.Lastly, we also provide an example of improved ESSPs based on what we saw in an estimated ESSP resulting from method PFC.Consequently, other methods can also be used to find initial estimates and our advice is to be guided by the initial ESSPs as to whether a method provides estimates that may benefit to strategies employed in this paper.

Appendix: Computing details
In this section we detail the use of R which was used to compute all results.The analyses that have been carried out in the previous sections using the freely available statistical software package R version 2.13.1 (R Core Team 2014, see R: A Language and Environment for Statistical Computing by).The tolerance level for exiting the iterative process was set at tol = 1 × 10 −5 although a maximum iterations of 100 was also enforced.Initial SIR, SAVE and pHD estimates were obtained using the dr package (Weisberg 2002).For SIR and SAVE we did not specify the slicing parameter H and instead used the package default of H = max(8, p +3).When robust estimates were obtained in Steps 0.2 and i.2, the function rlm from the 'MASS' package (Venables and Ripley 2002) with the Huber weight function as shown in ( 7) and where the default choice of c = 1.345σ was used where σ is replaced with a robust estimate of the residual standard deviation (the median absolute deviation estimate).Otherwise ordinary least squares was utilized.For ρ(r ) = r 2 (non-robust), in Step i.1 non-linear least squares was carried out using the function nls with the Gauss-Newton algorithm.For the robust equivalent, robust non-linear least squares was employed via the nlrob function from the 'robustbase' package (Rousseeuw et al. 2011) and again with the Huber weight function.Polynomial transformation of the B x i 's in R is routine using the poly function and setting the optional argument raw to TRUE to ensure that the dimension reduced regressors are transformed such that they are original polynomial terms and not orthogonal terms.
With respect to the spline fitting in Sect.2.3, for the minimization we used the nlminb function available through the R stats package (available as standard in the R base distribution) whose arguments were the initial estimates based on either OLS, SIR, SAVE of pHd and the objective function to be minimized.A cubic smooth spline was fitted using the R function smooth.spline.
For MAVE, we used functionality provided in support of work from Li et al. (2010) found at http://www4.stat.ncsu.edu/~li/software/GroupDR.R.This functionality uses SIR to initialize the estimates and we used the default tolerance and kernel settings.In the final example where the method PFC was employed, we used the R package ldr (Adragni and Raim 2014) which includes functionality for several likelihood-based dimension reduction methods.We used the same arguments in the call to the PFC functionality for this example as was found in the paper.
initial estimates followed by further polynomial minimization (where L = 3 was used) described in Sect.2.1 withρ(R) = R 2 .Reported are the mean (standard deviation in parentheses) and median of the first and second canonical correlations (labelled r

Fig. 2
Fig.2Estimated ESSP's for the arsenic levels in soil example.OLS was used to obtain the K = 1 ESSP's in a and SIR was used for the K = 2 ESSP in c which includes the L = 2 degree fitted polynomial.An improved K = 1 ESSP using the Huber objective function from (7) and the iterative algorithm from Sect.2.1 is shown in b.For the K = 2 case and using just four of the 14 predictors, an improved ESSP is shown in e and the responses versus fits are depicted in f.For these plots minimization was carried out over the sum of squared residuals

Fig. 3
Fig. 3 Estimated ESSP's for the bridge example.SIR was used to obtain the K = 2 ESSP in a which includes the L = 3 degree fitted polynomial and where the responses versus subsequent fits are shown in b.An improved ESSP using the Huber objective function from (7) and the iterative algorithm from Sect.2.1 is shown in c and the corresponding responses versus fits are depicted in d

Fig. 4
Fig. 4 Estimated ESSP's for the BigMac example using PFC for the initial estimates.The initial PFC ESSP is shown in a and improved ESSPs using the robust Huber objective function from (7) and non-robust sum of squared residuals objective function are provided in b and c respectively

Table 1
Results for 500 trials of n observations simulated from Model 1 using OLS, SIR, SAVE, residualsbased pHd (pHd r ) and MAVE.MAVE* is MAVE where pHd r has been used for the initial estimates, -P denotes initial estimates followed by further polynomial minimization described in Sect.2.1 with ρ(R) = R 2 and the degree of polynomial is set to either L = 2, L = 3 or L = 4 Reported are the mean (standard deviation in parentheses) cor(X n β, X n β) 2 where β is the estimated β

Table 2
Results for 500 trials of 250 observations simulated from Model 2 using SIR, SAVE, residuals-based pHd (pHd

Table 3
Results for 1000 trials of n observations simulated from Model 3 using OLS, SIR, SAVE and residuals-based pHd (pHd r )

Table 4
Results for 1000 trials of n observations simulated from Model 3 with the contamination scheme detailed in(12)

Table 5
Values of r c base on