An Outer-Product-of-Gradient Approach to Dimension Reduction and its Application to Classification in High Dimensional Space

Abstract Sufficient dimension reduction (SDR) has progressed steadily. However, its ability to improve general function estimation or classification has not been well received, especially for high-dimensional data. In this article, we first devise a local linear smoother for high dimensional nonparametric regression and then utilise it in the outer-product-of-gradient (OPG) approach of SDR. We call the method high-dimensional OPG (HOPG). To apply SDR to classification in high-dimensional data, we propose an ensemble classifier by aggregating results of classifiers that are built on subspaces reduced by the random projection and HOPG consecutively from the data. Asymptotic results for both HOPG and the classifier are established. Superior performance over the existing methods is demonstrated in simulations and real data analyses. Supplementary materials for this article are available online.


Introduction
Consider response Y ∈ R and covariates X = (X [1] , . . ., X [p] ) ∈ R p .Without any assumptions on the relationship between X and Y, the regression function m(x) = E(Y|X = x) is usually our interest.However, as the number of covariates increases (i.e., the dimension of X grows), it is hard to estimate m(x) efficiently.Fortunately, when the dimension of X is high, the covariates are usually not equally important in the regression, and they may also contribute to the regression through a few linear combinations.Methods for the former includes the sure independence screening and variable selection; see, for example, Fan and Lv (2008), Li, Zhong, and Zhu (2012), and Lin and Zhang (2006).For the latter, PCA (Hotelling 1933) is the most popular method, and its extension has been made (Gorban et al. 2008;Lu, Plataniotis, and Venetsanopoulos 2011).The sufficient dimension reduction method (SDR for short hereafter) is another well-known approach of the latter and is our interest of this article.
SDR searches for projections B 0 such that Y is independent of X given B 0 X, that is, Y ⊥ X B 0 X.The space spanned by the column vectors of B 0 , denoted by S(B 0 ), is called the dimension reduction space.If S(B 0 ) is a subspace of all the other dimension reduction spaces, it is called central dimension reduction subspace (CS); see, for example, Cook (2009).Although the central subspace provides complete information of the relationship between Y and X, one might be sometimes only interested in the conditional mean function for which the dimension reduction assumes CONTACT Yingcun Xia staxyc@nus.edu.sgNational University of Singapore, Singapore.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/JASA.
where B 0 is a p × d orthogonal matrix with d ≤ p. Similar to CS, the central mean subspace (CMS, Cook and Li 2002) is defined as a subspace that includes all S(B 0 ) with B 0 satisfying (1).
The sliced inverse regression (SIR, Li 1991) is arguably the most popular method in estimating CS.Based on SIR, many other methods have been proposed; see, for example, Cook and Weisberg (1991), Li (1992), Zhu and Fang (1996), Cook and Li (2002), Li, Zha, and Chiaromonte (2005), and Li and Wang (2007).Another class of dimension reduction methods is based on kernel regression; see, for example, Xia et al. (2002) and Xia (2007).The outer-product-of-gradients (OPG) estimation is one in the second class; Fukumizu, Bach, and Jordan (2009) and Fukumizu and Leng (2014) investigated OPG in the reproducing kernel Hilbert spaces (RKHS).In addition, Yin and Cook (2005) and Yin, Li, and Cook (2008) proposed a general dimension reduction method based on information extraction.Ma andZhu (2012, 2013) developed a semiparametric approach that has appealing theoretical efficiency.
For high-dimensional data where p = p n → ∞ as sample size n → ∞, most feature selection techniques work well, but methods of dimension reduction are less straightforward partially because the theory is no longer valid as p → ∞.Lin, Zhao, and Liu (2018) showed that the consistency of SIR holds if and only if p n /n → 0 as n increases.Yu, Dong, and Shao (2016) proposed a marginal SIR utility for ultrahigh dimensional feature selection and dimension reduction.By imposing sparsity assumption on the model, Lin, Zhao, and Liu (2019) proposed a Lasso-SIR technique for dimension reduction with p n.Other methods based on forward approach or sparsity assumption were also proposed; see, for example, Yin and Hilafu (2015), Tan et al. (2018), andQian, Ding, andCook (2018).However, extension of the kernel-based dimension reduction approaches to high dimensional data is more challenging and less attended.
In this article, we will extend OPG to high dimensional data allowing p = p n → ∞ as n → ∞.The proposed method is called the high-dimensional outer-product-of-gradients estimation, denoted by HOPG hereafter.The main idea of this new approach is to utilize the distance correlation (Székely, Rizzo, and Bakirov 2007) to define a window for the local linear regression (Fan and Gijbels 1996) so that it works for highdimensional data and is able to estimate gradients efficiently.Under some mild regularity conditions, HOPG is proved to be consistent in estimating CMS directions and can be easily extended to the estimation of CS; see Xia (2007).Note that Fukumizu and Leng (2014) did a pioneering work based on OPG in high-dimensional setting by assuming the regression function and its gradients belong to a reproducing kernel Hilbert space.In comparison, HOPG is proposed for general regression functions.More importantly, HOPG is based on the local linear regression to facilitate the use of Lasso penalty when the dimension of covariate is high, and thus, it has better numerical performance.
To showcase the application of the proposed method, a new ensemble classification method will be proposed for highdimensional data.Inspired by Cannings and Samworth (2017), our method can be briefly summarized as follows.Firstly, the original data is randomly projected to subspaces of moderate dimension that can be handled by HOPG efficiently.Then, the projected spaces will be further reduced by HOPG.In each of the reduced subspace, a classifier, such as k-nearest neighbors (kNN), is estimated or trained.The final classification is obtained by aggregating the individual classifiers trained in the projected subspaces.The method is called classifier via dimension reduction in projection space (denoted by DRIPS hereafter).
The contribution of this article is 3-fold: (1) we devise a general approach to local linear nonparametric kernel regression in high-dimensional space.As far as we know, this is the first successful attempt for high-dimensional kernel regression.See, for example, Lafferty and Wasserman (2008); (2) we extend OPG to general high-dimensional SDR and give theoretical justification in more general settings than Fukumizu and Leng (2014); and (3) because of the local linear formulation, our method affiliates the use of Lasso approaches, and thus enjoy more theoretical generality, better numerical performance and faster calculation in comparing with the existing methods.In addition, the proposed ensemble classifier showcased the ability of SDR in improving the function estimation and classification.
The rest of this article is organized as follows.Details of a new local linear smoother and HOPG are stated in Section 2. The ensemble classifier is proposed in Section 3; some theoretical issues are also discussed in the section.Simulation studies for HOPG and real data analyses for DRIPS are reported in Section 4. Assumptions for theorems and their comments can be found in the Appendix.Theoretical proofs are given in a separate supplementary material document.

High Dimensional OPG
Let X = (X [1] , . . ., X [p] ) ∈ R p be a random vector with p = p n → ∞ and Y be the corresponding response variable.To illustrate our idea, let us focus on CMS and consider the following model where m(x) = E(Y|X = x) is the regression function, g(•) is an unknown link function, and . Denote the row vectors in projection matrix B 0 by β [j] = β (3) Hereafter || • || is the Euclidean norm.Suppose (X i , Y i ) n i=1 are random observations of (X, Y).Let K(u) be a kernel function on R p and where bandwidth h = h n → 0, α = (α 1 , . . ., α p ) and |α| = p j=1 α j .The local linear smoother (Fan 1993) at x is (5) In the conventional kernel smoothing, α j = 1 is used uniformly for all j ∈ {1, . . ., p}, thus, the "volume of the window" is proportional to h p n .For the consistency of a(x) and b(x) in (5) as estimators of m(x) and its gradients, it is required that (i) the number of observations in the window, which is proportional to nh p n , must tend to infinity, and (ii) the volume of the window must tend to zero.However, for high dimension where p = p n → ∞, the two requirements cannot be fulfilled simultaneously because nh p n → ∞ for any h n → 0 as n → ∞.As a consequence, the classical kernel estimation fails to achieve the statistical consistency.Lafferty and Wasserman (2008) proposed a method for the kernel smoothing of high-dimensional data, but their theory was only justified for finite dimension.
The indices α 1 , . . ., α p and bandwidth for local linear estimation (5) are chosen as follows.Note that if X [j] contributes more linearly to the response, α j should be smaller resulting in a bigger bandwidth for X [j] .On the contrary, if X [j] contributes more nonlinearly, α j should be bigger resulting in a smaller bandwidth for the calculation of partial derivative along X [j] .With this consideration, we define where R j is the residual of linear regression of Y on X [j] , and dCor(•, •) is the distance correlation coefficient of Székely, Rizzo, and Bakirov (2007).It can be seen that α j measures the nonlinear dependence between Y and X [j] , and it is 0 if their dependence is either 0 or is purely linear.
Denote the sample α based on data (X i , Y i ) n i=1 by α = α1 , . . ., αp ; see Székely, Rizzo, and Bakirov (2007) for the calculation.In practice, the local linear smoother at point x is the minimizers of (5) with α replaced by α.When p is large compared with n, the Lasso penalty on gradient (i.e., the L 1 norm |b j |) can be added in (5) for the minimization.Denote the minimizers by â(x) and b(x) = b[1] (x), . . ., b[p] (x) , respectively.
As far as we know, this is the first result about the local linear smoother under high dimensional setting.Note that the RHS of ( 7) is the sum of two terms corresponding to estimation variance and bias, respectively.It is interesting to see that using estimated αj 's does not defect the asymptotic property of local linear estimation.Denote by bj = b(X j ) the estimator of (5) at each observed X j .Calculate Theorem 2. Suppose model (2) and Assumptions (A1)-(A7) in Appendix hold.Then in probability as n → ∞, where |A| represents the largest absolute value of entries in matrix A.
Based on the restrictions on p n , h n , and ω n in Assumptions (A4) and (A7), p n is allowed to diverge at a speed of p n = o(n 2/(|α|+4) ).This restriction on p n is stronger than Lasso-SIR (Lin, Zhao, and Liu 2019).This is reasonable as SIR is based on one-dimensional partition of the sample space while OPG is based on multi-dimensional partition.It is possible to allow faster growing p n following a refining procedure of Xia et al. (2002) and Xia (2007).On the other hand, the smaller |α| is, the bigger p n can be.When p n, a variable screening step is suggested before applying HOPG; details is given in Section 4. In practice, when p is large we can apply variable screening to reduce the number of variables before applying HOPG.It is worthwhile to mention that d = d n can also increase with n at a slow rate, d n = o log(n) ; see Remarks 1 and 2 in the supplementary material.
Following Xia (2007), HOPG also works for multivariate response Y. Specifically, suppose Y i , = 1, . . ., L are the multiple response or dummy variables of observation Y i .Let where α = ( α 1 , . . ., α p ) is similarly calculated as in ( 6) with Y replaced by Y .Then, the dimension reduction directions are the eigenvectors of This idea also allows HOPG to estimate CS if the response Y is replaced by 1(Y < c ) for a number of c 's, where 1(•) is the indicator function.The above idea can be applied to MAVE of Xia et al. (2002).Another issue about CMS or CS is their dimensions, which will be discussed in Section 3.5.

Classification in High-Dimensional Space
The goal of dimension reduction is to estimate the regression function or classification more efficiently.However, little effort has been paid to this goal in the literature of SDR.Based on HOPG, we will propose an ensemble classification method in this section.
The method first uses random projection to reduce the high dimension p to a moderate dimension m, and then uses HOPG to further reduce the dimension of the projected subspace.Choice of m will be discussed later.This approach is inspired by Cannings and Samworth (2017) where they did not use SDR as a second dimension reduction but set m = 5 directly.Their practice is not justifiable either by the Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss 1984) or by the sufficient dimension reduction.
Suppose a random pair (X, Y) takes values in R p × C, where Y is the class label of X = (X [1] , . . ., X [p] ) and C = {c 1 , . . ., c L } is the set of classes.The proportion of each class is denoted by Combined with the k-nearest-neighbor classification method (kNN, Fix and Hodges Jr (1951)), the workflow of our method is as follows.
(X, Y) : X = (X [1] , ..., X [p] ), Y ∈ {c 1 , ..., c L }. Here, B is the number of random projections or the number of individual classifiers.Details in the workflow will be discussed in the following sections.

Random Projection and Sufficient Dimension Reduction
The first step of DRIPS is to project the original data space, R p , into a lower dimensional subspace R m with m ≤ p, such that the projected data approximately retains the structure of the original data; see Johnson and Lindenstrauss (1984).Another motivation for this step is to generate many individual classifiers for the ensemble learning; see, for example, Bingham and Mannila (2001) and Rokach (2010) .In practice, we randomly generate a large number of candidate projections, denoted by P b , b = 1, . . ., B. As suggested by Cannings and Samworth (2017), B = 500 is used in our calculation.Next, we need to estimate p P (z) By the Johnson-Lindenstrauss lemma, however, m in the random project cannot be too small and depends on the sample size n in order to retain the space structure of the data.As a consequence, p P (z) cannot be estimated well.This requires a second dimension reduction, which will be done by HOPG by assuming p P (z) = p P (D z).Based on dataset S P b n , denote the estimator of the first d directions of CMS by D b .The choice of d will be discussed later.
For the choice of m, note that the rate of convergence in Johnson-Lindenstrauss lemma is O [ To balance the performance of both, we select m ∝ n 2/(α 1 +•••+α m +4) , the largest dimension at which HOPG is justified; see the discussion below Theorem 2. In our calculation, integer m is randomly selected in a range between min{n 2/( α1 +•••+ αm +4) , p} and min{4n 2/( α1 +•••+ αm +4) , p}.The voting procedure in the ensemble will automatically select an appropriate m in the range.

Classifier in the Dimension Reduced Space
Let P : p × m and D : m × d be the two consecutive projections constructed above, then It can be seen that the value of C PD S (x) only depends on x through its d-dimensional projection (PD) x.This is the reason for our method to work efficiently in theory.Following the steps above, the B classifiers can be constructed based on training set S n .Suppose kNN is used to implement the above classification, the detail is then as follows.Let u i = (PD) x i and u = (PD) x, where x is a new observation.Rearrange the samples according to and their corresponding labels of classes are y (1) , . . ., y (n) .The estimated probability for x to fall into class c is where k is the number of neighbors.Then, the kNN classifier is defined as

Voting
As an ensemble classifier, a rule is needed to aggregate the classification results of all the individual classifiers.Note that, for the random forest (Breiman 2001;Liaw and Wiener 2002) and bagging methods (Breiman 1996), the individual classifiers are assumed to be unbiased, therefore, a simple majority voting of the classification results is efficient.However, only when the random projection satisfies the Johnson-Lindenstrauss lemma, our base classifier generates an unbiased estimator.Therefore, we need to discard some of the biased classifications.Similar to Cannings and Samworth (2017), we select B 0 = [rB] classifiers that have relatively smaller misclassification rates defined as where 0 < r < 1.Some theory could be developed for the selection of r; see, for example, Dasgupta and Gupta (2003).
In our calculation r = 0.4 is used.For ease of exposition, we assume that the first B 0 classifiers are selected while the others are discarded.For any observation x, the voting probability for x to be classified as c is thus Usually, x is assigned into class c 0 with 0 = arg max 1≤ ≤L v n (x).

Selection of Tuning Parameters
Note that the dimension d 0 of CMS is unknown and it may depend on projection P b .In addition, for kNN classification, the number of neighbors is also an essential tuning parameter.
In order to select these parameters, we define a leave-one-out cross-validated error rate with respect to d and k as where is the kNN classifier defined in (12) based on delete-one-observation of the training set: {(x j , y j ) : j = 1, . . ., i − 1, i + 1, . . ., n}.The minimizers of CV(d, k) with respect to d and k will be used as the selected numbers of dimension and neighbors, respectively.The consistency of the selection will be discussed in next section.

Consistency of the Classifier
For simplicity, we only consider the two-category (i.For a fixed projection P, let p(x) = P(Y = 1|P X = P x).Let D be a m × d dimension reduction matrix obtained by HOPG using data S P n = {(P X i , Y i ), i = 1, . . ., n}, where d is the estimated dimension and p(x) is the kNN estimator of p(x) based on Theorem 4. Suppose assumptions in Theorem 3 hold, we have lim n→∞ p(x) = p(x) in probability.
Note that the consistency is for each classifier with a fixed projection.This is essential for the voting procedure in Section 3.3.The results stated above can be extended to multi-class classification by modifying some of assumptions; see Appendix A.2.

Numerical Study
To assess the performance of HOPG in SDR and DRIPS in classification, we carry out numerical studies in this section and compare them with their respective competitors.

Simulations for HOPG
For dimension reduction, we compare HOPG with MAVE (Xia et al. 2002), gKDR (gKDR-v, Fukumizu andLeng 2014), sparse SIR (SSIR, Tan et al. 2018), Lasso-SIR (Lin, Zhao, and Liu 2019), sequential SIR Yin and Hilafu (2015) and TC-SIR of Qian, Ding, and Cook (2018).For fair comparison, when dimension p > n, we screen the number of predictors down to n/ log(n) using Li, Zhong, and Zhu (2012).We found the screening can improve all the methods even though some of them work for p > n.In the calculation of HOPG, we select the bandwidth by the rule-ofthumb, that is, We adopt the measure of Li (1991) to evaluate estimation performance.Suppose U is an n×d matrix, and V is an n column vector.Let R(V, U) be the R 2 of V regressing on U.With a little abuse of notation, when Note that the larger R(V, U) represents stronger linear correlation between V and U. Thus, if B is an estimator of B 0 , larger R(B 0 X, B X) means B is a better estimator of B 0 .
In the following simulated models, X = (X [1] , . . ., X [p] ) ∼ N(0, ) with = (0.5 |i−j| ) p×p , and the error term ε follows a standard normal distribution N(0, 1).The sample sizes used in the simulations are n = 500, 1000, while dimensions are p ∈ {n/50, n/10, n/5, n/2, 1.5n, 2n} for each n. where where sign(•) is the sign function, and Q δ (Z) is the δth quantile of Z.Let B 0 = (β 1 , β 2 ) be the dimension reduction matrix, which is designed as follows.First, suppose there are only s n important variables in our model and are randomly placed in the covariates.Generate a p × 2 matrix U, where each entry is independently drawn from the standard normal distribution.Randomly select s n rows of U, keep values in these row and set the others to 0. Then, B 0 is the Orthonormalize U.
Let B be estimated dimension reduction matrix with fixed d 0 = 2, and calculate R(B 0 X, B X).For each triple (n, p, s n ), we repeat the simulation and estimation 100 times.The average of R(B 0 X, B X) for the 100 replications is shown in Figures 1 and  2 for s n = min{p, √ n} and s n = min{p, n/ log(n)}, respectively.Figures 1 and 2 show that the performance of HOPG is always among the top 2 in all cases, and is outstanding in models 1 and 3.The performance of HOPG is similar to MAVE when dimension p is small, while it performs much better when p is getting larger.For Model 2 and Model 4, the results in Figure 1 show that sparse SIR is comparable to HOPG but is worse than HOPG when s n increases.Although, gKDR and TC-SIR perform similarly to HOPG in Model 2 and Model 4 for larger s n , they perform poorly in the other cases.On the other hand,  in Figure 2, all the methods get worse when p is very large, revealing the fact that the existing dimension reduction methods including HOPG are unable to estimate CS so efficiently when dimension p n.To see the performance of proposed method for selection of dimension d, we apply the method to Model 4. With n = 500 and different p, the boxplot of selected d based on 100 simulations are shown in Figure 3.It can be seen that the selection is very accurate when p ≤ 50.When p gets larger, the method selects bigger d.This is reasonable because the estimation of the directions gets less efficient as p gets larger, as a consequence more directions need to be included to compensate for the loss of efficiency in the estimation of directions.

Real Data Examples for DRIPS
In this section, we use real data to assess the performance of DRIPS when HOPG, SIR, and gKDR are used for the SDR in DRIPS.Our method is also compared with several popular classification methods, including the random- projection ensemble classifier (RPknn) of Cannings and Samworth (2017), kNN, the support vector machines (SVM), random forests (RF), adaptive boosting (AdaBoost), and extreme gradient boosting (xgBoost).Two high-dimensional linear classification methods the multi-class sparse discriminant analysis (MSDA, Mai, Yang, and Zou 2019) and the sparse optimal scoring method (SOS, Clemmensen et al. 2011) are included in comparison.We also consider using kNN for classification after SDR, which is used to show the improvement of ensemble procedure.
The calculation is done in R using the following packages and tuning parameters.Package RPEnsemble (Cannings and Samworth 2016) is used for the calculation of RPknn, where the dimension of the image space of the projections is 5, B 1 = 500 and B 2 = 50 as recommended by Cannings and Samworth (2017).For kNN, k is chosen from {1, 3, . . ., 15} by the leaveone-out cross-validation.SVM is implemented using the svm function in the e1071 package (Dimitriadou et al. 2008) with radial kernel.RF is implemented by the randomForest package (Liaw and Wiener 2002), where 1000 trees are used, and [ √ p] components are randomly selected in training each tree.
We use package Adabag (Alfaro, Gamez, and Garcia 2013) for the adaBoost setting the number of iterations to be 100 as default.xgBoost is accomplished by applying xgboost package (Chen et al. 2015) with the maximum number of boosting iterations set as 25.
To make a fair comparison, we use all the six real datasets in Cannings and Samworth (2017) that have more than 500 observations.As these datasets contain only two categories (i.e., L = 2), two other real datasets with more categories are also used for comparison.However, as RPknn is not applicable to the last two datasets, the adaptive boosting (adaBoost) method is used as a replacement.Details of the datasets are listed below.The human activity recognition using smartphones dataset (http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) contains 561 accelerometer measurements (p = 561).In the binary case, we select walking and laying activities from the original training set.Thus, we form a dataset contains 2633 observations with 1266 belonging to "walking" (class 0) and 1407 belonging to "laying" (class 1).data 6 (GISETTE dataset) The GISETTE dataset (https:// archive.ics.uci.edu/ml/datasets/Gisette) is a dataset to separate the highly confusable handwritten digits "4" and "9, " which consists 3000 handwritten "4"s and 3000 handwritten "9"s.The digits have been size-normalized and centered in a fixed-size image of dimension 28 × 28.Together with some distractor features called "probes" having no predictive power, the dimension is finally p = 5000.data 7 (Full human activity recognition dataset) The human activity recognition using smartphones dataset (http://archi ve.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) consists 10,299 observations of 6 activities (L = 6) with 561 accelerometer measurements (p = 561).In the dataset, there are 1722 observations belong to "walking" (class 1), 1544 belong to "walking upstairs" (class 2), 1406 belong to "walking downstairs" (class 3), 1777 belong to "sitting" (class 4), 1906 belong to "standing" (class 5) and 1944 belong to "laying" (class 6).data 8 (Gas drift dataset) The gas sensor array drift dataset  In the calculation, training sets of size n are randomly sampled from the original datasets, and the remaining data are used as the test sets.Take n = 50, 100, 200, 500, 1000, and 2000 successively when applicable.After repeating the calculation randomly 200 times, the average of classification errors for each setting are calculated and presented in Tables 1 and 2. In the tables, the third to fifth columns represent the classification errors of DRIPS based on different dimension reduction methods, denoted by DRIPS+HOPG, DRIPS+SIR, and DRIPS+gKDR, respectively; the sixth column is the errors of using kNN directly to all the original predictors, denoted by kNN+all; the seventh column is the errors of applying kNN to the space reduced by HOPG, denoted by kNN+HOPG.
By comparing the classification errors, we have the following observations.The classical methods kNN, RF, SVM, and xgBoost have reasonable good performance in all the datasets, while the more recently proposed linearly classification methods MSDA and SOS are less appealing, indicating that there is nonlinearity in the data.Among all the methods, DRIPS+HOPG attains the smallest misclassification errors in most of cases, especially when the size of training set n is large.For the cases, where DRIPS+HOPG fails to achieve the minimum classification errors, its performances are very close to the best.In particular, for the hill-valley dataset (data 3), DRIPS+HOPG is outstanding with misclassification rate close to 0 when n ≥ 100.In contrast, the other classifiers are far from satisfactory even when the training sizes increase, implying possible inconsistency of those methods.
Note that RPkNN is an ensemble classifier based the random projection alone.It usually has much bigger classification errors than our double dimension reduction approach, that is, the random projection and SDR, indicating the big contribution of SDR in the classification.On the other hand, kNN+HOPG can improve kNN (i.e., kNN + all) but it is obviously less efficient than DRIPS, indicating the importance of the ensemble procedure.In summary, both the random projection and SDR are essential for DRIPS in the classification.
Based on the above observations, we have the following conclusions.First, HOPG is more efficient than the other SDR methods for dimension reduction and thus is able to make more accurate classification.Second, both the random projection and SDR are essential for DRIPS in the classification, while the random projection or SDR alone is much less efficient.Third, if it is properly used, SDR is indeed very helpful in function estimation and classification and can improve the existing classification methods significantly.

Conclusion
In this article, we first devised a general approach for the local linear nonparametric kernel regression in high-dimensional space.Based on this approach, we extended OPG to highdimensional data and gave theoretical justification under general settings.Because of the local linear formulation, HOPG affiliates the use of Lasso approaches and thus has appealing performance.The ensemble classifier DRIPS aggregates classifiers trained in the subspaces reduced by the random projection and HOPG consecutively.Both HOPG and DRIPS have showcased their promising performance in comparison with their competitors.Some issues have not been fully addressed in this article.The theoretical consistency rate is slow, and p = p n can only grow slowly with n.These rates could be improved by employing more complicated mathematical derivation.For DRIPS, how to select the optimal dimension m of projection, and to what extent it can improve the individual classifier, also need further investigations.Assumptions (B1)-(B4) are almost the same as conditions in Theorem 1 of Samworth (2012) where explanations of these assumptions can be found.The only difference is (B2), where we assume p d (u) is bounded differentiable with respect to V 0 .This assumption is used to derive the smooth condition of p d (u) in order to obtain the theorems in Chaudhuri and Dasgupta (2014).Assumption (B5) is adopted to guarantee the asymptotic result for CV (d, k).In addition, the following assumption is needed, (B6) For each ( 1 , 2 ) = ( 3 , 4 ), the sub-manifolds 1 , 2 and 3 , 4 of R d are transversal.

Supplementary Materials
Technical Proofs: this supplement consists of justification for Assumption (A1) and the proofs of Theorems 1-4.(online_supplement.pdf).

Figure 1 .
Figure 1.Simulation results for models 1-4 with s n = min{p, √ n}.In each panel, thick solid, dot-dash, dot, solid, dash, long-dash and double-dot-dash curves represent the average correlations between the estimated and true sufficient dimension reduction spaces, R(B 0 X, B X), with B being estimated by HOPG, MAVE, gKDR, SSIR, Lasso-SIR, sequential SIR and TC-SIR, respectively.

Figure 2 .
Figure 2.Simulation results for models 1-4 with s n = min{p, n/ log(n)}.In each panel, thick solid, dot-dash, dot, solid, dash, long-dash and double-dot-dash curves represent the average correlations between the estimated and true sufficient dimension reduction spaces, R(B 0 X, B X), with B being estimated by HOPG, MAVE, gKDR, SSIR, Lasso-SIR, sequential SIR and TC-SIR, respectively.

Figure 3 .
Figure 3. Boxplot of the selected structural dimension d for Model 4.

(
B1) The support of random vector U is W ⊂ R d , which is a compact d-dimensional manifold with boundary ∂W.(B2) Let = {u ∈ W : p d (u) = 1/2} be the nonempty decision boundary set.There is an open subset V 0 of R d containing such that: firstly, p d is continuous on V\V 0 , where V is an open set containing W; secondly, p d is differentiable and ṗd (u) is bounded in V 0 ; thirdly, the restriction of F d to V 0 has twice continuously differentiable density function f d .(B3) There exists ρ > 0 such that R d ||u|| ρ dF d (u) < ∞.Moreover, for sufficiently small δ > 0, the ratio F d (B δ (u))/(a d δ d ) is bounded and bounded away from zero uniformly for u ∈ W. (B4) For all u ∈ , ṗd (u) = 0.And for all u ∈ ∩ ∂W, ∂ ṗd (u) = 0, where ∂W is the boundary of set W and ∂ ṗd denotes the restriction of ṗd to ∂W. (B5) The number of neighbors k= k(n) takes value in interval k n = [k, k],wherek C −1 max{n β , log 2 (n)} k C min{n (1−βn/4) , n 1−β , n 4/11 }, for some constant C > 0 and β ∈ (0, 1/2).