Sufficient Dimension Reduction via Distance Covariance

We introduce a novel approach to sufficient dimension-reduction problems using distance covariance. Our method requires very mild conditions on the predictors. It estimates the central subspace effectively even when many predictors are categorical or discrete. Our method keeps the model-free advantage without estimating link function. Under regularity conditions, root-n consistency and asymptotic normality are established for our estimator. We compare the performance of our method with some existing dimension-reduction methods by simulations and find that our method is very competitive and robust across a number of models. We also analyze the Auto MPG data to demonstrate the efficacy of our method. Supplemental materials for this article are available online.


INTRODUCTION
Suppose X is a p × 1 predictor vector and the response Y is a scalar. Sufficient dimension reduction aims to find sufficiently reduced variables so that the regression information of Y |X will be completely retained. Ever since Li (1991) proposed the sliced inverse regression (SIR), many sufficient dimension-reduction methods have been proposed. For instance, sliced average variance estimation (SAVE; Cook and Weisberg 1991), inverse regression (IR; Cook and Ni 2005), and directional regression (DR; Li and Wang 2007). All of the above methods need the linearity condition or constant covariance condition or both; however, these conditions are not easy to verify in practice and the results of these methods may be misleading if the linearity condition or constant covariance condition is violated. Thus, many authors endeavor to develop different sufficient dimension-reduction methods that can avoid the above two conditions. Some representative ones are minimum average variance estimation method (MAVE; Xia et al. 2002), sliced regression (SR; Wang and Xia 2008), and ensemble approach (Yin and Li 2011); however, these methods used nonparametric tools, which implicitly require the predictors to be continuous and the link 2. METHODOLOGY 2.1 DISTANCE COVARIANCE Székely, Rizzo, and Bakirov (2007) proposed the distance covariance (hereafter DCOV) as a new distance measure of dependence between two random vectors. Let U ∈ R p and V ∈ R q , where p and q are positive integers, then the DCOV between U and V with finite first moments is the nonnegative number, V(U, V ), defined by where f U and f V stand for the characteristic functions of U and V, respectively, and their joint characteristic function is denoted by f U,V . The norm | · | is defined in the following way: for complex-valued functions f (·), |f | 2 = ff , wheref is the complex conjugate of f . The weight function w(t, s) is specially chosen so that the above integral exists and has attractive properties.
More details of w(t, s) can be found in Székely, Rizzo, and Bakirov (2007). Székely and Rizzo (2009) gave an equivalent form of DCOV as where (U , V ) and (U , V ) are iid copies of (U, V ).
An attractive property of DCOV is that DCOV equals 0 if and only if two random vectors are independent (Székely, Rizzo, and Bakirov 2007). This property makes it possible that DCOV can be used as a sufficient dimension-reduction tool, as we will see in following sections.

SUFFICIENT DIMENSION REDUCTION
To facilitate our discussion, let B be an arbitrary matrix and let S(B) be the subspace spanned by the column vectors of B. Let dim(S(B)) denote the dimension of S(B) and let P B( X ) denote the projection operator that projects onto S(B) with respect to the inner prod- Let β be a p × q matrix with q ≤ p, and ⊥ ⊥ be the independence notation. Sufficient dimension reduction is based on the following conditional independence: where (1) indicates that the regression information of Y given X is completely contained in the linear combinations of X, β T X. The column space of β in (1), denoted by S(β), is called a dimension-reduction subspace. If the intersection of all dimension-reduction subspace is itself a dimension-reduction subspace, then it is called the central subspace (CS) denoted by S Y |X (Li 1991;Cook 1994Cook , 1996. The existence conditions are established by Cook (1998) and Yin, Li, and Cook (2008). Throughout the article, we assume the central subspace exists, which is unique. Furthermore, let d denote the structural dimension of the central subspace, and let X be the covariance matrix of X, which is assumed to be nonsingular. Our primary goal is to identify the central subspace by estimating d and a p × d basis matrix B of the central subspace.

DCOV AS A SUFFICIENT DIMENSION REDUCTION METHOD
Let β be an p × d 0 arbitrary matrix, where 1 ≤ d 0 ≤ p. We will show that under mild conditions, solving (2) will yield a basis of the central subspace.
Here the squared DCOV between β T X and Y is defined as Székely, Rizzo, and Bakirov (2007) showed that the conditions of E|X| < ∞ and E|Y | < ∞ guaranteed that the DCOV was finite; thus, throughout the article we assume E|X| < ∞ and E|Y | < ∞. In the optimization problem (2), we use the constraint β T X β = I d 0 . The reason is that V 2 (cβ T X, Y ) = |c|V 2 (β T X, Y ) for any constant c (Székely and Rizzo 2009, theorem 3), and therefore we can always get a bigger value of V(β T X, Y ) by multiplying β a constant with bigger absolute value; so we need a scale constraint to make the maximization procedure work.
The following propositions ensure that if we maximize V 2 (β T X, Y ) with respect to β under the constraint and some mild conditions, the solution indeed spans the central subspace.

The equality holds if and only if S(β) = S(η).
Proposition 2. Let η be a basis of the central subspace with η T X η = I d , β be a p × d 2 matrix and β T X β = I d 2 . Here d 2 could be bigger, less, or equal to d.
Proposition 1 indicates that if S(β) is a subspace of the central subspace S Y |X = S(η), then the DCOV between β T X and Y is always less than or equal to the DCOV between η T X and Y and the equality holds if and only if β is also a basis matrix of the central subspace, that is, S(β) = S(η). Proposition 2 indicates that if S(β) is not a subspace of the central subspace, then under a mild condition the DCOV between β T X and Y is always less than the DCOV between η T X and Y. The above two propositions indicate that we can always identify the central subspace by maximizing V 2 (β T X, Y ) with respect to β under the quadratic constraint. The independence condition, P T η( X ) X ⊥ ⊥ Q T η( X ) X, in Proposition 2 is not as strong as it seems to be, and it could be satisfied asymptotically when p is reasonably large. For further discussions about the independence condition, see Sheng and Yin (2013, sec. 3.5).
Propositions 1 and 2 in this article are extensions of proposition 1 in Sheng and Yin (2013). Indeed, intuitively the idea of extensions seems very straightforward. It is like an RSS in OLS model whose values decrease until to the correct model, then increase, as predictors change; while our values of the DCOV increase until to the correct dimension, then decrease, as dimensions change. However, the complicated DCOV index leads to certain difficulties of the proof. We use a novel geometric idea in the proof of Lemma A.1 to overcome the difficulty, and then combine proposition 4.3 in Cook (1998) and the third property in theorem 3 in Székely and Rizzo (2009) to reach the conclusions. We believe this new technique may be useful in other similar but complicated distances.

ESTIMATING THE CENTRAL SUBSPACE WHEN d IS SPECIFIED
In this section, we propose an algorithm for estimating the central subspace when the structural dimension d is known. Let (X, Y) = {(X i , Y i ), i = 1, . . . , n} be a random sample from (X, Y ) and let β be a p × d matrix. The sample version of V 2 (β T X, Y ) denoted by V 2 n (β T X, Y) (Székely, Rizzo, and Bakirov 2007) has the following form: where, for k, l = 1, . . . , n, Here | · | is the Euclidean norm in the respective dimension. Letˆ X be the sample version of X , then an estimated basis matrix of the central subspace, say η n , is To find such an η n , we use sequential quadratic programming method (SQP; Gill, Murray, and Wright 1981, chap. 6) to solve the above nonlinear optimization problem. The SQP procedure incorporated in MATLAB can be directly adopted in our algorithm. The outline of our algorithm is similar to the algorithm in Sheng and Yin (2013), hence we omit it here. One main difference is the way to choose initials. In this article, we use SIR, SAVE, and LAD to estimate the initials and we choose the one that gives the biggest squared DCOV as the final initial value.
Note that by invariance law, we can equivalently work on standardized predictor Z-scale, then transform back to X-scale. Indeed, Propositions 1 and 2 hold for standardized predictor Z. Thus, we write the algorithm under Z scale, and we transform the estimate back into X scale later. This scheme seems to work well in our simulations and real-data analysis. An alternative procedure suggested by a referee is to use a successive one-at-a-time search similar to that of Yin, Li, and Cook (2008). Indeed, we coded such a procedure whose performance is overall equivalent to that of above. Hence, we only report the results of the former algorithm. In the next section, we show that the estimator η n is consistent and asymptotically normal.

Proposition 3. Assume η is a basis matrix of the central subspace S Y |X and η
In general, the support of X does not have to be compact; however, Yin, Li, and Cook (2008, proposition 11) showed that if one chooses a support of X, say S, which is a compact set and large enough, then S Y |X S = S Y |X , where X S is X restricted onto S, and therefore we assume the support of X is compact to simplify the proof. Proof of Proposition 3 is given in the online supplementary file. Indeed, we can further prove the √ n-consistency and asymptotic normality of the estimator as stated below. The proof of Proposition 4 is also given in the online supplementary file.
, then under the regularity conditions given in the online supplementary file, there exists a rotation matrix Q: A referee pointed out that while η is not unique in Proposition 4, the subspace spanned by η is unique, hence, its projection ηη T is unique (identifiable). Based on Proposition 4, we are able to give the asymptotic normality of η n η T nˆ in the following corollary whose proof is given in the online supplementary file.
Corollary 1. Let η be a basis matrix of the central subspace and η n = arg max β Tˆ X β=I d V 2 n (β T X, Y), then under the same assumptions and conditions in Proposition 4, we have is the covariance matrix defined in the online supplementary file.

ESTIMATING d
In practice, the dimension of the central subspace is unknown and must be estimated. Li (1991Li ( , 1992 proposed a sequential test based on a chi-squared statistic. Cook and Yin (2001) proposed the permutation test to determine the structural dimension. Bootstrap method to estimate d has been used by Ye and Weiss (2003) and Zhu and Zeng (2006). In this article, we also adopt the bootstrap method to determine d.
To precisely present the bootstrap idea, we introduce a measure of distance based on Li, Zha, and Chiaromonte (2005): where S 1 and S 2 are two d-dimensional subspaces of R p and P S 1 , P S 2 be the orthogonal projections onto S 1 and S 2 , and · is the maximum singular value of a matrix. The smaller the m is, the closer the two subspaces are.
The basic idea of the bootstrap method is presented as follows: for each possible working dimensions 1 ≤ k ≤ p − 1, we obtain an estimated subspace, say,Ŝ k , then we randomly sample the data with replacement B times, and obtain the estimated subspace based on the bootstrap samples, and denote them byŜ b k , b = 1, . . . , B. For each k, we calculate the distances betweenŜ k andŜ b k , and denote them by m (Ŝ k ,Ŝ b k ), for b = 1, . . . , B. Finally, we use the mean of m (Ŝ k ,Ŝ b k ), b = 1, . . . , B, as the measure of variability for each k, and the k corresponding to the smallest variability is our estimated d.
The bootstrap method works well in our setup, and this is partly due to the properties described in Propositions 1 and 2 in Section 2.3, that is, under mild conditions, if and only if k = d, the maximization of V 2 n (β T X, Y) with respect to β can be achieved at the central subspace; thus, when k = d, the variability is the smallest. When k < d, the maximization is achieved at a subspace of the central subspace. Theoretically, the subspace could be any k-dimensional subspace of the central subspace. However, simulation studies show that in some cases, a certain perhaps fixed subspace of the central subspace is much more frequently estimated than other subspaces. But when k = d, the unique central subspace is always estimated. This may explain why the variability when k < d is still bigger than that of k = d. When k > d, there are (k − d) directions that are placed randomly in the null space of the central subspace; therefore, the variability increases significantly when k > d.

SIMULATION STUDIES
In this section, we compare the performance of our method with some existing dimension-reduction methods, that is, the well-known SIR (Li 1991), SAVE (Cook and Weisberg 1991), PHD (Li 1992;Cook 1996), and LAD (Cook and Forzani 2009).
We simulated three different models to illustrate the efficacy of our method. We use two configurations (n, p) = (100, 6) and (500, 20). Let β 1 , β 2 , and β 3 be p-dimensional vectors with their first six components being (1, 0, 0, 0, 0, 0) T , (0, 1, 0, 0, 0, 0) T , and (1, 0.5, 1, 0, 0, 0). When p = 20, the subsequent components of β 1 , β 2 , and β 3 are taken to be 0. For each configuration, we use M = 100 simulated random datasets. The mean and standard error (SE) of m s are reported in the corresponding tables, where m is defined in Section 2.6 and the smaller the m is, the better the estimator is. With , 1 , and 2 being independent standard normal random variables, the three models are: For each model, we generate three different kinds of X = (x 1 , x 2 , . . . , x p ): Part (1), standard normal predictors X ∼ N(0, I 10 ); Part (2), nonnormal predictors; and Part (3), discrete predictors.   Table 1. In Model A, the first component is symmetric about 0 in Part (1) and semi-symmetric in Part (2); therefore, SIR will fail to estimate this part, but it is in favor to SAVE and PHD. The second component is monotone. Thus, it is in favor to SIR, but SAVE and PHD may miss it. In Part (3), both components have linear trend, so SIR will do better. LAD does not work when the predictors are discrete, which may be due to the existence of singular matrix in LAD algorithm. Overall, DCOV has the consistently better performance than other methods.
Model B: This model has been previously studied by Chen and Li (1998), and Wang and Xia (2008). The predictors for Part (2) and Part (3) are generated as follows: Part (2), x i ∼ Uniform(−2, 2), i = 1, . . . , p; Part (3), x i ∼ Binomial(10, 0.1), i = 1, . . . , p. Simulation results are listed in Table 2. In Model B, the first component has a rough linear trend, and therefore SIR has the capability to catch it, but SAVE and PHD may fail to   Chen and Li (1998). Both SAVE and PHD failed to recover this direction. LAD performs well under Model B. However, when n and p are not big enough, LAD may fail due to the singularity of matrix in the algorithm. Overall, DCOV still has much better performance than other methods under model B.
Model C: This Model C again comes from Wang and Xia (2008). The predictors for Part (2) and Part (3) are generated as follows: Part (2), x i +1 2 ∼ Beta(1.5, 1), i = 1, . . . , p; Part (3), x 1 , . . . , x 5 ∼ Poisson(1), x 6 ∼ Binomial(10, 0.3), x 7 , . . . , x p ∼ Poisson(1). Simulation results are listed in Table 3. In Model C, the direction appears in the variance component. The plot of Y versus β T 3 X has clear megaphone trend. SIR is capable of finding this direction as explained by Chen and Li (1998), while SAVE and PHD fail to find it. SIR, LAD, and DCOV all have good performances under the Model C.
To illustrate the bootstrap method for estimating d, we still consider the three different kinds of predictors as described above. For each model, we use (n, p) = (400, 6) with bootstrap size B = 200. The results for Models A, B, and C are summarized in Table 4.     It can be seen clearly that the bootstrap method correctly chooses the dimension under different models with different predictors.

AUTO MPG DATA
We use the auto MPG data (http://archive.ics.uci.edu/ml/datasets/Auto+MPG) to illustrate the proposed method. After deleting the missing values, there are 392 cases in the data. The response variable is the auto MPG and we consider six covariates: the number of cylinders, displacement, horsepower, weight, acceleration, and model year. We consider the covariate "the number of cylinders" as a continuous variable, as its values (3,4,5,6,8) have practical meanings. For the covariate "model year," we slice it into three intervals: 70-73, 74-79, 80-82. We do not use the covariate "origin," because "origin" and "the number of cylinders" have a strong relationship: on the average, cars made in the United States (origin = 1) tend to have more cylinders, while cars made in Europe (origin = 2) have fewer cylinders and Japan cars (origin = 3) have the fewest cylinders.
The bootstrap method described in Section 2.6 results in Table 5, which suggests that the estimated dimension is two. Figures 1 and 2 show the scatterplots of the response versus the first two DCOV directions, respectively. The first direction indicates a strong linear trend while the second direction shows a clear nonlinear wave pattern.
Furthermore, we compare our method DCOV with four competitors: SIR, SAVE, PHD, and LAD. For a given dimension 2, we prefer the estimate with the smallest variability by the bootstrap criterion described in Section 2.6. The same procedure was also used by Zhu, Zhu, and Feng (2010) and Cook and Zhang (2014). We used the bootstrap method to generate 200 datasets. We estimated the central subspace based on the bootstrap samples, computed the m (Ŝ Y |X ,Ŝ b Y |X ), b = 1, . . . , 100, and calculated the mean as a measurement. The reported results in Table 6 show that DCOV has the smallest average m ; therefore, DCOV is the best method for this dataset, which might be due to that DCOV is very good at dealing with categorical predictors.
Proof of Proposition 1. Since S(β) ⊆ S(η) = S Y |X , d 1 ≤ d, there exists a matrix A, which satisfies β = ηA. Therefore, V 2 (β T X, Y ) = V 2 (A T η T X, Y ). Assume the single value decomposition of A is U V T , where U is a d × d orthogonal matrix, V is a d 1 ×d 1 orthogonal matrix, and is a d×d 1 diagonal matrix with nonnegative numbers on the diagonal, and it is easy to prove that all nonnegative numbers on the diagonal of are 1. According to Székely and Rizzo (2009, theorem 3 (ii) Let U T η T X = (X 1 , . . . ,X d ) T . Since all nonnegative numbers on the diagonal of are 1 and T U T η T X = (X 1 , . . . ,X d 1 ) T , by Lemma A.1, we get V 2 ( T U T η T X, Y ) ≤ V 2 (U T η T X, Y ). The equality holds if and only if d = d 1 . And according to Székely and Rizzo (2009, theorem 3 (ii)), V 2 (U T η T X, Y ) = V 2 (η T X, Y ). Thus, V 2 (β T X, Y ) ≤ V 2 (η T X, Y ), and equality holds if and only if S(β) = S(η).

SUPPLEMENTARY MATERIALS
Online supplementary files: The online supplementary file includes the detailed proofs for Proposition 3, Proposition 4, and Corollary 1 in the Section 2.5 of the article and they are properly referred in the article. (supplement.pdf) MATLAB code: The dcov.zip file contains the MATLAB codes for the DCOV method described in the article. In addition, one simulation example and the code for real-data analysis are also in the file. The "readme.txt" file introduces each MATLAB code file and the "mpg1.txt" contains the auto MPG data. (dcov.zip)