A Data-Adaptive Principal Component Analysis: Use of Composite Asymmetric Huber Function

This article considers a new type of principal component analysis (PCA) that adaptively reflects the information of data. The ordinary PCA is useful for dimension reduction and identifying important features of multivariate data. However, it uses the second moment of data only, and consequently, it is not efficient for analyzing real observations in the case that these are skewed or asymmetric data. To extend the scope of PCA to non-Gaussian distributed data that cannot be well represented by the second moment, a new approach for PCA is proposed. The core of the methodology is to use a composite asymmetric Huber function defined as a weighted linear combination of modified Huber loss functions, which replaces the conventional square loss function. A practical algorithm to implement the data-adaptive PCA is discussed. Results from numerical studies including simulation study and real data analysis demonstrate the promising empirical properties of the proposed approach. Supplementary materials for this article are available online.


BACKGROUND
Principal component analysis (PCA) has been most widely used for dimension reduction and feature extraction of high-dimensional data. Suppose we have a p × n dimensional data matrix M that approximately lies on a low-dimensional subspace. Then the data matrix M can be expressed as where L is a low-rank matrix that represents the subspace and Z is a matrix that models a small noisy perturbation of each entry of L. PCA finds the best rank-q estimate of L in the 2 sense, which can be derived by singular value decomposition (SVD) of the data matrix and thresholding. Thus, it yields an optimal estimate of the subspace when the perturbation of Z is an iid Gaussian random variable. By following the projection point of view, we consider a "self-contained" regressiontype criterion to pose the PCA problem. Let x = (x 1 , . . . , x p ) T be a p-dimensional random vector with mean vector μ and covariance matrix xx . PCA finds a linear approximation to represent the data x as where A is a p × q matrix and ξ is a q-dimensional vector of parameters, which is a linear projection given by ξ = Bx with a q × p weight matrix B = (b 1 , . . . , b q ) T (q ≤ p). Then PCA can be defined as a solution of the following least-square criterion, For a sample version of (3), given the observations x 1 , . . . , x n , PCA fits the model (2) to the data by minimizing the squares of reconstruction errors, min μ,A,ξ j n j =1 The solutions of the optimization above can be summarized aŝ where v j is the p-dimensional eigenvector associated with the jth largest eigenvalue of xx . Hence, the best rank-q approximation to the data is given bŷ An interpretation of (5) is that V q V T q is a projection matrix that maps x j onto the rank q subspace with minimum squared reconstruction error.
As one can observe, the conventional PCA explained by the above approach uses up to the second moment of the data, and hence, it may overlook various aspects of the data structure (Caussinus 1992). In literature, various extensions of PCA have been studied. A popular approach is to replace a conventional covariance matrix with a robust scatter matrix (Campbell 1980;Devlin, Gnanadesikan, and Kettenring 1981). However, this method shows low breakdown points in high dimensions. The projection pursuit method is also used to find directions at which a certain robust criterion is maximized (Li and Chen 1985). Hubert, Rousseeuw, and Vanden Branden (2005) had recently proposed a robust method, named ROBPCA, that combines projection pursuit with robust scatter matrix estimation. However, existing methods focus on a robustness to outliers. More recently, Hubert, Rousseeuw, and Verdonck (2009) proposed a modified ROBPCA for skewed data, termed S-ROB.

MOTIVATION AND MAIN IDEA
Along with the linear projection formulation ("regression-type") of PCA in (3) and (4), Caussinus (1986Caussinus ( , 1992 considered a fixed effect model for PCA to obtain a generalized PCA. In our context, the model can be expressed as where e denotes a vector of independent random errors with E(e) = 0 and a certain covariance, typically I. Tipping and Bishop (1999) also presented a probabilistic model for PCA, which is the same as (6), based on a Gaussian latent variable model that is closely relevant to factor model. Additionally, they considered the assumption of Gaussian distributed error for statistical inferences and further analysis. However, if the structure of errors cannot be dealt with the second order statistic, that is, the distribution of the errors is skewed or asymmetric, then the linear projection-type PCA may not perform dimension reduction properly. It can be conjectured that although the conventional PCA represents the major patterns of data, some minor patterns might be distorted due to the structure of the errors.
For a particular example, we consider summer precipitation by averaging three summer months (June-July-August) data for the period of years 1979-2007 that are taken from Climate Prediction Center merged analysis of precipitation (CMAP) data (Xie and Arkin 1997). For understanding regional activities of the precipitation field, we focus on the East Asia region spanning 20 • N-50 • N and 120 • E-150 • E, which only includes 169 grid points. We use anomalies of the CMAP data for our analysis. The anomaly is computed by simply subtracting the sample mean of the raw data. We perform the conventional PCA to analyze the precipitation data, and obtain the first four leading PCs and the corresponding loading matrices. Note that the cumulated variances by four PCs are 31.7%, 49.1%, 60.0%, and 68.4%, respectively.
Using the four loading matrices and PCs, we generate a linear approximation g(ξ ) that represents the precipitation data. Figure 1(a) shows the first three loading matrices that are used for generating g(ξ ). We now simulate synthetic data x by adding Gaussian distributed errors e into g(ξ ), and then apply the conventional PCA and the proposed data-adaptive PCA (DPCA), which will be discussed in Section 2. The resultant loading matrices are displayed in Figure 1(b) and 1(c). As shown, the loading matrices obtained by both PCAs are almost identical to the true loading matrices in Figure 1(a). It seems that the results of PCA and DPCA are not affected by Gaussian distributed noises.
On the other hand, for reflecting structured noises, we consider a skewed t-distribution St(ξ = 0, ω = 1, α = 5, df = 3), where ξ is a location parameter, ω is a scale parameter, α is a shape parameter, and df is a degree of freedom (Azzalini and Capitanio 2014). We apply the conventional PCA and DPCA to the synthetic data, which are contaminated by the skewed t-distributed errors. Figure 2 shows the first three true loading matrices and the estimated loading matrices by PCA and DPCA, respectively. To further assess the quality of the method, we consider a discrepancy between the true loading matrices and the loading matrices estimated by each PCA as where v ij andv ij denote the true loading value and its estimate at the grid (i, j ), respectively. The results are listed in Table 1, where the DPCA provides much smaller root mean squared error (RMSE) values than PCA. From this example, we make an important observation that the conventional PCA is sensitive to the noise effect, e. More specifically, PCA can be easily distorted by structured errors, while it is not affected by Gaussian distributed errors.
Although we consider only positively skewed errors across observations in the experiment above, it is plausible that some real data can be contaminated by various structured  errors such as skewed, heavy-tailed, or bimodal ones. The primary goal of this study is to propose a new data-adaptive PCA that does not require prior information of the data. That is to say, it automatically reflects the distributional features of the data. For this purpose, on the basis of the regression-type view of PCA in (3) and (4), we propose a new PCA by replacing the least-square loss function in the conventional PCA criterion with a composite asymmetric Huber function to be introduced in Section 2. The key components of the proposed method are two-fold: (i) PCA can be expressed as a least-square framework and (ii) the quadratic loss function is replaced by a weighted linear combination of asymmetric Huber functions. The weight is set to reflect the heavy-tailed, bimodal, or skewed form of the distribution of the data. Hence, it can be expected that the proposed method works well for various types of data. Before closing this section, we remark that the conventional PCA focuses on the extraction of major features of the data, whereas both ROBPCA and S-ROB pay attention to detect outliers when the data are heavy-tailed or skewed, respectively. As mentioned earlier, the proposed approach is designed for representing the important patterns of the data as well as for identifying some minor activities properly when the data are contaminated by structured errors. Thus, it seems that our approach strikes a middle ground between two objectives.

COMPOSITE ASYMMETRIC HUBER FUNCTION
We now discuss a connection between skewed distributions and asymmetric Huber-type loss functions, and consider its utility for developing a new PCA.
To represent skewed data or bimodal data, we consider the following asymmetric density function, where μ and σ are location and scale parameters, respectively, and the function ρ τ,c (u) is given by with c = 1.345. The function ρ τ,c (u) with τ = 0.5 is equivalent to the Huber loss function (Huber 1964). By introducing the skewness parameter τ , this function can be termed "asymmetric Huber function." Furthermore, we consider a weighted average of the density f τ,c (u; μ, σ ) according to different τ 's. Figure 3(a) shows five density functions f τ,c (u; μ, σ ) with several τ = 0.05, 0.25, 0.5, 0.75, 0.95, respectively, and three different weighted averages of f τ,c (u; μ, σ ) are displayed in Figure 3(b)-3(d). The weights for Figure 3(b) is 1 for f τ,c (u; μ, σ ) with τ = 0.1 and all other density functions have 0 weights. The uniform weight in Figure 3(c) is set to 1/19 for all 19 density functions, where the skewness parameter varies from τ = 0.05 to τ = 0.95 with equal space. The weights in Figure 3(d) are 0 for all density functions except ones with τ = 0.5, 0.9, 0.95, in which the weights are 0.125, 0.625, 0.25. As shown, the weighted averages of f τ,c (u; μ, σ ) generate skewed or multi-modal density functions that might reflect various noise distributions. We observe that the weighted average of f τ,c (u) corresponds to the function K k=1 w k ρ τ k ,c (u), termed "composite asymmetric Huber function." In other words, given observations, the minimization of the composite asymmetric Huber function is equivalent to the maximization of the likelihood from the weighted average of f τ,c (u).

PCA BASED ON COMPOSITE ASYMMETRIC HUBER FUNCTION
Here, we propose a new PCA that produces data-adaptive loadings ("regression coefficients") under the regression-type optimization view of PCA. For this purpose, we consider the following optimization problem, where u ρ τ k := p i=1 ρ τ k ,c (u i ) with u = (u 1 , . . . , u p ) T , and the weights w 1 , . . . , w K are positive constants, sum up to 1 ( w k = 1), which control the contribution of each func- tion ρ τ k ,c . Employing the composite asymmetric Huber functions is capable of reflecting distributional features of data such as skewness as well as bounding outliers effectively, compared to using a single skewed function.
We now consider an empirical version of (7) with n independent random vectors x 1 , . . . , x n . Then the data-adaptive loadings of q principal components can be expressed as the solution of The above criterion can be rewritten as where (u) i denotes the ith element of u. As one can see from the criterion, the weights {w j,k } K k=1 for a particular observation vector x j control the contribution of the func-tions {ρ τ k ,c (·)} K k=1 . We want to point out a distinction between the above criterion for the proposed method and a criterion with a weighted skewness parameter τ * = K k=1 w k τ k . For simplicity of discussion, we assume that all weights are 0 except w = w = 1/2 . Then the criterion with the weighted skewness parameter (not composite one) becomes which focuses on the average skewness of τ and τ only. On the other hand, the cost function for the proposed composite one is 1 and hence, the proposed criterion can reflect the skewness information for both τ and τ .
We note that for the number of K, it should be predetermined for implementing the method. The function ρ τ k ,c (·) is almost identical to the quantile function when c is sufficiently small. Thus, the choice of quantile levels in the setting of composite quantile regression (CQR) is similar to our setting in some sense. Zou and Yuan (2008) investigated that a large K leads to an oracle-like estimator in variable selection framework under some conditions including an assumption that the errors follow a symmetric distribution. Recently, Kai, Li, and Zou (2010) suggested that a substantial gain in efficiency of the local CQR estimator can be obtained with a relatively small K such as K = 9. In computational aspects, the computational cost increases dramatically with the number of quantiles. In this study, we set K = 19 with equally spaced skewness parameters, τ 1 = 0.05, . . . , τ 19 = 0.95. As an objective way of selecting K, we can consider a data-adaptive method such as crossvalidation. This topic is left for future study.
For the weights of the composite asymmetric Huber functions, it is important to obtain optimal weights that reflect the distribution (or skewness) information of the data well. The ideal weight vector is one that gives high values on the meaningful skewness parameters and near zeros on the meaningless ones. A simple choice is to take uniform weights, which provide the following criterion min μ,A,ξ j n j =1 K k=1 x j − μ − Aξ j ρ τ k subject to A T A = I q . In this study, for a better performance, we consider a K-dimensional (marginal) weight vector for each observation vector x j = (x 1j , . . . , x pj ) T , j = 1, . . . , n as w j = (w j,1 , . . . , w j, where f j and F j are the marginal probability density function and cumulative distribution function of the observation x j , respectively. In practice, we estimate these f j and F j using a kernel density estimation with is a Gaussian kernel function and h denotes a bandwidth.

PSEUDO DATA-BASED DATA-ADAPTIVE PCA
Here, we introduce a pseudo data approach and show that the solution of (8) can be interpreted as a conventional PCA of the pseudo data. The pseudo data were introduced by Huber (1964), and used by Cox (1983) for M-type smoothing splines and by Oh, Nychka, and Lee (2007) for penalized robust smoothing including wavelet shrinkage.
We modify the pseudo data for PCA as where ψ τ k ,c (u) := (ψ τ k ,c (u 1 ), . . . , ψ τ k ,c (u p )) T with u = (u 1 , . . . , u p ) T and ψ τ k ,c (u) = ρ τ k ,c (u). The least-square optimization of (4) for PCA coupled with the pseudo data becomes wherex j = (x 1j , . . . ,x pj ) T and g(ξ j ) = (g 1j , . . . , g pj ) T . The above solution can be characterized by its score equations. Let (g) be the gradient of the least-square criterion in (10). Then, one obtains and the PCA solutionĝ is given by (ĝ) = 0. We restate the optimization for the dataadaptive PCA in (8) as Then, it follows that, from the definition of pseudo data, the score equations of (11) can be expressed as which is the component-wise score equations of (g) that is the gradient of the criterion in (12). The solution of the data-adaptive PCA will be the root of (ĝ) = 0. An important question we now pose is that the minimization of (8) is equivalent to the minimization of the conventional PCA criterion with pseudo datax as n j =1 To answer to this question, we investigate some theoretical results. Let q × n matrix ξ = (ξ 1 , . . . ,ξ n ) withξ j = (ξ 1j , . . . ,ξ qj ) T be a solution of (13). Then we show that the solutionξ is asymptotically equivalent toξ = (ξ 1 , . . . ,ξ n ) withξ j = (ξ 1j , . . . ,ξ qj ) T , which is the minimizer of Theorem 1. Assume that the diagonal elements of the projection matrix = A( A T A) −1 A T are uniformly small, that is, max 1≤i≤p γ ii = n,p 1. Under a further assumption that n n,p q 2 → 0, we have whereξ tj is the (t, j )th component ofξ , andξ tj is the (t, j )th component ofξ .
Theorem 1 implies thatξ inherits the same asymptotic properties asξ . In fact, the above result can be considered as an extension to the PCA framework of Huber (1973). A detailed proof can be found in the online supplement.

PRACTICAL ALGORITHM
Here, we develop a practical algorithm based on the pseudo data. Ideally, if one knew A, one could form the pseudo data and apply the least-square method. In practice, the pseudo data are not available since A and B are unknown. Thus, we consider a fixed point analogy to the pseudo data, "empirical pseudo data" defined as whereμ,Â, andξ are computed by the following iterative algorithm. Suppose that we observe a p × n data matrix X = (x 1 , . . . , 1. Regress X on the rank m subspaceB (0) X, and obtain a p × m coefficient matrix The first q eigenvectors and eigenvalues of covariance matrix of the converged matrix X (L) are our final estimators, respectively. Here, the number of principal components q is chosen to explain more than 80% of total variance. A few remarks are in order.

The effect of pseudo data transformation:
To evaluate the effect of the pseudo data transformation, we generate positively (negatively) skewed data and compute the converged pseudo data z (L) from the above iterative algorithm. Figure 4(a) shows a histogram of positively skewed original data and a normal density (red line) with the same mean and variance as a contrast. The converged pseudo data z (L) of the positively skewed data are displayed in Figure 4 (b). As shown, the distribution of the transformed pseudo data tends to be close to a bell shape distribution. From Figure 4(c) and 4(d), we have similar results for the negatively skewed data.

Theorem 2. Let
Under the assumption that ψ τ k ,c (u) = ρ τ k ,c (u) is continuous and ψ τ k ,c (u) ≤ 2 almost everywhere, we obtain whereX (L) is the solution of the proposed algorithm at the stage L.
A proof of Theorem 2 can be found in the online supplement. For further analysis, letting be the set of p × n matrices of rank q, we define a stationary point of the function R(G) as a point of G ∈ at which the gradient vector is zero. We further assume two conditions on the subset 0 (⊂ ) and the function R: A proof of Proposition 1 can be found in the online supplement.

PRACTICAL PERFORMANCE
In this section, simulation studies and a real data analysis are conducted to evaluate the practical performance of the proposed methods. Throughout this section, we compare the proposed method, the data-adaptive PCA (DPCA) with existing ones, the ordinary PCA, ROBPCA, and S-ROB. Note that the codes of the R statistical package used to implement the methods and to carry out all the experiments are available at http://stat.snu.ac.kr/heeseok/dpca in order that one can reproduce the same results.

A SIMULATED EXAMPLE
We consider three orthonormal sine functions as true eigenvectors, which are shown in Figure 5(a) where the length of the each curve is 200 (Bailey 2012). Thus, the dimension of the loading matrix A is 200 × 3. To simulate 30 data vectors with the length 200, we generate a 3 × 30 dimensional principal component matrix from a multivariate normal distribution MVN(μ, ) with μ = (0, . . . , 0) T and = diag (1, 25, 50). We obtain a 200 × 30 dimensional low-rank matrix L of the model (1). Note that each column of the matrix L represents a noiseless data g. Figure 5(b) shows five projection data vectors among 30. We finally generate a noisy data matrix M by adding noise component, that is, each column of the matrix M is x i = g i + e i , i = 1, . . . , 30, where the noise term is considered as two cases: (i) normal noise, e i ∼ N(μ, σ 2 ) and (ii) skewed t-distributed case, e i ∼ St(ξ, ω, α, df ) with a location parameter ξ , a scale parameter ω, a shape parameter α, and a degree of freedom df. The corresponding five noisy data according to two types of noise are displayed in Figure 5(c) and 5(d). The detailed descriptions of the skewed distributions are found in Azzalini and Capitanio (2014). In the simulation, we set μ = 0 and ξ = 0 for all cases. For the skewed case, the shape parameter α is set to be 15 for the positively skewed distribution, or −15 for the negatively skewed one, and the df is set as df = 3. Note that, in the skewed t-distribution, the mean can be computed as ξ + α df df−2 . For our simulation, we consider two cases for ω and σ , ω = σ = 8 and ω = σ = 11.
To compare the PCA methods, we evaluate a measure of discrepancy, RMSE between three true eigenvectors and the corresponding estimated eigenvectors RMSE(v) = v j −v j 2 , j = 1, 2, 3, where v j denotes the p-dimensional eigenvector associated with the jth largest eigenvalue. Note that since the dimension of the subspace is 3, we only consider the first three ones. In addition, we consider a reconstruction error (RE), where g ij is the (i, j )th component of the noiseless data andĝ ij is the reconstructed one by a PCA method based on up to three principal components. For each combination of noise scenario (normal, positively skewed α = 15, negatively skewed α = −15), σ , and ω, we generate 200 sets of samples. For each simulated dataset, the three PCA methods are applied to obtain principal components, and then RMSE(v) and RE(ĝ) are computed and listed in Tables 2 and 3. We observe that (i) the results of the DPCA are comparable to those of PCA with regard to normal noise, (ii) ROBPCA, S-ROB, and DPCA approaches outperform PCA for skewed noisy data, and (iii) for all cases, DPCA gives identical results with PCA, or DPCA is superior to ROBPCA and S-ROB. We note that it seems that the third PCs are distorted by noise components, so that the RE values are increased when the third PCs are used for reconstruction.

REAL DATA ANALYSIS
We consider the daily maximum precipitations in August from CMAP during the period of year 1997 to 2008 (n = 12). These are the maximum precipitations on 360 × 180 grids that cover the entire globe with an 1 • interval. We focus on the East Asia region that covers 30-50 • N and 120-140 • E, and hence, the number of variables is p = 441. We apply PCA, ROBPCA, S-ROB, and DPCA to the maximum precipitation data, and then obtain loading matrices of four leading PCs. As shown in Figure 6, the loadings obtained by DPCA are different from those of the conventional PCA, especially in the third and fourth PCs. The main features of the maximum precipitation data are explained in the first and second PCs, and the differences of the noise treatment between each PCA method are represented in the third and fourth PCs. We note that the four PCs explain almost 90% of the variance, and the data satisfy the assumption in Theorem 1; = A( A T A) −1 A T are assumed to be uniformly small. For analysis, we compute the one-year-out cross-validated prediction by several PCA methods as follows: (i) A particular year t * is excluded from the total n years. (ii) Apply a PCA method to the remaining (n − 1) × p dimensional data, and obtain the p × q dimensional eigenvector matrix A. (iii) Compute the principal components for prediction as ξ = A T x t * with the excluded year data x t * . (iv) Obtain the prediction values asx t * = Aξ .
Then, we compute prediction errors of each method, PE(x) = n t * =1 x t * −x t * 2 . The prediction errors are sequentially computed by up to four PCs. Figure 7 shows the prediction errors over years computed by different methods. As one can see, for all years, the proposed DPCA produces the smallest PE values and other PCA methods give similar results. We remark that ROBPCA and S-ROB are suitable for identification of outliers, but Figure 6. From the first row to the fourth row, the first four loading matrices (from left to right) of PCs for regional daily maximum precipitations by PCA, ROBPCA, S-ROB, and DPCA, respectively. they provide almost identical results to those of PCA in this prediction problem. To be specific, the prediction results from PCA are almost identical to those from ROBPCA.

CONCLUDING REMARKS
In this article, we have proposed a new data-adaptive PCA method that represents skewed or heavy-tailed distributed data, which may not be well reflected by the ordinary PCA. We believe that this is the main contribution of the article and by itself is an important step. We have found it useful to focus on a composite asymmetric Huber function that replaces the conventional least-square loss function. Furthermore, by introducing the concept of pseudo data, the proposed data-adaptive PCA method can be interpreted as another version of the ordinary PCA, and a practical algorithm based on the pseudo data has been developed. We have also discussed some theoretical properties related to the algorithm such as the convergence property.
Finally, the problems we bring up here can be analyzed by independent component analysis (ICA) that considers higher-order statistics of data (Comon 1994;Hyvarinen and Oja 2000). As PCA, by constraining the number of independent components (ICs), ICA can be used to carry out dimension reduction while retaining the useful information. Figure 8 shows the patterns obtained from both methods, ICA and DPCA, and we observe the differences between them. These differences may occur because ICA cannot identify a correct ordering of the ICs nor the proper scaling (including sign) of the ICs. Moreover, ICA method has been developed to find hidden signals from mixed ones, and hence, it may  be difficult to interpret extracted patterns as we do in PCA methods. On the other hand, the proposed methods can be understood in the PCA framework, which makes them easy to interpret.

SUPPLEMENTARY MATERIALS
Supplementary materials: Proofs of Theorems 1 and 2 and Proposition 1.