Supervised Sparse and Functional Principal Component Analysis

Principal component analysis (PCA) is an important tool for dimension reduction in multivariate analysis. Regularized PCA methods, such as sparse PCA and functional PCA, have been developed to incorporate special features in many real applications. Sometimes additional variables (referred to as supervision) are measured on the same set of samples, which can potentially drive low-rank structures of the primary data of interest. Classical PCA methods cannot make use of such supervision data. In this article, we propose a supervised sparse and functional principal component (SupSFPC) framework that can incorporate supervision information to recover underlying structures that are more interpretable. The framework unifies and generalizes several existing methods and flexibly adapts to the practical scenarios at hand. The SupSFPC model is formulated in a hierarchical fashion using latent variables. We develop an efficient modified expectation-maximization (EM) algorithm for parameter estimation. We also implement fast data-driven procedures for tuning parameter selection. Our comprehensive simulation and real data examples demonstrate the advantages of SupSFPC. Supplementary materials for this article are available online.


INTRODUCTION
Principal component analysis (PCA) has been widely used in multivariate analysis to extract important features in data. Principal component (PC) loadings usually provide useful interpretation of major variations, while PC scores facilitate follow-up statistical analyses such as clustering and regression. It is a powerful tool for dimension reduction, pattern recognition, and visualization for big data.
This article concerns regularized PCA methods, which impose useful structural regularization on PCA, and have been extensively studied in the literature. Special structures like sparsity and smoothness are imposed on the loading vectors to model high-dimensional data with complex structure. For example, functional PCA is used to model functional observations such as temporal data or spatial data (see Rice and Silverman 1991;Silverman et al. 1996;Huang, Shen, and Buja 2008, and references therein). In high-dimensional situations where most variables are noise and only a few variables are important, sparse PCA is used to simultaneously select variables and capture major variations (see Zou, Hastie, and Tibshirani 2006;Shen and Huang 2008;d'Aspremont, Bach, and Ghaoui 2008, and references therein). More recently, some researchers studied two-way extensions of the above one-way regularized PCA methods (see, e.g., Huang, Shen, and Buja 2009;Lee et al. 2010;Allen 2013).
Although powerful, the above regularized PCA methods have one limitation in common: they only make use of a single dataset, and by default ignore any other measurements collected on the same set of samples. It is now increasingly common that multiple related datasets are available on the same set of samples. In such cases, borrowing information across datasets may lead to recovery of a more interpretable low-rank structure. This is especially relevant when the additional measurements, referred to as supervision information, can potentially drive underlying patterns within the primary data. For example, in Section 5, we are interested in studying expression patterns of a number of yeast genes over two cell cycles. In addition to the gene expression data, we have extra binding information of transcription factors (TFs) for each gene. Since TFs regulate gene expressions biologically (Nikolov and Burley 1997;Lee and Young 2000), using TF binding information as supervision when studying expression patterns can lead to a more inherent and meaningful discovery. Another motivating example considered in Section D of the online supplement concerns daily arrival rates of patients to a hospital emergency room. It is of interest to understand patient arrival patterns to better allocate medical resources. In addition to the primary data, that is, the arrival rates at different time of day over many days, we want to use the day-of-week index as supervision to extract day-of-week specific arrival patterns.
Motivated by these applications, in this article, we develop a supervised regularized PCA framework that makes use of extra supervision information when doing regularized PCA. We name it the supervised sparse and functional PCA, or SupSFPC. Supervision, subject to variable selection, directly affects the PC scores, while smooth and sparse structures are imposed on the PC loadings. The SupSFPC framework is very general and flexible. It unifies and generalizes many variants of PCA. In particular, without the supervision, it encompasses regularized PCA methods such as functional PCA and sparse PCA as special cases. Supervision and regularization complement each other under SupSFPC. By smoothing the loading vectors, our method can borrow strength across neighboring variables to reduce noise; with sparsity, the variation of the functional estimate is reduced; supervision indirectly affects the loading vectors to make them more interpretable. Overall, the proposed SupSFPC method can recover an interpretable and accurate low-rank approximation of a primary dataset with potential guidance from supervision data.
Recently, Li et al. (in press) studied a supervised version of the standard PCA. However, their method cannot accommodate special features of functional or high-dimensional data. Incorporating smoothness and sparsity in such data reduces estimation variability and improves interpretability. Furthermore, their method cannot achieve variable selection of the supervision set: when auxiliary data contain irrelevant information to the low-rank structure of the primary data, it is desirable to eliminate unimportant variables and identify crucial driving factors. For example, in the yeast gene expression application of Section 5, researchers are also interested in identifying TFs that regulate cell cycles. The SupSFPC method addresses the above problems through regularization. The regularization brings new challenges for model fitting, such as optimization of a nondifferentiable object function and selection of multiple tuning parameters. We develop an innovative algorithm for SupSFPC by combining an expectation-maximization (EM) algorithm with several ascent algorithms. We further alleviate the computational burden by embedding tuning parameter selection in the iterative scheme. Numerical results show high computational efficiency and improvement in interpretability over existing methods.
The rest of the article is organized as follows. In Section 2, after reviewing the functional PCA model, we propose our new SupSFPC model, followed by the penalized likelihood framework. We then elaborate on connections of the SupSFPC framework to various regularized PCA and supervised PCA methods. In Section 3, we develop a computationally efficient algorithm to estimate the model parameters, and briefly discuss tuning parameter selection. We then demonstrate our method using comprehensive simulation studies in Section 4 and a real data example in Section 5. Additional technical details and numerical studies can be found in the online supplement.

MODEL AND LIKELIHOOD
In this section, we first review the functional PCA model, and then develop the SupSFPC model and introduce a regularized likelihood approach for the model fitting.

FUNCTIONAL PCA MODEL
We assume that Z i (s) (i = 1, . . . , n) are independent realizations of a smooth random function Z(s) with mean function E(Z(s)) = μ(s) and covariance function cov(Z(s), Z(s )) = G(s, s ). The index variable s can represent any continuous measure such as time, spatial location, and so on. Its domain S is assumed bounded. The covariance function can be decomposed as where d 1 ≥ d 2 ≥ · · · ≥ 0 are the eigenvalues and V k (s) (k = 1, 2, . . .) are the corresponding orthogonal unit-norm eigenfunctions. Consequently, by the Karhunen-Loève theorem, the random function Z(s) can be expressed as a linear combination of the eigenfunctions as are uncorrelated random variables with mean zero and variance d k . In particular, Z i (s) has the expression where u ik (i = 1, . . . , n) are independent realizations of u k . This is the classical functional PCA model where (u 1k , . . . , u nk ) T is the kth score vector, corresponding to the kth loading function V k (s), k ≥ 1. Researchers usually consider the above functional model with measurement errors added, as in, for example, Yao, Müller, and Wang (2005). Namely, each observed trajectory, denoted by X i (s), is expressed as where Z i (s) is the latent random function given in (1), and e i (s) is a measurement error process that is assumed to be uncorrelated at each point with mean zero and variance σ 2 e , independently identically distributed (iid) for different observations.
In practice, the majority of variations in data are contained in the subspace spanned by the first few PC loadings in (1). Namely, the first few layers of the latent function Z(s) dominate and the rest are negligible. Hereafter, we consider the following rank-r functional PCA model: where u (i) = (u i1 , . . . , u ir ) T is the r × 1 score vector for the ith observation, and V(s) = (V 1 (s), . . . , V r (s)) T is the collection of r loading functions. In particular, the finite linear combination u T (i) V(s) is referred to as the low-rank approximation of the ith demeaned observation X i (s) − μ(s).

SUPSFPC MODEL
Let X i (s) be the ith functional observation from Model (2). Let y (i) be a q × 1 vector containing q auxiliary variables for the ith observation. We assume that y (i) , the supervision data, drives the low-rank structure of X i (s), the primary data, by directly affecting its PC score vector u (i) in Model (2). In particular, we propose the following multivariate linear model for the scores: where β 0 is an r × 1 intercept vector, B is a q × r coefficient matrix with the rows corresponding to the supervision variables and the columns corresponding to the PC scores, and f (i) is an independent realization of an r × 1 random vector with mean zero and unknown covariance f . For example, in the genetic application of Section 5, X i (s) denotes the gene expression profile of the ith sample, while y (i) are the corresponding transcription factors. Model (3) consists of a fixed term β 0 + B T y (i) and a random term f (i) . The fixed term captures the variations in u (i) that can be explained by the supervision data y (i) . The random term effectively collects the leftover variations driven by other (unknown) factors. Model (3) is flexible enough to adapt to different situations including those where the supervision information is indeed redundant, as we discuss later in Section 2.3.
Combining (2) and (3), we obtain the supervised functional PCA model. In particular, we substitute u (i) in (2) with (3) and get the following equivalent expression of the model: The first term, μ(s) + β T 0 V(s), is an intercept term. Without loss of generality, we assume that X i (s) and y (i) are centered at each variable so we can omit this intercept term. The second term, y T (i) BV(s), is a fixed term that incorporates the supervision information.
The third term, f T (i) V(s) + e i (s), is a random term, with the covariance function being V(s) T f V(s ) + σ 2 e δ(s − s ), where δ(·) is the Dirac delta function. The recovery of the low-rank structure y T (i) BV(s) + f T (i) V(s) is of primary interest in dimension reduction. We further generalize Model (4) by assuming that B and V(s) are potentially sparse. Consequently, we name Model (4) the supervised sparse and functional PCA model, or the SupSFPC model. Recall that B is a coefficient matrix to incorporate supervision. Sparsity on B can effectively identify auxiliary variables that do not provide supervision to the low-rank structure of the primary data. In particular, when B is a zero matrix, all auxiliary variables are irrelevant to the primary data, and the SupSFPC model reduces to the functional PCA model (2). The loading functions in V(s) can be sparse as well, in the sense that the support of each loading function may not be the entire domain S. Similar to James, Wang, and Zhu (2009) where the authors studied a regression model with a (potentially sparse) functional predictor, we remark that sparse functions facilitate model interpretations by removing unnatural wiggles around zero. Overall, sparsity is usually a desirable (and sometimes necessary) feature in practice, especially when analyzing high-dimensional data.
As it stands, Model (4) is not identifiable. Because, for any r × r orthogonal matrix Moreover, the columns of B and the entries of V(s) and f (i) are subject to scale and order shifts. To rule out this kind of ambiguity, we impose the following identifiability constraints: 1. The loading functions in V(s) form an orthonormal basis, that is, 2. The covariance matrix f is diagonal with distinct positive eigenvalues.
3. The diagonal values of f are strictly decreasing.
The orthonormality constraint of the loading functions, also used in the functional PCA model (1), facilitates interpretation and rules out scale shift. The diagonality of the covariance matrix with distinct eigenvalues prevents random rotations. The order of the eigenvalues of f determines the order of the loading functions in V(s) and the columns in B. We remark that under the above conditions, the loading functions carry explicit interpretations: the first loading captures the direction where variation in the data from unknown sources is maximized; subsequent loadings are orthogonal to the previous ones and sequentially maximize variations from unknown sources. This is similar with functional PCA where there is no supervision and all variations are from unknown sources.

PENALIZED LIKELIHOOD
In reality, typically we do not observe an entire function but rather at discrete sampling points. In particular, we assume that there are p sampling points in domain S indexed by s 1 , . . . , s p , which may not be evenly spaced. For notational simplicity, without special notice we generally use i = 1, . . . , n to index samples, use j = 1, . . . , p to index discretized points, and use k = 1, . . . , r to index PC layers.
The discrete observations of the functional data X i (s) (i = 1, . . . , n) are collected in an n × p matrix X, where x ij = X i (s j ). We discretize V(s) and e i (s) in Model (4) accordingly as a p × r loading matrix V with v jk = V k (s j ) and an n × p error matrix E with e ij = e i (s j ). We further denote U = (u (1) , . . . , u (n) ) T as an n × r score matrix, Y = (y (1) , . . . , y (n) ) T as an n × q supervision data matrix (viewed as fixed in the current context), and F = (f (1) , . . . , f (n) ) T as an n × r random error matrix. As a result, we obtain the following discretized version of the SupSFPC model (4): The identifiability conditions follow directly from those for the functional version of the model (4). Namely, V T V equals to an r × r identity matrix I r , and f is diagonal with positive decreasing eigenvalues.
To fit the SupSFPC model, we adopt a maximum likelihood approach. We assume normality for E and F. In particular, we assume that e ij is iid from a univariate normal distribution N (0, σ 2 e ), and f (i) is iid from a multivariate normal distribution N r (0, f ). In addition, e ij is independent of f (i) . From (5), we see that the observation vector and different samples are independent. As a result, the loglikelihood of the observed data matrix X is To impose desirable structures (i.e., smoothness and sparsity) on V and B, we optimize a regularized log-likelihood function to estimate the model parameters. Let θ (B, V, σ 2 e , f ) denote the model parameter set and be the parameter space under the identifiability conditions. We propose to solve the following optimization problem: where P f (V) is the roughness penalty ("f " stands for functionality) on columns of V, and P s (V) and P s (B) are the sparsity-inducing penalties ("s" stands for sparsity) on entries of V and B, respectively. We remark by imposing sparsity on B we also avoid overfitting in the multivariate linear model (3) when the dimension of supervision data is high (q > n). Therefore, SupSFPC does not have any restrictions on the order of n, p, and q, and is suitable for high-dimensional data. For sparsity, numerous penalties have been proposed and studied in the literature (see Tibshirani 1996;Fan and Li 2001;Tibshirani et al. 2005;Yuan and Lin 2006). In this article, we present our method using the LASSO penalty (Tibshirani 1996). It can be easily generalized to incorporate other penalties as well. The sparsity-inducing penalties in (6) take the following form: where v k and b k are the kth columns of V and B corresponding to the kth layer of the low-rank structure, respectively, and λ k and γ k are the corresponding layer-specific tuning parameters. For smoothness, generalized 2 penalties are widely used in the literature. Here we consider the elliptical 2 penalty: where α k are the layer-specific tuning parameters, and is a fixed p × p positive semidefinite matrix depending on the sampling points, with the quadratic form v T k v k penalizing differences among adjacent values in v k . Here, we use the same formulation of as in Green and Silverman (1994), which connects nicely with smoothing splines.
The penalized likelihood framework (6) is very general and it subsumes many existing methods as we now discuss. If P f (V) = P s (V) = P s (B) = 0, that is, without any structural constraints, it reduces to the supervised SVD (SupSVD) method of Li et al. (in press). When P s (B) = ∞, that is, B = 0, it reduces to regularized PCA methods: if P f (V) = 0 and P s (V) = 0, it corresponds to functional PCA of Huang, Shen, and Buja (2008); if P f (V) = 0 and P s (V) = 0, it results in sparse PCA of Shen and Huang (2008); if P f (V) = 0 and P s (V) = 0, one obtains the one-way situation of the sparse and functional PCA (SFPC) method of Allen (2013). We also note that the general framework includes many degenerated situations that have not been well studied before. For instance, when P f (V) = 0 while P s (V) = 0 and P s (B) = 0, the framework reduces to a supervised PCA method with sparsity in V and B.
We want to comment on situations where the sampling points are different for different samples. For example, in longitudinal studies, patients may follow up at different times and also have distinct time domains. Similar situations have been referred to as sparsely observed data in functional data analysis (see, e.g., James, Hastie, and Sugar 2000;Yao, Müller, and Wang 2005). In such situations, we can think of two possible approaches. For the first one, we can find a set of common grid points that are finer than the irregular sampling points, and treat the functional observations as missing on those grids where no data are observed. Our estimation algorithm can be extended to incorporate missing values. The second approach is to use basis expansion to interpolate the functional data, and then evaluate them on a set of common sampling points.

COMPUTATIONAL ALGORITHM
In this section, we propose an algorithm for parameter estimation of the SupSFPC model. For the sake of clarity in describing the estimation algorithm, we first assume that all the tuning parameters, including the rank of the model, are given. We motivate and summarize the algorithm in Section 3.1, and derive the algorithm in more detail in Section 3.2. Then we briefly discuss the data-driven selection of tuning parameters in Section 3.3. Detailed derivation of the tuning parameter selection can be found in Section A of the online supplement.

EM ALGORITHM
Directly optimizing the penalized log-likelihood (6) with respect to the identifiability constraints is nontrivial. The model parameters are intertwined in the log-likelihood L(X): both the mean and the covariance terms share the parameter matrix V. In addition, the sparsity-inducing penalties are nondifferentiable; the feasible region determined by the identifiability conditions is nonconvex. We propose an algorithm that effectively combines the expectation-maximization (EM) algorithm with proximal gradient ascent (Nesterov 2005;Beck and Teboulle 2009) and block-coordinate descent (Ortega and Rheinboldt 2000) to overcome these computational difficulties.
To motivate the EM formulation, we first note that the hierarchical Model (5) contains the PC scores U as latent variables. It is easily seen that x (i) and u (i) are jointly normally distributed, and different samples are independent. The joint log-likelihood of the observed data X and the latent data U can be decomposed as where the conditional log-likelihood of X given U is which only depends on V and σ 2 e , while the marginal log-likelihood of U is only depending on B and f . Therefore, an EM algorithm can effectively separate the parameter estimation into two parts to simplify the optimization. The EM algorithm iterates between an E step and an M step. In the (t + 1)th iteration, the E step is to calculate E U|X,θ (t) [L(X, U)], where the expectation is taken with respect to U given X and θ (t) = (B (t) , V (t) , σ 2 e (t) , f (t) ), the estimated parameter set obtained in the tth iteration. The M step is to maximize E U|X,θ (t) [L(X, U)] − P f (V) − P s (V) − P s (B) with respect to θ ∈ , with the penalty terms as in (7) and (8). We denote the corresponding optimizer as θ (t+1) . After convergence, we obtain a local optimal solution for optimizing the regularized log-likelihood (6). Algorithm Summary: Before the detailed technical derivation, we summarize the algorithm with fixed tuning parameters below in Algorithm 1.

DERIVATION OF THE EM ALGORITHM
Since u (i) and x (i) are jointly normally distributed, the conditional distribution of u (i) given x (i) and θ (t) is easily derived as N r (μ (t) (i) , (t) ) where We remark that the conditional expectation of the PC scores for the ith sample is a weighted average of B (t) T y (i) and V (t) T x (i) , where the weight is determined by ( -Estimate σ e 2 (t+1) from (18); (22); in SupSFPC, the PC scores are partially driven by the supervision effect y (i) , and partially affected by the observation x (i) as in the ordinary PCA.
In the E step, given (11) and (12), we can derive the explicit expression of E U|X,θ (t) (L(X, U)). As a matter of fact, we do not need to calculate the expectation of the entire joint log-likelihood, but rather only the following three terms: second-order term : quadratic form : E U|X,θ (t) tr U U T = ntr (t) + tr (t) (t) T , (15) where is any r × r symmetric matrix.
In the M step, we optimize E U|X,θ (t) [L(X, U)] − P f (V) − P s (V) − P s (B) with respect to θ ∈ . It is equivalent to the following two separate optimization problems where L(X|U) is given by (9), and L(U) is given by (10). The notation, diag( f ), represents a diagonal matrix whose diagonal entries are the diagonal entries of f .

Estimation of
σ 2 e and f . We take the first-order derivative of (16) (or (17)) with respect to σ 2 e (or the diagonal entries of f ) and set them to zero, and obtain the analytical expressions (see Section B of the online supplement for the derivation) where V (t+1) and B (t+1) are the optimizers of V and B for (16) and (17), respectively, to be discussed below. In particular, the conditional expectation terms of (18) and (19) can be obtained using (13) to (15) from the E step.

Estimation of V.
Optimizing (16) with respect to V under the orthogonality constraint is formidable. Instead, we propose to drop the orthogonality and optimize the criterion with respect to the columns of V, one at a time while fixing the others, mimicking a block-coordinate descent algorithm. Since the conditional distribution (9) of X given U is identifiable even without the orthogonality condition, the optimization problem is still well defined. The scheme is similar to the deflation method used in the regularized PCA literature (see, e.g., Huang, Shen, and Buja 2008;Shen and Huang 2008;Hays et al. 2012;Allen 2013). We remark that the greedy algorithm maintains orthogonality of the columns of V approximately throughout the EM iterations. In our simulation study, the angle between any two loading vectors is typically greater than 85 • ; in the yeast cell cycle example studied in Section 5, the smallest angle between any of the first four loadings is 87.7 • . Therefore, the column-by-column optimizers serve as a reasonable surrogate of the global optimizer of (16).
Given all the parameters except the kth column of V, we can estimate v (t+1) where k , and c (t) k = E U|X,θ (t) (u T k u k ). The matrices U −k and V (t) −k are the submatrices of U and V (t) leaving out the kth column u k and v (t) k , respectively. This setup facilitates parallel computing for the different columns in V. The constants β (t) k and c (t) k can be calculated from (13) and (14). The modified tuning parameters λ (t) k and α (t) k can absorb the unknown constant σ 2 e (t+1) , and be selected adaptively in a data-driven fashion in each iteration. For now, we treat them as known.
To solve (20), we adopt the proximal gradient ascent scheme studied by Nesterov (2005) and Beck and Teboulle (2009). We drop the subscripts and the superscripts in (20) for simplicity, and the optimization problem becomes min where ∇f is the gradient of f , and L is the Lipschitz constant of ∇f such that ∇f (a) − ∇f (b) 2 ≤ L a − b 2 for every a, b ∈ R p . Since ∇f (v) = −β + (I + α ) v, L is the largest eigenvalue of I + α . Note that l is the proximal gradient ascent iteration index for estimating one column of V, not to be confused with the EM iteration index t. In particular, we solve (21) approximately through the following two steps where thres() is a soft-thresholding function that thres(β, λ) sign(β)(|β| − λ) + .

Estimation of B.
To estimate B, we can rewrite (17) as r independent unconstrained optimization problems, and obtain each column of B as where is absorbed by the modified tuning parameter γ (t) k that can be adaptively selected. The vector E U|X,θ (t) (u k ) can be calculated from (13).
The optimization problem (22) is an univariate LASSO problem with E U|X,θ (t) (u k ) being the response vector, Y being the n × q design matrix, and b k being the coefficient vector. In addition, both the response vector and the design matrix are column centered, so there is no intercept term. Many methods have been developed to solve (22) (see Efron et al. 2004;Friedman, Hastie, Tibshirani 2010). Here we use the default coordinate descent algorithm in Matlab (Friedman, Hastie, Tibshirani 2010).

TUNING PARAMETER SELECTION
The tuning parameters in (6) play an important role in balancing the likelihood and the penalties. Note that there are 3r tuning parameters in the model. Searching over a 3r-dimensional grid and refitting the model (potentially multiple times, if one uses crossvalidation) for each tuning set can be a huge computational burden. Instead, we adopt a nested procedure of selecting tuning parameters introduced by Huang, Shen, and Buja (2009). In each iteration, we find the optimal tuning parameters λ (t) k and α (t) k for v (t+1) k while solving (20), and find the best γ (t) k for b (t+1) k while solving (22). Numerical results illustrate this nested procedure always converges. Theoretical justification of the convergence property of the scheme is an open question.
In particular, when selecting λ (t) k and α (t) k , we assume that they do not interfere with each other and select one while fixing the other as zero. As a result, the selection of α (t) k is equivalent to selecting the smoothing parameter in a smoothing spline problem. For this selection task, we use leave-one-out cross-validation (LOOCV), since the LOOCV score has an analytical form from Green and Silverman (1994), which facilitates fast computation. To select λ (t) k , we set it at an asymptotical value since the problem is equivalent to a filtering problem studied in Yang, Ma, and Buja (2013). The asymptotical value can induce appropriate amount of sparsity in v (t+1) k . The tuning parameter γ (t) k in (22) is selected using Bayesian information criterion (BIC), which is a popular choice in LASSO problems (Wang, Li, and Leng 2009;Chand 2012). The degree of freedom is determined in the same way as in Tibshirani et al. (2012). As a result, the algorithm is computationally efficient and scalable for high-dimensional data. Numerical results in Sections 4 and 5 suggest that the scheme performs well. A more detailed derivation of tuning parameter selection can be found in Section A of the online supplement.
So far we have assumed that the rank r of the model is known. In practice, the rank needs to be determined from data. It is reasonable to assume that the rank of the underlying signal of the primary data matrix is inherent. Therefore, all rank selection methods studied in the PCA literature may be used in our framework. In this article, we adopt a popular approach of using the scree plot of the primary data matrix to determine a proper rank. One can also consider other methods, such as the permutation assessment method in Buja and Eyuboglu (1992) and the bi-cross-validation method in Owen and Perry (2009). More sophisticated rank selection methods for functional data and high-dimensional data need further investigation and are beyond the scope of the current article.

SIMULATIONS
In this section, we compare SupSFPC with SupSVD proposed by Li et al. (in press), one-way SFPC proposed by Allen (2013), and the PCA using comprehensive simulations.

SIMULATION SETTINGS
Data are generated from the low-rank model: X = YBV T + FV T + E, which connects to the SupSFPC, SupSVD, one-way SFPC, and PCA models, respectively, through specific choices of B, f , and V. Throughout the section, we assume that each entry of E is iid standard normal (i.e., σ 2 e = 1). Study I: We first consider a unit-rank setup where n = 200, p = 100, q = 4, r = 1. The 200 × 4 supervision matrix Y is filled with standard normal random numbers and then column centered. The 200 × 100 primary data matrix X is also column-centered after being generated. We focus on four settings where data are generated from each model, respectively: • Case 1 (SupSFPC): The loading vector V is shown in the left panel of Figure 1; the coefficient vector B is (3, −3, 5, 0) T ; F is a 200 × 1 random vector where each entry is iid standard normal (i.e., f = 1). • Case 2 (SupSVD): The parameters B and f are the same as in Case 1; the loading vector V is filled with standard normal random numbers and scaled to have norm one. Namely, there is no smoothness or sparsity in the loading.

• Case 3 (SFPC):
The vector V is the same as in Case 1; the coefficient B = 0, which eliminates the supervision effect; each entry of F is iid N (0, 9) (i.e., f = 9).
• Case 4 (PCA): The parameters B and f are the same as in Case 3, and the loading vector V is obtained in the same way as in Case 2.
• Case 6 (SupSVD): The parameters B and f are the same as in Case 5; the 120 × 3 loading matrix V is filled with standard normal random numbers and normalized to have orthonormal columns.
• Case 7 (SFPC): The loading matrix V is the same as in Case 5; the coefficient matrix B = 0, which eliminates the supervision effect; the covariance matrix f is diagonal with diagonal values (16,9,4).
• Case 8 (PCA): The parameters B and f are the same as in Case 7, and the loading vectors are obtained in the same way as in Case 6.

RESULTS
For each case, we repeat the simulation 100 times and present the median and the median absolute deviation (MAD) of each performance measurement for all methods in Table 1. The results show that SupSFPC outperforms the other methods in all cases, in terms of the considered aspects. One explanation for the superior performance is that SupSFPC is a general framework unifying many existing methods. It automatically adapts to a wide range of practical situations.
There are several interesting observations in Table 1. First, in Case 3 and Case 7, SFPC surprisingly performs badly in all aspects. This is likely due to an inadequate tuning parameter selection procedure. The original SFPC article did not provide any guidance on how to set tuning grids for BIC, which is a crucial issue in practice. We consulted with the author and used a suggested tuning grid here. Second, in Case 4 and Case 8, SupSFPC and SupSVD outperform SFPC and PCA in terms of score prediction and low-rank structure recovery. Since the auxiliary data are irrelevant in both cases, the improvement in score prediction must come from the shrinkage effect imposed by (I + σ 2 e −1 f ) −1 in (13). This has been studied from a random matrix point of view by Shabalin and Nobel (2013). Third, in Case 6 where the generating model is SupSVD, the medians of MSPE U and MSE V for SupSVD are larger than SupSFPC. In this case the only difference between SupSFPC and SupSVD is that the former does not require strict orthogonality in loading estimation, so we think the improvement comes from this extra flexibility. Nevertheless, both SupSVD and SupSFPC have similar medians of MSPE UV T that are superior to SFPC and PCA. This suggests that the recovery of low-rank structures actually benefits from incorporating auxiliary data.

REAL DATA EXAMPLE: YEAST CELL CYCLE DATA
In this section, we demonstrate the advantage of SupSFPC using a yeast cell cycle dataset. Two additional real data examples, a government bond yield dataset and a hospital emergency room visit dataset, are considered in Sections C and D of the online supplement.
We consider microarray expression measurements (X) of yeast genes over a certain time period. About 800 cell cycle-related genes are identified in Spellman et al. (1998) through three independent synchronization methods. We consider the data from the α factor based experiment where mRNA levels were measured at every 7 min for 18 time points (about 2 hr) covering two cell cycles. In addition to the expression data, we also have ChIP-chip data (Lee et al. 2002) that contain binding information (Y) of 106 TFs for the cell cycle-  related genes. We exclude genes with missing values in either expression measurements or TF binding information as in Chen and Huang (2012) and Chun and Keleş (2010), and consider a subset of 542 genes. The data are publicly available in the R package "spls." Figure 2 shows the raw expression time series of the 542 cell cycle-related genes.  The goal of the yeast cell cycle data analysis is two-fold: (i) understanding the underlying expression patterns of cell cycle-related genes, and (ii) identifying transcription factors (TFs) that regulate cell cycles. Below we address both topics simultaneously using SupSFPC. Zhao, Marron, and Wells (2004) primarily focused on the former by projecting the raw time series onto Fourier basis functions with even frequencies and carrying out principal component analysis of the projected data. Chun and Keleş (2010) and Chen and Huang (2012) studied the latter by regressing the gene expression data onto TF data through sparse partial least square and sparse reduced rank regression, respectively.
The primary data matrix X contains expression measurements of 542 genes at 18 time points. The supervision data matrix Y contains binding information of the same genes for 106 TFs. We mean center each time point in X and each TF in Y. Based on the scree plot of singular values of the column-centered data matrix X, we select the rank to be r = 4. The fitting procedure took about an hour (641 EM iterations) on a standard desktop (Intel Xeon CPU X5570 @ 2.93GHz dual processor) to reach relatively high accuracy (the 2 difference between consecutive estimates of the loading matrix below 10 −3 ). Figure 3 compares the loading estimates from four methods: SupSFPC, SupSVD, SFPC, and PCA. By taking into account the auxiliary binding information, the SupSFPC loadings are the most interpretable ones. The first and the forth loading vectors of SupSFPC effectively capture periodic patterns of cell cycles without referring to a priori knowledge of true cyclic information as in Zhao, Marron, and Wells (2004). In addition, the second loading mainly presents the variation in the first cell cycle, and the third loading reflects the contrast of the two cycles. The fourth loading also emphasizes the variation in the second cycle. Figure 4 shows the clustering results of the 542 cell cycle-related genes based on SupSFPC scores. We apply a five-mean clustering approach, where the number of clusters is suggested by Zhao, Marron, and Wells (2004). Different clusters contain genes with different periodic phases. In particular, the genes in the 2nd-5th clusters clearly exhibit different cyclic patterns, similar to the results in Zhao, Marron, and Wells (2004). The genes in the first cluster, on the other hand, do not show strong periodicity, which may need further investigation.
We also investigate the TF activities. Active TFs correspond to the nonzero rows of the estimated supervision coefficient matrix B. Out of the 106 TFs, we identify 32 to be active, with 13 of them being among the 21 experimentally confirmed TFs in Wang, Chen, and Li (2007). The TF activities for those discovered by SupSFPC are shown in Figure 5. Most of the confirmed TFs have clear periodic behavior; among the unconfirmed ones, DOT6, MET4, SFL1, and YAP5 have the most significant cyclic patterns, which may provide useful guidance to further investigate the regulation effect of TFs on yeast cell cycle.

SUPPLEMENTARY MATERIALS
Appendices: Technical derivations of the algorithm in A and B; additional real data examples in C and D. (Appendix.pdf; pdf file) Code and Data: Matlab codes for different methods (i.e., SupSFPC, SFPC, and SupSVD) and numerical analyses, along with the yeast cell cycle data, bond yield data, and emergency room data. (codedata.zip; zip file)