Eigen selection in spectral clustering: a theory guided practice

Based on a Gaussian mixture type model , we derive an eigen selection procedure that improves the usual spectral clustering in high-dimensional settings. Concretely, we derive the asymptotic expansion of the spiked eigenvalues under eigenvalue multiplicity and eigenvalue ratio concentration results, giving rise to the first theory-backed eigen selection procedure in spectral clustering. The resulting eigen-selected spectral clustering (ESSC) algorithm enjoys better stability and compares favorably against canonical alternatives. We demonstrate the advantages of ESSC using extensive simulation and multiple real data studies.


Introduction.
Clustering is a widely-used unsupervised learning approach to divide observations into subgroups without the guidance of labels.It is an obvious statistical and machine learning formulation when there are no meaningful labels in the training datasets, such as in customer segmentation and criminal cyber-profiling applications.It is also a sensible approach when labels, in theory, do exist, but we have solid reasons to believe that the labels in the datasets are far from accurate.For instance, Medicare-Medicaid fraud detection cannot be formulated as a supervised learning problem, because although the labeled fraudulent transactions are real frauds, people believe that there are a large number of undiscovered frauds in the record.
Over the last sixty years, many clustering approaches have been proposed.The most dominant ones include k-means, hierarchical clustering, spectral clustering, and various variants (Hastie, Tibshirani and Friedman, 2009;James et al., 2014).The k-means algorithms (Bradley, Fayyad and Mangasarian, 1999;Witten and Tibshirani, 2010) adopt a centroid-based clustering approach.Hierarchical clustering algorithms (Ward Jr, 1963) first seek to build a hierarchy of clusters and then make a cut at a hierarchical level.Spectral clustering (Ng, Jordan and Weiss, 2002;Von Luxburg, 2007) clusters observations using the spectral information of some affinity matrix derived from the original data for measuring the similarity among observations.Among the above mentioned main-stream clustering approaches, spectral clustering is particularly well suited for high-dimensional settings, which refers to the situations that the number of features is comparable to or larger than the sample size.High-dimensional settings mainly emerged with modern biotechnologies such as microarray and remain relevant due to the subse-quent technological advances such as next-generation sequencing (NGS) technologies.Methodological and theoretical questions in high-dimensional supervised learning (i.e., regression and classification) have been attracting a great deal of attention in the statistics community over the last 20 years (see the review paper Zou (2019) and references within).In contrast, highdimensional unsupervised problems have had far fewer works so far.It is a challenging problem mainly because effective dimension reduction is difficult without the assistance of a response variable.Spectral clustering alleviates the problem of curse of dimensionality in high-dimensional clustering by consulting only a few less noisy eigenvectors of an affinity matrix.For example, suppose that we would like to cluster n observations into K groups, where K is the predetermined cluster number.Spectral clustering algorithms usually compute the top K eigenvectors of an affinity matrix and then perform a k-means clustering using just these K eigenvectors.
The intuition behind the above spectral clustering method is that under a broad data matrix generative model of low-rank mean matrix plus noise, the data label information is completely captured by the eigenvectors corresponding to top eigenvalues of an affinity matrix based on the low-rank mean matrix.Thus, the eigenvectors corresponding to non-spiked eigenvalues can be safely dropped and the purpose of noise reduction is achieved.
In this paper, we formalize the above intuition by considering the special case of K = 2 and Gaussian distributions.Concretely, the data matrix follows the aforementioned structure of low rank (i.e., rank = 2) mean matrix plus noise defined as X = IEX + (X − IEX), where X is a p × n matrix and n is the sample size.A natural and popular way is to construct the affinity matrix as X X.We show that the two spiked eigenvectors of H := (IEX) IEX, which can be understood as the noiseless version of the affinity matrix, completely capture the label information.We also identify scenarios where exactly one of the two spiked eigenvectors of H is useful for clustering.
Here, an eigenvector is useful if its entries take two distinct values, corresponding to the true cluster labels.Note that the eigenvectors of H are unavailable to us and the spectral clustering is applied to their sample counterparts, that is, the eigenvectors of the affinity matrix X X.
These motivate us to select useful eigenvectors of the affinity matrix in implementing spectral clustering.
Specifically, in this paper, we propose an innovative eigen selection procedure in the usual spectral clustering algorithms and name the resulting algorithm ESSC.Our eigenvector selection step is guided by the theoretical investigation of the top two eigenvectors of H.We also provide theoretical justification on our selection criteria.Our theoretical development does not require a sparsity assumption on the data generative model, such as those in Cai, Ma and Wu (2013) and Jin and Wang (2016).This guarantees that our procedure is potentially suitable for a wider range of applications.A by-product of our theoretical development is an asymptotic expansion of the eigenvalues when the population eigenvalues are close to each other (Proposition 1).This is a result of stand-alone interest.We provide extensive simulation studies, and observe that on small sample sizes, our clustering algorithm ESSC compares favorably in terms of stability and mis-clustering rate against the spectral clustering algorithm without the eigen selection step.These pieces of empirical evidences suggest that ESSC in general, increases the stability of spectral clustering algorithms and achieves competitive clustering results compared with the canonical alternatives.Although our theoretical analysis is conducted under Gaussian distribution assumption, the general idea of eigenvector selection extends to other high-dimensional clustering problems such as community detection using network data.
We acknowledge that although the eigen selection idea for spectral clustering is mostly absent in the statistics community, it was practiced in one previous work in the computer science literature.Indeed, Xiang and Gong (2008) proposed an EM algorithm to select the eigenvectors of an affinity matrix.But their approach is a heuristic practice and lacks theoretical analysis for the eigenvalues and eigenvectors to back-up the method.
There is relatively recent literature on theoretical and methodological developments on highdimensional clustering.For instance, Ng, Jordan and Weiss (2002) proposed a symmetric-Laplacian-matrix-based spectral clustering approach and prove the corresponding consistency.Cai, Ma and Zhang (2019) proposed a clustering procedure based on the EM algorithm for a high-dimensional Gaussian mixture model and proved consistency and minimax optimality for the procedure.Jin and Wang (2016) proposed a KolmogorovSmirnov (KS) score based feature selection approach (IF-PCA) to first reduce the feature dimension before implementing spectral clustering.The feature selection idea for clustering was also considered in other works including Chan and Hall (2010) and Azizyan, Singh and Wasserman (2013).None of these aforementioned works select eigenvectors.In this sense, our method and theory complement the existing literature by providing a way to stabilize and improve the performance of existing spectral clustering methods.
The rest of the paper is organized as follows.We introduce the statistical model and key notations in Section 2. In Section 3, we present the main algorithm.Section 4 includes the theoretical results.Simulation study and real data analysis are conducted in Section 5. Technical proofs and further discussion are relegated to the Supplementary Material.
2. Model setting and notations.In the methodological development and theoretical analysis, we consider the following sampling scheme.We assume that the data matrix X = (x 1 , . . ., x n ) is generated by (1) where {w i } n i=1 are i.i.d.from p-dimensional Gaussian distribution N (0, Σ), µ 1 , and µ 2 are two p-dimensional non-random vectors, and Y 1 , . . ., Y n ∈ {0, 1} are deterministic latent class labels.
As such, Y i = 1 means that the ith observation x i is from class 1, and Y i = 0 means that x i is from class 2. The parameters µ 1 , µ 2 and Σ are assumed to be unknown.Without loss of generality, we assume that µ 1 = µ 2 and µ 2 = 0.
The main objective is to recover the latent labels Y i 's from the data matrix X.If {Y i } n i=1 were i.i.d Bernoulli random variables, (1) would be a Gaussian mixture model.Our analysis can extend to this setting but we opt for considering fixed Y i 's to focus on our attention to the eigen selection principle.
We introduce some notations that will be used throughout the paper.For a matrix B, we use B to denote its spectral norm.For any vector x, x(i) represents the i-th coordinate of x.For any random matrix (or vector) A, we use IEA to denote its expectation.We define c 11 = µ 1 2 2 , c 22 = µ 2 2 2 and c 12 = µ 1 µ 2 , where • 2 is the L 2 norm of a vector.For any positive sequences u n and v n , if there exists some positive constant c such that u n ≥ cv n for all n ∈ N, then we denote u n v n .We denote the i-th largest eigenvalue of a square matrix A by λ i (A).Finally, we denote σ 2 n = Σ 2 (n + p).

Algorithm.
In this section, we develop a novel eigen selection procedure that improves the widely used spectral clustering algorithms.We start our reasoning from the noiseless case.
The entire logic flow of the development process is presented before we introduce the final eigen-selected spectral clustering algorithm (ESSC).
3.1.Motivation if the signal were known.Spectral methods frequently act on the top eigenvectors of the adjacency matrix X X to recover the underlying latent class labels.As introduced previously, a common practice is to use the top K = 2 eigenvectors.In this section, we provide some intuition on how the top two eigenvectors contain useful information for clustering.
For notational convenience, denote and n 2 = a 2 2 2 , then n 1 and n 2 are the numbers of non-zero components of a 1 and a 2 respectively, and Next we discuss the properties of the spectrum of H.Because (3) rank((IEX) ) ≤ rank(a 1 µ 1 ) + rank(a 2 µ 2 ) = 2 , there exist at most two n-dimensional orthogonal unit vectors u 1 and u 2 such that (4) Here, d 2 1 and d 2 2 are the top two eigenvalues of H and u 1 and u 2 are the corresponding (population) eigenvectors.Under our model setting, we have d 2 1 > 0 because otherwise µ 1 = µ 2 = 0, contradicting with the model assumption.For simplicity, in the following, we use u = (u(1), . . ., u(n)) to denote either u 1 or u 2 and d 2 to denote its corresponding eigenvalue.
By exchanging the 2nd and 3rd rows and columns of H simultaneously, we can get the following matrix with a clear block structure .
The eigenvalues of H and H are the same and the eigenvectors are the same up to proper permutation of their coordinates.Inspired by the block structure of H after proper permutation, we can see that ( 2) and ( 5) imply u(i) = d 2 u(j), for j such that a 1 (j) = 0 .
From ( 6) and ( 7), we conclude that if d 2 > 0, then Therefore, the eigenvector u corresponding to a nonzero eigenvalue d 2 > 0 takes at most two distinct values in its components.On the other hand, if d 2 > 0 and u takes two distinct values in its components, then these values have a one-to-one correspondence with the cluster labels.We also notice that when d 2 = 0, u would not be informative for clustering.Given these observations, we introduce the following definition for ease of presentation.
Definition 1.A population eigenvector u is said to have clustering power if its corresponding eigenvalue d 2 is positive and its coordinates take exactly two distinct values.
Theorem 1.The top two eigenvalues of H can be expressed as , and (10) .
Moreover, we conclude the following regarding the clustering power of u 1 and u 2 .Theorem 1 implies that under our model described in equation ( 1), at least one of u 1 and u 2 have clustering power.More importantly, this theorem indicates that even in the noiseless setting (i.e., when H is known), there are cases in which only one eigenvector has clustering power and that this eigenvector could be either u 1 or u 2 .This suggests the potential importance of eigenvector selection in spectral clustering and we propose Oracle Procedure 1 below to select a set U of important eigenvectors under the noiseless setting.Despite its simple form, Oracle Procedure 1 is difficult to implement at the sample level.
To elaborate, note that in practice we will have to estimate the eigenvalues and eigenvectors (d 2 i , u i ), i = 1, 2. Without loss of generality, assume that d 1 ≥ d 2 ≥ 0. Note that d 1 and d 2 are the top two singular values of IEX, which can be naturally estimated by the top two singular values of X.Further note that u 1 and u 2 are the top two right singular vectors of IEX, which can be naturally estimated by u 1 and u 2 , the top two right singular vectors of X.One useful technique in the literature for obtaining these sample estimates is to consider the linearization matrix which is a symmetric random matrix with low-rank mean matrix.It can be shown that the top two singular values of X are the same as the top two eigenvalues of Z, and the corresponding singular vectors of X, after appropriate rescaling, are the subvectors of the top two eigenvectors of Z. See detailed discussions on the relationship in the next subsection.
It has been proved in the literature that for random matrices with expected low rank structure, such as Z, the estimation accuracy of spiked eigenvectors largely depends on the magnitudes of the corresponding eigenvalues.Specifically, as shown in Abbe et al., the entrywise estimation error for each spiked eigenvector is of order inversely proportional to the magnitude of the corresponding eigenvalue.Thus, dense eigenvector may be estimated very poorly unless the corresponding eigenvalue has a large magnitude, that is, highly spiked.The results in Abbe et al. apply to a large Gaussian ensemble matrix with independent entries on and above the diagonal.Similar conclusions can be found in Fan et al. (2018) and Bao, Ding and Wang under Wigner or generalized Wigner matrix assumption.
Since spectral clustering is applied to estimated eigenvectors, these existing results suggest that in high-dimensional two-class clustering, one should drop the second eigenvector in spectral clustering if the corresponding eigenvalue is not spiked enough, unless it is absolutely necessary to include it, when, for example, the first spiked population eigenvector has no clustering power.
On the other extreme, if the two spiked eigenvalues are the same, that is, in the case of multiplicity, by part (b) of Theorem 1, at least one of u 1 and u 2 has clustering power.We argue that in this situation, at the sample level it is better to use both spiked eigenvectors in clustering for at least two reasons.First, by Proposition 1 to be presented in Section 4 and the remark after it, each d i , i = 1, 2 can only be estimated with accuracy O p (1).Therefore, detecting the exact multiplicity can be challenging.Second, the two spiked population eigenvectors are not identifiable.The two spiked sample eigenvectors estimate some rotation of (u 1 , u 2 ), each with estimation accuracy of order inversely proportional to d 1 (or d 2 ) (Abbe et al.).Thus, even in the worst case where exactly one eigenvector is useful, including both in clustering will not deteriorate the clustering result much because the additional estimation error caused by the useless eigenvector is the same order as caused by the useful eigenvector.In view of the discussions above, we update the oracle procedure as follows.Our implementable algorithm will mimic the oracle procedure below.
Algorithm 2 [Oracle Procedure 2] In step 2 of Oracle Procedure 2, positive sequence c n is to help check whether d 2 1 and d 2 2 are close enough.We include a buffer c n because, in implementation, d 1 and d 2 are estimated with errors (c.f.Proposition 1).As discussed above, the rationale behind step 3 is that when the second eigenvalue is much smaller than the first one, and so the estimated second eigenvector can be too noisy to be included for clustering, we use the estimated second eigenvector only when the first one is not usable.Oracle Procedure 2 prepares us to introduce our final practical selection procedure.

Eigen Selection Algorithm.
The two oracle algorithms discussed in the previous subsection assume the knowledge of H.In practice, we observe X instead of H. Next, we will elevate our reasoning on H to that on X and propose an implementable algorithm for eigenvector selection.
Denote by u 1 and u 2 the eigenvectors of the matrix H := X X , corresponding to the two largest eigenvalues t 2 1 and t 2 2 ( t 1 ≥ t 2 ≥ 0) of H, respectively.As discussed after Oracle Algorithm 1, t 1 and t 2 are the top singular values of X, and d 1 and d 2 are the top singular values of IEX.Thus, t 2 1 and t 2 2 estimate d 2 1 and d 2 2 , respectively.Further note that u 1 and u 2 are the top two right singular vectors of X, while u 1 and u 2 are the top two right singular vectors of IEX.Under some conditions, when d 2 1 /d 2 2 = 1, i.e., no multiplicity, we have , it is only possible for us to show that ( u 1 , u 2 ) ≈ (u 1 , u 2 ) U (e.g., by Davis-Kahan Theorem), where U is some 2 × 2 orthogonal matrix.
Spectral clustering clusters x i 's into two groups by dividing the coordinates of u 1 (and\or u 2 ) into two groups via the k-means algorithm.In some scenarios, d 2 is small (compared to d 1 ) and u 2 is significantly disturbed by the noise matrix X−IEX; in these scenarios, u 2 is likely not good enough to distinguish the memberships.Putting these observations together, Oracle Procedure 2 can be implemented by replacing As briefly discussed in the previous subsection, for easier analysis of the eigenvalues and eigenvectors of H = X X, we consider the linearization matrix Z.
It can be shown that the top two eigenvalues of Z are t 1 and t 2 .Let v 1 and v 2 be the eigenvectors of Z corresponding to t 1 and t 2 respectively, and v −1 and v −2 are the eigenvectors of Z corresponding to − t 1 and − t 2 respectively.By Lemma 5 in the Supplementary Material, ±d 1 and ±d 2 are the eigenvalues of IEZ, and the vector consisting of the first n entries of the eigenvector of IEZ corresponding to d k equals u k √ 2 , k = 1, 2.Moreover, the vector consisting of the first n entries of the eigenvector of Z corresponding to t k equals u k √ 2 , k = 1, 2. Given these correspondences, we will leverage the two largest eigenvalues of Z and the corresponding eigenvectors for clustering.
Based on the discussions above, we propose Algorithm 3: Eigen-Selected Spectral Clustering Algorithm (ESSC).Let τ n and δ n be two diminishing positive sequences (i.e., τ n + δ n = o(1)) and u 0 be an (n + p)-dimensional vector in which the first n entries are 1 and the last p entries are 0. In numerical implementation, we choose τ n = log −1 (n + p) and δ n = log −2 (n + p), which are guided by Theorems 2-3.Moreover, let , where v 1 is the unit eigenvector of IEZ corresponding to d 1 .Hence, checking whether |f| is small enough (e.g., |f| < δ n ) is a reasonable substitute for checking whether u 1 has all equal entries.4. Theory.In this section, we derive a few theoretical results that support the steps 3 and 4 of Algorithm 3. We first prove in Proposition 1 asymptotic expansions for eigenvalues t 1 and t 2 .These results potentially allow us to design a thresholding procedure on either t 1 − t 2 or t 1 / t 2 Algorithm 3 [Eigen-Selected Spectral Clustering (ESSC)] 1: Set U = ∅.2: Calculate t1 and t2 and the corresponding eigenvectors v1 and v2 from Z. Form u1 and u2 using the first n entries of v1 and v2, respectively.3: Check whether t1/ t2 < 1 + τn.If yes, add both u1 and u2 to U and go to Step 5; if no, go to Step 4. 4: Check if |f| ≥ δn.If yes, add u1 to U and go to Step 5; if no, add u2 to U and go to Step 5. 5: Return U. 6: Apply the k-means algorithm to vector(s) in U to cluster n instances into two groups.
to detect the multiplicity of eigenvalues.Indeed, our proposition fully characterizes the behavior of t 1 and t 2 , so that we can derive an expansion for t 1 − t 2 , but this expansion depends on the covariance matrix Σ (see Remark 3), which is not easy to estimate without the class label information.Similarly, an expansion of t 1 / t 2 involves Σ.These concerns motivate us to resort to a less accurate but empirically feasible detection rule for eigenvalue multiplicity.Concretely, we derive concentration results regarding t 1 / t 2 , which do not rely on estimates of Σ and they give rise to step 3 of Algorithm 3. Theorems 2-3 provide a guarantee for using diminishing positive sequences τ n and δ n as thresholds for steps 3 and 4 in Algorithm 3. We adopt the following assumption in the theory section.
Assumption 1. (i) The eigenvalues of Σ are bounded away from 0 and ∞. (ii) n 1/C ≤ p ≤ n C for some constant C > 0. (iii) d 1 ≥ n σ n for some absolute constant and n ≥ n 0 ( ), where n 0 ( ) ∈ N depend on .
Remark 1. Assumption 1 can accommodate both sparse and non-sparse parameters µ 1 , µ 2 , and Σ.To gain better insights, consider the balanced setting n 1 ∼ n 2 .By Assumption 1, we have This combined with (9) which says , we have either If inequality (11) holds, then we have Otherwise if inequality (12) holds, by Cauchy-Schwarz inequality that c 2 12 ≤ c 11 c 22 , we have In a particular case when p ∼ n, we have σ 2 n = Σ 2 (n + p) ∼ n and the inequality above is reduced to In other words, a sufficient condition for Assumption 1 is that the norm of µ 1 or µ 2 tends to infinity with some small polynomial rate of n.This includes both the sparse and non-sparse cases.
Remark 2. We illustrate a simple example that validates the condition Before presenting Proposition 1, we will introduce population quantities t 1 and t 2 , which are asymptotically equivalent to population eigenvalues d 1 and d 2 .We will establish below that t 1 and t 2 are indeed the asymptotic means of t 1 and t 2 , respectively.
As we work on Z, a linearization of H, we will investigate IEZ and Z − IEZ.Let the eigen decomposition of IEZ be where recall that v 1 and v 2 are the unit eigenvectors corresponding to d 1 and d 2 , v −1 and v −2 are the unit eigenvectors corresponding to −d 1 and −d 2 .
Then the eigen decomposition of IEZ can be written as For complex variable z, and any matrices (or vectors) M 1 and M 2 of suitable dimensions, we define the following notations.
and ( 18) Under Assumption 1 and suppose that we have the following conclusions 1.The equation We denote these solutions by t 1 and t 2 with t 2 ≤ t 1 . 2. (21) Equation ( 19) is a signal strength assumption requiring that the top two eigenvalues should be spiked enough, and that the second eigenvalue cannot be too much smaller than the top where g 11 , g 12 , g 21 and g 22 are defined in For t 2 , we also have an alternative expression (25) .
Proposition 1 provides asymptotic expansions of t k around t k (k = 1, 2) that are not achievable by routine application of the Weyl's inequality.Indeed, Proposition 1 implies that the To bound the main term in (26), we calculate the variance and covariance of v i Wv j , 1 ≤ i, j ≤ 2, as follows. ( where w i is the last p entries of v i .Also note that By Lemma 2 in the Supplementary Material and (17), we have By (28) and Assumption 1 on Σ, for M 1 and M 2 with finite columns and spectral norms, we have Then (30), (S.73), ( 29), Assumption 1 and the definition of g(z) together imply that By Lemma 1 we have 31) suggests that we have with probability tending to 1, (g 11 (t 1 ) + g 22 (t 1 )) 2 − 4 g 11 (t 1 )g 22 (t 1 ) − g 2 12 (t 1 ) 1 2 (32) for any positive constant .Through ( 27) and (29), we see that on both sides of (32), the information of Σ plays an important role.Therefore, a good thresholding procedure on t 1 − t 2 would involve an accurate estimate of Σ, which is difficult to obtain in the absence of label information.
Similar to the asymptotic expansion for t 1 − t 2 , an asymptotic expansion for t 1 / t 2 would also involve the covariance matrix Σ.Nevertheless, the latter has better concentration property compared to the former, which motivates us to consider a non-random thresholding rule on t 1 / t 2 .The concentration property of t 1 / t 2 under different population scenarios is summarized in Theorem 2 and the first part of Theorem 3, respectively, with the former corresponding to the case close to multiplicity and the latter corresponding to the case far from multiplicity.Moreover, the second part of Theorem 3 validates the step 4 of ESSC.We would like to emphasize that Theorem 3 does not require d 2 to be spiked and thus can be applied even when d 2 = 0.
Theorem 2. Under Assumption 1, if d 1 /d 2 ≤ 1 + n −c for all n ≥ n 0 , where c and n 0 are some positive constants, then there exists a positive constant C such that where is the constant in Assumption 1.
Theorem 3. Let u 0 be an n + p vector in which the first n entries are 1's and the last p entries are 0's.Assume that Assumption 1 holds and d 1 /d 2 ≥ 1 + c for some positive constant c.Then for any positive constant D, we have for all n ≥ n 0 , where n 0 is some constant that only depends on the in Assumption 1 and constant D.Moreover, if the first n entries of v 1 are equal, we have for all n ≥ n 0 , By Theorems 2 and 3, we can choose τ n ≤ C(n −c + n −2 ) and δ n ≤ n − /2 for Algorithm 3.
In our simulation, we let τ n = log −1 (n + p) and δ n = log −2 (n + p).These choices were made because in view of Assumption 1(ii), log −1 (n + p) n −c + n −2 and log −2 (n + p) n − /2 for sufficiently large n and p.
Throughout this section, we set µ 1 = r(µ 11 , µ 12 ) , where µ 11 is an l-dimensional vector in which all entries are 1, µ 12 is a (p − l)-dimensional vector in which all entries are 0, and r is a scaling parameter.Our simulation is based on the following five models.
The covariance matrix Σ = I.
In Model 1, the covariance matrix Σ has non-zero off-diagonal entries.In Models 2-4, each non-zero entry of µ 1 and µ 2 with magnitude not bigger than r is covered by Gaussian noise with variance r 2 .In Models 3-4, µ 1 is parallel to µ 2 .With Model 5, we investigate how the trend of the misclustering rate changes with n.
For CHIME , we use the Matlab codes uploaded to Github by the authors of Cai, Ma and Wu (2013).Since CHIME involves an EM algorithm, the initial value is very important.We use the default initial values provided in the Matlab codes.We also need to provide the other initial values of µ 1 , µ 2 , β 0 = Σ −1 (µ 1 − µ 2 ) and π denoted by µ 1 , µ 2 , β 0 and π respectively.
Specifically, we set We repeat 100 times for each model setting and calculate the average misclustering rate and the corresponding standard error in Tables 1-5.Tables 1-2 indicate that CHIME outperforms the other approaches for Models 1-2.While for Models 3-4, the performance of CHIME is worse than the others.We conjecture that such a phenomenon happens because the differences of µ 1 and µ 2 are small and µ 1 − µ 2 has more non-zero coordinates than that in Model 2, which does not cater the sparse assumptions in CHIME very well.In all five models, IF-PCA compares unfavorably against ESSC.We elaborate on our reasoning as follows.In IF-PCA, there is a step to subtract the mean from the data, which is equivalent to consider the centralized data X − x1 n , where x = 1 n n i=1 x i and X = (x 1 , . . ., x n ).The mean part of this matrix is IEX − (IEx)1 n .By the SVD of IEX, we have where w 1 and w 2 are the corresponding left singular vectors.Then In some scenarios, the subtraction of (IEx)1 n decreases the magnitude of the useful spiked singular value.For example, in Model 1, if n 1 = n/2, we can see IEX = d 1 w 1 u 1 and IEX − (IEx)1 n = d 1 √ 2 w 1 u 1 , where w 1 and u 1 are two new left and right singular vectors.Note here that the singular value has decreased from d 1 to Similar to our argument on eigenvalues, when the spiked singular values are small, the corresponding singular vectors might be too noisy for clustering.However, subtracting (IEx)1 n does not necessarily always impact clustering in a negative way.For example, if u 1 and u 2 are orthogonal to 1 n and w 1 and w 2 are orthogonal to (IEx), then d 1 and d 2 are the singular values of IEX − (IEx)1 n , which is the same as IEX.Hence, the effect of −(IEx)1 n is complicated and varies case by case.In fact, as we will see in the real data analysis, IF-PCA performs among the best on several datasets.Table 5 for Model 5 indicates how the misclustering rates change as n increases.When n is small, We also observe that ESSC performs better than other methods.
In addition to the tables, we also report the averaged misclustering rates in visual representations as Figures 1-5.In these figures, we plot the theoretical optimal misclassification rate of the oracle classifier (36), which is the Bayes error.Note here that this is not the Oracle in the tables, which records the misclutering rates of the oracle rule evaluated on the samples.5.1.Real data analysis.We use several gene microarray data sets collected and processed by authors in Jin and Wang (2016).These data sets are canonical datasets analyzed in the literature such as in Dettling (2004), Gordon et al. (2002) and Yousefi et al. (2009).We use a processed version at www.stat.cmu.edu/jiashun/Research/software/GenomicsData.On these data sets, we compare ESSC with IF-PCA and two spectral clustering methods.The first spectral method (SC1) directly applies k-means to the first n rows of ( v 1 , v 2 ) and the second method (SC2) is the one that uses a non-linear kernel as described in the simulation section.We do not report the performance of CHIME in this section, as initializations on parameters such as Σ are not communicated in the original paper and unlike simulation, there is no obvious initialization choice for real data studies.All the datasets considered in this section belong to the ultra-highdimensional settings.In each dataset, the number of features is about two orders of magnitude larger than the sample size; see Table 6 for a summary.In supervised learning, when feature dimensionality and sample size have such a relation, some independence screening procedure is usually beneficial before implementing methods from joint modeling.We will adopt a similar twostep pipeline for clustering.As IF-PCA involves an independence screening step via normalized  and Wang (2016).Moreover, we subsample each dataset so that the resulting datasets all have an average size of 60.Concretely, when a dataset has n instances, we keep each instance with a probability 60/n.For each dataset, we repeat the subsampling procedure 10 times and report the average misclustering rates of the clustering methods on the subsamples.it would be interesting to study how an eigen selection procedure might help spectral clustering when a non-linear kernel is used to create an affinity matrix.
S. Proof of Theorem 1.We use u = (u(1), . . ., u(n)) to denote either u 1 or u 2 and d 2 to denote its corresponding eigenvalue, unless specified otherwise.
Because a 1 only takes two values, by ( 8), there are at most two values of u(i), i = 1, . . ., n.
We denote these values by v 1 and v 2 .By ( 6) and ( 7), the number of v 1 's in u is either n 1 or n 2 .
Without loss of generality, we assume the number of v 1 's in u is n 1 and the number of v 2 's in u is n 2 .

SUPPLEMENTARY MATERIAL TO "EIGEN SELECTION IN SPECTRAL
CLUSTERING: A THEORY GUIDED PRACTICE" S. Proof of Proposition 1.The main idea for proving Proposition 1 is to carefully construct a matrix whose eigenvalue is t k − t 1 , then using similar idea for proving Lemma 1, we can get the desired asymptotic expansions.
Assumption (19) implies that (S.6) It follows from d 2 σ n (by Assumption 1) and (S.6) that (S.7) It follows from (S.6) and Assumption 1 that Throughout the proof, (S.8) will be applied in every O p (•), o p (•), O(•) and o(•) terms without explicit quotation.We define a Green function of W (defined in ( 16)) by (S.9) Thus, by (S.7) and Lemma 3, with probability tending to 1, ) and G( t 2 ) are well defined and nonsingular with probability tending to 1. Since we only need to show the conclusions of Proposition 1 hold with probability tending to 1, in the sequel of this proof, we will assume the existence and nonsingu- By the decomposition of IEZ in (15) and definition of W in (16), we have Z = VDV − Combining this with the identity det(I + AB) = det(I + BA) for any matrices A and B, we have Since D > 0, it follows from the equation above that To analyze (S.11), we prove some properties of G(z) and the related expressions.First of all, by Lemma 1, we have (S.12) Therefore the distance of t k and d k is well controlled and will be used later in this proof.Now we turn to analyse t k , k = 1, 2. By (S.10), we have By ( 14), (S.13), (S.14), Lemmas 2 and 3, for any z ∈ [a n , b n ] we have It follows from Lemma 2 and (17 By (S.15) and Lemma 2, we can conclude that for all z By (S.17) and (S.20), we have (S.21)Moreover, by (S.17), (S.18) and (S.20) we have By (S.16)-(S.22),we have the following expansions Now we turn to (S.11).By (S.15), (S.17) and (S.20), we can see that ).In other words, the off diagonal terms in the determinant (S.11) are all O p ( 1 The 3rd diagonal entry in the determinant (S.11 By (S.15), (S.17) and (S.20), we have where The three equations (S.16), (S.18) and (S.25) lead to

and
(S.28) Further, recalling the definition in (S.28), it holds that By (S.17), (S.18) and (S.23), we have Plugging this into (S.27) and by Lemmas 2, we have for all Hence there exists a 2 × 2 random matrix B such that where B(z Further, in light of expressions (S.15) and (S.24), we can obtain the asymptotic expansion In view of (S.32) and the definition of t k , we have (S.33) By (S.26), (S.31) and (S.33), an application of the mean value theorem yields and for some positive constant C 0 and sufficiently large n.Combining the above two equations with Assumption 1 and where C is some positive constant.
By (S.48) and the condition that 2 }, by (S.47) and (S.49), we have 2 }, By Assumption 1, (S.47) and (S.49), for sufficiently large n we have This together with the assumption that d 1 /d 2 ≥ 1 + c implies (34).Now we turn to (35).Let Notice that v 1 is the unit eigenvector of Z corresponding to d 1 .By the definition of Z, we know that 2 1/2 u 1 is the unit eigenvector of X X corresponding to d 2 1 and 2 1/2 u 1 is the unit eigenvector of XX corresponding to d 2 1 .Similarly, by the condition that the first n entries of v 1 are equal, we imply that the first entries of v 1 are equal to (2n) −1/2 .Let 1 n be an n-dimensional vector whose entries are all 1's.
S.4.Technical Lemmas and their proofs.
Lemma 2. For X we considered in this paper, for any positive integer l, there exists a positive constant C l (depending on l) such that and IEx Wy = 0 and where x and y are two unit vectors (random or not random) independent of W.
where {W i } n i=1 are i.i.d.from N (0, Σ).The entries of Y are i.i.d.standard normal random variables.Moreover, we decompose W defined in ( 16) by Let the eigen decomposition of Σ be UΛU .Since the entries of Y are i.i.d.standard normal random variables, we have Y d = UY.Then W can be written as To establish Lemma 2, it remains to relax the bounded restriction (S.59).In other words, we need to replace the condition (S.59) by the condition of w ij , 1 ≤ i, j ≤ n in (S.58).We highlight the difference of the proof.Expanding IE( Let i = (i 1 , . . ., i l+1 ) and j = (j 1 , . . ., j l+1 ) with 1 We define an undirected graph G i whose vertices represent i 1 , . . ., i l+1 in i, The next Lemma follows directly from Theorem 2.1 in Bloemendal et al. (2014).
Lemma 3.For any constant c > 1.Under the same conditions as Lemma 2, we have for any , D > 0, there exists an integer n 0 ( , D) depending on and D, such that for all n ≥ n 0 ( , D), it holds Lemma 4. Suppose that c 12 = 0.If n 1 c 11 ≥ n 2 c 22 , then we have Proof.We prove this Lemma under the condition n 1 c 11 ≥ n 2 c 22 .Recall the definition of H in (2), if c 12 = 0, we have Notice that a 1 a 2 = 0, a 1 2 2 = n 1 and a 2 2 2 = n 2 , we imply that a 1 a 1 2 and a 2 a 2 2 are the two eigenvectors of H with corresponding eigenvalues n 1 c 11 and n 2 c 22 .By the definition of d 1 and d 2 in (4) and the condition that n 1 c 11 ≥ n 2 c 22 , we have We prove (20) first.By the definition of f (z) in ( 18), we have By Lemma 2 and the expansion (17), for any M 1 and M 2 with finite columns and spectral norms, we have By Assumption 1 on Σ, there exists a constant c such that Σ ≥ cI, therefore we have (S.74) By (S.74) and Lemma 2, for large enough n we have  By similar arguments and there exists t2 ∈ [a n , b n ] such that f 22 ( t2 ) = 0.
Hence we complete the proof of ( 20) and now we turn to (21).Calculating the first derivative of f ii , by Lemma 2, (S.79) and (S.80) we have (S.84) , for f 11 , by Lemma 2 we have Combining this with (S.84), we imply that (S.85) Similarly, we also have Finally, by Lemma 2 and (S.68), similar to the arguments of (S.76) and (S.77), we have (S.87) By (S.87) and (S.88) and the convexity of det(f (z)), we have Combining this with (S.83), (S.85) and (S.86), we imply that (S.89) a k a l c kl ≥ 0 .
By the same arguments as ( 2)-( 5), we conclude that H has a block structure.Let u be the unit eigenvector corresponding to one of the largest three eigenvalues of H and d be the corresponding eigenvalue.Following similar arguments as in ( 5)-( 8), we have that u has at most three distinct The expression for d will be more complicated than the two-component case we considered in this paper.It suggests the technical challenges that one would face to extend our current work to multiple-component Gaussian mixture models.
S.2.Non-Gaussian distribution.Checking the proof of our main theorem carefully, we can see that the key tool is Lemma 2. As long as Lemma 2 holds, then all of our theorems holds.Hence for non-gaussian distribution Z, it suffices to show Lemma 2 holds for non-gaussian distribution.
The proof is expected to be more complicated than Lemmas 4 and 5 in Fan et al. (2018) and is worthy for further investigation.
with cn > 0 some threshold depending on n to be specified.If yes, add both u1 and u2 to U and go to Step 4; If no, go to Step 3. 3: Check whether u1 has two distinct values in its components.If yes, add u1 to U and go to Step 4; If no, add u2 to U and go to Step 4. 4: Return U. 5: Use eigenvector(s) in U for clustering.

5.
Simulation Studies.In this section, we compare our newly proposed eigen-selected spectral clustering (ESSC) with k-means, Spectral Clustering, CHIME, IF-PCA and the oracle classifier (a.k.a, Bayes classifier).Recall that the oracle classifier to distinguish x|(Y = 1) ∼ and π = 0.4.For Spectral Clustering, there are a lot of variants.In the simulation part, we follow Ng, Jordan and Weiss (2002) with the common non-linear kernel k(x, y) = exp{− x−y 2 2 2p } to construct an affinity matrix.For IF-PCA in Jin and Wang (2016), we directly apply the Matlab code provided by the authors without modification.

Fig 7 :Fig 8 :
Fig 7: Misclustering rate of the Breast Cancer data vs. different feature dimension p.The red curve represents IF-PCA, the cyan curve represents SC1, the blue curve represents SC2 and the black curve represents ESSC.

Fig 9 :Fig 10 :
Fig 9: Misclustering rate of Lung Cancer 2 data vs. different feature dimension p.The red curve represents IF-PCA, the cyan curve represents SC1, the blue curve represents SC2 and the black curve represents ESSC.
invertible with probability tending to 1. Recalling the determinant formula for block structure matrix that det (C) det(A − B C −1 B) , for any invertible matrix C and setting C = V − G( t k )V − − D, we have with probability x W l y = x W l y , where above diagonal entries of W = ( w ij ) 1≤i,j≤n are independent normal random variables such that for any positive integer r,(S.58)max 1≤i,j≤n IE| w ij | r ≤ Σ r c r ,where c r is the r-th moment of standard normal distribution.Actually, if { w ij } 1≤i,j≤n were bounded random variables with (S.59) max 1≤i,j≤n | w ij | ≤ 1 , then Lemmas 4 and 5 of Fan et al. (2018) imply that there exists a positive constant c l depending on l such that (S.60) IE| x ( W l − IE W l ) y| 2 ≤ c l σ l−1 n , and (S.61) |IE x W l y| ≤ c l σ l n .
5. Let A be a p×n matrix.Denote A = λ 2 is a non-zero eigenvalue of A A, then ±λ (λ > 0) are the eigenvalues of A. Moreover, assume that a and b are the unit eigenvectors of A A and AA respectively corresponding to λ 2 , then By the definition of eigenvalue, any eigenvalue of A (denoted by x) satisfy the fol-If x = 0, then (S.67) is equivalent to det(A A − x 2 I) = 0 .Therefore the first conclusion that ±λ are the eigenvalues of A. By the definition of a and b, they are the right singular vector and left singular vector of A respectively corresponding to singular value λ.Then equations (S.66) follow.S.5.Proof of Lemma 1.The high level idea for proving (20) is to show that i) det(f (a n )) > 0 and det(f (b n )) > 0, ii) the function det(f (z)) is strictly convex in [a n , b n ], and iii) there exists some z ∈ (a n , b n ) such that det(f (z)) ≤ 0. The result in (21) is then proved by carefully analyzing the behavior of the function det(f (z)) around d 1 and d 2 .
70)Substituting z = a n into f , by (S.69), for large enough n we have(S.71)|R(v 1 , v 2 , a n )− D + R(V − , V − , z) −1 = O(b n ) z ∈ [a n , b n ] .By (S.71) and (S.72) we have (S.73) z ∈ [a n , b n ] .Therefore det(f (z)) is a strictly convex function and has at most two solutions to the equation det(f (z)) = 0, z ∈ [a n , b n ].By (S.69) and (S.70), we have a n , b n ] .Therefore f 11 (z) is a monotonic function in [a n , b n ].Moreover, by the definitions of a n , b n , σ n and Lemma 2, we havef 11 (a n ) < 0, f 11 (b n ) > 0.Hence we conclude that there is a unique point t1 ∈ [a n , b n ] such that f 11 ( t1 ) = 0.
Lemma 1 in the Supplementary Material), while the Weyl's inequality gives | t k − d k | ≤ W , which, combined with Lemma 3 in the Supplementary Material, implies that the fluctuation oft 1 − t 2 around d 1 − d 2 is O p (σ n ).On the other hand, Proposition 1 also suggests that designing a statistical procedure by thresholding t 1 − t 2 would be a difficult task, as argued in detail in Remark 3.

Table 3
The misclustering rate of several approaches for Model 3 with π = 0.5 k-means.Since the number of non-zero coordinates of µ 1 and µ 2 in Model 4 is much fewer than that in Model 3, the signal strength of the means in Model 4 is not strong enough to have large spiked singular values.As such, the performance of ESSC in Table 4 is worse than that of k-means when p is smaller (e.g., less than 200).However, since the misclustering rate of ESSC increases slowly as p increases, when p passes 200, ESSC competes favorably against k-means.Comparing to Spectral Clustering, ESSC excels in all models for almost all p and n.

Table 4
The misclustering rate of several approaches for Model 4 with π = 0.5

Table 5
The misclustering rate of several approaches for Model 5 with π = 0.5

Table 6
Sample size and dimensionality of real data sets Fig 6: Misclustering rate of the Colon Cancer data vs. different feature dimension p.The red curve represents IF-PCA, the cyan curve represents SC1, the blue curve represents SC2, and the black curve represents ESSC.From Figures 6-10, we compare the methods as follows.ESSC and SC1 work better than IF-PCA for the Colon Cancer and Leukemia data.For Lung Cancer 1 data, ESSC has a similar misclustering rate with IF-PCA in general and outperforms the other two approaches.For Breast Cancer data, SC2 outperforms the other approaches, SC1 works a little better than IF-PCA, and ESSC has similar performance with SC1.For Lung Cancer 2 data, IF-PCA has the best performance and ESSC is the second best.Overall, ESSC belongs to the top two across all five datasets, demonstrating its efficiency and stability.6.Conclusion.In this work, with a two-component Gaussian mixture type model, we propose a theory-backed eigen selection procedure for spectral clustering.The rationale behind the selection procedure is generalizable to more than two components in the mixture.We refer interested readers to Supplementary Material for further discussion.Moreover, for future work, by (S.31) and t ij is some number between t 1 and t k .By (S.32), similar to (S.84)-(S.89),we can show that 11 (t 2 ) − g 22 (t 2 ) − (g 11 (t 2 ) + g 22 (t 2 )) 2 − 4 g 11 (t 1 )g 22 (t 2 ) − g 2 12 (t 2 ) Proof.The proofs of (S.44) and (S.45) are the same, so we only prove (S.44).By Lemma 2, we have g ij (t 1 ) = 11 (t 1 ) − g 22 (t 1 ) + (g 11 (t 1 ) + g 22 (t 1 )) 2 − 4 g 11 (t 1 )g 22 (t 1 ) − g 2 12 (t 1 ) Proof of Theorem 2. By Lemma 3 and weyl's inequality | t k