Covariate Regularized Community Detection in Sparse Graphs

In this paper, we investigate community detection in networks in the presence of node covariates. In many instances, covariates and networks individually only give a partial view of the cluster structure. One needs to jointly infer the full cluster structure by considering both. In statistics, an emerging body of work has been focused on combining information from both the edges in the network and the node covariates to infer community memberships. However, so far the theoretical guarantees have been established in the dense regime, where the network can lead to perfect clustering under a broad parameter regime, and hence the role of covariates is often not clear. In this paper, we examine sparse networks in conjunction with finite dimensional sub-gaussian mixtures as covariates under moderate separation conditions. In this setting each individual source can only cluster a non-vanishing fraction of nodes correctly. We propose a simple optimization framework which provably improves clustering accuracy when the two sources carry partial information about the cluster memberships, and hence perform poorly on their own. Our optimization problem can be solved using scalable convex optimization algorithms. Using a variety of simulated and real data examples, we show that the proposed method outperforms other existing methodology.


Introduction
Community detection in networks is a fundamental problem in machine learning and statistics.A variety of important practical problems like analyzing socio-political ties among leading politicians (Gil-Mendieta & Schmidt 1996), understanding brain graphs arising from diffusion MRI data (Binkiewicz et al. 2017), investigating ecological relationships between different tiers of the food chain (Jacob et al. 2011) can be framed as community detection problems.Much attention has been focused on developing models and methodology to recover latent community memberships.Among generative models, the stochastic block model (Holland et al. 1983) and its variants (Airoldi et al. (2008) etc.) have attracted a lot of attention, since their simplicity facilitates efficient algorithms and asymptotic analysis (Rohe et al. 2011, Amini et al. 2013, Chen & Xu 2016).
Although most real world network datasets come with covariate information associated with nodes, existing approaches are primarily focused on using the network for inferring the hidden community memberships or labels.Take for example the Mexican political elites network (described in detail in Section 4).This dataset comprises of 35 politicians (military or civilian) and their connections.The associated covariate for each politician is the year when one came into power.
After the military coup in 1913, the political arena was dominated by the military.In 1946, the first civilian president since the coup was elected.Hence those who came into power later are more likely to be civilians.Politicians who have similar number of connections to the military and civilian groups are hard to classify from the network alone.Here the temporal covariate is crucial in resolving which group they belong to.On the other hand, politicians who came into power around 1940's, are ambiguous to classify using covariates.Hence the number of connections to the two groups in the network helps in classifying these nodes.Our method can successfully classify these politicians and has higher classification accuracy than existing methods (Binkiewicz et al. 2017, Zhang et al. 2016).
In Statistics literature, there has been some interesting work on combining covariates and dense networks (average degree growing faster than logarithm of the number of nodes).In Binkiewicz et al. (2017), the authors present assortative covariate-assisted spectral clustering (ACASC) where one does Spectral Clustering on the the gram matrix of the covariates plus the regularized graph Laplacian weighted by a tuning parameter.A joint criterion for community detection (JCDC) with covariates is proposed by Zhang et al. (2016), which could be seen as a covariate reweighted Newman-Girvan modularity.This approach enables learning different influence on each covariate.In concurrent work Weng & Feng (2016) provide a variational approach for community detection.
All of the above works are carried out in the dense regime with strong separability conditions on the linkage probabilities.ACASC also requires the number of dimensions of covariates to grow with the number of nodes for establishing consistency.
In contrast, we prove our result for sparse graphs where the average degree is constant and the the covariates are finite dimensional sub-gaussian mixtures with moderate separability conditions.In our setting, neither source can yield consistent clustering in the limit.We show that combining the two sources leads to improved upper bounds on clustering accuracy under weaker conditions on separability on each individual source.
Leveraging information from multiple sources have been long studied in Machine learning and Data mining under the general envelop of multi-view clustering methods.Kumar et al. (2011) use a regularization framework so that the clustering adheres to the dissimilarity of clustering from each view.Liu et al. (2013) optimize the nonnegative matrix factorization loss function on each view, plus a regularization forcing the factors from each view to be close to each other.The only provable method is by Chaudhuri et al. (2009), where the authors obtain guarantees where the two views are mixtures of Log-concave distributions.This algorithm does not apply to networks.
In this paper, we propose a penalized optimization framework for community detection when node covariates are present.We take the sparse degree regime of Stochastic Blockmodels, where one can only correctly cluster a non-vanishing fraction of nodes.Similarly, for covariates, we assume that the covariates are generated from a finite dimensional sub-gaussian mixture with moderate separability conditions.We prove that our method leads to an improved clustering accuracy under weaker conditions on the separation between clusters from each source.As byproducts of our theoretical analysis we obtain new asymptotic results for sparse networks under weak separability conditions and kernel clustering of finite dimensional mixture of sub-gaussians.Using a variety of real world and simulated data examples, we show that our method often outperforms existing methods.Using simulations, we also illustrate that when the two sources only have partial and in some sense orthogonal information about the clusterings, combining them leads to better clustering than using the individual sources.
In Section 2, we introduce relevant notation and present our optimization framework.In Sec-tion 3, we present our main results, followed by experimental results on simulations and real world networks in Section 4. Majority of the proofs are presented in the appendix, with details deferred to the supplementary.

Problem Setup
In this section, we introduce our model and set up the convex relaxation framework.For clarity, we list all definitions and notations that will be used later in Table 3.
be the size of cluster i, and let m min and m max be the minimum and maximum cluster sizes respectively.We use π i := m i n , π min = m min n and α = m max /m min .We denote by A the n × n binary adjacency matrix and by Y the n × d matrix of d dimensional covariates.The generation of A and Y share the true and unknown membership matrix Z = {0, 1} n×r .We define the graph model as: B is a r × r matrix of within and across cluster connection probabilities.Furthermore A ii = 0, ∀i ∈ [n].We consider the sparse regime where n max k B k is a constant and hence average expected degree is also a constant w.r.t n.Amini et al. (2018) define two different classes of block models in terms of separability properties of B. We state this below.
This distinction is important because the weakly assortative class of blockmodels is a superset of strongly assortative models, and most of the analysis are done in the stronger setting.To our knowledge, there has not been any work on weakly assortative blockmodels in the sparse setting.For the covariates, we define, W i are mean zero d dimensional sub-gaussian vectors with spherical covariance matrices σ 2 k I d and sub-gaussian norm ψ k (for i ∈ C k ).Standard definitions of sub-gaussian random variables (for more detail see Vershynin (2010)) are provided in the Supplementary material.We define the distance between clusters C k and C as d k = µ k − µ and the separation as Kernel matrix, symmetric and positive definite Distance between cluster centers for the covariates KI Eq. ( 11) Reference matrix for the kernel ν k Eq. ( 6) From now on we use I n to denote the identity matrix of size n, 1 n to represent the all one n-vector and E n , E n,k to represent the all one matrix with size n × n and n × k respectively.We use standard order notations O, o, Ω, ω, etc.For example, we use We also use Õ notation to exclude multiplicative terms that are logarithmic in n.
The common element in all of these is maximizing the inner product between A and X, for a positive semidefinite matrix X.Here X is a stand-in for the clustering matrix ZZ T .Unequal-sized clusters is usually tackled with an extra regularization term added to the objective function (see Hajek et al. (2016), Perry & Wein (2017), Cai & Li (2015) among others).While the above consistency results are for dense graphs, Guédon & Vershynin (2015), Montanari & Sen (2016) show that in the sparse regime one can use this method to obtain an error rate which is a constant w.r.Peng & Wei (2007), Mixon et al. (2017), Yan & Sarkar (2016) for more references).In particular in these settings one maximizes W, X , for some positive semidefinite matrix X, where W is a matrix of similarities between pairwise data points.For classical k-means W ij can be Y T i Y j whereas for k-means in the kernel space one uses a suitably defined kernel similarity function between the ith and jth covariates.We analyze the widely-used Gaussian kernel to allow for non-linear boundaries between clusters.Let K be the n × n kernel matrix whose (i, j)-th entry is , where f (•), where f (x) = exp(−ηx) for x ≥ 0. This kernel function is upper bounded by 1 and is Lipschitz continuous w.r.t. the distance between two observations.Furthermore, in contrast to network based SDPs, the above uses X as a stand in for the normalized variant of the clustering matrix ZZ T , i.e. the desired solution is It can be seen that X 0 2 F = r.In our optimization framework, we propose to add a k-means type regularization term to the network objective, which enforces that the estimated clusters are consistent with the latent memberships in the covariate space.
where λ n is a tuning parameter (possibly depending on n) and the constraint set F = {X 0, 0 ≤ X ≤ 1 m min , X1 n = 1 n , trace(X) = r} is similar to Peng & Wei (2007).The m min in the constraint can be replaced by any lower bound on the smallest cluster size, and is mainly of convenience for the analysis.In the implementation, it suffices to enforce the elementwise positivity constraints, and other linear constraints.For ease of exposition, we define When K(i, j) = Y T i Y j , then the non-convex variant of the objective function naturally assumes a form similar to the work of ACASC (modulo normalization of A).

Main Results
Typically in existing SDP literature for sparse networks or subgaussian mixtures (Guédon & Vershynin 2015, Mixon et al. 2017), one obtains a relative error bound of the deviation of X M (the solution of the SDP ) from the ideal clustering matrix X 0 .This relative error is typically proportional to the ratio of the observed matrix with a suitably defined reference matrix, and some quantity which measures the separation between the different clusters.Our theoretical result shows that the relative error of the solution to the combined SDP is proportional to the ratio of the observed A + λ n K matrix to a suitably defined reference matrix to a quantity which measures separation between clusters.This quantity is a non-linear combination of the separations stemming from the two sources.We first present an informal version of the main result.Main theorem (informal): Let X A+λnK be the solution of SDP (4).Let s k G and s k C be constants denoting the separations of cluster k from the other clusters defined in terms of the model parameters of the network and the covariates respectively.If the tuning parameter λ n = λ 0 /n for some constant λ 0 , then where c G and c C are constants representing the error corresponding to the graph and the covariates.
Note that in SBM, the separation is well-defined, i.e. when M = A, a natural choice of the reference matrix is E[A|Z] which is blockwise constant.In this case, the separation is given by min k (B kk − max B k ), and leads to a result on weakly assortative sparse block models which we present in more details in Section 3.1.However, for the kernel matrix K, the main difficulty is that one cannot achieve element-wise or operator norm concentration of K (also discussed in Von Luxburg et al. (2008)).This makes the choice of the reference matrix difficult.To better understand the role of the separation parameter, we first present a key technical lemma bounding X M − X 0 F .The main goal of this lemma is to establish an upper bound on the frobenius norm difference between the solution to an SDP with input matrix M to the ideal clustering matrix.
Lemma 1.Let X M be defined by Eq (4) for some input matrix M .Also let Q be a reference matrix where Remark 1.The key to the above lemma is to find a suitable reference matrix Q which satisfies some separation conditions between the blocks.The deviation between X M and X 0 is small if M − Q is small, and large if the separation between blocks in Q is small.While the proof technique is inspired by Guédon & Vershynin (2015), the details are different because of our use of different constraints and because our reference matrix Q does not have to be blockwise constant and can be weakly assortative instead of strongly assortative.
The results on networks, covariates and the combination of the two essentially reduces to identifying good reference matrices (Q) for the input matrices A, K, and A + λK, which 1. Satisfies the properties of Q in the above lemma.

Has a large separation min
) increasing the denominator of Eq. ( 5).
3. Has a small deviation from M , thereby reducing the numerator of Eq (5).Now the main work is to choose the reference matrix Q for A + λK.As pointed out before, a common choice for reference matrix of A is E[A|Z].For the covariates, we divide the nodes into "good" nodes S k := {i ∈ C k : Y i − µ k ≤ ∆ k } and the rest.Also define S = ∪ r k=1 S k .∆ k will be defined such that the kernel matrix induced by the rows and columns in S is weakly assortative, and 3∆ k + ∆ ≤ d k .Define A simple use of triangle inequality gives min i,j∈S k K ij ≥ r k and max i∈S k ,j∈S , =k K ij ≤ s k .Hence the separation for cluster k is ν k := r k − s k .We define the reference matrix K I as: The choice of ∆ k is crucial.A large ∆ k makes the size of non-separable nodes S c small, but drives down the separation ν k .
We are now ready to present our main result.As we will show in the proof, the new separation . Typically, in the general case with unequal sub-gaussian norms, one should benefit from using different ∆ k 's for different clusters.For example for a cluster with a large a k − b k , we can afford to have a small ν k .To think in terms of ∆ k , for this cluster one can have a large ∆ k , which will make |S k | larger than before, but will not affect the separation (a k − b k ) + λ 0 ν k of cluster k very detrimentally.We now present our first main theorem.and π , Here K G is the Grothendieck's constant.The best value of K G is still unknown, and the best known bound is K G ≤ 1.783 (Braverman et al. 2013).First note that in the sparse case, we take λ n = λ 0 /n for some constant λ 0 .In general the upper bound depends on several parameters such as λ n and the scale parameter η in the gaussian kernel.We provide procedures for tuning λ n and η in Section 4. The ∆ k 's show up in the numerator as well as the denominator.Finding the optimal ∆ k is cumbersome in the general case with unequal ψ k 's.In Section 3.2 we derive an upper bound for equal ∆ k 's for concreteness.Now we present two natural byproducts of our analysis, namely the result on graphs, i.e. bounds on X 0 − X A F and the result on covariate clustering i.e. bounds on X 0 − X K F .

Result on Sparse Graph
While most dense network-based community detection schemes give perfect clustering in the limit (Amini et al. 2013, 2018, Cai & Li 2015, Chen & Xu 2016, Yan, Sarkar & Cheng 2017), in the sparse case no algorithm is consistent; however semidefinite relaxations (among others) can achieve an error rate governed by the within and across cluster probabilities (Guédon & Vershynin 2015, Montanari & Sen 2016).The sparse network analysis is done under strongly assortative settings.
Proposition 1 (Analysis for graph).Let a k , b k defined as in Theorem 1 are positive constants and g ≥ 9. Then with probability tending to 1, where α := m max /m min .
Note that in the above result, in order to have the error rate to go to zero, one would require a k − b k to go to infinity, whereas by definition a k , b k are constants.Therefore one can only hope for a small albeit constant .In addition, both number of clusters r and the ratio between largest and smallest cluster sizes α needs to be constant order w.r.t n in order to guarantee the error rate does not increase when the network grows.
Remark 2 (Comparison with prior work).In contrast to having min k a k − max k b k (strong assortativity) in the denominator like Guédon & Vershynin (2015), we have min k (a k − b k ) (weak assortativity), which allows for a much broader parameter regime.

Result on Covariates
We present a result for covariates analogous to the sparse graph setting, which establishes that, while SDP with covariates is not consistent with finite signal-to-noise ratio, it achieves a small error rate if the cluster centers are further apart.But before delving into our analysis, we provide a brief overview of existing work.
For covariate clustering, it is common to make distributional assumptions; usually a mixture model with well-separated centers suffices to show consistency.The most well-studied model is Gaussian mixture models, which can be inferred by Expectation-Maximization algorithm, for which recently there has been some local convergence results (Balakrishnan et al. 2017, Yan, Yin & Sarkar 2017) and its variants (Dasgupta & Schulman 2007).The condition required for provable recovery on the separation is usually the minimum distance between clusters is greater than some multiple of the square root of dimension (or effective dimension).
Another popular technique is based on SDP relaxations.For example, Peng & Wei (2007), Mixon et al. (2017) propose a SDP relaxation for k-means type clustering.To make the analysis concrete, for Proposition 2, we use ∆ k = ∆.
Proposition 2 (Analysis for Covariates).Let K be the kernel matrix generated from kernel function f .Denote ν k as in Eq (6).
Remark 3 (Comparison with prior work).In recent work, Mixon et al. (2017) show the effectiveness of SDP relaxation with k-means clustering for sub-gaussian mixtures, provided the minimum distance between centers is greater than the standard deviation of the sub-gaussian times the number of clusters r.We provide a dimensionality reduction scheme, which also shows that the separation condition requires that d min = Ω( min(r, d)).Our proof technique is new and involves carefully constructing a reference matrix for Lemma 1.

Analysis of Covariate Clustering when d r
In high dimensional statistical problems, the signal is often assumed to lie in a low dimensional subspace or manifold.This is why much of Gaussian Mixture modeling literature first computes some projection of the data onto a low dimensional subspace (Vempala & Wang 2004).To reduce the dimensionality of the raw data, one could do a feature selection for the covariates (e.g.Jin et al. (2017), Verzelen et al. (2017)).In contrast, here we propose a much simpler dimensionality reduction step, which does not distort the pairwise distances between cluster means too much.The intuition is that, for clustering a subgaussian mixture, if d r, the effective dimensionality of the data is r since the cluster means lie in an at most r-dimensional subspace.
Hence we propose the following simple dimensionality reduction algorithm when d r in a spirit similar to Chaudhuri et al. (2009).We split up the sample into two random subsets P 1 and P 2 of sizes n 1 and n − n 1 and compute the top r − 1 eigenvectors U r−1 of the matrix . Now we project the covariates from subset P 2 onto this lower dimensional subspace as Y i = U T r−1 Y i to get the low dimensional projections.We take n 1 = n/ log n.
for some constant C, the projected Y i are also independent data points generated from an isotropic sub-gaussian mixture in r − 1 dimensions.Furthermore the minimum distance between the means in the r−1 dimensional space is at least d min /2 with probability at least 1 − Õ(r 2 n −d ), where d min is the separation in the original space.
The proof of this lemma is deferred to the supplementary material.We believe the proof can be generalized to non-spherical cases as long as the largest eigenvalue of covariance matrix for each cluster is bounded.Typically θ r−1 (M ) signifies the amount of signal.For example, for the simple case of mixture of two gaussians with π 1 = 1/2, and µ 2 = −µ 1 , θ r−1 (M ) = µ 1 2 , which is essentially d 2 min /4.Hence the condition on θ r−1 (M ) essentially translates to a lower bound on the signal to noise ratio, i.e. d 2 min ≥ 48ψ 2 max + C d log 2 n n for some constant C .When d > r, if one applies Lemma 2 on the r − 1 dimensional space, then as long as d 2 min = Ω(ψ 2 max r), the separation in the low dimensional space also satisfies the separation condition in Proposition 2. Thus the dimensionality reduction brings down the separation condition in Proposition 2 from Ω(ψ max √ d) to Ω(ψ max min(r, d)).
The sample splitting is merely for theoretical convenience which ensures that the projection matrix and the projected data are independent, resulting in the fact that the final projection is also an independent sample from a sub-gaussian mixture.To be concrete, the labels of P 1 do not matter asymptotically, since they incur a relative error in X 0 − X K F / X 0 F less than where α and r are both constants.In our setting, the relative error in Proposition 2 is a small but non-vanishing constant, and so this additional vanishing error term does not affect it.However this sample splitting step is not necessary in practice (Chaudhuri et al. 2009), and so we do not pursue this further.
We now present the tuning procedure, and experimental results.

Experiments
In this section, we present results on real and simulated data.The cluster labels in our method are obtained by spectral clustering of the solution matrix returned by the SDP.We will use SDPcomb, SDP-net, SDP-cov to represent the labels estimated from X A+λnK , X A and X K respectively.Performance of the clustering is measured by normalized mutual information (NMI), which is defined as the mutual information of the two distributions divided by square root of the product of their entropies.We have also calculated classification accuracy and they show similar trends, so only NMI is reported in this section.For real and simulated data, we compare: (1) Covariateassisted spectral clustering (ACASC) (Binkiewicz et al. 2017); (2) JCDC (Zhang et al. 2016), (3) SDP-comb, (4) SDP-net and ( 5) SDP-cov.The last two are used as references of graph-only and covariate-only clustering respectively.

Implementation and computational cost
Solving semidefinite programming with linear and non-linear constraints has been a challenging problems in numerical optimization community.Many SDPs proposed in statistical literature (Cai & Li 2015, Chen & Xu 2016, Amini et al. 2018) are solved by the alternating descent method of multipliers (ADMM) algorithm (Boyd et al. 2011).Although ADMM is tractable for middlesized problems and reasonable numerical behavior, whether it convergences in presence of nonnegative constraints, which is prevalent in network literatures, remains an open problem.Recently, Yang et al. (2015) propose a majorized semismooth Newton-CG augmented Lagrangian method, called SDPNAL+, which is provably convergent.We solve the SDP using the matlab package of SDPNAL+ in all our experiments1 .The package provides an efficient implementation of the algorithm.Solving the SDP for matrix of size 1000 × 1000 takes less than a minute on a Macbook with a 1.1 GHz Intel Core M processor.

Choice of Tuning Parameters
As we pointed out earlier, the elementwise upper bound 1 m min is only for convenience of theoretical analysis.In the implementation, we do not enforce this constraint.So the main tuning parameters would be the scale parameter in the kernel matrix η and the tradeoff parameter between graph and covariates λ n .In most of our experiments the number of clusters is assumed known.In this section, we also provide a practical way to choose among candidates of r when it is not given.
Choice of η We use the method proposed in Shi et al. (2009) to select the scale parameter.The intuition is to keep enough (say 10%) of the data points in the "range" of the kernel for most (say 95%) data points.Given the covariates, we first compute the pairwise distance matrix.Then for each data point Y i , compute q i as 10% quantile of d(Y i , Y j ), ∀j ∈ [n].The bandwidth is defined as w = 95% quantile of q i 95% quantile of χ 2 d and scale parameter η = 1 2w 2 .Note when the data is high-dimensional, we will first conduct dimensionality reduction as in Section 3.3, then use the intrinsic dimension to tune the scale parameter.
Choice of λ n As λ n increases, the resulting X A+λnK clustering gradually changes from X A clustering to X K clustering.Our theoretical results show that, with the right λ n , X A+λnK and X 0 should be close, and hence also have similar eigenvalues.Let θ i (M ) be the i-th eigenvalue of matrix M .Define the eigen gap function for clustering matrices g(X) := (θ r (X) − θ r+1 (X))/θ r (X).Using Weyl's inequality and the fact that we pick the λ n maximizing g(X A+λnK ).In Figure 1 (a)-(c), figures from left to right represent the situation where graph is uninformative (Erdős-Rényi), both are informative and covariates are uninformative.We plot g(X A+λnK ) and NMI of the clustering from X A+λnK with the true labels against λ n .Figure 1 shows that g(X A+λnK ) and NMI of the predicted clustering have a similar trend, justifying the effectiveness of the tuning procedure.
Unknown number of clusters In many real world settings, it is generally hard to possess the knowledge of number of clusters.Methods are proposed for selecting number of blocks under sparse stochastic block models (Le & Levina 2015), but most of these methods are designed specific for graph adjacency matrix and cannot be generalized to continuous matrix scenarios.We observe that the eigen gap acts as an informative indicator for picking the number of clusters.So when the number of clusters is unknown, we run the SDP over a grid of λ n , k, and choose the pair that maximizes the eigen gap.As we show in Figure 2, we construct two settings and test the performance of using eigen gap to select r.In the first setting, the true model has 3 clusterings with proportion 3 : 4 : 5, the probability matrix is We sample n = 800 data points, and run SDP on top of it with different choice of λ n and specified number of clusters k.For each pair of parameter, we compute the NMI and eigengap and plot them on the upper and lower panel of Figure 2(a).As we can see, the eigen gap presents a similar trend as the NMI, hence picking the pair that optimizes eigen gap will have a relatively high NMI as well.Note here the mis-specified k = 2 has a higher NMI than that of the true value of r.This tells us even the number of clusters is mis-specified, the SDP is still able  to find structure that correlates with the underlying model.This phenomenon is also observed in several other works (Yan, Sarkar & Cheng 2017, Perry & Wein 2017).
In the second scenario, we generate a planted partition model with 10 equal-sized clusters, where B = 0.046I 10 + 0.004E 10 , along with Gaussian covariates centered at [3 * I 10 | 0 3,90 ].We conduct the same type of experiment as above and plot the NMI and eigengap.In this case, the eigen gap succussfully recovered the true number of clusters.

Simulation Studies
In this part we consider two simulation settings.In the first setting, we generate three clusters with sizes 3:4:5, with n = 800.The probability matrix is B = 0.01 *   1.6 1.2 0.16 1.2 1.6 0.02 0.16 0.02 1.2   , and the covariates for each cluster are generated with 100 dimensional unit variance isotropic Gaussians, whose centers are only non-zero on the first two dimensions with µ 1 = (0, 2, 0 . This is the same setting as in the first simulation for unknown r.In this example, the network cannot separate out clusters one and two well, whereas the covariates can.On the other hand, clusters two and three are not well separated in the covariate space, while they are well separated using the network parameters.The experiments are repeated on 10 independently generated samples and the box plot for NMI is shown as in Figure 3(c).In the second row of Figure 3, we examine covariates with nonlinear cluster boundaries.The graph used here is the same as above, and the covariates are 2-dimensional, whose scatter plot is shown in Figure 3(e).In this case, the kernel matrix is able to pick up local similarities hence performs better than combination via inner product similarity as used in ACASC.In both simulations, SDP-comb outperforms others.

Real World Networks
Now we present results on a real world social network and an ecological network.The performance of clustering is evaluated by NMI with the ground truth labels.Table 4 shows the NMI of all methods, where our method outperforms other covariate-assisted approaches.From Figure 4(a, c), for example, node 35 has exactly one connection to each of the military and civilian groups, but seized power in the 90s, which strongly indicates a civilian background.On the other hand, node 9 took power in 1940, a year when civilian and military had almost equal presence in politics, making it hard to detect node 9's political affiliation.However, this node has more edges to the military group than the civilian group.By taking the graph structure into consideration, we can correctly assign the military label to it.Weddell sea trophic dataset The next example we consider is an ecological network collected by Jacob et al. (2011) describing the marine ecosystem of Weddell Sea, a large bay off the coast of Antarctica.The dataset lists 489 marine species and their directed predator-prey interactions, as well as the average adult body mass for each of the species.We use a thresholded symmetrization of the directed graph as the adjacency matrix.Let G be the directed graph, the (i, j) th entry of GG T captures the number of other species which i and j both feed on.We create binary matrices A τ = 1(GG T ≥ τ ).Choosing different τ 's between 1 to 10 gives similar clustering.We use τ = 5.All species are labeled into four categories based on their prey types.Autotrophs (e.g.plants) do not feed on anything.Herbivores feed on autotrophs.Carnivores feed on animals that are not autotrophs, and the remaining are omnivores, which feed both on autotrophs and other animals (herbivore, carnivore, or omnivores).Since body masses of species vary largely from nanograms to tons, we work with the normalized logarithm of mass following the convention in Newman & Clauset (2016).Figure 5(b) illustrates the log body mass for species.Without loss of generality, we order the nodes as autotrophs, herbivores, carnivores and omnivores.
In Figures 5(c), we plot A τ .Since the autotrophs do not feed on other species in this dataset, and since herbivores do not have too much overlap in the autotrophs they feed on, the upper left corner of the input network is extremely sparse.On the other side, the body sizes for autotrophs are much smaller than those of other prey types.Therefore the kernel matrix clearly separates them out.
We see that SDP-net (Figure 5(e)) heavily misclusters the autotrophs since it only replies on the network.SDP-net (Figure 5(f)) only takes the covariates into account and cannot distinguish herbivores from omnivores, since they possess similar body masses.However, SDP-comb (Figure 5(d)) achieves a significantly better NMI by combining both sources.Table 4 shows the NMI between predicted labels and the ground truth from SDP-comb, JCDC and ACASC.While JCDC and ACASC can only get as good as the the best of graph or covariates, our method achieves a higher NMI.

Discussion
In this paper, we propose a regularized convex optimization framework to infer community memberships jointly from sparse networks and finite dimensional covariates.We theoretically show that our framework can improve clustering accuracy of either source under weaker separation conditions.In particular, when each source only has partial information about the clustering, our methodology can lead to high clustering accuracy, when either source fails.We demonstrate the performance of our methodology on simulated and real networks, and show that it in general performs better than other state-of-the-art methods.While for ease of exposition we limit ourselves to two sources, our method can be easily generalized to multiple views or sources.Empirically, we demonstrate that our method works for covariates with nonlinear cluster boundaries; we intend to extend our theoretical analysis to this setting and non-isotropic covariates as well.

B Proof of Lemma 1
We start with the following lemma, whose proof is in the Supplementary.
Proof.We first show that for all such X, the eigenvalues of X are in [0, 1].Let v i be the eigenvector of X corresponding to the i th largest eigenvalue θ i .Since X is positive semi-definite, θ i ≥ 0, ∀i.Without loss of generality, let i * = arg max i |v 1 (i)|, i.e. be the index of the entry with the largest absolute value of v 1 .Since Xv 1 = θ 1 v 1 , and j X ij = 1, X ij ≥ 0, we have: Therefore Now we are in position to prove Lemma 1.
Proof of Lemma 1.Note that both X 0 and X M are in the feasible set F, by optimality, we have M, X M ≥ M, X 0 .We construct Q as stated in the lemma to obtain: Q Note that Q is constant on diagonal blocks and upper bounded by q k on off-diagonal blocks, with respect to the clustering of nodes.Using the fact that The third line and last inequality uses the constraint that j Xij = 1, and 1 − j∈C k Xij ≥ 1 − j Xij = 0. On the other hand, By Lemma 5, and the fact that X 0 2 We first introduce the following result on sparse graph with Grothendieck's inequality by Guédon & Vershynin (2015).
n .Then, with probability at least 1−e 3 5 −n , we have , where K G is the Grothendieck's constant, and its best know upper bound is 1.783.
Proof of Proposition 1.Notice that A and P := E[A|Z] has zero diagonals.Therefore, where p max = max k a k /n and p min = min k a k /n.Thus by Lemma 1 and Eq (8), In sparse regime, both m min X 0 and m min X A belong to the set M + G .Let g = np ≥ 9, applying Lemma 6 we get with probability at least 1 − e 3 5 −n , Recall that α := m max /m min , we get with probability tending to 1,

D Proof of Proposition 2
Proof of Proposition 2. Recall that by definition, for i ∈ C k , Y i − µ k is sub-gaussian random vector with sub-gaussian norm ψ k .Using the following concentration inequality from Hsu et al. (2012) for sub-gaussian random vectors, we have: we can divide the nodes into "good nodes" (those close to their population mean) S k and the rest as follows: with high probability.Note that m ) is a sum of i.i.d random variables.Therefore, using the Hoeffding bound we have: Using δ = log m k /2m k , we have: we have: Finally, using union bound over all clusters we get: By Lemma 1, all diagonal blocks are blockwise constant and the off-diagonal blocks are upper bounded by f Now it remains to bound the ∞ → 1 norm of K − K I .Note that if i ∈ S k , j ∈ S , k = , then by a simple use of triangle inequality we have ≤ max x,y∈{±} n i,j∈S ≤ max x,y∈{±} n i,j∈S where (i) is due to |K ij − (K I ) ij | ≤ 1, and (ii) comes from the definition of K I .Now Eq 12 follows as Recall that f (x) = exp(−ηx 2 ), and 0 ψ 2 max d , for some φ > 0, which will be chosen later.Furthermore, we also define We will first bound part (A). ( where (i) uses the Mean value theorem: for e x − 1 ≤ x + e y x 2 /2 for y ∈ d , using the fact that log x ≤ √ x, we have: Using Eq 15, we see that ξ > √ 180 2 √ 5 − 1 = 2, and hence γ > 0. Now we pick φ = log ξ ξ 2 .Now we will use this to obtain a lower bound on 1−exp(φ−φξ 2 ).Since ξ ≥ 2, we have ξ 2 /4 ≥ 1. Hence Using the fact that the function log x x 2 is monotonically decreasing when x > 2, we see that φ < log 2/2 2 and exp(φ) ≤ 1.2.Furthermore, Now Eq. ( 18) yields: Finally, we bound (B) in Eq 17 using Eq 19. ( for some constant c 1 > 0. Putting pieces together, we have Proof of Theorem 1.Let K I be defined as in Eq (11 Now by Grothendieck's inequality on both A − P , X A+λnK − X 0 and K − K I , X A+λnK − X 0 , one gets, By Lemma 6 and Eq (13), /n in conjunction with Eq (10), we get with probability tending to 1, F Analysis of covariate clustering when d r Before proving Lemma 2, we clearly state our assumptions and other useful lemmas.
Assumption 1.We assume that M is of rank r − 1, i.e. the means are not collinear, or linearly dependent, other than the fact that they are centered.This is a direct consequence of Corollary 5.50 from Vershynin (2010).The main ingredient of the proof is provided below.
Lemma 8. Let U r−1 be the top r−1 eigenvectors of Ŝ estimated using P 1 , and λ be the smallest positive eigenvalue of M .For any vector v in the span of {µ i } r i=1 , as long as λ > 5 ψ 2 max + C d log 2 n n we have U T r−1 v ≥ v /2 with probability at least 1 − Õ(n −d ).Proof.Take n 1 = n log n and v to be a vector in the span of {µ i } r i=1 .By definition, we have M v ≥ λ v .Let R = Ŝ − S. Denote σ2 = i π i σ 2 i , by Lemma 7, S = M + σ2 I d .We also know that σ2 ≤ σ 2 max ≤ ψ 2 max by the property of sub-gaussian distributions.Since S is estimated from P 1 with n 1 points, applying Lemma 7 with n = n 1 we get R ≤ = C d log n 1 n 1 .By Weyl's inequality, Ŝv = (M + R + i σ 2 i I d )v ≥ (λ − σ 2 max − ) v .Let U r:d be the eigenspace orthogonal to U r−1 .Assume the contradiction that U T r−1 v < v /2.Then there has to be a unit d dimensional vector u ∈ span(U r:d ), such that |u T v| > v /2.On one hand, if we write u for |c| > 1/2 and some unit vector v ⊥ orthogonal to v, we have Ŝu ≥ λ−σ 2 max − 2 − √ 1 − c 2 Ŝv ⊥ .Note Ŝv ⊥ = (M + R + σ2 I d )v ⊥ .Since v ⊥ is orthogonal to the span of M , Ŝv ⊥ ≤ (σ 2 max + ).Hence On the other hand, since u ∈ span(U r:d ), by Weyl's inequality, Ŝu ≤ |λ k ( Ŝ)| ≤ σ 2 max + .This contradicts with Eq. ( 20) since we assume λ > 5(ψ 2 max + ) ≥ 5(σ 2 max + ).The result is proven by contradiction.
Remark 4. Note that the result can be generalized to non-spherical case as long as the largest eigenvalue of covariance matrix for each cluster is bounded.
We are now ready to prove Lemma 2.

G From X to cluster labels
From some solution matrix X, we can apply spectral clustering on it to get the cluster labels.Below we present a theorem that bounds the misclassification error by the Frobenius norm of matrix difference.The proof technique is inspired by those in Rohe et al. (2011), Yan & Sarkar (2016).
Theorem 2. The number of misclassification nodes is bounded by 64m max X − X 0 2 F .Proof.Let Û be the top r eigenvectors of X, U ∈ R n×r be the top r eigenvector of X 0 .Let ν ∈ R r×r be the population value of the eigenvector corresponding to each cluster, U = Zν.By Davis-Kahan theorem Yu et al. (2014), we have Define M = {i : c i − Z i νO ≥ 1 √ 2mmax }.We now prove that M is a superset of all misclassified nodes by the above procedure, and its cardinality is bounded as in the theorem statement.U is a unit basis so we know Now we prove all points lying outside of M is correctly labeled, or equivalently, c i − Z i νO < c i − Z j νO 2 for all Z j = Z i .To see this, note for ∀i, j ∈ [n], when Z i = Z j , Therefore when

Figure 2 :
Figure 2: NMI and eigen gap for various choice of r.

Figure 3 :
Figure 3: The first and second rows have results for isotropic Gaussian covariates and covariates lies on a nonlinear manifold respectively.We plot the adjacency matrix A in (a) and (b), where blue, red and purple points represent within cluster edges for 3 ground truth clusters respectively and yellow points represent inter-cluster edges.In (b) and (e) we plot covariates ; different shapes and colors imply different clusters.(c) and (f) show the box plots for NMI.
For simplicity, we assume c k = c 0 .We take c 0 = log d 2 min ψ 2 max d /d and the scale parameter η = φ 20c 2

Lemma 7 .
Let M = k π k µ k µ T k and S be the covariance matrix of n data points from a subgaussian mixture, thenS = M + i π i σ 2 i I d .Let Ŝ be the sample covariance matrix Ŝ = n i=1 (Y i − Ȳ )(Y i − Ȳ ) T n .We have Ŝ − S ≤ C d log n nfor some constant C with probability bigger than 1 − O(n −d ).
Proof of Lemma 2. Recall that Y i = U T r−1 Y i where U r−1 and Y i are from two different partitions and hence independent.Let Z i ∈ [r] denote that latent variable associated with i.Thus, E[Y i |Z i = a, P 2 ] = U T r−1 E[Y i |Z i = a] = U T r−1 µ a .Thus the means of the new mixture are µ a := U T r−1 µ a and the covariance matrix is isotropic, i.e.E[(Y i − µ a )(Y i − µ a ) T |P 2 , Z i = a] = σ 2 a I r−1 .Furthermore, using Lemma 8 we have min k = µ k − µ = min k = U T r−1 (µ k − µ ) ≥ d min /2.Since this requires an application of Lemma 8 to each of the vectors µ k − µ , k, ∈ [r], the success probability is at least 1 − Õ(r 2 n −d ) by union bound.

Table 2 :
Random variables used in the paper

Table 1 :
Population quantities used in the paperNotation For a matrix M ∈ R n×n , we use M F and M to denote the Frobenius and operator norms of M respectively.The ∞ norm is defined as: t n and

Table 3 :
Useful notations and definitions depends on the gap between the within and across cluster probabilities.SDPs are not only limited to network clustering.Several convex relaxations for k-means type loss are proposed in the literature (see

Table 4 :
NMI with ground truth for various methods and using the fact that 2r(p max − p min ) m min min k which means node i is correctly clustered.Below we bound the cardinality of M. By Markov's inequality,