Estimation of Symmetry-Constrained Gaussian Graphical Models: Application to Clustered Dense Networks

We propose a model selection algorithm for high-dimensional clustered data. Our algorithm combines a classical penalized likelihood method with a composite likelihood approach in the framework of colored graphical Gaussian models. Our method is designed to identify high-dimensional dense networks with a large number of edges but sparse edge classes. Its empirical performance is demonstrated through simulation studies and a network analysis of a gene expression dataset.


INTRODUCTION
The analysis of complex high-dimensional data is one of the main problems of modern statistics. To identify the dependencies or independencies among continuous variables, the main parameter of interest is the covariance matrix between these variables. Conditional independence between variables can be represented by means of a graph where each vertex represents a variable and where the absence of an edge (i, j ) implies the conditional independence of the variable X i and the variable X j given all the other variables. Gaussian models with conditional independences between selected pairs of variables represented by means of a graph are called graphical Gaussian models and have been one of the main tools of modern statistics for the analysis of high-dimensional data. Since, in the case of Gaussian data, conditional independence between variables translates into fixed zeros in the precision (inverse covariance) matrix, graphical Gaussian models allow for a substantial dimensionality reduction.
In real applications modeled with a graphical Gaussian model, there often exist additional symmetry constraints on the parameters. For example, genes belonging to the same functional or structural group may behave in a similar manner and thus share similar network properties (Toh and Horimoto 2002). Similarly, in the analysis of social networks data, a fundamental problem is to estimate the friendship patterns among individuals, where individuals belonging to the same geographical region or social group can be considered as nodes from the same cluster in the network (Ma, Gong, and Bohnert 2007). For such clustered networks, we may assume that the edges linking the same pair of clusters share similar properties and that the nodes belonging to the same cluster also have similar properties. These restrictions will result in restrictions on the model parameters and further dimensionality reduction.
To express these additional symmetry constraints, Højsgaard and Lauritzen (2008) introduced colored graphical Gaussian models. Two of the models they define are the socalled RCON and RCOR models which are graphical Gaussian models Markov with respect to an undirected graph, with additional symmetry constraints on, respectively, the entries of the concentration matrix = −1 and the partial correlation matrix P = diag( ) −1/2 diag( ) −1/2 . Likelihood estimation in both models can be obtained through Newton iteration or partial maximization (Højsgaard and Lauritzen 2007). However, at each iterative step, such an algorithm involves the inversion of the concentration matrix, which can be computationally costly for large matrices.
In this article, we work with colored graphical models but use a penalized likelihood approach which performs model selection and estimation simultaneously. In practice, either the clusters are known or we identify them through one of the many known clustering techniques (see Section 4). Once the clusters of variables are identified, the model selection is done within the class of colored graphical Gaussian models with given clusters. The penalized likelihood function will select edges within and between given clusters. We circumvent the problem of the inversion of large matrices for the computation of the maximum likelihood estimate by using a composite likelihood function rather than the likelihood function. Following Besag (1974) and Friedman, Hastie, and Tibshirani (2010), the factors of the composite likelihood are the likelihoods derived from the conditional distributions of the variable X i given all the other variables. These conditional distributions are univariate Gaussian distributions, where the conditional mean and variance can be formulated as functions of the other variables and the original parameters. The idea of forming fully conditional composite likelihood can be viewed as performing all the linear regressions of one variable on all the other variables in the graphical model. The composition of all these conditional likelihoods will naturally yield a composite likelihood as a function of the entries of and P, the parameters of the RCON and RCOR models, respectively.
In the literature, various penalty functions and algorithms have been proposed to estimate and select unconstrained Gaussian graphical models. Yuan and Lin (2007) proposed penalized likelihood methods for estimating the concentration matrix with the L 1 LASSO penalty. Banerjee, Ghaoui, and D'Aspremont (2007) proposed a block-wise updating algorithm for the estimation of the concentration matrix. Further along this line, Friedman, Hastie, and Tibshirani (2008) proposed the graphical LASSO algorithm through a coordinate-wise updating scheme. Fan, Feng, and Wu (2009) proposed to estimate the concentration matrix using the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) penalty to reduce the bias problem. Zhang (2010) proposed the minimax concave penalty (MCP). The MCP penalty applied in the regression setting and the graphical model setting have been investigated by Breheny andHuang (2011) andMazumder, Friedman, andHastie (2011), respectively.
The remainder of this article is organized as follows. In Section 2, we recall some basic properties of colored graphical Gaussian models and composite likelihood. In Sections 3.1 and 3.2, we derive the penalized likelihood function for the RCON and RCOR models and present the coordinate descent algorithm with the LASSO, SCAD, and MCP penalties to perform the estimation. In Section 3.3, we investigate the asymptotic behavior of the penalized composite likelihood estimate and establish its ORACLE property. In Section 4, we discuss ways of determining the color classes if they are not known a priori. In Section 5, simulation studies are presented to demonstrate the empirical performance of the method. In Section 6, we apply our method to a clustered microarray dataset to model the network linking individual genes.

RCON AND RCOR GRAPHICAL GAUSSIAN MODELS
The reader is referred to Højsgaard and Lauritzen (2008) for a detailed description of colored graphical Gaussian models. We will recall here their main features. Let G = (V , E) be an undirected graph, where V = {1, 2, . . . , p} is the set of vertices and E the set of edges. If X = (X 1 , . . . , X p ) t follows the N p (0, ) distribution for some positive definite covariance matrix , it is a well-known result that X i is independent of X j given all the other variables X V \{i,j } if and only if θ ij , the ij th entry of , is 0 where = −1 (Lauritzen 1996) . For a given graph G, we therefore consider the cone P G of positive definite matrices with zero ij th entry whenever the edge (i, j ) does not belong to G. The graphical Gaussian model Markov with respect to G is the model We note that, here and in the remainder of the article, we take μ to be equal to 0 without any loss of generality. If μ = 0, we simply center our data. Given two variables X i and X j , let θ (ij ) be the 2 × 2 matrix (θ lk ) l = i,j, k = i,j . The conditional covariance matrix of (X i , X j ) t given X V \{i,j } is and similarly, the conditional covariance σ ii of X i given all the other variables is The (p − 1)-dimensional vector jj −1 j,V \{j } of regression coefficients β ij of X j on the other variables is equal to It follows that, for model (1), if (i, j ) ∈ E, β ij = 0. Later in the article, we will use the convenient notation B j for the p-dimensional vector of regression coefficients of X j on X i , i = 1, . . . , j − 1, j + 1, . . . , p augmented by the entry 0 in the jth spot. Also for model (1), the conditional correlation coefficient of X j and X k given X V \{j,k} , that is, the partial correlation coefficient between X j and X k , is ρ jk = − θ jk √ θ jj θ kk and, therefore, the partial correlation matrix P = (ρ ij ) 1 ≤ i,j ≤ p = diag( ) −1/2 diag( ) −1/2 and the concentration matrix have the same zeros. Now, let V = {V 1 , . . . , V k } form a partition of V = {1, . . . , p} and let E = {E 1 , . . . , E l } form a partition of the edge set E. If all the vertices belonging to an element V i of V have the same color, we say that Similarly if all the edges belonging to an element E i of E have the same color, we say that E is a coloring of the edges of G and that (V, E) is a colored graph.
Consider model (1). If, for = −1 ∈ P G , we impose the further restrictions that 1. if m is a vertex class in V, then for all i ∈ m, θ ii are equal; 2. and, if s is an edge class in E, then for all (i, j ) ∈ s, the entries θ ij of the concentration matrix are equal, then model (1) becomes a colored graphical Gaussian model called the RCON(V, E) model. We use the notation for the vector of free parameters in . Let us now define an RCOR model. Given V and E as above, if a model of the type (1) satisfies the two additional properties that 1. if m ∈ V, then for all i ∈ m, θ ii are equal; 2. and, if s ∈ E, then for all (i, j ) ∈ s, the entries ρ ij of the partial correlation matrix are equal, then this model becomes the RCOR(V, E) model. We use the notation for the vector of free correlation parameters in P. We note that an RCON model forms a natural exponential family while an RCOR model forms a curved exponential family. In the models defined above, the color for the vertices and the edges are not related to each other. But we could impose further constraints linking those two colorings. For example, we could further assume that different edges linking two vertices belonging to the same two vertex color classes have to belong to the same edge color class. Such constraints only affect the partitioning of the vertices and edges, but do not affect the estimation.

COMPOSITE LIKELIHOOD
The estimation of Gaussian graphical model has been mainly based on the likelihood method. An alternative method of estimation based on composite likelihood (henceforth abbreviated CL) has drawn much attention in recent years (Cox and Reid 2004;Varin 2008).
It has been demonstrated to possess good theoretical properties, such as consistency for the parameter estimation, and can be used to establish hypothesis testing procedures. Let Y = (Y 1 , . . . , Y p ) T be a random vector in R p . Let {f (y; θ ), y ∈ Y, θ ∈ T } be a parametric model, with Y ⊆ R p , T ⊆ R q , p ≥ 1, and q ≥ 1. Let {A i , i = 1, . . . , m} be a set of events with associated likelihood functions L i (θ ; y) = f (y ∈ A i ; θ ). Then, according to Varin (2008), see also Lindsay (1988), a composite likelihood (CL) is the weighted product of the likelihoods corresponding to each event, where w i , i = 1, . . . , m are positive weights. As the composite score function ∂ log L c (θ ; y)/∂θ is a linear combination of several likelihood score functions, its expectation is equal to zero under the usual regularity conditions (Varin 2008). Even though the composite likelihood is not a real likelihood, the maximum composite likelihood estimate is still consistent for the true parameter. The asymptotic covariance matrix of the maximum composite likelihood estimator takes the form of the inverse of the Godambe information: /∂θ } are the sensitivity matrix and the variability matrix, respectively. Readers are referred to Cox and Reid (2004) and Varin (2008) for a more detailed discussion on the asymptotic behavior of the maximum composite likelihood estimator. Gaussian graphical models with added colored classes constraints on the parameters are not closed exponential families in the sense of Mardia et al. (2009). So the maximum composite likelihood estimates for the RCOR and RCON models suffer a slight loss of efficiency compared to the maximum likelihood estimates. We recall, however, that in terms of computational complexity, composite likelihood estimation does not require large matrix inversions and will make the computations quite fast.

THE RCON MODEL
Let X = (X 1 , . . . , X p ) be a random vector with distribution following the RCON(V, E) model. Following (2), (3), and (4), we have that the conditional distribution of X j given X V \{j } can be written as Let x (1) , . . . , x (n) be a sample from this RCON model with x (i) = (x i1 , x i2 , . . . , x ip ) T and let X be the n × p data matrix. The composite likelihood function obtained from the conditional distribution of X j given X V \{j } , j = 1, . . . , p is The composite log-likelihood can also be written in matrix form as up to a constant, where X (j ) = (x 1j , x 2j , . . . , x nj ) T is the jth column of the n × p data matrix and B j is a p-vector with elements β ij except for a zero at the jth position (see (4)). From (7) above, we see that each f (x ij ; , x (i) V \{j } ), i = 1, . . . , n depends only on the vector θ j of free parameters in the jth row of . Since c ( ) is a function of just the vector θ of free parameters, we will simply write c ( ) = c (θ ) and we, therefore, have that is asymptotically convex around the true null value θ 0 . We propose to estimate the sparse RCON model by solving the minimization problem: where p λ (.) is a penalty function, λ is the penalty parameter and the penalty is on the off-diagonal parameters θ E s , s = 1, . . . , l (there is no sparsity for the reciprocal conditional variances θ ii ). The contribution of the composite likelihood is of the order of n, so the penalty term has to be of the order of n to compete. Such a penalization scheme will encourage the sparsity of edge classes but not the sparsity of the total number of edges. There are several penalty functions available. The LASSO penalty (Tibshirani 1996), p λ (|θ |) = λ|θ |, increases linearly with the size of its argument. It is convex and the numerical algorithm is stable. It is a classical tool and it is widely used in many applications. However, the LASSO estimates may suffer from bias for large parameters. Furthermore, the LASSO estimator may not be selection consistent unless a strong irrepresentable condition is satisfied. To avoid such problems, the smoothly clipped absolute deviation (SCAD, Fan and Li 2001) and the minimax concave penalty (MCP, Zhang 2010) have been proposed.
• The MCP penalty gradually relaxes the penalization rate until, when θ > γ λ, the rate of penalization decreases to zero. The penalty function is symmetric about 0, and for any real θ > 0, it takes the form for λ ≥ 0, and γ > 1.
In the literature, it has been shown that both SCAD and MCP regression methods have the so-called oracle property, implying that the corresponding penalized estimator is consistent and as efficient as the maximum likelihood estimate obtained under the true subset model. To numerically minimize Q(θ ), we can employ the coordinate descent algorithm, which proceeds by updating each parameter of the objective function one at a time (Tseng 2001;Friedman et al. 2007). The use of the LASSO, SCAD, and MCP penalty with coordinate descent algorithm and the convergence properties of the corresponding estimators have been extensively investigated in Breheny andHuang (2011) andMazumder, Friedman, andHastie (2011).
• We first consider the LASSO penalty. Differentiating Q(θ ), we obtain the first derivative of the objective function with respect to the edge class parameter θ E s (see the online supplementary file for its derivation). This provides the update for θ E s : where S(z, λ) = sign (z)(|z| − λ) + is the soft-thresholding operator, C = 1 n X T X denotes the sample covariance matrix, and E c s = {(k, j )|k = j and (k, j ) / ∈ E s }. Given the color edge group E s , define the edge adjacency matrix T E s , with T E s kj = 1, if (k, j ) ∈ E s , and T E s kj = 0 otherwise. The update for θ E s can be simplified as follows: where denotes the component-wise product, B is the p × p matrix with columns B j , j = 1, . . . , p, and denotes the p × p diagonal matrix with entries θ −1 jj . • If the SCAD penalty is used, we use the method of one-step local linear approximation (Zou and Li 2008). The size of the penalty is equal to the first derivative of the penalty function evaluated at an initial consistent estimate θ * E s . We, therefore, modify the LASSO updating Equation (12) into the following SCAD updating equation: where p λ denotes the first derivative of the SCAD penalty function.
• If the MCP penalty is used, we can apply the univariate thresholding rule proposed in Breheny and Huang (2011) and Mazumder, Friedman, and Hastie (2011) and obtain the updating equation: Now, let us differentiate Q(θ ) with respect to θ V m , the common values of θ jj for the vertex class V m . Using (9) again and recalling that where q j = k =j l =j θ kj θ lj C kl = θ 2 jj B t j X t XB j /n and |V m | is the cardinality of V m . Therefore, the solution of this likelihood equation is Since, clearly from its expression above, q j is positive, the quadratic Equation (15) has one unique positive solution as given above. Alternating the updating scheme throughout all the θ E s , and θ V m until convergence, we obtain the penalized sparse estimate of the concentration matrix under the RCON model.

THE RCOR MODEL
Consider an RCOR(V, E) model with vertex coloring V and edge coloring E. Recall that P = (ρ ij ) 1 ≤ i, j ≤ p denotes the partial correlation matrix. Given an edge color class, for all edges (i, j ) ∈ E s , ρ ij are all equal and denoted as ρ E s . Let ρ = (ρ E s , E s ∈ E). Given a vertex color class, for all vertices i ∈ V m , θ ii are all equal and denoted as θ V m . Let D denote the diagonal matrix with entries θ jj , j = 1, . . . , p. We propose to estimate the sparse RCOR model by solving the minimization problem: First we consider the LASSO penalty. Differentiating (16) with respect to ρ E s , we obtain the thresholded estimate of the partial correlation which takes the following form: whereP denotes the matrix P with the 1 on the diagonal being replaced by 0. For the SCAD and MCP penalties, updating equations similar to (13) and (14) can be obtained in a parallel manner.
Differentiating (16) with respect to θ V m , we obtain

ASYMPTOTIC PROPERTIES
Although penalized estimators based on SCAD or MCP penalty have been shown to possess the ORACLE property in the regression setting, this property has not been established for composite likelihood on colored graphical Gaussian models. We do so in Theorems 1 and 2. These two theorems are stated for the SCAD penalty and the asymptotic behavior ofθ E s in (13)  Theorem 1. Given the SCAD penalty function p λ (θ ), for a sequence of λ n such that λ n → 0, and √ nλ n → ∞ as n → ∞, there exists a local maximizerθ of Q(θ ) with ||θ − θ 0 || 2 = O p (n − 1 2 ). Furthermore, we have lim n→∞ P (θ z c = 0) = 1.
The proofs of Theorems 1 and 2 are given in the supplementary file. Next, we establish the asymptotic distribution of the estimatorθ z , where θ z denoting the subvector of nonzero parameters in θ. Let 0 be the true value of the parameter. Define the matrix 1 = diag{p λ n (|θ j 0 |); j ∈ z}, and the vector b 1 = (p λ n (|θ j 0 |)sign(θ j 0 ); j ∈ z). Let H zz and V zz denote the submatrices of H (θ 0 ) and V (θ 0 ), respectively, corresponding to z.
Theorem 2. Given the SCAD penalty function p λ (θ ), for a sequence of λ n such that λ n → 0 and √ nλ n → ∞, as n → ∞, the subvector of the root-n consistent estimator θ z has the following asymptotic distribution: Under the assumption of Theorem 2, the subvector of the root-n consistent estimator θ z has the following asymptotic covariance H −1 zz V zz H −1 zz . The formulas for ∂ 2 c /∂θ ∂θ T can be found in the proof of Lemma 1 of the online supplementary file. The corresponding estimate for the Hessian matrix isĤ zz = ∂ 2 c /∂θ z ∂θ T z |θ . To estimate V zz , we use the sample covariance matrix of the composite score vectorV zz = 1 n n i=1 S iz S T iz |θ , where S i = ∂ c (Y i )/∂θ denotes the score vector obtained for the ith observation and S iz denote the subvector of S i corresponding to the subvector of θ z . Combining the estimated H zz and V zz , we can compute the standard error estimates of the penalized composite likelihood estimates. In practice, the bootstrap method can also be used to provide the standard error estimation.

REMARKS ON THE DETERMINATION OF COLOR CLASSES
So far, we have assumed the color classes are known. If there are not known, there exist strategies to determine them. We list some of them here. First the fused LASSO penalty (Tibshirani 2005) can be used to collapse the estimates of θ ij which are close to each other. For illustration purposes, let us focus on the determination of the color classes for edges. The approach can be readily extended to obtain color classes for vertices. We first obtain an initial LASSO estimate of all the edges and rank them from the smallest to the largest. We then consider the following objective function with penalties on the distances between parameters which are adjacent in the order previously obtained: where p λ 1 and p λ 2 are two penalty functions (L 1 or SCAD), and a(θ ij ) and b(θ ij ) are the two edge parameters which are ordered left and right next to θ ij . The resulting estimates from the graphical fused LASSO are clumped together. As λ 2 increases, more and more edges are fused together and therefore form color classes. By tuning the size of λ 2 , we can vary the number of color classes. To solve this minimization problem, we can adopt the fused LASSO signal approximator (FLSA) algorithm originally proposed for regression problems (Friedman et al. 2007) to the graphical model setting. The algorithm consists of iteratively applying the three following steps with λ 1 being fixed and λ 2 being incremented starting from zero with small δ values added at each cycle: • Coordinate descent step: compute ∂Q(θ) θ ij , and use the subgradient at points where the objective function is not differentiable. By the Karush-Kuhn-Tucker (KKT) conditions, we solve for θ ij while other parameters are fixed at the current update.
• Fusion step: examine adjacent θ ij 's, a(θ ij ) and b(θ ij ) and force them to be fused into one parameter and see if the objective function can be decreased by doing so.
• Smooth step: add a small increment δ to λ 2 which thus becomes λ 2 + δ and repeat the coordinate descent and fusion steps.
As an alternative strategy for the determination of color classes, we can perform spectral clustering (Ng, Jordan, and Weiss 2001;Qin and Rohe 2013) on the nodes. The algorithm consists of the following steps: • Obtain the standard LASSO estimate A of , and form the diagonal matrix D with D ii = n l=1 A il + τ, where τ is a positive number chosen to make sure that the appropriate matrices are positive definite. Construct the normalized symmetric graph Laplacian which is L = D −1/2 AD −1/2 .
• Inspect the eigenvalues of L, take the k largest eigenvalues and form the N × k matrix X with columns the corresponding eigenvectors of L. We note that one may use the Laplacian defined as I − L (Shi and Malik 2000) rather than L as we did here. The two versions of the Laplacian matrix have the same set of eigenvectors, but the eigenvalues change from λ i to 1 − λ i . Therefore, if I − L is used, one should choose the k smallest eigenvalues rather than the k largest. Form the matrixX with entries • Run k-means on the rows ofX. The cluster of rows is equivalent to the cluster of nodes.
The two methods discussed above can both provide the color classes for our proposed colored graph algorithm.

SIMULATED EXAMPLES WITHOUT PENALIZATION
We examine the performance of the unpenalized composite likelihood estimator on large matrices. First we consider the RCON model. We simulate data under different scenarios with n varying from 250 to 1000 and p varying from 40, 60, to 100. We include 30 different edge classes and 20 different vertex classes. We simulate multivariate Gaussian random vectors with entries of the sparse precision matrix given by θ E = (0 25 , 0.2591, 0.1628, −0.1934, 0.0980, 0.0518), and θ V = (1.3180, 1.8676, 1.788004, 1.7626, 1.6550, 1.1538, 1.3975, 1.7877, 1.7090, 1.6931, 1.46313, 1.5131, 1.7084, 1.7344, 1.1441, 1.8059, 1.7446, 1.8522, 1.3146, 1.1001), where 0 p denotes a zero vector of length p. The values for nonzero θ E s are uniformly sampled from −0.3 to 0.3. The values for θ V m are uniformly sampled from 1 to 2. Then these θ E and θ V are used by all the 100 simulated datasets. We consider two different scenarios for the assignment of edges to the edge classes. In the balanced network scenario, each edge is randomly assigned to the s = 30 edge classes with equal probability. So the number of edges for each edge class varies from dataset to dataset but is on average equal to p(p − 1)/2s edges. In a similar manner, for each simulated dataset, each node is randomly assigned to the m = 20 vertex (or node) classes with equal probability. So the number of nodes in each node class varies from simulation to simulation but each node class has an average number of p/m nodes. For comparison purposes, we also simulate an unbalanced network and investigate the performance of our method: the first three edge classes have only on average p(p − 1)/(4s) edges, the fourth class has an average of 13p(p − 1)/(4s) number of edges while all the other classes have an average of p(p − 1)/s of edges. The results are given in Table 1 where we compare both the absolute errors and relative errors of the composite likelihood estimates with those of the naive estimates from 100 simulated datasets. The naive estimator estimates the edge class parameters and vertex class parameters by simply averaging all the values belonging to the same class in the inverse sample covariance matrix. The "ae" stands for absolute error and "mre" stands for mean relative error. For example, the "ae" ofθ Es is defined as ||θ Es − θ Es ,0 || 2 ; and the "mre" ofθ Es is defined as mean of (|(θ Es − θ Es ,0 )/θ Es ,0 |). The "mre" are only calculated for the nonzero subset of parameters. The first half of table above the line is for balanced network and the second half of the table below the line is for unbalanced network.
The proposed composite likelihood estimates consistently enjoy much smaller errors than the naive method across all settings. As shown also in Table 1, the absolute sum of squared errors of the edge class parameters are slightly larger for the unbalanced network than for the balanced one. With regard to the mean of the relative errors for the nonzero edge class parameters, the two networks have a comparable performance.
Next, we investigate the empirical performance of the proposed composite likelihood estimator under the RCOR model. We simulate under different scenarios with n varying from 250 to 1000 and p varying from 40, 60 to 100. We include 30 different edge classes and 20 different vertex classes. We simulate multivariate Gaussian random vectors with entries of the sparse partial correlation matrix given by ρ E = (0 26 , 0.1628, −0.1534, 0.0980, 0.0518) and with θ V = (3. 0740, 3.6966, 3.7772, 3.5475, 3.2841, 3.4699, 3.7235, 3.5987, 3.3313, 3.8183, 3.9236, 3.9008, 3.9011, 3.0470, 3.0139, 3.2072, 3.8438, 3.4823, 3.9373, 3.0125). The values for nonzero θ E s are uniformly sampled from −0.3 to 0.3. The values for θ V m are uniformly sampled between 3 and 4. We choose the range of 3 to 4 for θ V m , because we find that for large p, such as p = 100, in order for the matrix to be positive definite, the θ V m have to be large. This set of θ is used by all the 100 simulated datasets. The assignment of edges to each edge color classes is done randomly with equal probability. We conduct a similar random assignment of vertices to each vertex color class. The simulation results are given in Table 2. We provide both the absolute errors and relative errors for the composite likelihood estimates and the naive estimates from the 100 simulated datasets. For both the estimated partial correlations and the conditional variances, the composite likelihood estimates yield consistently smaller errors compared to the naive estimates. This superior performance is consistent across all the different sample sizes and different dimensions of the matrices.

SIMULATED EXAMPLES WITH PENALIZATION
In our next calculations, we introduce penalization. We examine the empirical performance of the penalized composite likelihood estimator. We simulate the RCON model using the same settings as in Table 1. We consider different scenarios with n = 250 or n = 500, and p = 40, p = 60, and p = 100. We use the penalized composite likelihood estimator to estimate the sparse matrix. The tuning parameter is selected by the Bayesian information criterion (BIC), with BIC = −2 (θ ) + df log n, or composite likelihood Bayesian information criterion (COMP-BIC), with COMP-BIC = −2 c (θ) + df log n, where df denotes the total number of nonzero edges andθ denotes the penalized composite likelihood estimate (Gao and Song 2010). Although traditionally, the degree of freedom should be the number of parameters, we use the number of nonzero edges here instead. The reason is that for large network with large p, the likelihood term grows with both n and p, for the penalty term to compete with the likelihood term, we use the term of log n multiplied by the number of nonzero edges, which grows with p as well. As shown in the simulation result below, the model selection performance with the number of nonzero edges used as the degree of freedom is very good. It yields small false positive rates and small false negative rates.
For each setting, 100 simulated datasets are generated and for each dataset we calculate the number of false negatives and false positives. The results are given in Table 3: we see that the proposed method has satisfactory model selection properties with very low false negative and false positive edges. For example, with n = 500 and p = 60, each simulated dataset has an average number of 1475 zero edges and 295 nonzero edges. The proposed method identifies an average of zero false negatives and 0.58 false positives. The size of the tuning parameters is also listed in Table 3. To compare the efficiency of our method using colored graphical Gaussian models with that of more classical methods, we also did a model search using the unconstrained LASSO and the unconstrained SCAD. We used the GLASSO package (Friedman et al. 2008) to perform the unconstrained LASSO. For the unconstrained SCAD, we follow the standard procedure, that is, we first find the LASSO estimate. Then at this estimate, we linearize the SCAD penalty function, evaluate it at the LASSO estimate and use it as the penalty term in the GLASSO package. With n = 500 and p = 60, the LASSO has an average of 221.85 false negatives and 6.47 false positives and SCAD has an average of 212.86 false negatives and 9.59 false positives. Another interesting phenomenon is that with the same sample size, the symmetryconstrained approach has better performance as p increases, and in contrast, the LASSO and SCAD have decreased performance as p increases. This is because increasing p increases the number of parameters in the LASSO and SCAD, but does not affect the number of parameters in the symmetry-constrained approach. On the contrary, with the same sample size, and the same number of edge and vertex classes, increasing p actually provides more information about the edge class and vertex class parameters. We further examine the empirical performance of the penalized composite likelihood estimator for model selection with an RCOR model. We consider different scenarios with n = 250, n = 500, and p = 40, p = 60, and p = 100.  Table 4, one can see that the proposed method has satisfactory model selection property with very low false negative and false positive results. With n = 500 and p = 60, our approach has an average of 0 false negative results and 10.38 false positive results. In comparison, the LASSO has an average of 212.32 false negatives and 32.72 false positives and SCAD has an average of 212.10 false negatives and 33.31 false positives. These results exemplify that if the data are generated from a clustered network, the symmetry-constrained approach, whether with an RCON or an RCOR model, fully uses the clustering structure in model selection and outperforms the unconstrained approach.

APPLICATION
We now apply our proposed method to a real biological dataset. The experiment was conducted to examine how GM-CSF modulates global changes in neutrophil gene expressions (Kobayashi et al. 2005). Time course summary PMNs were isolated from venous blood of healthy individuals. Human PMNs (107) were cultured with and without 100 ng/mL GM-CSF for up to 24 h. The experiment was performed in triplicate, using PMNs from three healthy individuals for each treatment. There are in total 12,625 genes monitored, each gene is measured nine times at time 0, and then measured six times at time 3, 6, 12, 18, 24. At each of these five points, three of the six measurements were obtained for the treatment group and the other three were obtained for the control group. We first proceed with standard gene expression analysis. For each gene, we perform an ANOVA test on the treatment effect while acknowledging the time effect. We rank the F statistic for each gene and select the top 1000 genes that have the most significant changes in expression between treatment and control group. Our goal is to study the network among these 1000 genes. We cluster these genes using the spectral clustering method (Qin and Rohe 2013). The number of clusters is chosen by inspecting the biggest gap between the eigenvalues of the Laplacian matrix of the data matrix. There are five leading large eigenvalues followed by 995 small eigenvalues. Therefore, the number of clusters is chosen to be five. The genes clustered together can be viewed as a group of genes who share similar expression profiles. This imposes symmetry constraints to the networks modeling. We assume that edges connecting the same pair of clusters or edges linking genes from the same cluster belong to the same edge class and vertices belonging to the same cluster belong to the same vertex class. Therefore, there is a total of 15 edge classes and 5 vertex classes parameters to be estimated based on a 1000 × 1000 data matrix. We perform penalized symmetry-constrained SCAD estimation and the tuning parameter is selected using the BIC and the Comp-BIC criteria. Both BIC and COMP-BIC are minimized at the same subset model. The selected model has five nonzero edge classes which include 335,196 nonzero edges. In Figure 1, the gene network among individual genes is depicted. Some of the underlying clustering structure is evident from the plot. The model is sparse containing only a few nonzero edge classes. Nevertheless, the overall network is very dense with more than 335,000 nonzero edges. Due to space limitations, we only show the estimated graph of the first 200 genes. For comparison purposes, we also perform network estimation based on the plain unconstrained SCAD, and the selected subset model has 4446 edges. This is a much sparser network than the one obtained by the symmetry constrained approach. Due to space limitations, Figure  2(a) shows the sparse network obtained by SCAD for the first 200 genes only. There is a small group of genes with at least one edge in Figure 2(a). In Figure 2(b), we zoom into this group and depict the network among all the genes that have at least one edge. The validity of the symmetry-constrained approach is dependent upon the underlying clustering structure. If such a clustering structure is based on some prior biological knowledge, the symmetry-constrained approach is the option of choice. For many biological systems, the sparsity assumption could be too constraining. Our proposed approach offers an alternative tool to model a potentially dense network.

CONCLUSION
For symmetry constrained RCON or RCOR graphical Gaussian models, the penalized composite likelihood based on conditional distributions offers a computationally convenient way to perform estimation and model selection while maintaining efficiency of the estimator. When the Gaussian graphical model is parameterized, in terms of edge and vertex classes, it is shown that the proposed penalized composite likelihood estimator will threshold the estimates for zero parameters to 0 with probability tending to 1 and the asymptotic distribution of the estimates for nonzero parameters follow the multivariate normal distribution corresponding to the estimation under the true submodel. In the literature, high-dimensional network modeling has been mainly restricted to sparse models. When the actual network is dense, symmetry constraints can be imposed onto the model to reflect the underlying symmetric structure and reduce the dimensionality of the model. It is a very useful dimension reduction strategy to model high-dimensional dense network with clusters.