Double shrinkage estimators for large sparse covariance matrices

Covariance matrices play an important role in many multivariate techniques and hence a good covariance estimation is crucial in this kind of analysis. In many applications a sparse covariance matrix is expected due to the nature of the data or for simple interpretation. Hard thresholding, soft thresholding, and generalized thresholding were therefore developed to this end. However, these estimators do not always yield well-conditioned covariance estimates. To have sparse and well-conditioned estimates, we propose doubly shrinkage estimators: shrinking small covariances towards zero and then shrinking covariance matrix towards a diagonal matrix. Additionally, a richness index is defined to evaluate how rich a covariance matrix is. According to our simulations, the richness index serves as a good indicator to choose relevant covariance estimator.


Introduction
Estimating variance-covariance matrices is the core of many statistical methods, including principal component analysis, graphical models, clustering, etc. These statistical methods are frequently applied to microarray data, social networks, functional images, data mining, climate change and many other multivariate problems with very high dimensions. Recently, sparse covariance estimation has become popular because the sparsity assumption not only eases the interpretation but also ensures the quality of covariance matrix estimation. The sparse covariance assumption is also essential in some applications. For example, in genetics, correlation is a measure of the linkage between two genetic loci and zero correlation can be the consequence of long-run evolution without interference. Therefore, the underlying correlation matrix should be very sparse and its non-zero off-diagonal elements represent interferences. Another example preferring sparse covariance matrix is association studies, which treat variables as nodes and non-zero covariances as edges to link these nodes. Social networks and metabolism pathways are examples because it is believed that an individual participates in only a few social groups and a protein directly interacts with only several other proteins. In other words, these nodes are linked sparsely.
Other than the sample covariance (SC), there are many other interesting covariance modelling. Chiu et al. [1] applied a matrix logarithmic covariance model; Boik [2] used a spectral decomposition method and Wong et al. [3] proposed a Bayesian model. To ensure positive definiteness, Pourahmadi [4] formulated the covariance matrix in terms of Cholesky decomposition. Another branch of the covariance estimation focuses on squeezing all sample eigenvalues together because eigenvalues of SCs always over-spread, so the smallest eigenvalue is too small and the largest eigenvalue is too large. To this end, for instance, Haff [5], Yang and Berger [6], Daniels [7] and Daniels and Kass [8] propose closed form Bayesian covariance estimators with different priors on eigenvalues. These methods retain good large-sample properties when the dimension of the matrix is fixed. Ledoit and Wolf [9] suggests a strictly positive definite estimation by shrinking the SC matrix towards the identity matrix with little computational burden. This works well when p > n and when p increases with n and we will introduce this method in Section 2. Following this line, modifications were made by Schäfer and Strimmer [10] and Fisher and Sun [11].
Many solutions have also been proposed for the sparsity. For likelihood-based methods, sparsity is the result of adding a LASSO penalty [12] to the likelihood. Similar approaches are also provided e.g. [13][14][15]. These approaches are called shrinkage methods. But these methods all require sophisticated algorithms to derive the optimum. Alternatively, thresholding, setting small elements to be 0, is more computationally efficient than the likelihood-based methods, and properties of this method have been investigated extensively by El Karoui [16] and Bickel and Levina [17]. Following this route, Rothman et al. [18] proposed a generalized thresholding method which combines thresholding and shrinking. The resulting covariance estimates are consistent and sparse under a setup in which both the sample size n and the dimension p go to infinity. Following Rothman et al., [18] Rothman [19] finds positive definite correlation estimates from a parameter space with a log-determinant barrier. Instead, Xiu et al. [20] and Liu et al. [21] developed algorithms utilizing alternating direction method of multipliers, [22], to find positive definite covariance/correlation estimates directly. It is worth to notice that their algorithms require eigenvalue decompositions if the current covariance estimate is not positive definite. So it costs significantly more computation time than generalized thresholding methods.
In this study, we combine the idea of shrinking towards the diagonal, [11] and the idea of generalized thresholding, [18] shrinking small covariances towards zero, and then propose a richness index to guide the choice of shrinkage estimators. The rest of this article is organized as follows. In Section 2, the shrinking towards the diagonal method and generalized thresholding method are reviewed. Section 3 addresses the proposed double shrinkage estimators, a richness index and its estimator, and cross validation (CV) approaches in order to select tuning parameter. In Section 4, we examine the performance of the random-splitting CV scheme, the richness index estimations, and the proposed covariance estimators. Conclusions are made in Section 5.

Preliminary
In this section, we review methods of shrinking towards the diagonal and generalized thresholding. Before moving forward, we define notations and several frequently applied functions. The (i, j)th element of a matrix A is a ij , [A] ij = a ij , and conversely, the collection of all a ij 's form the matrix A, A = [a ij ]. Suppose that x 1 , . . . , x n are n i.i.d. samples drawn from a p-dimensional multivariate normal N(0, ). The sample covariance matrix is defined as¯ For a particular matrix B ∈ R p×p , the Frobenius norm is defined as B F = tr(B T B) and the operator norm of B is defined as B 2 = ρ 1 (B T B) where ρ 1 (·) denotes the largest eigenvalue of its argument matrix. In the context of sparse covariance estimation, Bickel and Levina [17] and Rothman et al. [18] discussed the convergence in terms of the operator norm while Ledoit and Wolf [9] discussed the convergence in terms of the relative Frobenius norm defined as B RF = B F / √ p. In this article, we adopt the relative Forbenius norm because it results in a simple form for inferences, e.g.

Shrinking towards diagonal
A covariance estimator guaranteed to have a relevant condition number is desired if the inverse of a covariance matrix is intermediate to some estimation procedures, e.g. linear regressions and Gaussian graphical models. Ledoit and Wolf [9] proposed a well-conditioned estimator for high-dimensional covariance matrices. This is a convex linear combination of the SC matrix and a scaled identity matrix,¯ LW = αμI + (1 − α)¯ where I is the identity matrix, α ∈ [0, 1] and μ = p j=1 σ jj /p ∈ R + . Both α and μ can be estimated consistently. Heuristically, the estimator draws the SC¯ towards the target matrix μI so that eigenvalues are squeezed towards the average μ. Along this line, rather than using consistent estimators of μ and α, Schäfer and Strimmer [10] suggested using unbiased estimators. More recently, Fisher and Sun [11] extended this class of estimators to¯ FS = (1 − α)D + α¯ , whereD is the diagonal matrix consisting of diagonal elements of¯ . We name this class of approaches as 'shrinking towards the diagonal' approach. Approaches mentioned above are closely related to the empirical Bayesian method [10,23] and from another point of view, these approaches sacrifice the unbiasedness property to result in an estimator with smaller mean square error. The performance of¯ LW and¯ FS are comparable under some simulation scenarios [11]. However, in the view of data analysis, we notice that the two matrices involved in¯ FS have the same units element-wisely whereas the two matrices involved in¯ LW do not because μ = p j=1 σ jj /p. Thus, except that all variables have the same unit or are normalized, we prefer¯ FS for straightforward interpretation. In addition, by applying¯ FS , all estimators applied in Simulation Studies yield identical variance estimates so the interactions between specified correlation structures and applied covariance estimators can be fairly revealed.

Generalized thresholding
As a result of generalized thresholding, [18] the covariance estimator is the minimizer of the penalized least squares for λ ≥ 0 and variances (σ ii 's) are estimated by their corresponding sample variances. There are many choices for the penalty term. Adopting the adaptive lasso penalty [24] p λ,γ (σ ij ) = λ|σ ij | −γ results in a class of generalized thresholding function/estimator for γ ≥ 0. Note that g λ,0 (x) = x(1 − λ/|x|) + is referred to as soft thresholding in the literature; g λ,1 (x) is closely related to the non-negative garrote [25]; and lim γ →∞ g λ,γ (x) = xI(|x| > λ) is called hard thresholding. Moreover, as shown in [18], the generalized thresholding estimator has three properties: We further emphasize the property that, given λ and γ ∈ (0, ∞) the curve g λ,γ (x) is bounded by g λ,0 (x) and g λ,∞ (x), and, for any where the two equalities hold simultaneously if and only if |x| < λ. The properties (A.1)-(A.3) range the function space of g, whereas (1) provides the magnitude of g λ,γ (x) deviating from x with respect to γ . This property will be applied later. Note that, the class of generalized thresholding operator defined in [18] is larger than the one defined above. We focus on this smaller class to ease interpretations.

Double shrinkage estimators
In general, neither positive definiteness nor well-conditioned-ness is guaranteed by generalized thresholding estimators. On the other hand, the estimators in Section 2.1 do not yield sparse estimates. A straightforward method to obtain a sparse and well-conditioned covariance estimator is to combine ideas in Sections 2.1 and 2.2. Here, we suggest hybridizing shrinking towards 0 (generalized thresholding [18]), and shrinking towards diagonal. [11] Define the doubly shrinkage function/estimator as . Several generalized thresholding functions and doubly shrinkage functions are shown in the left panel and the right panel of Figure 1, respectively. Note that, sinceσ ij is unbiased, the area bounded by g λ,γ (σ ij ) and f (σ ij ), the identity function, quantifies the bias due to generalized thresholding. The left panel of Figure 1 shows not only the relationship (1) but also that, among generalized thresholding functions defined by adopting adaptive lasso, the resulting absolute bias is a monotone decreasing function of γ . Consequently, the bias resulting from g λ,0 (σ ij ) is the upper bound of this class of estimators, x Generalized thresholding functions whereas the bias resulting from g λ,∞ (σ ij ) is the lower bound. In addition, This implies that, in order to control the bias, α should be sufficiently small especially around the dominant contribution of the distribution ofσ ij .
Next, we offer the following theorem to address the convergence of the proposed estimator under the assumptions: The first two assumptions are also made in [17,18] and the last assumption quantifies the rate of choosing α. Although the assumption (B.1) is too restrictive to apply to general cases, it is still meaningful to provide the magnitude of λ and α in terms of n and p for research purpose. The proof is deferred to Appendix 1.

Choosing tuning parameters
For the proposed estimator, there are three tuning parameters: λ, γ , and α. In the following, we fix γ at 0, 1, and ∞ because their generalized thresholding counter parts are well studied in literatures. To choose the rest tuning parameters, CV approaches have become standard. We first demonstrate k-fold CV. For a data matrix X ∈ R n×p , rows of the data matrix are independent, whereas the covariance matrix of elements in a row is . We partition X into K sub-matrices with roughly equal numbers of rows, X [1] , . . . , X [K] such that In order to emphasize the data matrix applied to covariance estimation, we denote the SC estimated using data matrix X by¯ (X) = [σ ij (X)]. The loss for generalized thresholding estimator is defined as The (λ, α) that minimize L(λ, α) is called optimal. However, is unknown and thus we can not estimate L(λ, α) directly. Instead, is applied to estimate the loss. Note that, in k-fold CV, we estimate σ ij twice by d λ,γ ,α (σ ij (X [−k] )) andσ ij (X [k] ). According to our simulation (not shown in this work), five-fold CV severely overestimates the optimal λ. This may because only one-fifth of the data are applied to estimateσ ij , so the estimation precision is relatively low and hence the best λ chosen by CV is not close enough to the optimal.
Other than k-fold CV, Rothman et al. [18] suggested reserving some data for the testing data set and using the rest as the training data set. The training data set is applied to d λ,γ ,α (σ ij ) and the testing data set is applied to calculate the SCs. This approach performs well if the sample size is large enough. On the other hand, Bickel and Levina [17] suggested the random splitting CV. The procedure is as follows. Denote X b as the matrix whose rows are the result of the bth random permutation of the rows of X. Randomly split the data matrix X b into two submatrices X b [1] and X b [2] with roughly equal numbers of rows, and then the random splitting CV estimator of L(λ, α) is )] 2 is unbiased to the loss L(λ, α) up to a constant, for all b. The key point is that d λ,γ ,α (σ ij (X b [1] )) andσ ij (X b [2] ) are independent. The proof is deferred to Appendix 2.
In addition, since CV is a quadratic function in α the unique minimizer of CV iŝ Moreover, we have to enforce the α estimate falling in [0, 1] and thus define the estimatorα λ,γ = min{1, max{0,α λ,γ }} for α. Thenλ is chosen by grid search so that CV(λ,α λ,γ ; X) is minimized. In practice, λ should be no greater than the largest covariances |σ ij |. However, if the data is normalized (estimating correlation rather than estimating covariance) then it is proper to search λ in the interval (0, 1). Hereafter, we shorthand L(λ, α) by L(λ) because the best α is also a function of λ.

Richness index
As emphasized, the second purpose of this work is choosing relevant estimators among those introduced in Section 2 and proposed in Section 3. We conjecture that the 'richness' of a covariance matrix can be an index in choosing estimators because the possible improvement of aforementioned estimators is due to the sparsity assumption. [10] Without loss of generality, we consider as a correlation matrix, i.e. σ 11 = · · · = σ pp = 1. For a particular correlation matrix , let p D 2 = i =j σ 2 ij , the sum of correlation squares, and define the richness index as So, for a sparse covariance model, D 2 is relatively small and so is RI, whereas for a rich covariance model, D 2 is relatively large and so is RI. For instance, consider correlation models defined at the beginning of Section 4. As p → ∞, the ZR model results in D 2 = 2r 2 /p → 0; the MA model results in D 2 = 2r 2 (p − 1)/p → 2r 2 ; and the the compound symmetry (CS) model results in D 2 = r 2 (p − 1) → ∞. In this sequel, if p → ∞ and p/n → c < ∞, then, for the ZR model, RI → 0; for the MA model, RI → 2r 2 /(2r 2 + 1); and, for the CS model, RI → 1. It appears that RI can be a classifier to distinguish the aforementioned covariance models into three classes if the parameter r is not extreme, e.g. r ≈ 0 or |r| ≈ 1.
In order to estimate RI, we utilize the fact that, under normality, Consequently, an estimator of the richness index RI can be The performance of RI is shown in Section 4. Note that in practice we substitute the¯ in RI by sample correlation matrix.

Simulation studies
The simulation plan was as follows. For a given covariance matrix A, we factorized it by LU decomposition, say A = LL T . Next, a p-vector independent sample, z, were sampled from the standard normal distribution or from the exponential distribution with mean 1. Then each data set consists of n independent transformed vectors, x = Lz. The sample size was fixed at n = 50 with dimension p = 50f , where f = 0.6, 1, 2, 3, 4. Five covariance models were applied with parameter r = 0.5. They are In addition to the ZR and CS models, other models were also investigated by Bickel and Levina [17] and Rothman et al. [18]. Model ZR has two non-zero correlations, (1,2) and (2,1). The MA covariance matrix is very sparse as only 2(p − 1) elements are non-zero over p 2 − p elements. Model AR is only approximately sparse, [18] entries farther away from the diagonal tend to be closer to 0 at an exponential rate but not exactly equal to 0. Model BL is proposed by Bickel and Levina [17]. Correlations decrease linearly from the diagonal to those far from the diagonal. Finally, the CS model has no zero elements unless r = 0. In the first simulation study, we explored the performance of random-splitting CVs in terms of choosing λ. The sample size was 50 with dimension 100, and the number of random splitting was B = 10. To estimate L(λ) = E ˜ − 2 , 200 independent data sets were generated and the expectation was approximated by Monte Carlo integration. That is, L(λ) is approximated by where B 1 = 200 and X j 's are independent data matrices generated from the same data generating process. Moreover, define λ 0 = argmin λ≥0 MCL(λ) andλ = argmin λ≥0 CV(λ). Three models, AR, BL, and CS, and three estimators, hard thresholding (GT ∞ ), soft thresholding (GT 0 ), and a double shrinkage estimator DS ∞ were applied. Other estimators performed similarly to them, and so were omitted. The results for normal data and for exponential data are similar so we only show the former, in Figure 2. overlapped. The symbol O corresponds to MCL(λ 0 ) and the symbol X corresponds to the CV(λ). It appears thatλ's are much closer to λ 0 's under the BL and CS covariance models than under the AR model. However, as shown in the next simulation study, the discrepancy does not result in abrupt covariance estimates in general.
Next, we examined the performance of the richness index estimator. Values of RI under above covariance models with various dimensions p are summarized in Figure 3. By Figure 3, for both Gaussian data (left panel) and exponential data (right panel), the richness index estimate successfully classifies covariance models into three groups: ZR (RI → 0) has very low richness index estimates; MA (RI → 1/3) and AR have mild richness index estimates; and BL and CS (RI → 1) have richness index estimates approaching 1. In the next simulation, we found that this grouping coincides with choices of covariance estimators under these covariance models.
In the third simulation, we focus on comparing several covariance estimators: the SC, the FS estimator, [11] generalized thresholding estimators (GT γ , γ = 0, 1, ∞), the proposed double shrinkage methods (DS γ , γ = 0, 1, ∞), and the positive definite estimator proposed by Xiu et al. [20], named XMZ estimator. Three measures were applied to illustrate the performance of these estimators: the sample loss (SL) i =j [d λ,γ ,α (σ ij ) − σ ij ] 2 , the true positive rate (TPR) TPR = # true non-zero entries claimed to be non-zero # true non-zero entries First, we are interested in the effect of shrinking towards diagonal. So we show the relative loss, the loss of DS γ divided by the loss of GT γ , in Figure 4 for Gaussian data and in Figure C1 for exponential data. They show that, in most cases, DS γ outperforms GT γ except under BL and CS models with Gaussian data. Second, we are interested in choosing γ (0, 1, or ∞). For both Gaussian data ( Figure 5) and exponential data ( Figure C2), the losses due to DS 0 and DS 1 are similar than the losses due to DS ∞ . Under MA and AR model, this tendency is strong when the sample size is small and is vanished when the sample size is large. So, under ZR, MA and AR model (RI < ∞), using DS 0 may result in estimates with reasonably small loss in contrast to DS γ 's and GT γ 's. Last, we contrast the performance of estimators FS, DS 0 , and XMZ. From Figures 6 and C3, we have the following conclusion. In terms of having smallest loss, under ZR model, three methods are roughly equivalent; under MA and AR model, DS 0 is slightly better  Figure 4. Relative losses (estimated loss of DS γ divided by estimated loss of GT γ ) under different covariance models, different dimensions, and different γ 's where data sets were generated from normal distribution. Each losses was estimated by 100 independent replicates. than XMZ and is much better than FS, especially when the dimension is larger than the sample size; under BL and CS model, FS is the best.

Dimension (p)
In general, DS 0 has higher PDR than DS γ , γ = 1 and ∞, under all situation. For a fixed γ , most of PDR's of DS γ 's are higher than or equal to PDR's of GT γ 's. Note that the PDR of DS 0 can be as low as 0.71 when the true covariance model is BL or CS (models with IR → 1). As suggested, DS γ 's are preferred when the covariance model has 0 ≤ RI < 1 and under ZR, MA and AR model, all observed PDR's are 1. We also consider the median condition number (MCN) over 100 replicates. We observed that in most cases, the MCN of DS 0 is smaller than the MCN's of XMZ and FS. Last, we consider the TPRs and the FPRs of those estimators. These rates vary among covariance models. Under most situations, TPR and FPR are increased when γ becomes smaller. We observe that DS 0 has slightly higher TPR than XMZ does, up to 10%, and DS 0 has much higher TPR than XMZ does, up to 37%. These extreme cases happen when the BL model is applied. Finally, we notice that under ZR model (having only two non-zero entries), the TPRs are extremely low, ranging from 0.00 to 0.30. This phenomenon reveals that DS γ 's, GT γ 's and XMZ result in too many zero's. Therefore, we do not suspect the observation that both TPR and FPR are 0 which merely says that the corresponding covariance estimate is a diagonal matrix.

Concluding remarks
In this work, we have proposed a class of double shrinkage estimators for large sparse covariance matrices. The SC is first shrunk element-wisely towards zero and then shrunk towards a diagonal matrix. The performance of the doubly shrinkage estimator was examined via simulation under various covariance models. The results showed that for sparse (ZR and MA) or approximately sparse (AR) covariance models, double shrinkage estimators outperform the corresponding generalized thresholding estimators in terms of having smaller losses. On the contrary, for rich (BL and CS) covariance models, the generalized thresholding estimators with Gaussian data are better. In addition, we observed that the performance of estimators depends on how rich the target covariance matrix is but not merely how sparse the target covariance matrix is. This motivated us to define an index to measure the richness of a matrix. The proposed richness index estimator defined in Section 3.2 and examined in Section 4 appears to be useful in choosing better estimators. Combining the information resulting from our simulation, we conjecture that, for a given data set, if the richness index estimate is greater than 0.8 then the XMZ estimators (for sparse estimation) or the FS estimator (for non-sparse estimation) may perform well and, otherwise, the proposed doubly shrinkage estimator with γ = 0 may perform better.
We advocate using RI as an indicator of choosing covariance estimator. The suggestion rule is summarized in Table 1. Note that since AR is approximately sparse and BL is less sparse then  ZR and MA, the sparseness is also a good indicator for estimation selection. However, we can not estimate the sparseness of a sample covariance before choosing a relevant estimator but we can estimate the proposed richness index prior to choosing an estimator. Note that the proposed richness index is a function of the relative Frobenius norm of a sample correlation. Consequently, a single huge covariance hardly affect the RI estimate when the dimension is large.
In some association studies, covariance matrix is the final product where zero covariances indicate no association directly and hence the positive definiteness is not desired. On the other hand, in some discrimination analyses, covariance estimate is part of the Gaussian density function and therefore the covariance matrix has to be positive definite. When the positive definiteness is required, a mild modification can be imposed on the proposed shrinkage estimators. Because the eigenvalues of˜ is a monotone increasing function of α, we can simply tune α upward until the estimator becomes positive definite although, by doing so, the loss increases in general. Last, DS 0 is suggested when RI < 1 and fortunately, our simulations show that the PDR of DS 0 is 1 when the underlying covariance models have RI < 1 and hence above modification is barely required.