Bandwidth Selection for High-Dimensional Covariance Matrix Estimation

The banding estimator of Bickel and Levina and its tapering version of Cai, Zhang, and Zhou are important high-dimensional covariance estimators. Both estimators require a bandwidth parameter. We propose a bandwidth selector for the banding estimator by minimizing an empirical estimate of the expected squared Frobenius norms of the estimation error matrix. The ratio consistency of the bandwidth selector is established. We provide a lower bound for the coverage probability of the underlying bandwidth being contained in an interval around the bandwidth estimate. Extensions to the bandwidth selection for the tapering estimator and threshold level selection for the thresholding covariance estimator are made. Numerical simulations and a case study on sonar spectrum data are conducted to demonstrate the proposed approaches. Supplementary materials for this article are available online.


INTRODUCTION
With the advance in the modern data-collection technology, data of very high dimensions are increasingly collected in scientific, social economic and financial studies, which include the microarray data, the next generation sequencing data, recordings of large networks and financial observations of large portfolios. Suppose we observe independent and identically distributed p-dimensional random variables X 1 , . . . , X n with an unknown covariance matrix = var(X 1 ). The covariance is of great importance in multivariate analysis. The sample covariance S n = n −1 n i=1 (X i −X n )(X i −X n ) is a popular and valid estimator of in conventional settings where the dimension p is fixed and the sample size n is relative large. However, for highdimensional data such that p/n → c ∈ (0, ∞], it is known that S n is no longer consistent; see Bai and Ying (1993), Bai, Silverstein, and Yin (1998), and Johnstone (2001) for accounts of the issue.
There have been advances in constructing consistent covariance estimators for high-dimensional data via the regularization methods that involve thresholding or truncation. Regularization based on the Cholesky decomposition has been considered in Wu and Pourahmadi (2003), Huang et al. (2006), and Rothman, Levina, and Zhu (2010) for estimating and its inverse. Bickel and Levina (2008a) proposed banding the sample covariance S n that truncates all subdiagonal entries beyond certain bandwidth to zero. Cai, Zhang, and Zhou (2010) investigated a tapering estimator which can be viewed as a soft banding on the sample covariance, and demonstrated that it can attain the minimax optimal rate. For random vectors which do not have a natural ordering so that the elements of do not decay as they Yumou Qiu is Assistant Professor, Department of Statistics, University of Nebraska-Lincoln, NE 68583-0963 (E-mail: yumouqiu@unl.edu). Song Xi Chen is Chair Professor, Department of Business Statistics and Econometrics, Guanghua School of Management and Center for Statistical Science, Peking University, Beijing 100651, China, and Professor, Department of Statistics, Iowa State University, Ames, IA 50011-1210 (E-mail: csx@gsm.pku.edu.cn). We thank the editors, the AE and three anonymous referees for constructive comments and suggestions which have improved the presentation of the article. We also thank Feng Yi and Professor Hui Zou for sharing the code of their work. The research was partially supported by NSF grant DMS-1309210 and National Natural Science Foundation of China key grant 11131002.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/jasa. move away from the diagonal, Bickel and Levina (2008b) proposed a thresholding estimator, which was further developed by Rothman, Levina, and Zhu (2009) and . Regularized estimation of −1 has also been developed in Bickel and Levina (2008a), Cai, Liu, and Luo (2011), and Xue and Zou (2012).
The banding and tapering estimators require specifying the bandwidth that defines the number of subdiagonals which are not truncated to zero. For the thresholding estimator, a threshold level needs to be determined. Bickel andLevina (2008a, 2008b) and Cai, Zhang, and Zhou (2010) showed that the performance of these estimators are crucially dependent on the choice of the bandwidth or the threshold level. Bickel andLevina (2008a, 2008b) introduced cross-validation approximations to the Frobenius risk of estimation by repeated random splitting of the sample to two segments. One segment of the sample was used to estimate and the other was employed to form crossvalidation scores for the bandwidth and the threshold level selection, respectively. The conventional sample covariance was used to estimate in the first segment. This can adversely affect the performance of the bandwidth or threshold level selection due to the sample covariance's known defects under high dimensionality. For banded covariances, Qiu and Chen (2012) proposed a method to select the bandwidth, using a by-product of their test for the bandedness of . Yi and Zou (2013) proposed a bandwidth selection for the tapering estimator by minimizing the expected squared Frobenius norm of the estimation error matrix for Gaussian distributed data.
In this article, we employ the Frobenius risk of the banding and the tapering estimators as the objective function, and define the underlying bandwidth as the smallest bandwidth that minimizes the objective function. By studying the objective function under a general distributional framework, we investigate the properties of the underlying bandwidth under a bandable covariance class that is better suited for the Frobenius norm. An estimator of the bandwidth is proposed by minimizing a nonparametric estimator of the objective function.
The use of the Frobenius norm, as Yi and Zou (2013) have noted, confers easier tractability than that based on the spectral norm. The ratio consistency of the proposed bandwidth estimator to the underlying bandwidth is established. We give a lower bound for the coverage probability of the underlying bandwidth being contained in an interval around the estimated bandwidth. Extensions to the tapering and thresholding estimators are considered.
The article is organized as follows. The new bandable covariance class and some needed assumptions are outlined in Section 2. Section 3 defines the underlying bandwidth and gives its properties. A ratio consistent bandwidth estimator is constructed and its theoretical properties are investigated in Section 4. Section 5 provides an extension to the bandwidth selection for the tapering estimator. Section 6 extends to the threshold level selection for the thresholding estimator. Simulation results and a real data analysis are presented in Sections 7 and 8, respectively. Technical proofs are provided in the Appendix and the online supplementary material, respectively.

BANDABLE CLASSES AND ASSUMPTIONS
Let X 1 , X 2 , . . . X n be independent and identically distributed (IID) p-dimensional random vectors with mean μ and covariance matrix = (σ ij ) p×p . Throughout the article, || · || F , || · || (2,2) and || · || (1,1) denote the Frobenius, the spectral and the 1 norms of a matrix, respectively; and C with or without subscripts denotes positive constants whose value may change on different occasions. We make the following assumptions.
Assumption 1 prescribes the mechanism governing the sample size and the dimensionality. The last part of Assumption 1 contains the "large p, small n" paradigm where p can be much larger than n, as well as the case of p and n being the same order. For the bandwidth selection, no specific relationship between n and p is needed. However, for the threshold level selection discussed in Section 6, a restriction in the form of log p = o(n 1/3 ) is required. Assumption 2 is a version of the general multivariate model employed in Bai and Saranadasa (1996) and Qiu and Chen (2012), where {Z il } m l=1 may be viewed as the innovations of the data. Bickel and Levina (2008a) considered the following "bandable" class of covariances: for positive constants α, C and ε 1 . For p × p matrix M = (m l 1 l 2 ) p×p , let B k (M) = (m l 1 l 2 I{|l 1 − l 2 | ≤ k}) p×p be a banded version with a bandwidth k ∈ {0, . . . , p − 1}. Bickel and Levina (2008a) proposed B k (S n ) as an estimator of , where S n is the sample covariance, and showed that Cai, Zhang, and Zhou (2010) considered a slightly different class They replaced the restriction on the eigenvalues in U 1 (α, C) with those on the diagonal elements. For U 2 (α, C), Cai, Zhang, and Zhou (2010) proposed the tapering estimator T k (S n ) = T (k) • S n , where • denotes the Hadamard product, and T (k) = ω l 1 l 2 is the weighting matrix with Note that ω l 1 l 2 = 1 for |l 1 − l 2 | ≤ k, ω l 1 l 2 = 0 for |l 1 − l 2 | ≥ 2k and ω l 1 l 2 decreases linearly for k < |l 1 − l 2 | < 2k. For easy algebraic manipulation, we use 2k as the effective bandwidth rather than k as in Cai, Zhang, and Zhou (2010). Cai, Zhang, and Zhou (2010) showed that for k ∼ n 1/(1+2α) E||T k (S n ) − || 2 (2,2) = O{log(p)/n + n −2α/(1+2α) }, which attains the minimax convergence rate over U 2 (α, C). The banding and tapering estimators are not necessarily positive definite. One way to mitigate the problem is to obtain the spectral decomposition of the covariance estimators and replace the negative and zero eigenvalues with small positive values as suggested by Cai, Zhang, and Zhou (2010).
It is clear from the analysis in Bickel and Levina (2008a) and Cai, Zhang, and Zhou (2010) that the convergence rates of the banding and the tapering estimators are critically dependent on the bandwidth k, whereas the bandwidth k depends on the unknown index parameter α of the bandable classes. However, estimating the index parameter is a challenging problem.
We shall consider another "bandable" matrix class which is better suited for bandwidth selection based on the Frobenius norm. To define the new "bandable" covariance class, let us define for k = {0, 1, . . . , p − 1}, to be the average of the squares of the kth subdiagonal entries.
For a fixed positive constant ν and the in Assumption 2, define a covariance class (2. 3) The bounded largest and smallest eigenvalues in Part (i) replicates that in U 1 (α, C). Part (ii) of (2.3) prescribes that h(k) diminishes to zero at a rate faster than k −1 for k large. It may be viewed as an analogue to the sparsity condition max l 2 |l 1 −l 2 |>k |σ l 1 l 2 | ≤ Ck −α (2.4) in U 1 (α, C) and U 2 (α, C). Note that for another covariance matrix class Cai, Zhang, and Zhou (2010) established the minimax convergence result for the Frobenius norm under F(β, M) with β > 1. Hall and Jin (2010) also considered this class in their innovated higher criticism test. Part (iii) of (2.3) requires h(k) to maintain a sufficient amount of "energy" for smaller bandwidths so that h(k) is at least of order n −1 . We note that h(k) actually starts with quite high "energy" since Part (i) implies that nh(0) = np −1 σ 2 ll → ∞. The reason for having the sample size n appeared in Part (iii) is because the banding estimator depends on the sample size n. As shown in the next section, the criterion function for the bandwidth selection is based on the expected Frobenius norm of the estimation error matrix of the banding estimator, which inevitably has n involved.
The main difference between G(ν, q 0 p ) and U 1 (α, C) or U 2 (α, C) is that the sparsity in G(ν, q 0 p ) is written in terms of h(k) whereas that in U 1 (α, C)/U 2 (α, C) are defined via |j −l|>k |σ jl |. This difference reflects the different norms employed in these studies. As we use the Frobenius norm, it is natural to define the sparsity via h(k).
Two specific forms of h(k), which will be referred to repeatedly, are those which decay exponentially and polynomially fast as k → ∞. In the case of the exponential decay, h(k) = C p (k)θ −k for some θ > 1 ; (2.6) in the case of the polynomial decay, In both cases, {C p (q)} p−1 q=0 are sequences bounded within [C 1 , C 2 ] for C 1 ≤ C 2 . It can be shown that Part (iii) of (2.3) is satisfied under (2.6) or (2.7) with q 0 p = log n/(2 log θ ) for the exponential decay and q 0 p = n 1/(2β) for the polynomial decay.

UNDERLYING BANDWIDTH
In this section, we define the underlying bandwidth for the matrix class G(ν, q 0 p ). The properties of the underlying bandwidth are given, which provide the basics for its empirical estimation in the next section.
Consider the standardized square of Frobenius norm for B k (S n ) − , Comparing with the spectral norm, the Frobenius norm is more tractable in the context of the bandwidth estimation. The objective function is Obj B (k)}. As Obj B (k) is discrete, k B exists and we choose the smallest minimizer in the case of multiplicity.
We now analyze the properties of k B for ∈ G(ν, q 0 p ). Denote As tr( 2 )/(np) is irrelevant to k, we only minimize For Gaussian distributed data, = 0 and Obj B (k) = M n (k). The first term of M n (k) in (3.3) measures the bias caused by the banding estimation, and the second term penalizes for larger k. Therefore, Obj B (k) can be viewed as a penalized risk function of the bandwidth.
The following lemma provides the basic properties of M n (k) and R(k) in Obj B (k).
Lemma 1 and (3.4) imply that M n (k) is at least at the order k/n. Since q≤k R(q) ≤ C/n for a constant C, M n (k) is the leading order of Obj B (k) as k → ∞.
Lemma 2. Under Assumptions 1 and 2 and for ∈ G(ν, q 0 p ), The lemma shows that k B has a broader range thank B . This is due to the uncertainty introduced by | |R(k) in (3.4). The ranges given in Lemma 2 prepare fork B /k B → 1 as n → ∞, the key result of this section. Since k a,n ≥ q 0 p → ∞, it follows from Lemma 1 that M n (k) is the leading order term of Obj B (k) for k ∈ [k a,n − L, k b,n + L]. This suggests that we can minimize M n (k) directly.
The main thrust of the article is to minimize an empirical estimator of M n (k) to obtain an estimator ofk B , which may be viewed as a kind of M-estimation. As in the M-estimation, a condition is needed to guarantee the existence of a unique and well-separated minimum of the objective function. Since M n (k) is the leading order term of Obj B (k), a condition that serves this purpose is that, for any small δ > 0 and n large enough, Condition (3.6) is similar to the second equation of (5.8) in van der Vaart (2000), except that (3.6) imposes a minimum rate of separationk B n −1 between M n (k) and M n (k B ). The latter is because that M n (k B ) shrinks to zero at the rate ofk B /n as revealed by Lemma 1. The following lemma shows that under (3.6), k B andk B are ratioly equivalent.
As the condition (3.6) is a key condition to the M-estimation for the underlying bandwidth, we provide two sufficient conditions to (3.6) in the online supplementary material to show it can be satisfied if h(k) decays either exponentially or polynomially.

Exponentially Decayed h(k)
It is shown in the Appendix that k B ∼ log n/ log θ . A proof in the supplementary material shows that (3.6) is satisfied under the exponential decay.

Polynomially Decayed h(k)
If h(k) decays polynomially as specified in (2.7), k B ∼ n 1/β as shown in the Appendix. If max (3.6) is satisfied. One such situation is when all the diagonal elements are equal. If the diagonal entries differ, but are independent realizations from m super-populations, for a fixed integer where F h is the hth superpopulation distribution with mean φ h and finite variance for h = 1, . . . , m and p m = p/m. It is shown in the online supplementary material that (3.7) is satisfied with C 0 = m −1 m h=1 φ 2 h . Now let us put our analysis in the context of existing results on the banding and tapering estimation. Recall that Bickel and Levina (2008) found that if k ∼ {log(p)/n} −1/(2α+2) , the spectral risk of the banding estimator is O p {(log(p)/n) α/(α+1) } uniformly for ∈ U 1 (α, C). Cai, Zhang, and Zhou (2010) showed that setting k ∼ n 1/(2α+1) leads to the minimax optimal rate of O p {n −2α/(2α+1) + log(p)/n} for the tapering estimator under the spectral norm for ∈ U 2 (α, C). For the Frobenius norm, they showed that the minimax rate for ∈ U 2 (α, C) is equivalent to the minimax rate for the smaller class F(β, M) in (2.5) with β > 1, and the bandwidth of the tapering estimator corresponding to the minimax optimal rate is k ∼ n 1/(2β) . By inspecting their proofs, it can be shown that the banding estimator with k ∼ n 1/(2β) can also attain the minimax lower bound under the Frobenius norm. And the minimax rate of the banding and tapering estimators under F(β, M) is attained at covariances with |σ l j | = M|l − j | −β . The latter model coincides with the polynomial decay model (2.7) with h(k) = Mk −2β . We note that this minimax bandwidth rate of k ∼ n 1/(2β) is the rate of the k B under the polynomial decay as shown in (A.8). Since k B minimizers the Frobenius risk, the banding estimator with the k B should attain the minimax convergence rate under the Frobenius norm.

CONSISTENT BANDWIDTH ESTIMATOR
We consider in this section estimating the bandwidth for the banding estimator. A proposal for the tapering estimator will be given in Section 5. As outlined in the previous sections, there are two bandwidths k B andk B , which are asymptotically equivalent to each other under (3.6). However, it is easier to estimatek B than k B since M n (k) is more readily estimated.
Clearly, if = 0 as in the Gaussian case, due to its requiring estimating R(k) and .
According to (3.3), to estimate M n (k), we need to estimate, respectively, Note that, where g(q) := p−q l=1 σ ll σ l+q l+q . Define estimators of h(q) and g(q): where * denotes summation over different subscripts and P b n = n!/(n − b)!. These two estimators are linear combinations of Ustatistics of different orders with the first term being the dominating term, respectively. LetŴ (k) which are unbiased estimators of W (k) and V (k), respectively. Then, an unbiased estimator of M n (k) iŝ (4.1)
Alternatively, we can minimizeM n (k) over a more conservative interval [0, n] so that by making the relationship between n and p more restrictive.
Theorem 1. Under Assumptions 1 and 2, (3.6), if ∈ G(ν, q 0 p ) and (k b,n − k a,n )/k a,n ≤ C, then fork B given in (4.2), Ask B /k B → 1 under (3.6), Theorem 1 implies thatk B is a ratioly consistent estimator of k B . The same ratio consistency result can be established for the bandwidth estimator (4.3) under Assumption 2, (3.6) and n = O(p 1/3 ). The latter is more restrictive than Assumption 1. That (k b,n − k a,n )/k a,n ≤ C assumed in Theorem 1 implies that k a,n and k b,n are of the same order. Derivations leading to (A.5) and (A.8) in the Appendix show that it is satisfied under both the exponential and polynomial decays of h(k).
In the following, we evaluate the estimation error ofk B tok B by providing a lower bound on the probability ofk B being included in an interval aroundk B . To this end, we need a condition on the behavior of M n (k) in additional to (3.6).
Assumption 3. There exist a constant γ ≥ 1 and an integer τ ≥ 1 such that for any small δ > 0, any τ < η < δk B and n large enough While (3.6) dictates that the absolute deviation betweenk B and any k outside In the following, we show that Assumption 3 is satisfied for both the exponential and polynomial decay of h(k), whose proof is in the online supplementary material.
then Assumption 3 holds for τ = 1 and γ = 1; ( then Assumption 3 holds for τ = 1 and γ = 1 + 1/β. Theorem 2. Under Assumptions 1, 2, 3 and (3.6), if ∈ G(ν, q 0 p ), (k b,n − k a,n )/k a,n ≤ C and log(k 2,n ) The proof of Theorem 2 is given in the online supplementary material. Recall that k 1,n = k a,n /r 1 and k 2,n = min{r 2 k b,n , n} for r 1 , r 2 ≥ 1. Derivations given in (A.7) and (A.9) show that log(k 2,n ) q>k 1,n h(q) = o(1) under both the exponential and polynomial decays respectively for any positive constants r 1 and r 2 . Since τ is usually unknown, [k B − τ,k B + τ ] is not a confidence interval ofk B . We may call it a concentration interval. Theorem 2 shows that the probability thatk B is included in the interval converges to 1 if n 2γ −1 p −1 is bounded from infinity. For Gaussian data, = 0 and k B =k B . Hence, the concentration interval is also the one for k B .

EXTENSION TO TAPERING ESTIMATION
The analysis we have made for the banding estimator can be extended to the tapering estimator of Cai, Zhang, and Zhou (2010). The underlying bandwidth for the tapering estimator T k (S n ) can be defined via the standardized squared Frobenius (5.1) Table 1. Average and standard deviation in parentheses of the proposed bandwidth estimators and Bickel and Levina's CV estimators (BL) for the banding estimation under the covariance Design (A) with θ −1 = 0.7 and θ −1 = 0.9 in (7.2) for the standard normal and standardized t 5 innovations Taking the expectation, the risk of the tapering estimation is  Table 3. Average and standard deviation in parentheses of the proposed bandwidth estimators for the tapering estimation under the covariance Design (A) with θ −1 = 0.7 and θ −1 = 0.9, (B) and (C) with ξ = 0.5 and β = 1.5 in (7.2) and (7.3) for the standard normal and standardized t 5 innovations Normal  ( 0 ) and (1 − ω l 1 l 2 ) 2 σ 2 l 1 l 2 + 1 np k<|l 1 −l 2 |≤2k ω 2 l 1 l 2 σ l 1 l 1 σ l 2 l 2 .
The underlying bandwidth of the tapering estimator is k T = min{k |k = argmin 0≤k<p/2 Obj T (k)}. Similar to the banding estimator, the minimizer of Obj T (k) is equivalent to that of Obj T (k). Just like M n (k) is the dominant term of Obj B (k), N n (k) dominates Obj T (k) and the minimization of Obj T (k) can be carried out by minimizing N n (k).
Denote ω q to be the tapering weight for |l 1 − l 2 | = q. usingĥ(q) andĝ(q) in the previous section, we de- by noting that the tapering estimator used 2k as the effective bandwidth. Denotek T to be the smallest minimizer of N n (k). An analysis on the bandwidths k T andk T may be carried out in a similar fashion to what we have done for k B andk B for the banding estimator. The ratio convergence ofk T to k T may be established under certain conditions. We will evaluate the empirical performance ofk T in the simulations and the case study in Sections 7 and 8.

EXTENSION TO THRESHOLDING ESTIMATION
Both the banding and tapering estimators require the variables in X having a natural ordering such that the correlation decays as two variables are further apart. For covariances not satisfying such ordering, Bickel and Levina (2008b) proposed the thresholding estimator under the following covariance class: |σ l 1 l 2 | q ≤ c 0 (p), for all l 1 (6.1) for a q ∈ (0, 1) and some positive function c 0 (p). For any p × p matrix M = (m l 1 l 2 ) p×p , the thresholding operator is upper tail probability functions, respectively, and Obj D (t, ) = E{||D t n (S n ) − || 2 F }.
The proof is given in the online supplementary material. The sub-Gaussian condition in the theorem is required to use the moderate deviation results. However, if a standardization is used as in , the sub-Gaussian assumption can be relaxed. The standardization allows moderate deviation results for self-normalized statistics, which requires less assumption as shown in Jing, Shao, and Wang (2003).
From Proposition 3, it is seen that Obj D (t, ) is the leading order term of Obj D (t, ). We use Obj D (t, ) as a substitute of Obj D (t, ). Under Assumption 2, it can be shown that g 2 l 1 l 2 = σ l 1 l 1 σ l 2 l 2 + σ 2 l 1 l 2 + f l 1 l 2 . For simplicity, we focus on the normally distributed data in this section such that = 0 and g 2 l 1 l 2 = σ l 1 l 1 σ l 2 l 2 + σ 2 l 1 l 2 . Therefore, to estimate g 2 l 1 l 2 , it is suffice to estimate σ l 1 l 2 .
Note that η (1) l 1 l 2 , η (2) l 1 l 2 , and t n are continuous and differentiable functions. So, Obj D (t, ) is continuous and differentiable with respect to t. Therefore, the minimum of Obj D (t, ) exists on any closed interval [0, B] for B > 0. Define the underlying threshold level as t 0 ( ) = arg min t∈ [0,B] Obj D (t, ). (6.4) Before we present an algorithm to find an estimate of t 0 ( ), we review the cross-validation (CV) approach proposed in Bickel and Levina (2008b), which was designed to approximate the Frobenius risk Obj D (t, ). They proposed splitting the original sample into two groups of size n 1 and n 2 randomly for N times.
In the vth split, let S v 1 and S v 2 be the sample covariances based on the two subsamples, respectively. The estimated Frobenius risk with respect to t iŝ Similar approach has been used in Bickel and Levina (2008a) to select the bandwidth for the banding estimator, and in  for the adaptive thresholding estimator. Due to the inconsistence of S v 2 ,R D (t) is unreliable for Obj D (t, ), which may result in unstable threshold selection as revealed in our simulation study.
We propose an iterative procedure for selecting the threshold level t which makes use of the derived expressions for the Frobenius risk in Proposition 3. We use Obj D (t, Dt n,B L (S n )) for t n,B L = 2t B L (log p)/n as an initial estimate of Obj D (t, ) where Dt n,B L (S n ) is the thresholding estimator of with the Bickel and Levina's threshold selectort B L . In the computation of Obj D (t, Dt n,B L (S n )), all the g l 1 l 2 , η (1) l 1 l 2 and η (2) l 1 l 2 appeared in (6.3) are replaced by their estimates implied under Dt n,B L (S n ). Then, the selected threshold level in the first iteration iŝ t 1 = arg min t∈ [0,B] Obj D (t, Dt n,B L (S n )), (6.7) which may be viewed as a refinement of Bickel and Levina's approach.
Having acquired thet h−1 for a h ≥ 1, the hth iterative threshold estimator ist h = arg min t∈ [0,B] Obj D (t, Dt n,h−1 (S n )), (6.8) wheret n,h−1 = 2t h−1 (log p)/n. Simulations given in the next section demonstrate that the algorithm tends to converge within five iterations and had superior performance over Bickel and Levina's CV method.

SIMULATION RESULTS
We report results of simulation studies which were designed to evaluate the empirical performance of the proposed bandwidth and threshold estimators for the banding, tapering, and thresholding covariance estimators. We also compared with the cross-validation estimator of Bickel and Levina (2008a,b) and SURE of Yi and Zou (2013).
( 7.3) The random generation of the diagonal elements made the column series {X i1 , . . . , X ip } under Design (C) nonstationary. Similar design was considered in Cai, Liu, and Xia (2013). When evaluating the thresholding estimator, the normally distributed data were generated for the covariance structure (A) in (7.2) with θ = 0.7 −1 and 0.9 −1 , as well as a block diagonal covariance (Design (D)): (1) and (1) follow structure (A) with θ = 0.3 −1 and 0.9 −1 , respectively. (7.4) To mimic the "large p, small n" paradigm, we chose n = 40, 60 and p = 40, 200, 400, and 1000, respectively. We considered the more conservative bandwidth estimator in (4.3) that has a wider span of search region. For the banding estimation, comparison has been made with the cross-validation approach of Bickel and Levina (2008a,b). Similar to (6.5), the empirically estimated Frobenius risk with respect to the bandwidth k iŝ and the bandwidth estimator isk B L = arg min kR (k). According to Bickel and Levina (2008b), we chose n 1 = n(1 − 1/ log n) and the number of random splits N = 50. We choose B = 2.5 in (6.6), (6.7), and (6.8) in the algorithm for the threshold levels.
All the simulation results reported in this section were based on 500 replications. Tables 1, 2, and 3 report averages and standard deviations of the proposed bandwidth estimators for both the banding and the tapering estimation, and those of Bickel and Levina (2008a) (BL)'s CV bandwidth estimator, under both the Gaussian and the standardized t 5 innovations with the covariances (A), (B), and (C) in (7.2) and (7.3).
It is observed from Tables 1 and 2 that the proposed bandwidth had smaller bias and standard deviation than those of the Bickel and Levina's CV estimators for almost all the cases in the simulations. The bias and standard deviation of the proposed bandwidth selector were consistently less than 0.5 for larger p, which may be viewed as confirmatory to the finding in Theorem 2 that the  underlying bandwidths are within O 1 = [k B − 1,k B + 1] with overwhelming probability. It is also observed that as p was increased, both the bias and the standard deviation of the proposed bandwidth estimator were reduced. This was not necessarily the case for the CV bandwidth selector.
Comparing the results of the bandwidths for the banding and the tapering estimators in Tables 1 and 3 under Design (A), we found that the underlying k B and k T were more responsive to the increase of the sample size n than to the increase of the dimension p. This may be understood by the fact that the penalty term (np) −1 |l 1 −l 2 |≤k σ l 1 l 1 σ l 2 l 2 in the objective function decreases as n is increased. Although there is a division of p in the penalty term, it is absorbed as part of the averaging process. As a result, the underlying bandwidths were not sensitive to p upon given a particular covariance design. Under both the standardized normal and t 5 innovations, it was found that k B =k B and k T =k T for all the (p, n) combinations under the covariance Designs (A)-(C). This was not necessarily the case for more skewed data, for instance the standardized Gamma(0.1, 1) innovation ( Figure 3). Table 4 reports the average and the standard deviations of the selected threshold levels by the proposed iterative approach and Bickel and Levina's (2008b) CV method. It shows that the selected threshold level from the first iteration was already better than the CV method for having smaller bias and being less variable. The second iteration improved those of the first significantly, and the improvement continued as the iter-ation went. A convergence was largely established within five iterations.
In addition to evaluate the performance of the bandwidth estimation, we also computed the estimation loss for with the estimated bandwidths, and Bickel and Levina's (BL) as well as Cai and Yuan's (2012) (CY) adaptive blocking estimation. Let k B andˆ k T be the banding and the tapering estimators with the proposed bandwidth selection, respectively; andˆ k BL andˆ CY be the banding estimator with BL's bandwidth selection and Cai and Yuan's adaptive blocking estimation, respectively. For each of the covariance estimators, sayˆ , we gathered the spectral loss ||ˆ − || (2,2) and the Frobenius loss ||ˆ − || F . Figure 1 displays the boxplots of the estimation losses under Design (A) with θ = 0.7 −1 , Design (B) with ξ = 0.5 and β = 1.5 and the Gaussian innovations.
We observe from Figure 1 the estimation losses ofˆ k BL encountered large variance under both the spectral and Frobenius norms, which was likely caused by the large variation of the BL's bandwidth estimator shown in Tables 1 and 2. The estimation errors ofˆ CY were quite large in terms of the Frobenius norm. While its relative performance was improved under the spectral norm, the errors were still larger than those of the banding and tapering estimators with the proposed bandwidth selection methods under the covariance Designs (A) and (B). We observe a significant advantage of the covariance estimation with the proposed bandwidth selection method. In particular, the losses of the banding and the tapering estimators with the proposed bandwidths were substantially less than those ofˆ CY andˆ k BL .   Althoughˆ k BL 's median loss was less than that ofˆ CY in most cases, it was much more variable. In contrast, the banding and the tapering estimation with the proposed bandwidths had the smallest medians and variation. We also observe that the estimation loss of the tapering estimator was smaller than that of the banding estimator under Design (A). This is due to that the h(k) function decays gradually as the bandwidth k was increased. Therefore, the tapering estimator fits these covariance structures better than the banding estimator. However, under Design (B), the advantage of the tapering estimator over the banding estimator was much reduced.
We also compared the proposed method (4.3) with the fixed and the change-point methods of Qiu and Chen (2012) designed for banded covariances. We consideredk 0.5,0.06 for the fixed estimator, and the change-point estimator was applied on bandwidths whose p-values for the banded test were larger than 10 −10 . Figure 2 reports the bias and standard deviation of these three methods for covariance design (A) with the Gaussian distributed innovation. The covariance design prescribes an exponentially decaying off-diagonal with the speed of the decay controlled by θ . And the covariance under this regime is not banded but bandable. We observe that the performance of the proposed estimators was much more accurate than those of Qiu and Chen (2012), with much smaller bias and standard deviation. For the covariance design with θ = 0.7 −1 , both the fixed and the change-point estimators overestimated the underlying bandwidth. For the covariance design with θ = 0.9 −1 , we found dramatic under-estimation and overestimation for the fixed and the change-point estimators, respectively. The inferior performance of Qiu and Chen's methods confirms that they are not suitable for "bandable" covariances.
The relative performance of the proposed bandwidth selection for the tapering estimator to that of the SURE of Yi and Zou (2013) is displayed in Figure 3. The figure plots the differences in the absolute bias and the standard deviation between the SURE and the proposed bandwidth selection under covariance Design (A) with θ = 0.7 −1 . The comparison was made under the Gaussian innovation ( = 0), and the standardized Gamma innovations with = 6, 12, 20, and 60. We recall that measures the excessive kurtosis over that of the Gaussian. We observed that the performance of SURE and the proposed were largely comparable for smaller and larger n (n = 60). As got larger so that the data deviate more from the Gaussian, the performance of SURE was adversely affected. The standard deviation and the bias of the proposed bandwidth estimates were largely stable with respect to the changing . It is noted that SURE is proposed under Gaussianity whereas the proposed bandwidth estimation is largely nonparametric. This was the reason that the proposed method outperformed SURE for the Gamma distributed innovations.

EMPIRICAL STUDY
In this section, we reported an empirical study on a sonar spectrum dataset by conducting the banding and tapering covariance estimation with the proposed bandwidth selection methods. Gorman andSejnowski (1988a, 1988b) and Yi and Zou (2013) had analyzed the same data, which are publicly available at the University of California Irvine Machine Learning Repository. The dataset collects the so-called sonar returns which are the amplitudes of bouncing signals off an object, essentially the return signal strength over time. The sonar returns were collected from bouncing signals off a metal cylinder and a cylindrically shaped rock, respectively, positioned on a sandy ocean floor. The dataset contains 208 returns, 111 of them from the metal cylinder and 97 from the rock. A data preprocessing based on the Fourier transform was applied to obtain the spectral envelope for each sonar return, and each spectral envelope composed of 60 numerical readings in the range 0.0 to 1.0, with each reading representing the energy within a particular frequency band. Hence, the data dimension p = 60, and there were two samples of sizes 111 and 97, respectively. Gorman and Sejnowski (1988) analyzed the dataset by the neural network, aiming to classify sonar targets to two groups. Yi and Zou (2013) found that there was a quite obvious decay among entries of the sample covariance along the off-diagonals. They estimated the covariance matrices for the metal and the rock groups by their SURE-tuned tapering estimation method. Their analysis suggested the effective bandwidth of the tapering estimator to be 34 for the rock group.
We consider estimating the covariance matrices by the banding and tapering estimators with the proposed bandwidth selection. The estimated h(k) for the rock and metal groups are displayed in the upper panel of Figure 4, from which we see that h(k) decays rapidly as the bandwidth k increases, indicating potential bandable structure of the covariance. The estimated Frobenius lossM n (k) andN n (k) for both groups are displayed in the two lower panels of Figure 4 for both the banding and tapering estimators, respectively. These graphs showed that the bandwidths which minimize the Frobenius losses of the banding estimation were 26 and 37 for the rock group and metal group, respectively. These were quite different from the estimates of 35 and 44 for the two groups prescribed by the CV method of Bickel and Levina. For the tapering estimation, the proposed approach selected bandwidths of 17 and 25 for the two groups, and hence the effective bandwidths were 34 and 50 for the two groups, respectively. This respected the ordering that k B is between k T and 2k T . Although the SURE method produced similar bandwidth estimates of 16 and 25 for the two groups, the CV method for the tapering estimation gave bandwidths 28 and 26, respectively. These again were sharply different from the bandwidth estimates using the proposed method.

DISCUSSION
Cai and Yuan (2012) proposed an adaptive covariance estimator through a block thresholding approach for the normally distributed data with the covariance matrix class U 1 (α, C). They showed that such adaptive estimator can achieve the minimax convergence rate under the spectral norm. The approach of Cai and Yuan (2012) is "data-driven" up to the initial block size k 0 and a thresholding parameter λ, which were set to be log p and 6, respectively. The initial block size k 0 functions similarly as the bandwidth in the banding and tapering estimation. While fixing the initial block size k 0 attains simplicity, it may be less responsive to the different underlying covariance structures.
The block thresholding estimator can attain the minimax rate of convergence, so can the tapering estimator of Cai, Zhang, and Zhou (2010). It is important and assuring to have minimax properties. However, the minimax rate tends to be less sensitive when the matrix class under consideration is large, for instance the U 1 (α, C) class. As shown in Section 3, the rates of the underlying bandwidth k B , which minimizes the Frobenius risk for the banding estimation, are quite responsive to the different forms of sparsity of . Specifically, the exponential and polynomial decays lead to different rates for k B . This responsive feature can produce less estimation error. Our simulation study showed that the banding and the tapering estimators with the proposed bandwidths outperformed the block thresholding estimator consistently under the Frobenius norm for all three covariance designs used in the simulation, which was also the case under the spectral norm for the covariance designs (A) and (B). For the third design of covariance (Design (C)), the performance of the CY's estimator was comparable to those of the banding and tapering estimators.
It can be shown that the banding estimation can also reach the minimax convergence rate under the Frobenius norm at k B , the underlying bandwidth that minimizes the Frobenius risk. Under the matrix class considered in Theorem 1, the difference between Obj B (k B ) and Obj B (k B ) is negligible comparing to Obj B (k B ), as revealed by Corollary 1 in the online supplementary material. This leads to the belief that the banding estimation with the estimated bandwidthk B should also attain the minimax rate under the Frobenius norm for the matrix class G(ν, q 0 p ). Confirming this theoretically would be an interesting future research topic, given the limited space available for this article. Yi and Zou (2013) considered the bandwidth selection for the tapering estimator for Gaussian distributed data. The proposed method is nonparametric so it is more widely applicable, which may explains the better performance of the proposed method for the case of the Gamma distributed innovations.

Rate of k B Under Exponential Decay Subclass
Suppose h(q) = C(q)θ −q for θ > 1 and {C(q)} p−1 q=0 ∈ [C 1 , C 2 ]. Consider two equations: a/n = C 1 θ −k and b/n = C 2 θ −k which represent interceptions of two horizontal lines at a/n and b/n to the lower and upper bound functions of h(k), respectively. The solutions for k are, respectively, s a,n = (log n − log a + log C 1 )/ log θ and s b,n = (log n − log b + log C 2 )/ log θ.
To prove Theorem 1, first, we intend to calculate the variance of (p − q)ĥ(q) andĝ(q). To this end, we introduce some notations. For q = 0, . . . , p − 1, define X il X kl X 2 j l+q + X i l+q X k l+q X 2 jl .

Proof of Theorem 1
Let S 0 = [k 1,n , k 2,n ]. For any δ > 0 and every n, define S 1 = {k : |k −k B | ≥ δk B } ∩ S 0 . Then, ifk B ∈ S 1,n , we have sup k∈S 1 {M n (k B ) −M n (k)} ≥ 0. It follows that For the term on the right side of the inequality, noting by (3.6), we have inf k∈S 1 {M n (k) − M n (k B )} ≥ Ck B n −1 . Hence, Therefore, by Chebyshev's inequality, the probability on the right side of (A.10) can be bounded by a constant times Since k b,n = o(n), the last term in the inequality above is the small order term of np −1 . Noting that n = O(p) by Assumption 1, we have P (|k B −k B | ≥ δk B ) = o(n/p) → 0 for any δ > 0, which leads to the conclusion thatk B /k B → 1, as n → ∞.

SUPPLEMENTARY MATERIALS
The online supplementary materials contain additional proofs. [Received October 2013. Revised June 2014