Robust Covariance Matrix Estimation for High-Dimensional Compositional Data with Application to Sales Data Analysis

Abstract Compositional data arises in a wide variety of research areas when some form of standardization and composition is necessary. Estimating covariance matrices is of fundamental importance for high-dimensional compositional data analysis. However, existing methods require the restrictive Gaussian or sub-Gaussian assumption, which may not hold in practice. We propose a robust composition adjusted thresholding covariance procedure based on Huber-type M-estimation to estimate the sparse covariance structure of high-dimensional compositional data. We introduce a cross-validation procedure to choose the tuning parameters of the proposed method. Theoretically, by assuming a bounded fourth moment condition, we obtain the rates of convergence and signal recovery property for the proposed method and provide the theoretical guarantees for the cross-validation procedure under the high-dimensional setting. Numerically, we demonstrate the effectiveness of the proposed method in simulation studies and also a real application to sales data analysis.


Introduction
Compositional data analysis gives a powerful multivariate statistical approach for the analysis of high-dimensional data, especially when some form of standardization and composition is necessary.In this compositional framework, raw data are transformed into compositional data after calculating their relative proportions or percentages, and the sum of compositional data is constrained to the be some constant.This data structure is widely used in many research fields such as biological sciences (Tsilimigras and Fodor 2016;Gloor et al. 2017), business and economics (Fry, Fry, and McLaren 2000;Arata and Onozaki 2017), and geological sciences (Thomas and Aitchison 2005).As the rate of data collection increases exponentially, the need to analyze data under a compositional framework increases as well.The comparison of raw data across experimental units can be misleading when raw totals may vary greatly from experimental unit to experimental unit or be accidentally large.To facilitate the comparison between experimental units, researchers usually normalize the data matrix into a matrix of proportions or percentages.For example, a sales data analysis can provide helpful insights to design data-driven marketing strategies, and the comparison of sales data of various products/categories can be biased by different scales of sales across channels or store locations.In practice, compositions are commonly used in sales data to effectively compare the sales of products through different sales channels or at different store locations.
The constant-sum structure in compositional data leads to the so-called negative bias problem (also known as the constantsum problem) in the covariance or correlation matrix (Aitchison 1982).Most of existing multivariate statistical methods do not take into account the special nature of compositional spaces (i.e., the sum is constrained to be a constant), and they can lead to inappropriate inferences and spurious findings.This problem has been demonstrated in the statistical literature over the past four decades (Aitchison 1982), and we will also illustrate this issue in the empirical analysis of sales data in Section 5.As a result, alternative multivariate statistical methods have been developed to incorporate the special constant-sum structure of compositional data, including compositional regression analysis, logcontrast principal component analysis, and many others (Aitchison 1982(Aitchison , 1983;;Aitchison and Bacon-Shone 1984).In the last decade, new multivariate statistical methods such as highdimensional regression analysis (Lin et al. 2014;Srinivasan, Xue, and Zhan 2021) and large covariance matrix estimation (Cao, Lin, and Li 2019;Li et al. in press) have been proposed for the analysis of compositional data in the high-dimensional setting, where the number of samples can be smaller or much smaller than the number of variables.
Estimating covariance matrices is of fundamental importance for multivariate statistical methods since covariance matrices characterize the important statistical relationships between variables.The accurate estimation of the covariance matrix is essential for a variety of downstream statistical methods such as clustering, discriminant analysis, and principal components analysis, especially for high-dimensional data.Recently, Cao, Lin, and Li (2019) proposed the composition adjusted thresholding (COAT) procedure to estimate the sparse covariance matrix of compositional data in the highdimensional setting.However, existing methods including Lin et al. (2014), Cao, Lin, andLi (2019), Srinivasan, Xue, and Zhan (2021), and Li et al. (in press) require the restrictive Gaussian or sub-Gaussian assumption on the data-generating distribution for compositional data.In real-world applications, we often need to deal with nonnormal and heavy-tailed data.For example, the issue of nonnormal and heavy-tailed data is demonstrated in the application to sales data analysis in Section 5.The violation of Gaussian or sub-Gaussian assumption can greatly diminish the effectiveness of existing methods (Huber 1964(Huber , 1981;;Fan, Li, and Wang 2017;Avella-Medina et al. 2018).
In this article, we study the robust estimation of the large covariance matrix for high-dimensional compositional data to handle potential deviations from the normality.Synthesizing robust estimation and compositional analysis provides an promising approach without requiring the restrictive Gaussian or sub-Gaussian assumption.We propose a robust composition adjusted thresholding based on Huber-type M-estimation, denoted by M-COAT, to estimate the sparse covariance matrix for high-dimensional compositional data.M-COAT is a two-step regularized estimation procedure.In the first step, we combine both strengths of the centered log-ratio (clr) transformation and Huber-type M-estimation to derive a pilot estimator of the covariance matrix for compositional data.In the second step, given this robust pilot estimate, we employ a positive-definite thresholding scheme to induce sparsity and ensure the positive-definiteness of our thresholding covariance matrix estimator, which is vital for many downstream analysis.The numerical properties of the proposed M-COAT method are demonstrated in simulation studies and a real application to the store-scanner Dominick's Finer Foods dataset.
Theoretically, we study the asymptotic properties for the proposed M-COAT method under the high-dimensional setting where the dimension is nearly on the exponential order of the sample size.We assume a bounded fourth moment condition, allowing for a richer class of distributions beyond the Gaussian or sub-Gaussian assumption.The proposed estimator achieves the same convergence rates under both the spectral norm and Frobenius norm as the COAT does under the sub-Gaussian assumption.Also, we prove the correct sparsity and sign consistency properties for the proposed M-COAT method.
The proposed M-COAT method includes two tuning parameters: H in the piece-wise Huber loss function ψ H (z) = min{max(−H, z) H} and λ in the thresholding estimation.We introduce a cross-validation procedure using the squared Frobenius loss to choose both H and λ.The proposed crossvalidation procedure enjoys the theoretical guarantees under the high-dimensional setting.We prove that the tuning parameters selected by this cross-validation procedure perform comparably to the oracle tuning parameters that are based on the true covariance matrix under an elliptical distribution assumption.To the best of our knowledge, this result makes a separate contribution to the literature, as it fills the gap of choosing the tuning parameters with theoretical guarantees for robust large covariance matrix estimation via Huber-type M-estimation (Fan, Li, and Wang 2017;Avella-Medina et al. 2018).
The rest of this article is organized as follows.In Section 2 we first introduce the compositional framework and then present the methodological details for the proposed M-COAT method and its cross-validation procedure.In Section 3 we outline the theoretical results including convergence rates, correct sparsity, sign consistency, and a guarantee for cross-validation under the compositional and nonnormal framework in high dimensions.Sections 4 and 5 evaluate the numerical performance of the proposed M-COAT through simulation studies and an application to sales data analysis.Finally we conclude with a summary of our results in Section 6.The proofs and additional numerical results are presented in a supplementary file.

Methodology
We first describe the compositional framework (Aitchison 1982;Aitchison and Bacon-Shone 1984) in Section 2.1.After these preliminaries, we present the proposed two-step M-COAT method in Section 2.2 and a new cross-validation procedure to choose the tuning parameters of the M-COAT in Section 2.3.

The Compositional Framework
Following the compositional framework (Aitchison 1982;Aitchison and Bacon-Shone 1984;Cao, Lin, and Li 2019), we will use latent variables and the centered log-ratio (clr) transformation to account for the special constant-sum structure of compositional data.
Let W = (W 1 , . . ., W p ) with W k > 0, k = 1, . . ., p, be the vector of latent variables known as the basis.We consider the following normalization of W to obtain the vector of compositional variables X = (X 1 , . . ., X p ), where (1) for j = 1, . . ., p.With this normalization procedure, X lies on the simplex A p in R p , which is known as the Aitchison Simplex (Aitchison 1982) and is defined as follows: One critical issue arising from this simplex is that the compositional vector X is non-identifiable.To combat this issue, we define Y = log(W) = (log(W 1 ), . . ., log(W p )) T as the log-basis and assume that Y follows a distribution F Y with mean vector μ and covariance matrix = (σ ij ) p×p .The basis covariance matrix is the parameter of interest that characterizes the statistical relationship in the compositional data.
We will study the robust estimation of the sparse basis covariance matrix with the iid high-dimensional compositional data X 1 , . . ., X n , where p can be much larger than n.However, it is infeasible to use W or X for estimating the basis covariance matrix (Cao, Lin, and Li 2019).The clr transformation provides a promising approach by transforming the compositional data away from A p , which is defined as for j = 1, . . ., p, where g(X) = ( p j=1 X j ) 1/p is the geometric mean of X.The magnitude of each Z j is not restricted to (0, 1], but Z is still rank-deficient due to the simple fact that p j=1 Z j = 0.The covariance matrix of Z, denoted by , is also known as the centered log-ratio covariance matrix.By definition, we know that Z = AY with A = I p×p − (1/p)1 p 1 T p .Thus, we have that = A A T .As shown in Proposition 1 of Cao, Lin, and Li (2019), two covariance parameters and are asymptotically indistinguishable if they belong to a class of sparse covariance matrices.More specifically, the sparse basis covariance parameter is approximately identifiable by estimating the sparse covariance parameter .The sparse estimation of with the clr transformed data Z 1 , . . ., Z n serves as an effective proxy for the sparse estimation of .

The Proposed Method
The proposed M-COAT method consists of two steps.In the first step, we use the Huber-type M-estimation to construct a pilot estimator denoted by ˆ H based on the clr transformed data Z 1 , . . ., Z n .In the second step, we use a positive-definite thresholding scheme to obtain a positive-definite sparse covariance matrix estimator denoted by T λ ( ˆ H ). The resulting sparse covariance estimator T λ ( ˆ H ) can approximately estimate the desired basis covariance parameter .
The pilot estimator ˆ H plays an important role in accounting for both robustness and the compositional constraint before progressing into the thresholding estimation.As will be shown through numerical analysis, an accurate pilot estimator is necessary to ensure that large errors from the pilot step do not unduly influence thresholding.Cao, Lin, and Li (2019) used the sample covariance matrix of the clr transformed data Z 1 , . . ., Z n to construct the pilot estimator in the COAT method.If the logbasis Y satisfies the Gaussian or sub-Gaussian assumption, the sample covariance matrix of Z 1 , . . ., Z n performs very well to construct the pilot estimator in the high-dimensional regime.However, the nonnormal, heavy-tailed, and leptokurtic distribution of the log-basis Y will negatively impact the accuracy of the pilot estimator for the COAT, which we will demonstrate in the simulation studies in Section 4.
In the presence of nonnormal and heavy-tailed data, we need to construct a robust pilot estimator from Z 1 , . . ., Z n .In the literature, Huber's M-estimation (Huber 1981) provides a broad class of robust estimators in many statistical problems.Recently, Avella-Medina et al. ( 2018) explored the Huber-type M-estimation of the large covariance matrix.We extend the Huber-type M-estimation to construct a robust pilot estimator for high-dimensional compositional data in the sequel.
Recall that is the centered log-ratio covariance matrix of Z = clr(X) = (Z 1 , . . ., Z p ) .Since sparse and sparse are asymptotically indistinguishable, we use the Huber-type Mestimation to estimate each entry of = (γ uv ) p×p , where for any u, v = 1, . . ., p.In view of (4), it is sufficient to provide the robust estimation for μ z uv and μ z u with Z i = (Z i1 , . . ., Z ip ) (i = 1, . . ., n), for any u, v = 1, . . ., p, respectively.Given the Huber function ψ H (z) = min{max(−H, z), H} at level H, we construct the Huber-type M-estimators of μ z uv and μ z u by solving the solutions μz u and μz uv to the following estimating equation, respectively: and Here, H is a tuning parameter that needs to be chosen properly.Therefore, the robust pilot estimator can be constructed as follows: The robust pilot estimator ˆ H enjoys theoretical guarantees which are explored in Section 3. In addition, ˆ H outperforms other rank-based methods (Xue andZou 2012, 2014) through the simulation studies in Section 4.
Next, given the robust pilot estimator ˆ H in the first step, we use a positive-definite thresholding procedure to construct a sparse estimator ˆ for the basis covariance parameter in the second step.Due to the rank-deficiency of Z, the rank of ˆ H is at most p − 1.Nonetheless, both the covariance parameter of interest and the proxy covariance parameter are positive definite.The positive-definiteness of ˆ is essential to avoid the potential identifiability issue or degeneracy in downstream analysis.To this end, we use the positive-definite 1 -penalization (Xue, Ma, and Zou 2012) to compute the sparse estimator ˆ as where ε > 0 is an arbitrarily small constant, λ is a tuning parameter, and | • | 1,off denotes the 1 -norm of the off-diagonal entries.The constraint that ε > 0 ensures that the resulting covariance estimator is positive-definite.Note that ε can be set as small as possible, and it is not a tuning parameter.Through this constraint, we recover the optimal positive-definite estimate of , even though ˆ H itself is rank-deficient.We also note that ( 8) is a convex but nonsmooth function that can be efficiently solved by an alternating direction method of multipliers (ADMM) as outlined in Algorithm 1 of Xue and Zou (2012).We summarize the details about the proposed M-COAT method in Algorithm 1.
Algorithm 1: The proposed M-COAT method.
Input. the compositional data matrix X and tuning parameters H & λ.
Step 1. use Huber-type M-estimation to construct a robust pilot estimator where μz uv and μz u are solved from estimating equations Step 2. use an alternating direction method of multipliers (ADMM) to solve the positive-definite thresholding estimator ˆ from min Output: a positive-definite sparse basis covariance estimator ˆ .

A Cross-Validation Procedure
The effectiveness of our proposed M-COAT method lies in the proper selection of H in the Huber loss function and λ in the positive-definite thresholding.The tuning parameter λ is usually chosen by cross-validation that enjoys theoretical guarantees (Bickel and Levina 2008;Cai and Liu 2011;Cao, Lin, and Li 2019).The tuning parameter H affects the accuracy of the pilot estimator ˆ H and thus the sparse basis covariance estimator ˆ .The proper choice of H involves a tradeoff between bias and robustness.Unlike the classical robust statistical methods (Huber 1981), in order to alleviate the bias, we need to allow H to grow as the sample size increases in the high-dimensional setting (Avella-Medina et al. 2018).However, Avella-Medina et al. ( 2018) do not provide a procedure with theoretical guarantees to choose this critical tuning parameter for Huber-type M-estimation.Specifically, Avella-Medina et al. ( 2018) used a conservative choice by assuming the true distribution to be a Student t-distribution with five degrees of freedom.In what follows, we fill this gap with a cross-validation procedure with theoretical guarantees.Thus, this is a separate contribution to the literature of Huber-type Mestimation with high-dimensional data.
We propose a cross-validation procedure using the squared Frobenius norm to select both H and λ for the proposed M-COAT method.We randomly partition the sample into two subsamples of size n 1 and n 2 = n − n 1 and repeat this random partition K times, where n 1 n 2 n and K is a fixed integer.Let ˆ k 1,H be the robust pilot estimator defined as (7) using the training dataset with n 1 observations in the kth split, and T λ ( ˆ k 1,H ) is the corresponding M-COAT estimator.Let ˆ k 2 be the sample covariance matrix computed from the hold-out dataset with n 2 observations in the kth split.
For any given tuning parameters H and λ, we define the squared Frobenius loss to assess the discrepancy between To find a data-driven choice of (λ, H), we optimize (9) by searching over a set of candidate parameters (λ 1 , . . ., λ J 1 ) × (H 1 , . . ., H J 2 ), where J 1 denotes the size of the grid for λ and J 2 denotes the size of the grid for H.More specifically, we choose the pair ( λ, Ĥ) that minimizes (9) as follows: ( λ, Ĥ) = argmin λ∈(λ 1 ,...,λ J 1 ),H∈(H 1 ,...,H J 2 ) R(λ, H). (10) The final estimator of the sparse basis covariance parameter is given by ˆ = T λ( ˆ Ĥ ).In Section 3, we will prove that the data-driven choice ( λ, Ĥ) as in (10) performs comparably to the oracle tuning parameters that are based on the true covariance matrix under an elliptical distribution assumption.
In practice, we may define the grid points for λ to be evenly spaced between a small value near 0 and a large value such that the estimated covariance matrix would be close to a diagonal matrix.As for the choice of H, we define the grid according to the theoretical order (see Theorem 4 in the next section).Although we need to choose two parameters λ and H, it is very efficient to solve the penalized covariance estimation for each given (λ, H) pair, so the overall computing time is still feasible for large-scale real applications.We may also use a more efficient C/C++ implementation to reduce the computing time of the proposed method, and parallel computing can further reduce the computing time.

Theoretical Properties
In this section, we study the theoretical properties of our proposed M-COAT method and cross-validation procedure.We first derive the desired convergence rate and signal recovery properties in Section 3.1.Further, we provide a new theoretical guarantee for using the cross-validation to simultaneously choose H and λ in Section 3.2.
We consider the high-dimensional setting that log p/n → 0, where the dimension p is on a nearly exponential order of the sample size n.Also, we consider the following class of covariance matrices defined by the q norm with q ∈ [0, 1): U(0, s 0 (p), M) denotes the class of sparse covariance matrices and U(q, s 0 (p), M) with q ∈ (0, 1) denotes the class of approximately sparse covariance matrices.U(q, s 0 (p), M) was first introduced by Bickel and Levina (2008) and then widely explored in the large covariance matrix estimation (Rothman, Levina, and Zhu 2009;Cai and Liu 2011;Xue, Ma, and Zou 2012;Cao, Lin, and Li 2019).

Convergence Rates and Signal Recovery
We study the convergence rates and signal recovery of the M-COAT without requiring the Gaussian or sub-Gaussian assumption.Instead, we assume that the first four moments of Z are finite, that is, max 1≤u≤p E[Z u ] 4 = κ 2 < ∞.Our theoretical analysis allows for a richer class of heavy-tailed distributions such as the Laplace distribution and t-distributions with five or more degrees of freedom.
Theorem 1 and Corollary 1 present the rate of convergence of our proposed M-COAT estimator under the spectral norm and under the Frobenius norm, respectively.
Theorem 1. Suppose that log p/n → 0 and max 1≤u≤p E[Z u ] 4 = κ 2 < ∞.Then, uniformly on U(q, s 0 (p), M) with q ∈ [0, 1), if Then, uniformly on U(q, s 0 (p), M) with q ∈ [0, 1), if The above rates of convergence can be decomposed into two parts, the first matching the optimal rate proposed by Cai and Liu (2011) and the additional s 0 (p) p denoting the necessary additional error from using Z instead of the unattainable Y.An interesting feature of this rate is that in high dimensions the identification error s 0 (p) p vanishes as the dimensionality increases.Instead, for large p, the above rate is dominated by log p n which matches the rates on unconstrained data studied by Cai and Liu (2011) and Bickel and Levina (2008).
As shown in Theorem 1 and Corollary 1, the proposed M-COAT achieves the same convergence rates under the bounded fourth moment condition as the COAT (Cao, Lin, and Li 2019) does under the sub-Gaussian assumption.
Next, we present the signal recovery and sign consistency of the M-COAT in Theorems 2 and 3, respectively.We show that the proposed M-COAT estimator attains desirable guarantees for the recovery of the support of the true signal and also the sign of the true signal with high probability.

Cross-validation Guarantees
To derive the theoretical guarantees for the cross-validation procedure, we need an additional assumption on the distribution by assuming that Y i is drawn from the broad class of elliptical distributions.Specifically, we assume that each Y i ∈ R p follows an elliptical distribution such that Y i ∼ E p (μ, , φ), where E p (μ, , φ) is a p-dimensional elliptical distribution with the location parameter μ ∈ R p , positive-definite shape matrix , and density generator φ (Cambanis, Huang, and Simons 1981;Tyler 1987).By Cambanis, Huang, and Simons (1981), Y i is equivalently represented as where u i be a scalar random variable, and ξ i ∼ N (0, I p 0 ×p 0 ).This is not an overly restrictive assumption as commonly used heavy-tailed distributions such as the t-distribution and Laplace distribution are special examples of this class of elliptical distributions (Goes, Lerman, and Nadler 2020).
To demonstrate the effectiveness of our proposed crossvalidation procedure, we define the oracle tuning parameters λo and Ĥo as the optimal choices that has the oracle information about as follows: Theorem 4 shows that the error of the M-COAT estimator computed using the selected tuning parameters λ and Ĥ is close to the optimal error with the oracle tuning parameters λo and Ĥo .Following Bickel and Levina (2008) and Cai and Liu (2011), we just need to study the K = 1 setting in Theorem 4, and this result holds for the general K.
Theorem 4. Let J = J 1 × J 2 .Suppose that Y 1 , . . ., Y n follows the iid elliptical distribution E p (μ, , φ) with ∈ U(q, s 0 (p), M), Eu 4 i < ∞, log p n → 0, and p( and Note that Theorem 4 includes both settings: fixed J and diverging J.When J 1 and J 2 are fixed, in view of p → ∞, the condition that (log J) 3 = o(n(log p/n) 1−q/2 ) naturally holds.When J 1 or J 2 diverges, the upper bound on J is required such that the size of tuning set is proper with respect to the dimension and sample size.
We would like to point out that Avella-Medina et al. ( 2018) did not provide any theoretical guarantee to choose the tuning parameter for Huber-type M-estimation.In the current literature, Bickel and Levina (2008) and Cai and Liu (2011) provided the theoretical guarantees for their cross-validation procedure to choose the thresholding parameter under the Gaussian assumption.More specifically, Bickel and Levina ( 2008)  assumed a diverging size of the grid under the Gaussian assumption, and Cai and Liu (2011) assumed a fixed size of the grid under the Gaussian assumption.See Theorem 4 of Bickel and Levina (2008) and Theorem 5 of Cai and Liu (2011).To the best of our knowledge, Theorem 4 fills a gap to choose the tuning parameters with theoretical guarantees for the robust estimation of large covariance matrices, which makes a separate contribution to the literature.

Simulation Studies
We compare the numerical performance of the proposed M-COAT against the COAT and also its rank-based alternatives in simulation studies with different covariance structures and sparsity patterns.The rank-based alternatives use the adjusted Spearman's or Kendall's rank correlation as the pilot estimator (Xue andZou 2012, 2014;Avella-Medina et al. 2018), and we denote them by S-COAT and K-COAT, respectively.The tuning parameters for K-COAT, S-COAT and COAT were selected in a similar manner as M-COAT; however, we replace the Huberbased pilot estimator ˆ k 1,H used in (9) with the adjusted Kendall rank-based, the adjusted Spearman rank-based, and the standard covariance estimators accordingly.
We consider the following data generating process to compare the numerical performance.We generate Y through a Gaussian scale mixture framework.Gaussian scale mixtures fall into the class of elliptical distributions and is useful for efficiently generating elliptically distributed data (Andrews and Mallows 1974; Müller and Richter 2019; Goes, Lerman, and Nadler 2020; Li et al. in press).Thus, Y i = μ + u i 1/2 ξ i where the heavy- tailed behavior is captured by the choice of u i .We note that ξ i,k ∼ N (0, I p×p ) and set μ = 0 and denotes the true covariance matrix of Y.When u i = 1 for all i = 1, . . ., n, we have that Y follows a Gaussian distribution.We explore two heavy-tailed settings where u i ∼ Laplace(0, 1) and u i ∼ t 5 for all i = 1, . . ., n.
We consider two different covariance structures for : the autoregressive (AR) covariance structure with = (0.7 |i−j| ) p×p , and a sparse covariance structure with a blockdiagonal matrix = bdiag(A 1 , A 2 ), where p = p 1 + p 2 , A 1 = B + εI p 1 and A 2 = 4I p 2 .Here, B is a symmetric matrix whose lower triangle entries are independent and drawn uniformly from [−3, −1.5] ∪ [1.5, 3] with probability 0.15 and are zero otherwise.Let ε = max(−λ min , 0) + .01where λ min is the minimum eigenvalue of B to ensure positive-definiteness.Both covariance structures have served as the benchmarks in the literature such as Bickel and Levina (2008), Cai and Liu  (2011), Xue, Ma, and Zou (2012), and Cao, Lin, and Li (2019 1 and 2, respectively.The results for the Gaussian error setting across both covariance models is found in the supplementary materials.
As shown in both tables, regardless of the covariance structure, the M-COAT attains the smallest estimation errors and the COAT has the largest estimation errors under both the Laplace and t 5 distributions.The COAT uses the sample covariance matrix as the pilot estimator, which is inappropriate for nonnormal and heavy-tailed distributions.As a result, the COAT has much higher estimation errors than other methods across all settings.The rank-based estimators S-COAT and K-COAT are comparable to one another.The M-COAT provides further improvement over S-COAT and K-COAT, especially with the sparse covariance structure in Table 2.As for the selection performance, the M-COAT attains similar TPRs as S-COAT and K-COAT but achieves the significantly lower FPRs.The COAT has the slightly better TPRs but much worse FPRs than other methods across all settings.In summary, the M-COAT outperforms the COAT and also other robust estimators based on rank correlation in terms of both estimation and selection performances.

Application to Sales Data Analysis
In this section, we examine the numerical performance of the proposed M-COAT using the Dominick's Finer Foods scanner dataset, which is downloaded from Chicago Booth's Kilts Center for Marketing.This dataset contains a large amount of scanner data from 1989 to 1994 across multiple locations of the Dominick's Finer Foods store chain in the Chicago area.In this sales dataset, the total sales per store vary greatly due to store location.Besides, sales data is high-dimensional because of the large product variety, and can often be nonnormal.Thus, to effectively study the statistical associations of sales data, normalization is a required preprocessing step, and the compositional data analysis is then necessary.
This dataset consists the sales data of individual products as well as their associated departments (e.g., FISH, MEAT, GROCERY, DAIRY, and so on).Thus, researchers can perform the analysis of sales data at the department level or at the product level.In what follows, we first conduct the analysis of department-level sales data to motivate the importance of compositional analysis and robust estimation in Section 5.1, and we explore the high-dimensional compositional analysis of product-level sales data in Section 5.2.We demonstrate the effectiveness of our proposed approach via this real-world application.We also show that our method is a powerful tool for cross-selling analytics to measure the linkage between the sales of various products and provides a more accurate way of visualizing shoppers' behavior.

Department-Level Sales Analysis
Department-level sales were retrieved from the customer count files in the scanner dataset.We preprocessed the sales data by aggregating across years by store, and we formed a dataset consisting of the total sales of p = 28 departments across n = 70 different stores.
Through the exploratory analysis, we computed the kurtosis measures of the sales data at the department level.We plotted the department-level kurtosis values in Figure 1 for both the raw sales data, W, and the clr-transformed sales data, Z.As shown in Figure 1, it is evident that about 25% of the departmentlevel observations are heavy-tailed or leptokurtic (i.e., kurtosis is greater than 3) in terms of raw counts and about half of them are heavy-tailed after the clr transformation.This suggests the importance of robust estimation.
Next, we demonstrate the need for the compositional analysis.We computed the naive thresholding estimator on the raw count data, the COAT on the compositional data, and the M-COAT on the compositional data.All these covariance estimators were normalized to correlation matrices for a better visualization.In the heatmaps, the departments are ordered by hierarchical clustering using the estimated correlations.The thresholding estimator, the COAT, and the M-COAT are all visualized in Figure 2. When the compositional structure is ignored, as shown in Figure 2, the naive thresholding estimator leads to a very dense correlation structure, and it fails to identify any meaningful dependence structure between departments.The COAT improves the naive thresholding and captures the block structure after performing the hierarchical clustering using the estimated correlations.M-COAT provides the most meaningful result that reveals rich associations of sales data between departments.The COAT and M-COAT estimates share some similarities (such as the dependencies among the core departments of FISH, MEAT, GROCERY, and DAIRY) in their heatmaps.Compared with COAT, M-COAT demonstrates a better capability to capture an interconnected market by identifying the significant associations between the core departments and other departments such as the food-to-go (FTG) sections and BAKERY.

Product-Level Sales Analysis
This dataset also provides in-depth weekly sales data for individual products within each store.First, we aggregate the weekly sales data by year in order to capture larger scale economic factors that influence sales per year, and aid in our interpretation.Next, we select the top 10 products for each of the 27 departments and only keep those products that were sold across all 8 years for further investigation.Now, we have p = 173 unique products across n = 70 different stores, which falls into the high-dimensional setting.
We used the proposed M-COAT method to robustly estimate the high-dimensional covariance matrices for the yearly sales of each product and provide insights on cross-selling patterns.Figure 3 plots the heatmaps of the estimated covariance matrix for year 1989 and 1995. Figure 4 shows the heatmaps of the submatrices about five specific categories from the estimated covariance matrix for year 1989 and 1995, including cookies (red), front-end candies (purple), frozen entrees (black), frozen dinners (green), soda (yellow), and tuna (blue).Figure 5 shows the sales of paper towels (dark red) and dish detergent (dark blue) from the estimated covariance matrix for year 1989 and 1995.It is worth pointing out that an economic recession affected the United States the early 1990s (1989)(1990)(1991)(1992) and the mid-nineties (1994)(1995)(1996) was an era of strong economic growth.By comparing these heatmaps, we can have the following insights about product sales and consumer behavior during or after the economic recession.
• First, as shown in Figure 3, the estimated covariance structure in 1995 shows many more dependencies than in 1989.
It is likely that more products were sold together in the midnineties thanks to the improved economic prospects in the post-recession years.Given the significant amount of sparsity in the estimated covariance structure in 1989, products were likely purchased individually or in smaller groups due to the economic downturn and a jobless recovery during the recession years.• Second, Figure 4 shows the different co-purchasing behavior in 1989 versus 1995.Products such as tuna and frozen entrees are often considered the essential grocery products, while soda, cookies and candies are generally sweets or treats.In 1989, we see that cookies and soda products are positively correlated, suggesting they are often bought together, but negatively correlated to the frozen meals.In 1995, after the recession, the sales of sweets are more positively correlated with meals.• Third, another interesting finding discovered by the M-COAT procedure is the effect of brand loyalty.As shown in Figure 5, in both 1989 and 1995, Cascade brand product sales are positively correlated with one another, but negatively correlated with other dish detergent product sales.This suggests some form of brand loyalty where an individual who purchases Cascade products is more likely to purchase other Cascade products.Notably, brand loyalty may develop over time.In 1989, there does not appear to be a strong dependence structure among paper towel varieties.However, in 1995, we see that the Bounty brand product sales are positively correlated with each other, and a similar pattern occurs for the sales of Scott brand towels.

Conclusion
In this article we propose the M-COAT procedure to robustly estimate the covariance structure of high-dimensional compositional data.The proposed method attains improved accuracy in the presence of nonnormal and heavy-tailed distributions by employing Huber-type M-estimation.We also introduce a cross-validation procedure to choose the tuning parameters of the proposed M-COAT.Without requiring Gaussian or sub-Gaussian assumption, we obtain the rates of convergence and signal recovery property under the high-dimensional setting.We provide the theoretical guarantees for the cross-validation procedure, which makes a separate contribution to the literature of Huber-type M-estimation with high-dimensional data.The finite-sample properties of the proposed M-COAT are demonstrated in simulation studies.Finally, we apply the proposed M-COAT in a real application to the Dominick's Finer Foods scanner dataset and show how the M-COAT can give insights about product sales and consumer behavior.

Figure 1 .
Figure 1.Plots of the kurtosis for department-level gross sale count data and clr transformed data.When kurtosis is greater than 3, the distribution is leptokurtic (heavytailed).

Figure 2 .
Figure 2. Heatmaps for the thresholding estimator on raw count data (on the top left panel), the COAT procedure (on the top right panel), and the M-COAT procedure (on the bottom panel).The colors of the heatmap represent the correlation of the total sales between departments.Positive correlations are visualized by the orange color, and negative correlations are visualized by the purple color.The departments are ordered by hierarchical clustering using the estimated correlations.

Figure
Figure Heatmaps for the sparse covariance matrices of the product-level sales data in 1989 (left) and 1995 (right), estimated by the M-COAT procedure.The color of the circular dots indicates which of the 27 departments the product is from.Products from the same department are given the same color dot label on the rows and columns.

Figure 4 .
Figure 4. Heatmap for a subset categories of products labeled by brand.The plots shows the correlation matrix in the year 1989 (left) during the recession and the year 1995 (right) after the recession.Products from the same category are given the same color on their labels.

Figure 5 .
Figure 5. for products in the categories of dish detergents and paper towels labeled by brand.The plots shows the correlation matrix in the year 1989 (left) during the recession and the year 1995 (right) after the recession.Products from the same category are given the same color on their labels.

Table 1 .
Comparison of estimation performance for the AR covariance structure over 100 independent repetitions under the Laplace and t 5 distributions.

Table 2 .
Comparison of estimation and selection performance for the sparse covariance structure over 100 independent repetitions under the Laplace and t 5 distributions.