A Cluster-Based Outlier Detection Scheme for Multivariate Data

Detection power of the squared Mahalanobis distance statistic is significantly reduced when several outliers exist within a multivariate dataset of interest. To overcome this masking effect, we propose a computer-intensive cluster-based approach that incorporates a reweighted version of Rousseeuw’s minimum covariance determinant method with a multi-step cluster-based algorithm that initially filters out potential masking points. Compared to the most robust procedures, simulation studies show that our new method is better for outlier detection. Additional real data comparisons are given. Supplementary materials for this article are available online.


INTRODUCTION
Let a sample of n p-dimensional data values be denoted as x = (x 1 , . . . , x n ) . The squared Mahalanobis distance statistic for x i is wherex and S are the usual sample mean vector and covariance matrix based on x. If T 2 i exceeds the limit L 1−α given by x i 's is determined to be an outlier. Equation (2) assumes the x i to be independent with the same multivariate normal distribution. Assuming no outliers, if n tests are performed using Equation (2) and each test is of size α where α > 0 is chosen small, the probability at least one test incorrectly identifies an outlier is γ = 1 − (1 − α) n . Even for moderately sized samples, γ can be unacceptably close to 1. So, to address this issue, for any n, γ > 0 is chosen small and α in Equation (2) becomes α = 1 − (1 − γ ) 1/n . Thus, B (1−α; p 2 , n−p−1 2 ) is the (1 − α)-quantile of the beta distribution (Tracy, Young, and Mason 1992;Wierda 1994). Tracy, Young, and Mason (1992) also identified the T 2 i 's in Equation (1) to be dependent. Hence, the approach implementing the beta random variable described above is approximate. We address this deficiency later in the article.
When only one outlier exists, the above-described method is effective at detecting its existence. The primary objection to the use of Equations (1) and (2) comes from the fact that when multiple outliers are present, significant masking occurs and the power to detect them becomes undesirably small.
A natural approach to overcome the masking effect is to substitute into Equation (1) robust estimators of the mean vector and covariance matrix. Many authors have proposed robust estimators of the mean and covariance matrix from multivariate data. The minimum volume ellipsoid (MVE) method, the minimum covariance determinant (MCD) method, and estimators generated using trimming approaches are the most commonly adopted for detecting multivariate outliers. Jobe and Pokojovy (2009) developed an original multi-step, cluster-based algorithm for detecting outliers from individuals multivariate data collected sequentially over time. In this article, we propose a new multi-step, cluster-based algorithm for detecting multivariate outliers from an unordered sample x. Our new method implements the recently developed high breakdown estimators of the multivariate mean vector μ and covariance matrix described by Cerioli (2010). These new high breakdown estimators, referred to as finite sample reweighted MCD (FS-RMCD) estimators, are a modification of the reweighted MCD estimators of Hawkins and Olive (1999) and Rousseeuw and van Driessen (1999).
An overview of our new algorithm can be characterized by the following. From a sample of interest x, FSRMCD estimators are used to identify a bandwidth to nonparametrically estimate the multivariate probability density with the well-known kernel method by Parzen (1962). Application of the modal EM (MEM) algorithm, outlined by Li, Ray, and Lindsay (2007), to the nonparametrically estimated multivariate probability density function identifies local maxima. MEM determines an ascending path for each x i to a local maximum. All x i assigned to the same local maximum are considered to be members of the same cluster. Li, Ray, and Lindsay (2007) developed and named this cluster assignment modal association clustering (MAC). The largest cluster is identified and filtered using a symmetry measure. Remaining data points from this cluster are used to construct reweighted estimators of the mean vector μ and covariance matrix . With these robust estimators, squared Mahalanobis distances d 2 i , whose form is given in Equation (9) are calculated for each x i from x. If the maximum d 2 i exceeds a limit L 1 , then at least one outlier is detected and each d 2 i is then compared to a second limit L 2 . If the d 2 i exceeds L 2 , x i is identified as an outlier. We determine the limits L 1 and L 2 by simulation for selected sample size n and dimension p such that the family-wise and per-comparison rates are controlled at a selected level γ > 0. We refer to our new method as FSRMCD-MAC. The ability to detect outliers with the FSRMCD-MAC method is compared to that of the highly recommended IRMCD method developed by Cerioli (2010). For selected (n, p) combinations, the proportions of good points identified as outliers by the FSRMCD-MAC method, typically referred to as swamping rates, are compared to the rates produced by the IRMCD method. Outlier detection performance of both the IRMCD and FSRMCD-MAC methods are compared for real data given by Flury and Riedwyl (1988).

Finite Sample Reweighted MCD
For multivariate data x, the MCD method attempts to find the subset of h points such that the determinant of the covariance matrix from that subset is minimal among all determinants of the covariance matrix derived from any other set of h points contained in x. The MCD robust estimators then become the corresponding mean and appropriately scaled covariance matrix calculated from the identified subset I MCD : where k MCD is a proportionality constant assuring for unbiasedness and consistency of the estimators if the data x are iid normal (Croux and Haesbroeck 1999). Davies (1987), Rousseeuw and Leroy (1987), and Lopuhaä and Rousseeuw (1991) showed that h = (n + p + 1)/2 should be used to obtain the largest breakdown point for the MCD estimators, where · denotes the integer part. We briefly outline the FSRMCD algorithm here. First, raw estimatesμ MCD andˆ MCD given by Equation (3) are computed using the proportionality constant k MCD = h/n P ({χ 2 p+2 <χ 2 p,h/n }) s MCD with s MCD as proposed by Pison, Van Aelst, and Willems (2002). Letting D p,0.975 stands for an approximation of the 0.975th quantile of the distribution of d 2 MCD,i obtained by Hardin and Rocke (2005). The FSRMCD estimators are then the resulting reweighted estimatesμ where k FSRMCD = 0.975n /n P (χ 2 p+2 <χ 2 p,0.975 ) as proposed by Cerioli (2010) and m = n i=1 w i . In the present article, we apply the FSRMCD estimators implemented in the mcd-routine of the FSDA toolbox (www.riani.it/matlab.htm) under Matlab. We have chosen this method over the fast-MCD algorithm of Rousseeuw and van Driessen (1999) since it is less biased both for small and large samples due to incorporating a more accurate choice of small sample bias-correction factor for the raw MCD obtained by Pison, Van Aelst, and Willems (2002) and an asymptotic adjustment by Hardin and Rocke (2005).

FSRMCD-Based Bandwidth Selection for Multivariate Density Estimation
Given a dataset x = (x 1 , . . . , x n ) selected independently from some underlying distribution with an unknown density function f in a p-dimensional Euclidean space, a nonparametric estimate of f at each point y is desired. We have chosen to develop our multivariate density estimation based on the wellknown method by Parzen (1962). Throughout the following sections H will denote the bandwidth or window size matrix.
Let the kernel K : R p → R be a function satisfying certain regularity and integrability properties. The estimator of f at y is then given bŷ The quality of the estimator depends thus on the choice of the kernel K and the bandwidth H. A detailed treatment of this problem can be found in Scott (1992, pp. 125-195). For our purposes, we chose K to be the p-variate standard normal density.
Selecting an optimal bandwidth matrix is a more difficult problem than estimating the probability density function f itself and it is especially complex in multivariate settings. Since in our density-based clustering algorithm we are looking for the modes of f , that is, points y where local maximum of f is achieved satisfying ∇f (y) = 0 if f is smooth, it is important that a good estimation of ∇f is achieved. Chacón, Duong, and Wand (2011) established a general theorem that justifies the following choice under the multivariate normality assumption This is analogous to the usual Scott's rule of thumb bandwidth for estimating f . Because the underlying distribution can substantially differ from the normal distribution due to a contamination,ˆ should be replaced with a robust covariance estimator. In this article, we use theˆ FSRMCD described earlier, which produces a robust normal reference rule of thumb given by Equation (6).

Introduction to the Method
Our approach for identifying multivariate outliers rests on an assumption that a random sample of n p-dimensional data vectors comes from a finite mixture of multivariate normal densities with the biggest component representing the bulk of points. All data vectors not in this bulk, whether they are scattered or form clusters, are identified as masking points or potential outliers. The purpose of implementing multivariate probability density estimation and cluster identification is to be able to distinguish between the bulk and other components of the mixture. We propose to exploit a high breakdown robust estimator to address the oversmoothing effect that usually arises if Scott's rule-ofthumb bandwidth based on the standard covariance matrix is employed. The MAC method presented by Li, Ray, and Lindsay (2007) uses multiple bandwidths, each a function of the S matrix from the complete set of data to estimate the multivariate probability density function. Even though S has an undesirably low breakdown point, their method is effective in the practical sense of identifying modes and associated clusters. We modify the method of Li, Ray, and Lindsay (2007) by choosing the FSRMCD estimator instead of S because it has a much higher breakdown point and does not lead to oversmoothed density estimators even if the whole set of data is not multivariate normal. We multiply the FSRMCD covariance matrix estimator by a correction factor to form a single bandwidth H. Once the filtered central bulk is found, d 2 i 's for each of the n data vectors are produced. Based on these d 2 i 's, a decision is made as to whether a point is an outlier or not. Details of our proposed outlier detection algorithm follow.

Steps of the Algorithm
In this section, we describe our FSRMCD-MAC Algorithm.
Step 2. Similar to Equation (6), define the bandwidth matrix where c 2 n,p is a correction factor that improves performance. See Table S1 in the supplement. For example, c 2 n,p = 0.7532 + 37.51 n −0.9834 for p = 5 in supplemental Table S1. Use linear interpolation to compute c 2 n,p , if the particular p is not given in the table.
Step 3. Estimate the multivariate probability density viaf H given by Equation (5) using H from Equation (7) and kernel K as the p-variate standard normal density.
Step 4. Apply the Mode Association Clustering (MAC) described by Jobe and Pokojovy (2009) tof H . Among the clusters of points from x indexed by I 1 , . . . , I r ⊂ {1, . . . , n}, the biggest one, say I i * , and corresponding mode m are selected. In case of ambiguity, we pick that cluster with a smaller index i. If the largest cluster contains less than p + 1 points, take x as a single cluster, that is, This will be used to filter out skewness, which may dilute the effectiveness of our method. Note that 0 < β(x i ) ≤ 1.
Step 6. All i ∈ I i * with β(x i ) < β θ are peeled. The threshold β θ is selected as p,0.95 /2). β θ is thus based on a zσconfidence interval for φ I p×p (ξ ). Bias and var are the asymptotic bias and variance off (ξ ) from (5) if the underlying density f (x) is standard normal and H = κI p×p , κ > 0 (see Härdle et al. 2004, p. 71). Simulations suggested taking z = 1. Remaining points are then indexed by B ⊂ I i * .
Step 7. (8) and obtain robust squared Mahalanobis distances See Section 3.4 for the decision rule using d 2 i 's. A Matlab implementation of the algorithm can be found in compute signals.m file of the supplementary materials.

Correction Factor, Unbiasedness, and Implementation
To improve performance of the FSRMCD-based bandwidth, we introduced a correction factor c 2 n,p , see Equation (7). For p = 1, 5, 10, 15, simulations were conducted for different values of n to obtain particular values of the scaling factor that lead to the optimal performance of the FSRMCD-MAC algorithm. For each p, the method of least squares was used to fit a curve of the form α 1 + α 2 n α 3 to observed scaling factors that lead to an optimal performance. The best-fit curves are listed in supplemental Table S1. For example, selecting p = 5, c 2 n,p = 0.7532 + 37.51 n −0.9834 .
Since the FSRMCD-MAC-estimators given in Equation (8) are based onμ FSRMCD andˆ FSRMCD , which are affine equivariant,μ robust andˆ robust have the property of affine equivariance. Hence,μ robust is unbiased. On the other hand, simulations showed thatˆ robust is slightly biased. This is not essential in the framework of outlier detection. However, ifˆ robust is used to estimate the covariance,ˆ robust should be multiplied by a bias correction factor c FSRMCD−MAC given in supplemental Table S2. For example, choosing n = 60 and p = 5, c FSRMCD−MAC = 1.0407.

FSRMCD-MAC Decision Rule
It is helpful to frame outlier detection in a hypothesis testing mode. The null hypothesis is that no outliers exist, that is, for some x = (x 1 , . . . , x n ) all x i come from the same distribution and are independent. Thus, H 0i : x i ∼ N (μ, ) for all i = 1, . . . , n. By accepting H 0i for every i, we conclude no contamination. Likewise, if at least one of the H 0i is rejected, at least one x i from the data matrix x is deemed to be an outlier. If each H 0i is tested individually, we can control the per-comparison error rate by choosing the same test size α close to 0 for each test. Hence, a high proportion of contaminated x i will be detected with a small per-comparison error rate. However, when all x i ∼ N (μ, ) where μ and are known, the test statistics are independent and thus the probability at least one H 0i is rejected becomes γ = 1 − (1 − α) n . This probability is frequently referred to as the family-wise error rate and if the per-comparison error rate α is controlled to be small, γ can be unacceptably large, perhaps close to 1 even for moderately sized samples. Similar to what Cerioli (2010) proposed, the FSRMCD-MAC decision rule has two testing steps. The first testing step controls the family-wise error rate and the second testing step controls the per-comparison error rate. This two-step testing procedure is much like Fisher's protected F-test. These two testing steps are not to be confused with Steps 1 to 8 of the algorithm. Consider each test statistic d 2 i given in Equation (9) for x = (x 1 , . . . , x n ) . In testing Step 1 of the FSRMCD-MAC decision rule, using an identified threshold L 1,1−γ that controls the family-wise error rate, if max i=1,...,n d 2 i > L 1,1−γ , at least one x i is determined to be an outlier and the family-wise error rate is controlled at a reasonably selected level γ . Then, testing Step 2 is executed. If the maximum does not exceed L 1,1−γ , the FSRMCD-MAC decision rule concludes no outliers exist in the sample x. Consider testing Step 2. Each d 2 i given in Equation (9) for x is compared to a threshold L 2,1−γ constructed to control the per-comparison error rate. If d 2 i > L 2,1−γ , x i is determined to be an outlier, otherwise x i is judged not to be an outlier. The determination of L 1,1−γ and L 2,1−γ is described below.
For selected (n, p) combinations, 5000 sets of n pdimensional multivariate standard normal data vectors were generated. Let d 2 ij be the squared Mahalanobis distance statistic produced by Equation (9) from the ith item in the jth sample of size n. For each of the j = 1, . . . , 5000 sets of n data vectors, d 2 max,j = max i=1,...,n d 2 ij was recorded and L 1,1−γ identified as (1 − γ )5000th ranked d 2 max,j . Jensen, Birch, and Woodall (2007) recommended this approach for determining an initial threshold L 1,1−γ that guarantees a family-wise error rate of γ because of dependencies among the d 2 ij within the jth set of n data points. Second, for each of the j = 1, . . . , 5000 sets of n independently generated data vectors, a single d 2 ij denoted by d 2 rand,j was randomly selected. Note that since in each set of n independently generated data vectors, the d 2 i are dependent, a single d 2 ij was randomly selected from each different set. Thus, the threshold L 2,1−γ was identified as the (1 − γ )th quantile for the distribution of d 2 rand,j when no outliers exist. See supplemental Table  S3 for the limits L 1 and L 2 . For example, choosing n = 60 and p = 5, L 1,0.983 = 33.8628 and L 2,0.983 = 18.1691.
The simulations outlined here and assessed in Sections 4.1 and 4.2 were carried out on a computer cluster with 32 dual quad-core nodes, GigE connectivity for internode communication, 768 GB of distributed memory, and 15 TB of shared storage using Matlab R2010A under a 64-bit Linux kernel. Pseudorandom numbers were produced using the ziggurat-algorithm implemented in the function mvnrnd of Matlab. The maximum average run time over all selected scenarios ranged between 0.78 sec and 7.87 sec depending on p and n.

Outlier Detection Power
We implemented extensive simulations to calculate outlier detection power and swamping rates over a wide range of outlier counts and shift magnitudes for each of the (n, p) combinations in supplemental Table S3. We denoted r as the count of pdimensional "bad" or "outlying" points in a sample of size n and λ as a shift size indicator. For each of the 18 (n, p) pairs in supplemental Table S3, simulations were run for the 72 (r, λ) combinations letting r = τ n, where τ = 0.05, 0.10, . . . , 0.30 and λ = 1.6, 1.8, . . . , 3.8. For a given (n, p) combination, each of the 72 simulations generated 5000 independent sets of n pdimensional random vectors. In a given set of random vectors, n − r of the vectors were generated from N (0 p×1 , I p×p ) and r of the vectors were generated from N (λe p×1 , I p×p ), where e p×1 is a p × 1 vector of ones and a p × p identity covariance matrix I p×p . Using the two-step FSRMCD-MAC method and Cerioli's IRMCD method, the percent of r vectors identified as outliers and percent of the n − r vectors incorrectly determined to be outliers were recorded for each set of n p-dimensional random vectors. The average percent of r vectors identified as outliers based on 5000 sets of n p-dimensional vectors is the detection power for selected (n, p, r, λ). Likewise, swamping rate is the average percent of n − r vectors incorrectly identified as outliers based on the same 5000 sets. So, 18 × 6 × 12 = 1296 simulations, each of size 5000, were performed for both methods producing corresponding swamping and detection power rates. Choosing the mean vector λe p×1 is our way of identifying a shift in the distribution from a mean vector of 0 p×1 . The noncentrality parameter ncp = (μ 1 − μ 0 ) −1 (μ 1 − μ 0 ) is sometimes used to index a shift producing outliers. Here, μ 0 , μ 1 , and = 0 = 1 are the mean vectors and covariance matrices of the first and second group of points. It can be easily seen that ncp = λ 2 p. So, for a given p, a large λ corresponds to a large ncp, that is, a large shift.
Since we desired to compare the FSRMCD-MAC outlier detection performance with that of the IRMCD method Cerioli (2010) proposed, our comparisons were based on the same family-wise error rate denoted as γ , where γ is close to zero. Cerioli (2010) noted the published IRMCD performance used a stated γ = 0.01. For notational purposes, we refer to this γ as γ nominal . Because the IRMCD method is based on approximate probability distributions and the test statistics are not independent, the actual family-wise error rate γ actual for the IRMCD method is usually larger than γ nominal = 0.01. Based on simulations, the actual γ actual values for a stated γ nominal = 0.01 are given in our Table 1 and can also be found in Table 1  the IRMCD method were chosen. This ensures both have the same γ actual family-wise error probability. Note that for p = 15 and n = 40, we discovered γ actual = 0.105 where Cerioli 2010 reported 0.084. Consider two example comparisons for n = 60 and p = 5. Letting γ nominal = 0.01, the actual family-wise error rate used in the FSRMCD-MAC method was γ actual = 0.017. Notationally, in each configuration (1 − τ )n observations come from N (0 5×1 , I 5×5 ) and τ n from N (λe 5×1 , I 5×5 ). Figure 1(a) illustrates the percent of outliers detected by the FSRMCD-MAC and IRMCD methods for τ = 0.05 and selected λ's. The swamping rates never exceeded 1.5% for FSRMCD-MAC and 1% for IRMCD. Figure 1(b) illustrates the percent of outliers detected by the FSRMCD-MAC and IRMCD methods for τ = 0.20 and selected λ's. The swamping rates never exceeded 2.6% for FSRMCD-MAC and remained below 1% for IRMCD. Hence, in both scenarios the power improvements due to implementation of the FSRMCD-MAC method were consistent and significant. Although the FSRMCD-MAC method produced swamping rates up to 2.6% compared to 1% for the IRMCD method, both methods had the same experiment-wise error rate. Since swamping rates only have meaning when at least one outlier has been signaled, a method with a somewhat larger swamping rate but consistently and significantly higher outlier detection probability may be preferred by analysts. We thoroughly address the logic of this choice later in Section 4.2.
Continuing, two example comparisons for n = 200 and p = 5 were also evaluated. As before, letting γ nominal = 0.01 the family-wise error rate used in the FSRMCD-MAC method was γ actual = 0.011. Figure 2(a) illustrates the percent of outliers detected by the FSRMCD-MAC and IRMCD methods for τ = 0.05 and selected λ's. The swamping rates never exceeded 1.2% for FSRMCD-MAC and 1% for IRMCD. Figure 2(b) illustrates the percent of outliers detected by the FSRMCD-MAC and IRMCD methods for τ = 0.20 and selected λ's. The swamping rates never exceeded 1.3% for FSRMCD-MAC and remained below 1% for IRMCD. Cerioli (2010) also evaluated IRMCD performance for outlier configurations where outliers were scattered from central to more extreme locations. In the following two examples, uncontaminated data were generated from N (0 5×1 , I 5×5 ). For n = 200, p = 5, and γ nominal = 0.01, the first example generated contaminated data from N (0 5×1 , ψI 5×5 ), ψ = 2, 4, . . . , 12, and τ = 0.20. Figure 3(a) illustrates the outlier detection performance of the FSRMCD-MAC and IRMCD methods. The actual familywise error rate used in the FSRMCD-MAC method was γ actual = 0.011. The swamping rates for the FSRMCD-MAC method did not exceed 1.6% and that of the IRMCD method did not exceed 1%. Second, choosing n = 200, p = 5, γ nominal = 0.01, and τ = 0.20, we considered contaminated data coming from a p = 5 dimensional multivariate t-distribution with ζ degrees of freedom, ζ = 1, 2, . . . , 6. Figure 3(b) illustrates the outlier detection performance of the FSRMCD-MAC and IRMCD methods for this second example. The actual family-wise error rate used in the FSRMCD-MAC method was γ actual = 0.011. Swamping rates for the FSRMCD-MAC method did not exceed 1.3% and that of the IRMCD method did not exceed 1%.

Outlier Detection Performance Comparisons for Additional (n, p) Combinations
In this section, outlier detection performances of IRMCD and FSRMCD-MAC methods for the complete set of 18 (n, p) combinations outlined in Table 1 are compared and analyzed. For all (n, p) combinations except for those where p = 15, our comparisons are based on the outlier detection performances over the 72 (r, λ) combinations outlined earlier. Since for large λ's we noticed both methods almost always produced a proportion of detected outliers equal to 1, we restricted our comparisons to those whose noncentrality parameter (ncp) did not  exceed 150. In terms of λ and p, ncp = λ 2 p ≤ 150. For p = 5 or 10, λ ≤ 3.8 satisfies our restriction. However, for p = 15, λ ≤ 3.0 satisfies our restriction. Except for (n = 40, p = 10) and (n = 60, p = 15), the six power graph comparisons of FSRMCD-MAC to IRMCD for each of the other sixteen (n, p) combinations presented in Table 1 resemble those of Figures  1(a), 1(b), or 2(b). In other words, the outlier detection power values of the FSRMCD-MAC method are almost always larger than the corresponding power values produced by the IRMCD approach for each of the six r values considered across the complete range of λ's selected. For 16 of the 18 (n, p) pairs from Table 1, the FSRMCD-MAC approach is clearly superior to IRMCD method if only power of detection is considered. Swamping rates for IRMCD were almost always at or a bit lower than 1% for the 18 scenarios in Table 1. For the FSRMCD-MAC approach, n = 200 or 400 and p = 5 or 10, swamping rates were almost always at or lower than 2.6%. However, for the other (n, p) situations, the FSRMCD-MAC swamping rates sometimes increased above 15%, especially when the detection power of the FSRMCD-MAC method greatly exceeded that of IRMCD. Consider now one of the scenarios (n = 40, p = 10) where the FSRMCD-MAC method was not almost always uniformly better with regards to detection power. The outlier detection power was uniformly higher for FSRMCD-MAC when contamination was 25% and 30% (r = 10 or 12). If the contamination levels were 10% or 15%, outlier detection power of IRMCD was uniformly higher. When contamination levels were 5% or 20%, at some values of λ the detection power values favored FSRMCD-MAC and for other λ's the IRMCD method was superior. Swamping rates for the IRMCD method remained at or slightly lower than 1% for almost all (r, λ) pairs. The swamping rates for FSRMCD-MAC approach remained at or below 2% for all r and smaller λ's. For λ larger than 2.0 and all r, the swamping rates for FSRMCD-MAC were mostly from 2% to 8% with a few in excess of 10%. The other scenario (n = 60, p = 15) where the power detection levels from the FSRMCD-MAC approach were not uniformly larger than those produced by the IRMCD can be summarized as follows. For 5%, 25%, and 30% contamination levels and almost all λ, power detection levels were almost all larger from the FSRMCD-MAC method and sometimes much larger than those produced by the IRMCD approach. However, for 10%, 15%, and 20% contamination levels and most λ levels, detection power values from the IRMCD method were larger. Almost all swamping rates for the IRMCD method were less than 1.2%. FSRMCD-MAC swamping rates stayed below 3.9% when contamination levels were from 5% to 20%. Five FSRMCD-MAC swamping rates exceeded 15% and three were from 5% to 8% when contamination levels were at 25% or 30%. Other FSRMCD-MAC swamping rates were under 4%.
If only detection power determines the outlier detection method to use, FSRMCD-MAC is a reasonable choice for 16 of the 18 (n, p) pairs noted above. A natural question arises from this summary. Which outlier detection method should be chosen when both swamping and detection power are important? For n = 40, 60, 90, 125, and 200 and p = 5, a strong case can be made for choosing the FSRMCD-MAC approach because although its swamping rates were somewhat larger than those of IRMCD, detection power levels of the FSRMCD-MAC were much larger than those of IRMCD. See Figures 1 and 2. Of course, this is somewhat subjective. How one weights swamping versus detection power determines which method to choose. We propose a model to guide which method to use for outlier detection that weights swamping and detection power in a proportional fashion.
Let c equal the positive loss of identifying a good item as bad or an outlier. Let mult × c equal the positive gain or savings of identifying a bad or outlying item as being bad or an outlier and mult > 0. This is the notion of proportionality we choose to use in modeling the savings/loss aspect of outlier detection. Vardeman and Jobe (1999, p. 9) stated an "engineering rule of 10" that is often used in decision making related to repair savings/loss. In short, the rule equates to correctly identifying a part that needs repairing and repairing it to cost 1/10 the cost to repair it at the next stage of production. On the other hand, if a good part is repaired, the loss is the cost of unnecessarily repairing a part.
For a selected (n, p) pair, based on 5000 simulations of the FSRMCD-MAC method, we let Po i and So i be the observed proportion of contaminated items correctly detected and noncontaminated items identified as contaminated for the ith (r, λ) pair, i = 1, 2, . . . , cnt. Similarly, based on 5000 simulations of the IRMCD method, we let PI i be the observed proportion of contaminated items correctly detected and SI i the proportion of noncontaminated items identified as contaminated. For all n, when p = 5 or 10, cnt = 72 and when p = 15, cnt = 48. Letting r denote the number of bad or contaminated or outlying items from a sample of n items and n − r the number of good or uncontaminated items, an estimated per item savings equation corresponding to application of the FSRMCD-MAC method for a given (n, p, r, λ) can be written as The estimated change in per item savings SVG i , FSRMCD-MAC approach minus IRMCD method for the ith (r, λ) pair from a selected (n, p) pair becomes Since comparing SVG i to 0 informs which method should be used for outlier detection, we let without loss of generality c = 1. If SVG i exceeds 0, then for the ith (r, λ) combination FSRMCD-MAC is preferred over IRMCD. If SVG i is less than zero, IRMCD is preferred for the ith (r, λ) combination. In practice, however, r and λ are unknown. To address this matter, for each (n, p) pair considered, we propose finding the estimated average based on the cnt SVG i 's and a 90% twosided confidence interval for the average SVG. For each (n, p), the cnt SVG i 's come from a preselected grid equally spaced over a selected (r, λ) space. We assume the preselected grid of cnt points comes from a population of (r, λ)'s that occur uniformly over a two-dimensional space {1, 2, . . . , 0.3n } × [1.6, 150/p]. The sample average SVG becomes whered is the estimated mean vector based on the cnt column vectors d i . Continuing, the 90% confidence interval for the true average SVG becomes where −1 (0.95) ≈ 1.645 is the 95th percentile of the standard normal distribution and S is the estimated covariance matrix based on the cnt column vectors d i . For each of the 18 (n, p) pairs described above, we identified the range of mult values that make the lower bound exceed 0. These mult values inform the user to implement FSRMCD-MAC. Likewise, we identified the range of mult values that makes the upper bound less than 0. These mult values inform the user to implement IRMCD. Values of mult that guarantee a negative lower bound and positive upper bound were also found. These values inform the analyst using FSRMCD-MAC instead of IRMCD provides no advantage. Supplemental Practically, it seems mult is almost always much larger than 1 suggesting FSRMCD-MAC is to be used for most cases considered. For smaller λ values, especially those less than 3, our proportional decision model shows FSRMCD-MAC method to frequently produce larger savings per item for individual applications than those from IRMCD for certain ranges of mult. Additional work is needed to explore and analyze the change in savings per item from any one application of FSRMCD-MAC compared to IRMCD.

DATA ANALYSIS
In this section, we compare outlier detection performance of the FSRMCD-MAC method to that of the IRMCD method based on application to the same data. The data we selected for this comparison were introduced by Flury and Riedwyl (1988). Six variables were measured on each of 200 Swiss banknotes, 100 being authentic and 100 forged.
We now turn to a contextually interesting question: How well does each outlier detection method perform when it comes to separating genuine from forged bills? We supposed the configuration of genuine to forged to be 100 to 100, that is, 200 Swiss notes with 100 genuine and 100 forged. Since the number of bills belonging to the genuine group is 100 and h = [(n + p + 1)/2] = [(200 + 6 + 1)/2] = 103, MCD estimators do breakdown. Another way to evaluate whether MCD estimators breakdown is to consider the breakdown point. In this example, the number of counterfeit bills is 100, which is more than the breakdown point given by [(n − p + 1)/2] = [(200 − 6 + 1)/2] = 97. Thus, assumptions regarding breakdown are violated and the MCD estimators used by both methods breakdown. Using the simulated γ actual = 0.0118, the FSRMCD-MAC analysis is presented in Figure 4(a). Likewise, using a γ nominal = 0.01, the IRMCD analysis is presented in Figure 4(b). Note: in Figure 4, index = 1, . . . , 100 corresponds to genuine notes and index = 101, . . . , 200 corresponds to forged notes.
Both methods have the same swamping rate. FSRMCD-MAC incorrectly identifies five genuine notes as counterfeit and IRMCD incorrectly identifies five genuine notes as counterfeit. However, the detection power of the FSRMCD-MAC is 100 out of 100 whereas that of the IRMCD method is 15 out of 100. Even though both methods employ the reweighted MCD estimators, the FSRMCD-MAC method incorporates the reweighted MCD to build an estimated multivariate probability density function used by the cluster analysis to find a central bulk of data upon which to find a mean vector and covariance matrix to be used for calculating a robust test statistic. Equation (9) gives this statistic. It seems the FSRMCD-MAC method benefits from exploiting the reweighted MCD estimator as a bandwidth matrix to be used by the MEM and MAC algorithms to construct a cluster analysis helpful in filtering out unusual data values. In contrast, IRMCD uses the MCD estimator to directly construct a test statistic.

CONCLUSION
We have constructed and presented the FSRMCD-MAC algorithm for detecting multiple outliers. The FSRMCD-MAC algorithm was shown to detect with high probability multiple outliers that may occur among multivariate normal data. Similar to Cerioli (2010), FSRMCD-MAC uses two testing steps to identify outliers. The first testing step assures a selected familywise error rate and the second testing step maintains a selected per-comparison error rate. For a family-wise error rate corresponding to γ nominal = 0.01, comparisons of the IRMCD and FSRMCD-MAC outlier detection ability based on extensive simulations show the FSRMCD-MAC method regularly outperforms the power of IRMCD, the leading multivariate outlier detection method proposed by Cerioli (2010), for a wide range of (n, p, r, λ) combinations. Further, for many of the (n, p, r, λ) combinations the FSRMCD-MAC method produces swamping rates that are quite competitive with those produced by IRMCD. We proposed a model to guide which method to use for outlier detection that weights swamping and detection power in a proportional fashion. Supplemental Table S4 summarizes our decision rule for choosing FSRMCD-MAC or IRMCD. This rule uses a multiplicative model that is a function of the savings and loss that occur as a result of outlier detection decisions. Across the grid represented in Supplemental Table S4, as long as the savings from detecting an outlier is more than 4.5 times the loss of identifying a good item as an outlier, FSRMCD-MAC is preferred for the 18 scenarios considered. If the ratio is at least 2.02, FSRMCD-MAC is preferred for 15 of the 18 scenarios. The Swiss banknote example presented in Section 5 demonstrated the strength of the FSRMCD-MAC method over the IRMCD approach in a practical framework. Our simulation method for determining cut-off limits required in the FSRMCD-MAC two testing steps produced decision-rule limits that are accurate and not dependent on asymptotic assumptions. Further, our method overcomes dependencies not accounted for by the IRMCD method.

SUPPLEMENTARY MATERIALS
Contents of the supplementary files include a zip-archive of the Matlab files and a PDF with four additional tables. The ziparchive contains Matlab files implementing the FSRMCD-MAC algorithm as well as auxiliary routines. See each separate file for a detailed description. The tables referred to as Tables S1 through S4 in the text of the article are simply labeled as Tables 1 through 4 in the supplement. [Received February 2013. Revised September 2014