False Discovery Versus Familywise Error Rate Approaches to Outlier Detection

Abstract Outliers, in general, are observations that deviate sufficiently from a base distribution. This study deals with outlier detection approaches for large samples from continuous univariate distributions. Investigated are the properties of a practical newer outlier detection approach based on use of a false discovery rate (FDR) method in conjunction with a robustly estimated Tukey g-and-h base distribution. Compared are the properties of a boxplot type outlier detection approach that controls the familywise error rate (FWER) with a newer FDR approach. These options are compared in terms of error rates and effects of moving the outliers gradually further from base distribution center while using 5% and 1% FDR or FWERs. Two microarray datasets are used as examples where the assumed null distributions do not fit the data well. In such cases, the proposed estimated Tukey g-and-h null distribution approach leads to superior outlier detection performance. Supplementary materials for this article are available online.


Introduction
This article deals with outlier detection approaches for larger univariate samples, which are becoming more common in practice. Outlier detection has a long history with application to multiple applied areas. Barnett and Lewis (1994) provided an excellent and extensive summary of the outlier detection area with a shorter introduction given by Iglewicz and Hoaglin (1993). In general, a base distribution is assumed for the data followed by a search for outliers deviating from this base distribution. Frequently, the normal distribution is used as the base distribution. Barnett and Lewis (1994) discussed almost 50 outlier detection methods from the normal distribution. More recently, a false discovery rate (FDR) approach has become increasingly popular for testing multiple hypotheses. This approach can also be used for outlier detection as an alternative to the commonly used familywise error rate (FWER) methods. In this article, the effectiveness of a popular FWER method is compared to an FDR approach for detecting outliers. Additionally, the quite general Tukey g-and-h distributional family, with estimated parameters, is investigated as a practical alternative to an assumed base distribution for outlier detection.
The Tukey (1977a) basic boxplot is the most popular tool for labeling observations as outliers. The probability properties of the basic boxplot outlier detection procedure was studied by Hoaglin, Iglewicz, and Tukey (1986) and converted to a formal outlier detection procedure, for normally distributed base distributions, by Hoaglin and Iglewicz (1987). That was further generalized for use with a variety of common base distributions by Banerjee and Iglewicz (2007). This relatively simple boxplot procedure, denoted as BP, is used here as the studied FWER approach. In this article, a simulation comparison is made between the boxplot method (Tukey 1977a), denoted as BP, and a FDR approach (Benjamini and Hochberg 1995), denoted as BH, for detecting potential outliers. An important part of this investigation is to observe the effects on BP and BH of moving the "outliers" gradually further from the bulk of the data. This provides an insight in advantages and disadvantages of using a FWER versus FDR approach for identifying unusual observations. The standard boxplot outlier detection method labels all observations outside the interval as deserving further investigation, where Q 1 is the sample lower quartile and Q 3 the upper quartile, respectively. Here, these outside observations will be labeled as outliers. The popular boxplot labeling rule uses k = 1.5, though also considers k = 3 to detect "far out" observations. Banerjee and Iglewicz (2007) obtained the values of k and its relations to sample size n for several distributional families and provided a simple solution for k by controlling the some-outside rate per sample, defined as Pr[at least one x i ∈ region outside the cutoff in (1) under the null distribution] ≤ α.
Here, it is assumed that the sample size is sufficiently large so that Q 1 , M(the sample median) and Q 3 are close to F −1 (.25), F −1 (.5) and F −1 (.75), respectively; where F is the distribution of the data. For symmetric distribution, it is also reasonable to assume that Then for symmetric base distributions, (2) can be replaced by Pr[X (n) > Q 3 + k(Q 3 − Q 1 )] ≤ α/2, which can be estimated by solving Pr[X (n) Here α = 0.05 is used. Kimber (1990) studied this rule for skewed distribution and replaced (1) by When only upper (or lower) outliers are of interest, then the upper (or lower) cutoff is used. Hence, for one-sided upper end outlier detection, the outlier rejection region becomes The BH method was introduced to deal with multiple hypotheses testing, which controls the expected proportion of false positive discoveries. This method can also be used to detect outliers. Usually BH deals with n null hypotheses with decision rule D applied to each of these null hypotheses to decide whether to reject or not reject. As a simple illustrative example, giving n statistics z 1 , z 2 , . . . , z n to test the null hypotheses H 01 , H 02 , . . . , H 0n , the rule D rejects H 0i if |z i | ≥ c, where c indicates the rejection boundary. In this case, the rejected observations can be viewed as outliers. More generally, for a random sample from distribution F with outliers from distribution G, each observation z i can be viewed as a test statistic with null hypothesis H 0i that z i is from F. Hence, the nonnull z i that are rejected by D can be viewed as outliers.
In typical hypothesis testing, the null distribution is assumed to be known. Such a distribution is further called the theoretical null distribution, which is frequently assumed to be normal, possibly after a transformation. So, if the investigators feel that the null distribution is, say a student t 7 , then t 7 will be used as the theoretical null distribution. Such theoretical null distributions may not fit the data well as discussed by Efron (2006) and Xu, Iglewicz, and Chervoneva (2014). As an alternative, Efron (2005) suggested use of an empirical normal null estimated distribution. Denote the null density functions as f 0 (z) and π 0 the proportion of null cases, where π 0 is assumed to be large (π 0 ≥ 0.9). The null distribution f 0 (z) is assumed to be normal (μ 0 , σ 0 ). When the theoretical null distribution is another distribution, it needs to be transformed to a normal distribution first. This Efron approach makes the theoretical null distribution more general, as it provides the opportunity to use the data to estimate the unknown parameters μ 0 and σ 0 . Such an estimated normal null distribution, with μ 0 and σ 0 estimated from the data, will be called the empirical null distribution.
The approach given by Xu, Iglewicz, and Chervoneva (2014), of using an estimated far richer Tukey (1977b) g-and-h distribution as the base distribution, will be further investigated and compared with use of a theoretical and also empirical null. The four parameter g-and-h distribution is a monotone transformation of standard normal quantiles z p , for h ≥ 0, to g-and-h quantiles T A,B, g, h (z p ) as with A controlling location, B scale, g skewness, and h elongation. Consider several symmetric cases of (7) corresponding to A = 0, B = 1, g = 0 and increasing h. For example, h = 0 yields the standard normal distribution; h = 0.065 yields the logistic distribution; h = 0.109 yields the double exponential distribution; and h = 0.97 yields the Cauchy distribution. Notice, that as h increases, so does the thickness of the distribution tail areas. For simplicity, the Tukey g-and-h distribution will henceforth be referred to as the g-and-h distribution. By changing the values of g and h, the g-and-h distribution can approximate a wide range of common distributions including Cauchy distribution (Martinez and Iglewicz 1984), other t distributions, chi-square distributions, and gamma distributions, among others. Available methods to estimate the g-and-h distribution parameters include: a moment-based method (Martinez and Iglewicz 1984),; a letter value (LV) approach (Hoaglin 1985); numerical maximum likelihood method (NMLE) (Rayner and MacGillivray 2002); and two quantile least square methods (Xu, Iglewicz, and Chervoneva 2014). The LV and NMLE methods work well when no outliers or other contaminants exist; however, they are not resistant to a moderate proportion of contaminated observations. The robust quantile least square (rQLS) method for estimating the g-and-h parameters, introduced by Xu, Iglewicz, and Chervoneva (2014), is used here to fit the g-and-h parameters. The rQLS estimator minimizes the sum of squared differences between estimated sample quantiles, based on biweight trimmed sample, and population quantiles. Specifically, let x (i) , i = 1, . . . , n, be the ith order statistic of the sample x i , i = 1, . . . , n, from the g-and-h distribution. The predicted ith order statistic is defined as the quantile ξ p i (θ ) of g-and-h distribution with parameter vector θ , where p i = i−1/3 n+1/3 . Then the weight corresponding to x (i) is based on the biweight function and is given by where c is a constant of the biweight function. Effectively all observations with weight 0 are trimmed from the sample used in estimation. This method reduces the effects of the outliers and estimates g-and-h parameters using the central portion of the data rather than the entire sample. The g-and-h distribution has been applied in many areas because it fits a great variety of unimodal distributions well and is easy to use in simulations. Among more recent references are Badrinath andChatterjee (1988, 1991), Babbel (2002, 2005), Kafadar and Phang (2003), Field (2004), He and Raghunathan (2006), Dutta and Perry (2007), Walters and Ruscio (2009), Jimenez and Arunachalam (2011), and Strandberg and Iglewicz (2013.
The purpose of this article is to further study a number of aspects of outlier detection for larger samples including the effect of varied proportion of outliers on outlier detection quality, the performance of the BH method, the effect of outlier distribution location on BP and BH performance, and effect of choice of base distribution on outlier detection. The article is organized as follows. Section 2 summarizes an extensive simulation study. Two relatively large microarray datasets are considered in Section 3. These are analyzed using BP and BH with several choices for base distributions, leading to further insight in determining the number of expressed genes. Conclusions are presented in Section 4.

Simulation Study
A simulation study is first performed to compare BH methods based on estimated g-and-h, theoretical and empirical normal null distributions. As the student t distribution appears commonly in microarray literature, this distribution is used here as a base distribution. So, random samples generated from a g-andh approximation of t distribution with 10 degrees of freedom (t10), given by g-and-h parameters (0, 1.02, 0, 0.052), and its variants are used in this simulation study. The following scenarios were simulated to evaluate how well the BH method performed in identifying true outliers (null hypothesis is correctly rejected) and false outliers (true null hypothesis is rejected by chance). The simulation was repeated 1000 times for each scenario based on using n = 10,000 or 1000 random observations, as specified.
Four sample generating scenarios are considered in Table 1. In SM1, the theoretical null distribution is assumed to be t distributed with 10 degrees of freedom (t10) and random samples x 1 , x 2 , . . . , x n (n = 10,000) generated from g-and-h distribution with parameters (0, 1.02, 0, 0.052) as a g-and-h approximation to t(10). In scenario SM2, it is assumed that the theoretical t distribution is not valid and the sample distribution is skewed and more/less dispersed than a t distribution. To simulate such a scenario, random samples x 1 , x 2 , . . . , x n (n = 10,000) were generated from the g-and-h distribution (0, 0.7, 0.15, 0.052). This distribution has the same elongation as the t10 approximation in SM1 but is right skewed and less dispersed. Scenario SM3 consists of the same n = 10,000 random observations from SM2 plus contamination with 500 (5%) additional observations from N(5.82, 0.5), which is located at the 99.9968% quantiles of the base distribution. Scenario SM4 again consists of 10,000 random observation from SM2, with 500 additional observations, 250 observations from N(-3.19, 0.5), and 250 observations from N(5.82, 0.5), which are located at the quantiles of 0.0032% and 99.9968% quantiles of the base distribution, respectively.
The theoretical null distributions in SM1-SM4 are assumed to be t distributed with 10 degrees of freedom. The g-and-h parameters are estimated using the rQLS method on the original generated sample. Due to the simplicity of calculations, the theoretical and empirical normal null distributions are estimated from the transformed sample to compare with the g-and-h null distribution. Hereafter, they are referred to as the theoretical null distribution and the empirical null distribution, respectively. The transformation, z i = −1 (F t (10) (x i )), where F t (10) (·) is the t10 cumulative distribution function and the standard normal distribution function, is used to transform x 1 , x 2 , . . . , x n from assumed t10 distribution to z 1 , z, . . . , z n of the standard normal distribution. The FDR and local FDR results were obtained using the R package "locfdr" that allows estimating π 0 for theoretical null distribution N(0, 1) and (μ 0 , σ 0 , π 0 ) for empirical null distribution from transformed random sample z 1 , z, . . . , z n .
The following steps are used to identify outliers for g-and-h estimated BH method: 1. For every observation, the z-score based on rQLS estimated g-and-h parameters is calculated by solving for z p in Equation (6) or (7). 2. Convert the z-score to corresponding p-value P i , i = 1, . . . , n. Find the largest k such that P (k) ≤ k n α, where α is an a priori selected discovery rate; here α = 0.05 or α = 0.01 and P (k) is the kth ranked p-value. 3. Declare outliers the observations corresponding to P (i) , i = 1, . . . , k. This step is implemented using the "p.adjust" function in R with "BH" option. Table 1 presents the comparison of BH decision rule with discovery rate 1% and 5% (denoted as BH1 and BH5 hereafter, respectively) based on g-and-h null distribution estimated from original sample, plus theoretical and empirical null distributions from transformed sample. This table contains the number of false outliers and the number of true outliers for the BH1 and BH5 methods. The results indicate that when the theoretical null distribution is valid, the g-and-h estimated null distribution approach performs roughly as well as the assumed theoretical null for both the BH1 and BH5 methods. However, when the theoretical null distribution is not valid, that is, the theoretical null distribution is assumed to be t10 when the true null distribution is skewed and less dispersed, the estimated g-andh null distribution performs better than the theoretical and the empirical null distribution in terms of control of FDR. In this uncontaminated case (SM2), the BH5 detects 0.1 and 24.3 false outliers on average based on the estimated g-and-h null distribution and the empirical null distribution, respectively. That is, the empirical null method detects noticeably more false outliers as compared to the estimated g-and-h null approach (24.3 vs. 0.1). When outliers are on both sides (SM4), the BH5 detects 494.0 true outliers with 29.4 false outliers based on the estimated gand-h null distribution compared to 250.5 true outliers with 2.3 false outliers based on the theoretical null and 490.5 true outliers and 118.9 false outliers based on the empirical null distribution. The one-sided outlier scenario (SM3) shows a similar pattern as the two-sided outlier case, especially for the empirical null approach. Overall, the estimated g-and-h null approach works well for all Table 1 cases, while both the theoretical and empirical null do not. Figure S1 in the online supplementary materials illustrates the histogram and the density curve of π 0 f 0 (x) for the estimated g-and-h distribution (lower panel), the theoretical (upper left), and the empirical (upper right) null distributions from one of The random samples were generated from g-and-h distribution approximating student's t distribution with degrees of freedom  (SM); g-and-h distribution with same elongation as SM but right skewed and less dispersion (SM-SM); additional outliers were generated from normal distributions N(., .) and/or N(-., .) in SM and SM. The average number of false outliers and true outliers were presented for BH and BH methods.
the simulated examples of SM2, where the true null distribution is skewed and less dispersed. The estimated g-and-h distribution fits the data well on both sides. Since the sample is less dispersed than its theoretical distribution, N(0, 1) does not fit the transformed sample on both sides. The empirical null distribution fits the data better than does N(0, 1) but does not take account of the skewness. A sensitivity analysis is summarized in Table 2 where BP and BH are compared as the outlier distribution is moved further from the base distribution. Here the g-and-h base distribution is (0, 1, 0, 0.1) or (g, h) = (0, 0.1). This distribution has slightly lighter tails than the double exponential with (0, 0.109). The outlier generating normal distribution, N(μ, σ ), has the mean ranging from 8.9 to 81.1, with σ = 1 or σ = 0.25. For n = 10,000 and m = 500 outliers from N(8.9, 1), BP identifies 0 false outliers, but only 3.4 true outliers out of 500. This is clearly unsatisfactory. BH5, on the other hand, identifies all 500 true outliers at a cost of 26.3 false outliers. The performance of BH1 falls in between, with procedure identifying 490.5 true outliers out of 500 at a cost of identifying only 5.3 false outliers. For outliers coming from N(81.1, 1), all three procedures identify the 500 true outliers but at a cost of 0, 5.4, and 26.5 false outliers for BP, BH1, and BH5, respectively. In conclusion, BH is superior at identifying true outliers when the outlier generating distribution is a moderate distance from the base distribution at a modest cost of identifying false outliers, while BP does well when the outlier distribution is far away from the base distribution. Similar results are observed for the σ = 0.25 case and when n = 1000 with 50 added outliers. Two similar tables with standard normal and (g, h) = (0.1, 0) base distributions are given in the  N (μ, σ), where μs are ., ., ., and ., respectively; σ = 1 or 0.25, respectively. The number of false outliers, true outliers, and some-outside rate based on rQLS robustly estimated g-and-h null distribution are summarized.
online supplementary materials as Tables S1 and S2, respectively. Overall, results of Tables S1 and S2 in the online supplementary materials are similar to those of Table 2 in the text. Table 2 also reports the "some-outside rate, " defined as the probability that a sample containing n random null observations leads to the identification of at least one false outlier. The someoutside rate probability is controlled by the BP method for random samples with no outliers. So, in simulating, say, 1000 samples of n = 10,000 random observations, BP leads to about 50 of the 1000 samples containing at least one false outlier. As these roughly 50 samples will contain few false outliers, this simulation of 1000 times 10,000 = 10,000,000 observations will lead to roughly 100 false outliers. In that sense, BP is a very conservative procedure. From Table 2, n = 10,000, it is observed that the some-outside rate is between 3.5% and 3.9% for BP, while it is 100% for BH5. Thus, BH5 identifies far more false outliers among the 10,000,000 sampled uncontaminated observations. Table 3 contains results based on different proportion of outliers with samples generated from the g-and-h distribution (g, h) = (0.1, 0), n = 10,000 regular observations and outliers coming from N(μ, 1), with μ = 4.9 and 10.1, respectively. The number of outliers are, 0, 50, 100, 250, and 500 representing 0.5%, 1%, 2.5%, and 5%, respectively. The case with N(10.1, 1) outlier distribution results in the detection of all outliers, making BP the best of the three considered methods for this distant outlier case. The more interesting case deals with outliers generated from N(4.9, 1). Then, BH5 is a better choice as it identifies the highest proportion of true outliers. That leads to: 26.3 out of 50 identified outliers, or 52.6%; 67.6 out of 100, or 67.6%; 211.6 out of 250, or 84.6%; and 386.2 out of 500, or 77.2%. So the proportion of true outliers detected by BH5 depends on the number of outliers in sample, with typically a higher proportion detected as outliers increasing up to a point, after which the detected proportion of true outliers seems to decrease.
Although this study deals with larger sample cases, a limited investigation was performed for smaller sample sizes. Table S3 in the online supplementary materials provides some results for n = 100. These indicate that the BH5 method, based on estimated g-and-h base distribution, performs reasonably well even for n = 100, by identifying roughly 4.5 true outliers out of 5, on average, at a cost of less than one false outlier out of 100 regular observations. In general, the rQLS estimates of the g-and-h parameters are less efficient for smaller sample sizes, which are expected to make the set FDRs somewhat less reliable. Similarly, the smaller sample boxplot FWER approach will be affected by the replacement of the sample quartiles by population quartiles, in addition to the less efficient estimates of the g-and-h parameters. This may not be a problem in exploratory investigations. For example, an interim data investigation of 300 human temperature readings may include an outlier value of 139, leading to extra care in data recording. In summary, the discussed FDR outlier g-and-h method works well for larger samples and should also provide reasonable results for more moderate sample-size cases.

Data Examples
Two datasets are used in this section to illustrate the application of outlier detection based on g-and-h null distribution to relatively large sample size microarray experiments. Here, the "outliers" are genes of interest, whose expression are significantly increased or decreased as compared to other genes. The first set of observations is called the Swirl dataset. The Swirl experiment (Smyth 2005) used zebrafish as a model to identify genes with differential expression in the Swirl mutant zebrafish compared to wild-type zebrafish. The raw data contain 8448 gene expression measurements from four Swirl mutant and four wildtype zebrafish. The raw data were transformed by normalization, then linear model and empirical Bayes analysis were applied to normalized data and the final output is a univariate variable, which theoretically should be t distributed with seven degrees of freedom. The second dataset is HIV data (Efron 2013), which is also microarray data consisting of n = 7680 genes comparing four HIV-positive subjects with four healthy controls. The data used here contain the two-sample t statistics with six degrees of freedom.
For both datasets, the "locfdr" function is used to estimate the empirical null distribution following the transformation of the t statistics to z values by  N (μ, 1), where μ are . and ., respectively. The number of false outliers, true outliers, and some-outside rate for BP, BH, and BH based on robustly estimated g-and-h null distribution were summarized. degrees of freedom 7 and 6, respectively. Since the rQLS method has the restriction h > 0, the original student t statistics are used to estimate the g-and-h null distribution parameters. Figure 1 shows the histograms of the Swirl data and π 0 f 0 (z) for the theoretical null distribution N(0, 1), the empirical null distribution, and the estimated g-and-h null distribution. It is apparent that the actual null distribution is more dilated than N(0, 1) and left skewed. The empirical null fits the tail of data poorly and the estimated π 0 is greater than 1, which does not make practical sense. On the other hand, the estimated g-andh distribution fits the data fairly well. From Table 4 with BH5 method, it is observed that use of the theoretical null leads to detection of 156 expressed genes; this number changes to Table . Parameter estimates of the theoretical (0, 1, π 0 ), the empirical (μ 0 , σ 0 , π 0 ), and the g-and-h (A, B, g, h) null distributions and corresponding number of discoveries for Swirl and HIV data. 0 for empirical null and 43 for the estimated g-and-h distribution approach. Note that the empirical null method did not find any differentially expressed genes for this dataset. The  corresponding numbers of differentially expressed genes for HIV data example were 18, 107, and 13 for theoretical null, empirical null, and estimated g-and-h distribution approaches, respectively. Table 4 also presents the number of outliers found by BP and BH1 methods. These results indicate that the normal distribution assumptions cannot efficiently detect outliers in the Swirl dataset for BP method. Additionally, Table 5 shows the top-ranked 20 genes identified by BH5 based on estimated gand-h null distribution and their corresponding ranks based on theoretical null distribution. Note that the skewness estimated by g-and-h distribution changed the rank of gene "BMP2" from number 1 based on theoretical null distribution to number 7 based on estimated g-and-h distribution. It also results in changing ranks 2-10 and ranks 3-13 among other reorderings. In summary, the choice of null distribution can have a profound effect on the importance rankings of the expressed genes. Figure 2 presents the HIV data. The histogram of transformed z-values indicates that this data is slightly right skewed and less dispersed than the standard normal distribution; however, the tail is heavier than the estimated empirical null density curve. The estimated g-and-h parameters (0.126, 0.812, 0.053, 0.133) confirm that it is a right skewed, less dispersed (B < 1), and heavier tailed distribution compared to t(6), where g-and-h approximation parameters are (0, 1.043, 0, 0.091). Again, from Table 4, the numbers of BH5 discoveries based on estimated theoretical, empirical, and estimated g-and-h distributions are 18, 107, and 13, respectively.

Estimate
Although the illustrated examples deal with two microarray datasets, there are many other types of situations where the proposed outlier testing methods can be fruitful. As an example, the proposed methods can be applied to larger clinical trials, especially when studying none endpoint measurements, where outliers can indicate incorrect recordings, laboratory readings, or patients with more extreme reactions to treatment.

Conclusions
This article deals with relatively simple outlier detection methods designed for larger univariate samples. Such methods have the potential for use in a variety of statistical applications. Investigated are the FWER BP outlier detection method and also the far less common false discovery BH outlier detection approach. In addition, the g-and-h distribution, estimated by the robust rQLS method, is proposed to be used as the base distribution for larger sample datasets. This is shown to be a superior approach as data often do not follow the assumed theoretical null distribution, or even a symmetry assumption. When the data show skewness or heavier or lighter tails than normal, the theoretical null or empirical normal null distribution is no longer valid. We have seen that a robustly estimated g-and-h distribution produces superior results in such situations. The flexibility of the fitted g-and-h distribution makes it an appropriate choice as a practical null distribution. In addition, the robust estimation method, rQLS, reduces the impact of outliers when estimating the g-and-h null distribution parameters. The studied robustly estimated g-and-h parameter null distribution approach is especially useful when the null distribution is unknown.
The simulation results suggest that the location of outliers has an impact on the performance of the BP and BH methods. In larger samples, more nonoutliers may appear as potential outliers. In such cases, BH will identify more true outliers with controlled number of false outliers. On the other hand, the BP method identifies a reasonable number of true outliers and fewer false outliers. In some cases, where outliers are observed to be far away from the bulk of data, the BP method identifies a similar number of true outliers as the BH false discovery method but with fewer false outliers. Overall, the BH5 method is superior in practice to the FWER BP method, except when the outliers are far away from the true null distribution.
These conclusions are reinforced by the two relatively large sample microarray data examples discussed in this article. The graphs in Figures 1 and 2 show that the theoretical and empirical null distributions do not fit the data well, especially at the tails, while the robustly estimated g-and-h distribution provides a far better fit. Thus, using the estimated g-and-h as a base distribution is expected to yield more accurate identification of relevant genes that may be of special interest for further scientific study.
Although the proposed BH5 approach coupled with an estimated g-and-h base distribution is designed for larger sample sizes, it should perform reasonably well for more moderate values of n, especially when used in exploratory data investigations. Although a limited smaller sample-size study is reported here, the more moderate sample-size issue needs further investigation.

Supplementary Materials
The online supplementary materials contain an additional figure and tables.