Bayesian screening for feature selection

ABSTRACT Biomedical applications such as genome-wide association studies screen large databases with high-dimensional features to identify rare, weakly expressed, and important continuous-valued features for subsequent detailed analysis. We describe an exact, rapid Bayesian screening approach with attractive diagnostic properties using a Gaussian random mixture model focusing on the missed discovery rate (the probability of failing to identify potentially informative features) rather than the false discovery rate ordinarily used with multiple hypothesis testing. The method provides the likelihood that a feature merits further investigation, as well as distributions of the effect magnitudes and the proportion of features with the same expected responses under alternative conditions. Important features include the dependence of the critical values on clinical and regulatory priorities and direct assessment of the diagnostic properties.


Introduction
Many investigations, especially those using modern data collection, generate large numbers of continuous-valued data measurements ("features") per experimental unit. Most of these features are unlikely to be relevant. Obtaining large numbers of good observational units is often difficult. Classical methods such as discriminant analysis are unlikely to be useful for identifying relevant features when there may be thousands of measurements and few observational units (Donoho and Jin 2009b). Typical investigations include biomedical applications such as genomics ensuing from gene microarrays (Ahdesmäki and Strimmer 2010) and genome-wide association studies (GWAS) (Barnett et al. 2017), metabolomics Franceschi 2012a, 2012b), drug safety (Gould 2008(Gould , 2013(Gould , 2018, industrial applications , and, recently, economic applications (Cai and Sun 2017a). The effects of rare and weakly expressed important features may be obscured by the noise due to features that are unrelated to the attributes being studied. Such investigations exemplify large-scale inference (LSI), which addresses two basic issues: determining which of a large collection of potential explanatory variables or regressors should be retained in a model, and identifying the elements of very large observation vectors based on relatively few independent observations to consider further (Donoho and Jin 2015), which this paper addresses.
Screening an extensive series of measurements to identify relatively rare features that express potentially interesting differences ('signals') entails two basic aims: detection to determine if an ensemble of features distinguishes between the attributes being studied, and identification to identify which features might be informative. Depending on the number of features and the sparsity and weakness of the signals, it may not be possible to detect, let alone identify, potential signals (Donoho and Jin 2004, 2008, 2009b. Subsequent analyses become more efficient and informative when the data dimensionality is reduced by discarding non-informative features. There is a substantial literature on the properties of methods for detecting signals from large databases when the signals are rare and weak. The missed discovery rate (MDR) becomes important in this setting because of the difficulty of identifying true "signals" using conventional tools such as FDR (False Discovery Rate) that may be more effective in the rare signal setting when the signals are not weak (Donoho and Jin 2015;Klaus and Strimmer 2013;Li et al. 2019). The gene expression data set described by Golub et al. (1999) provides a suitable illustrative example. The goal is to find genes of acute lymphoblastic leukemia (ALL) that are differentially expressed from genes of acute myeloid leukemia (AML). Each feature represents a gene as measured by the gene microarray technology. A successful screening procedure reduces a large collection of genes to a manageable number of potentially "interesting" genes while retaining an acceptably small probability of missing the "interesting" genes. This well-known data set has been used in various publications, and we use it to show the utility of our method in the Examples section.

Conventional screening methods
Frequentist-based screening methods have been studied extensively, including variations on the positive False Discovery Rate (pfdr) (Storey 2002(Storey , 2003(Storey , 2007Storey et al. 2019Storey et al. , 2004Storey and Tibshirani 2003), an empirical Bayes method ashr (Stephens 2017;Stephens et al. 2019) that enables estimation of the FDR as well as the corresponding effect sizes, and methods tailored to the use of penalty functions to address issues arising in the evaluation of high-dimensional data. (Fan et al. 2014;Fan and Lv 2010;Fan and Song 2010) Decision-theoretic approaches have also been described, differing largely in how the loss functions are defined (Longford 2013(Longford , 2014Wu and Pena 2013).
The "Higher Criticism" strategy suggested originally by John Tukey for determining if an aggregate collection of hypothesis test results provides sufficient evidence to warrant concluding that at least some of the tests may reflect real differences has been addressed extensively in the literature. (Donoho and Jin 2004, 2008, 2009bIngster 1999;Jin and Ke 2016;Li and Siegmund 2015) Different statistics can be used to implement the strategy, including a "Higher Criticism" (HC) statistic (Donoho and Jin 2015) and its variations, the Berk-Jones (BJ) statistic (Berk and Jones 1979;Wellner and Koltchinskii 2003;Zhang and Wu 2018), and the Average Likelihood (AL) statistic (Walther 2013). HC statistics can also be used to identify features for further evaluation. Whether individual features can be identified or whether one can even decide if any feature differences can be detected depends on the sparsity of individual signals and on their strength, which can be clarified using a phase diagram (Cai et al. 2007;Donoho and Jin 2004;Ingster 1999;Ji and Jin 2012;Jin and Ke 2016;Klaus and Strimmer 2013;Xie et al. 2011). Detectability is not an issue for the examples considered here.
The HC and BJ statistics are examples of a general class of statistics called ϕ-divergence statistics. (Jager and Wellner 2007) Let X 1 , . . ., X N denote the elements of an observation vector that are independently distributed on the real line according to F(x). The objective is to test H 0 : F = F 0 against H 1 : F ≠ F 0 where F 0 denotes the cdf of a uniform (0,1) distribution. The empirical cdf of any observation vector is Let p i , i = 1, . . ., N denote 1-sided or 2-sided p-values [1 À F 0 x i ð Þ or 2ð1 À F 0 x i j j ð Þ] obtained by testing H 0 against H 1 for each of the X i . Following Zhang et al. (2017), let p 1 ð Þ � p 2 ð Þ � : : : � p N ð Þ denote the order statistics, define a supremum domain R by where k 0 , k 1 , α 0 , and α 1 are specified constants, and consider a general family of statistics for testing H 0 p i ~ U(0,1), where the definition of f depends on the statistic in the family. The ϕ-divergence statistics (Jager and Wellner 2007) for 2-sided tests are

Bayesian screening methods
Bayesian feature selection methods have been described in the literature, primarily aimed at identifying genes whose expression levels differ among different experimental conditions, either in terms of feature expression intensity or gene counts. With few exceptions, these methods use MCMC calculations, which may limit their practicability for very large feature sets. Do et al. (2005) describe a Dirichlet process mixture model for the probability that a gene is expressed differently under alternative conditions. Posterior inference for the model is carried out using MCMC simulation. Lewin et al. (2006) use an ANOVA model to model the logarithm of the expression of genes evaluated using microarrays under alternative experimental conditions as normally distributed with a mean that depends on an overall effect for each gene, a gene-specific differential effect for the condition, and an effect corresponding to the array that is a quadratic spline function of the expression level. The gene and condition-specific variances of the expression levels are assumed to be exchangeable. MCMC calculations are used to obtain posterior distributions of the parameters. Differential gene expressions are identified by sufficiently large values of the posterior probability that the genespecific differential condition exceeds a predefined critical value.
In a later paper, Lewin et al. (2007) use a three-component mixture at the parameter level to model separately the populations of over-expressed, under-expressed, and non-differentially expressed genes. The choice of the mixture prior can have a large effect on the classification of genes. Simulations cannot provide general guidance as to the best model so that model checking methods are needed. This approach also requires MCMC calculations, for which software is available. Yu et al. (2008) use a hierarchical random effect model and posterior distributions of Bayes factors to select differentially expressed genes without assuming exchangeability of Bayes factor distributions across genes. The method is based on the computation for each gene of the relative probability of concluding incorrectly the null hypothesis of no differential effect for the gene. Conjugate prior distributions for the model parameters that may include historical information are used to construct gene-specific posterior distributions for the Bayes factor and the calculation of critical values for the probability that there is no differential effect. In general, determination of the critical value will require MCMC calculations. Yu et al. (2015) subsequently proposed "confident difference criterion" Bayesian gene selection algorithms, one based on the standardized differences between two mean expression values among genes and the other incorporating as well the differences between two variances. These methods evaluate the posterior probability of a gene having different gene expressions between alternative conditions and declare a gene to be differentially expressed if this posterior probability is large. Gibbs sampling is needed to obtain samples from the posterior distributions of the parameters under hierarchical normal models with conjugate priors. Van de Wiel et al. (2013) focus on negative binomial count models in a Bayesian generalized linear model setting and propose a hybrid full Bayes -empirical Bayes type approach in which priors of crucial parameters are estimated and arbitrary parametric and non-parametric priors are allowed for other parameters. The calculations are based on Integrated Nested Laplace Approximations (INLAs) for latent Gaussian models (Rue et al. 2009) that avoid the need for MCMC calculations. Dadaneh et al. (2018) describe methods for detecting differentially expressed genes that model high-throughput sequencing count data using a gamma-negative binomial process, or a beta-negative binomial process (BNBP). Posterior distributions of the model parameters are obtained using Gibbs sampling. Detection of differentially expressed genes is based on tests of a null hypothesis that the posterior distributions of the gene-specific regularized (by gamma or beta processes) negative binomial shape parameters are the same across different experimental conditions. Instead of standard hypothesis testing methods, their approach measures the distances between gene-specific posterior distributions using symmetric Kullback-Leibler divergence.
Fisher and Mehta (2015a,b) describe an approach for feature selection, "Bayesian Ising Approach" (BIA) that can be used to obtain posterior probabilities corresponding to L2 penalized regression models without the need for MCMC calculations. Their method relates an n-dimensional vector y of responses to an n x p matrix of features X (p ≫ n) by y = Xβ + η, with the aim of identifying the nonzero elements of β using L2 penalized regression, i.e., by minimizing U(β) = (y-Xβ)′(y-Xβ) + λβ′β where λ is a penalty factor. The metric used is the expected value of the posterior probability that β ≠ 0 for each element of β. Instead of MCMC calculations that a conventional Bayesian approach would use, Fisher and Mehta propose a non-MCMC alternative that implements statistical physics ideas for studying high-dimensional regression based on "magnetizations of an Ising model with weak couplings". A key quantity for this purpose is the value of a natural scale parameter λ* such that the method will identify nonzero β values accurately when λ ≪ λ*, but will be unreliable when λ ≫ λ*. A plot of the estimated values of the expected probability that β ≠ 0 against a range of values of λ*/λ for each feature, referred to as a feature selection path, provides a way to identify relevant features. The purpose is to screen variables (features) rapidly to identify variables with low correlation to condition states (e.g., treatment) that can be removed before undertaking more intensive calculations. Rate limiting factors include the need to compute high-dimensional correlation matrices and the fact that correlations can strongly affect the posterior probabilities.

Proposed approach
The method presented here is motivated by previous work on drug safety for Bayesian screening of adverse effects (Gould 2008(Gould , 2013(Gould , 2018 based on random mixtures of distributions of event counts. The features in safety applications typically are counts of events generated by single-parameter binomial, or Poisson probability functions recorded in parallel test-control arms. The same method also applies to screening based on gene counts if binomial likelihoods are replaced with negative binomial likelihoods. The present paper extends the application scope of the approach to multi-parameter distributions of continuous feature values with particular emphasis on normally distributed features. The method retains the useful statistical properties and computational efficiency previously demonstrated for the Bayesian paradigm, particularly with regard to inferences about the mixture parameters and effect sizes. We also describe the diagnostic properties of the method. Although the mathematical models used here are superficially similar, the present approach has some fundamental differences from previous approaches. First, the focus is not on detecting features that differ between experimental conditions with control of the false discovery rate (FDR), as is the case for most current methods. Instead, the focus is on providing a statistically valid tool that subject matter experts can use to identify and, if appropriate, remove from further consideration features that are unlikely to differ materially between experimental conditions, so that the missed discovery rate (MDR) becomes the important metric. Reducing the dimensionality of the problem of identifying important features improves the efficiency and effectiveness of methods specifically aimed at identifying the important genes/features, including those described briefly above (Fisher and Mehta 2015a). Secondly, when the experimental conditions can be identified formally as "control" or "test", the posterior distributions of metrics expressing differences between the conditions can be conditional on the 'control' feature response because the issue addressed is whether the process generating the expected response of a feature to the "test" condition is the same as (or differs from) the process generating the expected feature response to the 'control' condition. Thirdly, the strategy for determining the prior distributions of key model parameters takes the apparently novel approach of explicitly incorporating clinical and regulatory requirements about acceptable error rates. Finally, the calculations are exact (non-asymptotic) and rapid because they do not require MCMC calculations The paper is structured as follows: We first introduce the method and define the underlying generative model. We then describe the Bayesian screening calculations and provide simulated and real-life examples including comparisons with HC-based approaches (Donoho and Jin 2015;Walther 2013;Zhou et al. 2019) and an FDR-focused screening approach (Storey 2002(Storey , 2003(Storey , 2007Storey et al. 2019Storey et al. , 2004Storey and Tibshirani 2003). A comparison is also provided for the real-life example using a Bayesian approach with conveniently usable software that does not require MCMC computations (Fisher and Mehta 2015a). The comparisons are by no means exhaustive and, in the interest of maintaining focus and computational practicability, do not include other Bayesian methods described above or other methods that have been described in the literature (Fan et al. 2014;Lv 2008, 2010;Fan and Song 2010). Finally, we discuss the benefits and limitations of the method and opportunities for future research. An annotated example of the use of the R code and a listing of the R code is provided in the Supplemental information.

Data
The data consist of collections of observations on N feat features reflecting the responses to a control or a test treatment.
f g denotes the feature-specific means of the observations on the test treatment, s 2 Ci and s 2 Ti denote estimates of the within-feature variance for each treatment that are independent of each other and from the observed responses, and u Ti ¼ m Ti s 2 Ti and u Ci ¼ m Ci s 2 Ci denote the sums of the residual squared errors for the responses to feature i on the test and control treatments, respectively, where m Ti and m Ci denote the corresponding degrees of freedom. The pooled residual sum of squares and degrees of freedom for feature i are u i = u Ti þ u Ci and m Ti þ m Ci . It is convenient to think of C and T as denoting control and test treatments, respectively, but this distinction is not essential. C and T could refer to two phenotypes, and the objective could be to identify genes that are expressed differently between the two phenotypes (section 3.2 provides an example).

Model
The calculations are described in terms of normally distributed outcomes for computational convenience, although, as the Appendix makes clear, the assumption of normality is not essential. The observed mean feature values Y Ci and Y Ti are assumed to have normal likelihoods with respective means μ Ci and μ Ti and precisions a Ci η i and a Ti η i (η i ¼ 1/σ 2 i where σ 2 i denotes the feature-specific residual variance), respectively, i = 1, . . ., N feat . Let a C = {a Ci , i = 1, . . ., N feat and a T = {a Ti , i ¼ 1; . . . ; N feat g denote the vectors of precision multipliers.
The means μ Ci and μ Ti have normal prior distributions with respective expected values θ Ci ;θ C and θ Ti and precision b 0 η i , and the precision parameters η i are assumed to have a gamma prior distribution with parameters ζ 1 /2 and ζ 2 /2. The value of θ Ti is a random mixture θ Ti Ti of the values that would apply under two conditions: that the same process generated the expected response of feature i to the test and control treatments ( . The mixture indicator variable has a Bernoulli prior distribution with parameter π 0 , which is generated from a beta (ξ 1 , ξ 2 ) prior density. The quantity π 0 is a parameter in the model for which a posterior distribution will be generated based on the observations, as opposed to conventional methods, which assume that π 0 is a fixed quantity whose value must be specified or estimated.
The parameters of the distribution of the difference or ratio metric corresponding to any feature depend on the process parameters (the values of and θ T ð Þ Ti and θ T ð Þ Ti ) and on scientific, clinical, or regulatory requirements. The approach extends a previously described method for count data (Gould 2018) to realizations of arbitrary likelihoods.
The value of θ C ð Þ Ti , the a priori mean of the expected response of feature i to the test treatment when the same process generates the expected responses to the control and test treatments given the responses of feature i to the control treatment, is assumed to be the same as e θ Ci , the mean of the posterior density of μ Ci that incorporates the observed response Y Ci and the mean θ Ci of the prior density of μ Ci . This assumption is tenable because equality of processes means that θ Ti = θ Ci and, in the absence of information about θ Ti , the best a priori estimate of θ Ti for feature i given the control treatment responses are θ Ci when the same processes generate the expected responses to the two treatments.

Prior distribution parameters
Ti expresses the effect of the test treatment on the response of feature i. A key aspect of the method is that regulatory and clinical requirements, and the response of feature i to the control treatment, determine the prior distributions of μ Ti and M i should be unlikely to exceed a critical value M crit so that a priori, Ti ) ≤ ϕ 1 (e.g., ϕ 1 = 0.1 or 0.2). The biological and regulatory requirements for the screening application are expressed by (1) and (2). Figure 1 illustrates the stochastic ordering relationship. Section D of the Appendix describes how the requirements expressed by (1) and (2) determine the parameters of the prior distribution of M.
The focus of what follows is on the arithmetic difference between the expected responses of a feature to the control and test treatments so that henceforth the text often refers to D i and d crit instead of M i and M crit .

Screening calculations
The calculations produce three key results: (a) posterior probability values that the same process generated the C and T expectations for feature i, P SPi = P(γ i = 0) ["SP" means "same process"]. (b) values of the posterior cdf of M i , for assessing the magnitude of the effect of the test treatment relative to the control, and (c) the posterior distribution of the proportion π 0 of the features where the same processes generated the expected values for the control and test treatments.
The P SPi values express the strength of the current evidence as to whether the same process generated the expected responses to the control and test treatments for feature i. A P SPi value less than a predefined constant, say ω, would imply that different processes generated the expected feature responses, i.e. that the i-th feature is potentially informative. When there are two conditions (A and B, say) and there is no inherent reason to regard one as a control and one as a test, then the screening process can be run twice, with A regarded as the test and B as the control, and then with B regarded as the test and A as the control (see section 3.2 as an example).

Diagnostic properties
The practical utility of the method depends on its specificity, sensitivity, and positive and negative predictive values. These values are determined by assumptions about the true distributions of expected values of the feature responses to the control and test treatments and the number of replicated observations on the test and control treatments for each feature. The values of the parameters of the marginal densities of the observed responses of the features to the test and control treatments can be specified subjectively, reflecting a desire to evaluate the robustness of the method, e.g., by considering 'worst cases', or can be based explicitly on the findings from the current (or another completed) experiment.  (2) and (1).
Suppose that the posterior density of the mixing parameter π 0 determined from the current trial also describes the predictive distribution of this parameter in the new trial, provided by (A6). The diagnostic properties can be calculated for each combination of control parameters (ϕ 0 , ϕ 1 , M crit ) used for the analysis of the current trial (each of these combinations will provide values of the posterior density of π 0 ) and for various sample sizes for the future trial.
Simulation of the screening process as it applies to a current experiment proceeds as follows. The current experiment provides vectors of observed feature mean responses Y C and Y T , values of the pooled within-feature residual squared errors, u C and u T with, respectively, m C and m T degrees of freedom (df), and the posterior density of π 0 . The posterior expectations of θ C ; θ T ζ 1 , and ζ 2 based on the responses to the control and test treatments define predictive distributions of future observations. It would be advisable to confirm that the marginal densities of the responses corresponding to the control treatment with the posterior expected parameter values fit the observed data acceptably using, e.g., the Kolmogorov-Smirnov or Cramer-von Mises tests.
S replications of the current experiment are to be done, with a Ci ≡ a Ti ≡ a, i = 1, . . ., N feat for each replicate. Each replication starts with N feat observations v from a beta m 2 ; � 2 À � density that provides N feat N feat random values of π 0i , i = 1, . . ., N, are drawn from the posterior density of π 0 based on the current experiment, and corresponding realizations e γ i are drawn from Bernoulli distributions with parameters π 0i , i = 1, . . ., N feat . Separate values are obtained for each replicate. The simulated feature responses for the s-th replicate are The values of p ðsÞ SPi , s = 1, . . ., S, [Appendix eqn (A8)] are the probabilities that the same process generates the expected responses of feature i to the control and test treatments. Asserting that a feature may be affected by the test treatment differently from the control treatment depends on the magnitude of this probability. Whether this identification is correct or not depends on the corresponding value of e γ i . Let d ab values estimate the probability that the screening process correctly identifies the features that are and are not affected by the test treatment. The averages of these values over S replicates provide the basis for evaluating the diagnostic properties of the screening method. Let H denote the 2 × 2 matrix consisting of these averages, The diagnostic properties of the screening procedure, considered as a function of the specifications (M crit , ϕ 0 , ϕ 1 , ψ), are functions of the elements of H, Neither the sensitivity nor the specificity depends on the marginal distribution of the comparison metric M (difference D or ratio R). However, the FDR (False Discovery Rate) and MDR (Missed Discovery Rate) can be sensitive to the event prevalence. It is easy to show that FDR = (1 + UR) −1 and MDR = (1 + V/R) −1 where R = P(Different processes)/P(Same process), U = Sensitivity/(1 -Specificity), and V = Specificity/(1 -Sensitivity). The quantities U and V are independent of the probability that the same process generates the expected responses, but R is not. The FDR increases and the MDR decreases as R decreases, so that the FDR could be large when few features are affected differently by the test and control treatments. Good control of the MDR requires high sensitivity, regardless of the specificity. Figure 2 illustrates the relationships.

Comparison with conventional screening
Conventional screening approaches such as the pfdr method (Storey 2002(Storey , 2003(Storey , 2007Storey et al. 2019Storey et al. , 2004Storey and Tibshirani 2003) identify potentially important features such as differentially expressed genes by repeated significance testing with corrections to the significance levels to control the effect of multiple testing on the FDR. A feature is "flagged" as potentially important if a corrected significance level (often called a "q-value") exceeds a specified critical value q crit . The value of q crit is chosen to control the FDR at an acceptable level.
The Bayesian screening method described here characterizes individual features in terms of the posterior probability that the expected response to a nominal test intervention or treatment is generated by the same process that generated the nominal control test or treatment values as opposed to a process that centers the distribution of expected 'test' values at a clinically, biologically, or regulatorily important (difference) value d crit . The posterior probabilities, along with posterior distributions of the test-control differences, provide the strength of evidence for deciding whether a particular feature should be 'flagged' for further investigation. For example, the determination of whether a feature should be pursued may depend on the clinical importance of the feature in the context of product safety evaluation.
Since the Bayesian approach is selective with respect to the magnitude of the difference that is important, while the conventional approach is not, the conventional approach can identify more potential candidates for follow-up than the Bayesian approach does. In particular, an observed testcontrol difference that is smaller than (M crit =) d crit and sufficiently precise because it is the mean of many replications may yield a q-value less than q crit while the posterior probability that the expected feature response to the 'test' treatment is generated by the same process as the expected response to the 'control' treatment may not be small. The 'real data' example in the next section illustrates this phenomenon.

Simulated data
This example illustrates the application of the method to a large dataset with known control and treatment feature identifiers. Two sets of data (S = 1 replication) were generated, each consisting of the average of a Ci = a Ti ≡ a = 5 measurements of each of N feat = 100,000 features. Each of the pooled residual sums of squares corresponding to the features (the u values) has m = 8 degrees of freedom. Setting the prior parameter values to θ C = 4, b 0 = 2, ζ 1 = 12, and ζ 2 = 10 implies that the marginal density of v = u/(u + ζ 2 ) is a beta density with parameters (4, 6). The marginal expected value of the standard deviation, σ = ffi ffi ffi ffi ffi ffi ffi ffi ffi u=m p is ffi ffi ffi ζ 2 m q B mþ1 2 ; ζ 1 À 1 2 � � =B m 2 ; ζ 1 2 � � ¼ 0:945 so that the standard error for any of the mean Y values is σ/√5 = 0.423. Y C values for both sets of data were drawn from a N(4, 0.423) distribution. The values of Y T for both sets of data were drawn from a N(θ T , 0.423) distribution where θ T = θ C + δγ with δ = 4. The first set of data assumed that the values of γ were generated from a Bernoulli (0.25) distribution (E(γ) = 0.25), so that θ T = 8 for about a quarter of the features (25,153 in fact), and θ T = 4 for the remaining features. The second set of data, which illustrates a situation in which signals are rare, assumed that the values of γ were generated from a Bernoulli (0.005) distribution (E(γ) = 0.005), so that θ T = 8 for about 500 features (517 in fact), and θ T = 4 for the remaining features. These prior parameter values were used only to generate the simulated data and did not affect any of the subsequent calculations. The simulated data provide the 'truth' about the processes that generated the expected responses to the 'control' and 'test' interventions. The values of b 0 and δ required to satisfy (1) and (2) were determined as described in section D of the Appendix. The value of ζ 1 was set to be equal to 2 for the screening calculations. The computations took about 4 minutes for each case on a laptop computer. Figure 3 displays the posterior densities of π 0 as functions of ϕ 0 and ϕ 1 when d crit = 4 for both examples. The range of values of π 0 with nontrivial posterior density values clearly depends on the expected value of γ that was used for the simulations. These ranges depend on the values of ϕ 0 and ϕ 1 for both examples. The probability that the same process generated the feature responses to the control and test treatments decreased as ϕ 0 increased for each fixed value of ϕ 1 , and as ϕ 1 increased for each fixed value of ϕ 0 . Figure 3 suggests that setting ϕ 0 to a relatively high value, e.g., 0.99 or greater, and setting ϕ 1 to 0.3 or greater, would lead to a posterior density for π 0 centered close to the true (usually unknown) value.
Low values of the posterior probability (p SP ) that the same process generated the control and test expected feature values do not necessarily imply that large difference values will have high posterior probabilities. Table 1 illustrates this point by displaying the values of p SP that are less than 0.3 when EðγÞ ¼ 0.005, d crit = 4, ϕ 0 = 0.99, and ϕ 1 = 0.2 for the features for which the posterior probability that Table 1. Features for which p SP < 0.3 and p post (D > 4) < 0.5 when E(γ) = 0.005, d crit = 4, ϕ 0 = 0.99, and ϕ 1 = 0.2. The numbers in the fifth through tenth columns are the p SP values corresponding to the (ϕ 0 , ϕ 1 ) combinations.  Figure 3) also demonstrates the profound effect of the value of ϕ 0 on the value of p SP . Table 2 summarizes the diagnostic properties of the screening process for both simulation cases as a function of the design parameters; Specificity ≈ 1 for all cases. Increasing ϕ 0 , ϕ 1 , and ω clearly increased the sensitivity and decreased the MDR. There was not much effect on the FDR.
For purposes of comparison with conventional frequentist-based methods, the well-known positive fdr (pfdr) method implemented by the qvalue function (Storey 2003;Storey et al. 2019Storey et al. , 2004Storey and Tibshirani 2003), an empirical Bayes method implemented using the ashr function (Stephens et al. 2019), and the Higher Criticism approach implemented using the SetTest function (Zhang and Wu 2018) also were applied to the simulated data. Table 3 provides values of the sensitivity, specificity, FDR, and MDR for each of these procedures. The diagnostic properties are functions of the critical criterion for the pfdr and empirical Bayes approaches, so they are reported corresponding to the minimum and maximum of a set of   Table 3 because the assumption of an essentially constant density of empirical p-values towards the upper end of the range (values near 1) required by the method does not appear to apply for the 1-tail significance levels. Table 3 suggests that the FDR and MDR values obtained using the pfdr or empirical Bayes approaches can be sensitive to the true value of π 0 and to the critical value. The findings from Table 2 suggest that the Bayesian screening approach may be relatively insensitive to the true value of π 0 and to the values of the tuning parameters ϕ 0 , ϕ 1 , and ω. This may not be surprising because the conventional and Bayesian approaches address different questions, as pointed out in Section 2.6. In addition, the findings in Table 3 would not be unexpected in view of the relationship between FDR, MDR, and prevalence displayed in Figure 2.
A question arose in the review of the manuscript about the effect of the variability of the responses on the diagnostic properties of the method. We addressed this question by repeating the simulations just described with the variance increased by a factor of 10. Table 4 displays the effect on the diagnostic properties. There appears to be little effect on the FDR when the prevalence of features that are affected by the test treatment is not rare (EðγÞ ¼ 0.25) although, consistently with the effect on power in conventional frequentist hypothesis tests, the sensitivity and MDR do decrease. When few features are affected by the test treatment (EðγÞ ¼ 0.005), the MDR and sensitivity do not appear to be affected greatly, but the FDR does increase substantially.

Real data
This example uses the well-known data regarding gene expression differences between two forms of leukemia [ALL = Acute Lymphoblastic Leukemia and AML = Acute Myeloid Leukemia] (Golub et al. 1999). The expression values were obtained from 27 samples of childhood ALL from the Dana Farber Cancer Institute leukemia data bank and 11 adult AML samples from the Cancer and Leukemia Group B leukemia cell bank plus an additional independent set of 20 ALL and 14 AML adult and childhood leukemias obtained from various sources. The unknown transformations of the original expression values for the 7128 genes provided by the 72 patients were downloaded from http://web.stanford.edu/~hastie/ CASI_files/DATA/ leukemia_big.csv. The analysis of the data can be, and has been, approached in many ways. The analyses described here use the average of the 72 expression values for each gene and each type of leukemia along with the usual within-gene residual sum of squares to illustrate an application of the screening approach for identifying genes that seem to be expressed differently in ALL and AML. Although the present analyses do not do so, typical data mining methods would be applied in practice with part of the data used to identify genes with potentially different expressions, and the remainder to validate the choice. This could be done with different splits of the data to provide some assurance about the robustness of the chosen set of genes. Another issue of interest, not addressed here, would be whether the ALL-AML gene differences were expressed differently for adult and childhood leukemias.
Examination of the distributions of the mean differences between the feature responses for the two forms of leukemia difference suggests that true mean differences larger than a critical value (d crit ) of 0.5 or 0.6 are highly unlikely; it is not clear that true differences less than 0.5 would be meaningful. Consequently, values of d crit of 0.5 and 0.6 were used for this illustrative example. Values of ϕ 0 of 0.95 and 0.99, and values of ϕ 1 of 0.1, 0.2, and 0.3 were chosen to define the clinical/regulatory requirements. Figure 4 displays posterior densities of π 0 for the differences ALL-AML and AML-ALL when d crit = 0.5. The displays for d crit = 0.6 (not shown) are similar. Differential gene expression seems unlikely for either difference. The value of ϕ 0 strongly affected the posterior density of π 0 . These are subtle effects because most of the probability content of the distribution of π 0 is very close to 1, so that very few genes are likely to be expressed differently in ALL and AML regardless of the values of ϕ 0 , ϕ 1 , and d crit .
The finding here of a small number of genes that appear to be expressed differently between the ALL and AML types is consistent with the results obtained by the higher criticism approach reported by Donoho and Jin (2009a). Table 5 show that this is what happened. Table 5 displays the genes whose p SP values exceeded ω = 0.6 for any of the differences ALL-AML and AML-ALL when ϕ 0 = 0.95 or 0.99 and ϕ 1 = 0.1, 0.2, and 0.3, and also displays the posterior probabilities that the true difference exceeds specified values. These posterior probabilities are important because observed differences with small p SP values do not necessarily lead to non-negligible posterior probabilities of meaningful differences. The values of p SP generally were near 1 (>0.99) for almost all of the genes when ϕ 0 ≤ 0.95. In other words, almost no genes would have been detected by the screening process if ϕ 0 ≤ 0.95.

The findings included in
Application of the pfdr method yielded different results primarily because the frequentist and Bayesian approaches address different questions, as noted in section 2.6. Table 6 displays the number of genes flagged by the qvalue program as a function of the q crit values. Figure 5 displays posterior probability trajectories produced by the Bayesian approach described by Fisher and Mehta (2015a) as a function of the ratio λ*/λ for the first "best" 500 features, the "best" 20 features, and the "best" 12 features. The evidence for selecting any feature as "relevant" is fairly weak, yet a few features appear to be a bit less weak than most. Table 7 lists the summary statistics for genes flagged by the Bayes screening approach described here with d crit = 0.5, ϕ 0 = 0.99, ϕ 1 = 0.2, and ω = 0.6, the pfdr approach with q crit = 1 × 10 −8 , the Berk-Jones statistic implemented in an early version of the SetTest program (Zhang and Wu 2018), and the 20 'best' features identified using the Fisher and Mehta (2015a) Bayesian approach. Of the 30 features in this table, 12 were flagged by all of the methods, 3 were flagged by 3, 7 by 2, and 8 features were flagged by only one method. The qvalue approach flagged 25 features, HC flagged 17, Fisher and Mehta flagged 18, and the Bayesian screening method flagged 19.

Discussion
This paper describes Bayesian screening of continuous-valued features using a Gaussian mixture model that extends a Bayesian screening method for adverse event counts. The computations produce three key quantities: the probability for each feature that the same process generates the expected responses to the C and T interventions/categories, the posterior distribution of the true T-C difference for each feature, and the posterior distribution of the proportion of features (the value of π 0 ) for which the C and T expected responses are generated by the same process. Between-group effect magnitudes can be calculated as differences or ratios. The assumption of normality is algebraically and computationally convenient, but not necessary. Similar calculations can be carried out with other distributional models, although possibly with a greater computational burden.
Bayesian screening was evaluated and contrasted in two scenarios that may be encountered in practice, namely when "treatment effects" on the feature values are "common", and when they are "rare", with γ i are drawn from a Bernoulli (0.25) and Bernoulli (0.005), respectively. For example, the "common" scenario may correspond to expression data from DNA-based studies, whereas the "rare" scenario is more germane to expression data from RNA-based studies, which enables identification of rare or weakly expressed genes (Liu Y, et al. 2015). Table 5. Values of p SP (as functions of ϕ 0 and ϕ 1 ) and posterior probabilities (as functions of d) that differences (D i ) exceed specified values when d crit = 0.5. The probabilities for D i correspond to ϕ 0 = 0.99 and ϕ 1 = 0.2. The performance of the Bayesian method was also compared with the performance of a conventional screening method based on a multiplicity-adjusted hypothesis testing paradigm, a conventional screening method based on the "Higher Criticism" principle, and a Bayesian method. The diagnostic properties were similar, but not identical, although with a suitable choice of design parameters and critical values, the approaches identified nearly identical features as being differentially expressed. This also underscores the potential operating range of Bayesian screening, not primarily developed to control the FDR.

ALL -AML
The practical applicability of any screening method depends on its flexibility and computational efficiency. Although Bayesian methods can be computationally demanding, the calculations required by the procedure described here turn out to be very rapid when the assumption of normality is tenable, even for large datasets. The analyses carried out on a simulated set of 100,000 features took about 4 minutes and the analyses carried out on a gene microarray data set with 7128 features took about 30 seconds, both on a standard laptop computer. All of the necessary quantities are calculated from closed-form expressions so that exact solutions are obtained when the assumptions about the data generating model are correct.  Table 7. Summary statistics for genes flagged using the qvalue program when q crit = 1 × 10-8, the HC approach, the Fisher and Mehta (F&M) Bayesian approach, and by the Bayesian screening approach described here when d crit = 0.5, ϕ 0 = 0.99, ϕ 1 = 0.2, and ω = 0.6. X denotes a flagged gene, e.g., gene 760 was flagged by all 4 methods, whereas gene 1630 was flagged only by the qvalue and Bayes approaches. The method relies on design parameters that are chosen to incorporate explicitly knowledge and requirements specified by scientists, clinicians, and regulators. Evaluation of suitable ranges of these parameters provides insights into the diagnostic properties, such as sensitivity, MDR and FDR of feature detection for a problem at hand. In this way, the application requirements determine the diagnostic properties in terms that are meaningful to the users. This makes the method flexible and data adaptable and this information can be further leveraged in prospective study planning.
The main objective of feature screening is the identification of potentially informative features as measured by the MDR. Therefore, it can be applied in learn-and-confirm situations. The learning step separates non-informative from potentially informative features. Subsequent analyses, and perhaps trials or experiments, are needed to confirm that these features are truly informative.
The calculations described here have assumed that the feature expressions are independent so that no initial reduction was performed to remove redundancies. However, independence often does not apply, and correlations can affect the sensitivity for detecting important features. Various approaches have been described for incorporating correlations into the analyses, including some of the Bayesian approaches described above. A recently described, computationally convenient deterministic dimension reduction approach (Millstein et al. 2020) implemented in the R package partition (Barrett and Millstein 2020) could be used initially to identify subsets of features that can be combined into new composite features. These possibilities are not pursued further here, since consideration of correlation reduction methods is outside the scope of this paper.
There are several potential future developments of our current work. Developing feature screening with respect to a time-to-event endpoint would be a natural next step because of a strong focus on survival analysis in oncology (Cristescu et al. 2018) and neuroscience biomarker research (Li et al. 2017). Another interesting area to explore would be to evaluate the utility of the Bayesian screening as a step prior to the development of a prediction algorithm (classifier). This would be useful, e.g. in responder/no-responder identification from high-dimensional biomedical data.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The author(s) reported that there is no funding associated with the work featured in this article.

ORCID
The conditional (on π 0 ) posterior probability function of γ i is The expectation of (A3) with respect to (A2) provides a corresponding 'unconditional' posterior probability. This approach incorporates π 0 as an explicit model parameter. A number of strategies for determining the value of π 0 without the assumption of a prior distribution have been described in the hypothesis-testing literature (Genovese and Wasserman 2004;Lai 2017;Langaas and Lindqvist 2005;Liang 2016;Storey 2003;Wen 2017;Yu and Zelterman 2017).
Low values of (A3) [or its unconditional version] are useful for identifying the features that are affected by the test treatment differently from the control. However, they do not provide a perspective on the magnitude of the difference, and it is possible to have a low posterior probability that γ i = 0 with only a modest test-control difference. This perspective can be provided by considering the posterior distribution of an appropriate metric (say M) for the test-control difference. Suppose that e θ i ( e θ Ci or e θ Ti ) denotes the parameter of the posterior distribution of θ i , and that M has cdf F M . Then, the conditional posterior cdf of M given γ i and the values of e θ Ci or e θ Ti (which depend on the observed feature values x Ci and x Ti ) is

B. Normally Distributed Feature Values
The observed mean feature values y Ci and y Ti are assumed to have normal likelihoods with respective means μ Ci and μ Ti and precisions a Cii η i and a Tii η i (η i ¼ 1/σ 2 i where σ 2 i denotes the feature-specific residual variance), a C and a T refer (usually) to the sample sizes, i = 1, . . ., N feat . The precisions may differ among the features (although exchangeability of the precisions among features is assumed), but not between the test and control interventions within a feature. The corresponding observed residual sums of squares are denoted by u Ci and u Ti ; the quantities η i u Ci and η i u Ti are assumed to have chi-square likelihoods with respective degrees of freedom m Ci and m Ti , i = 1, . . ., N feat . The means θ Ci and θ Ti have normal prior distributions with respective expected values θ Ci ;θ C and θ Ti , common precision b 0 η i ; and (because of exchangeability) the precision parameters θ i are assumed to have a gamma prior distribution with parameters ζ 1 /2 and ζ 2 /2. Expression (A1) describes the mixture attribute of the process. How the values of θ C ð Þ Ti and θ T ð Þ Ti are determined is key to the screening process. For the moment, these values are taken as given. Posterior inferences about the θ i are a main objective of the screening process.
The calculations are based on the product of the likelihoods and prior densities for feature i, i = 1, . . .,N

Differences
Under the normality assumptions used here, the quantity has a central t density with ζ 1 degrees of freedom. The residual sums of squares (the u Ci or u i values) on which the screening process is conditioned provides a particularly convenient way to specify the ratio ζ 1 /ζ 2 , the a priori expected value of η i . The expected value of u À 1 i is η i =ðm i À 2Þ, so that E(η i ) can be estimated by the average of ðm i À 2Þ=u i over the features, The calculations that follow replace λ with e λ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi b 0 e U=2 q . If Ψ() denotes a standard normal or central t cdf, then requirement (1) in the text, namely that PðD � d crit jδ ¼ 0Þ � ϕ 0 , determines the value of b 0 as b 0 ¼ 2 Ψ À 1 ϕ 0 ð Þ=d crit À � 2 = e U. The value of δ i is determined by requirement (2) and can be expressed simply as δ i ;δ ¼ d crit 1À Ψ À 1 θ 1 ð ÞΨ À 1 θ 0 ð Þ � � . Table A1 provides typical values for d crit � λ ¼ Ψ À 1 ϕ 0 ð Þ; and δ/d crit as functions of ϕ 0 and ϕ 1 when Ψ denotes a normal cdf and when Ψ denotes the cdf of a t distribution with 8 degrees of freedom. Table A1.

Ratios
When μ C ð Þ Ti is unlikely to be negative, the statement "μ T ð Þ Ti =μ C ð Þ Ti < r crit " is essentially the same as the statement μ T ð Þ Ti À r crit μ C ð Þ Ti < 0", so that simple expressions can be obtained for probability statements about the ratio μ

;κ
Requirement (2) implies that ρ i ¼ r crit À r crit À 1 ð Þψ À 1 ϕ 1 ð Þ=ψ À 1 ϕ 0 ð Þ;ρ The quantity η i can be replaced with its a priori expectation ζ 1 /ζ 2 which can be estimated as described above, in which case Ψ denotes a central t cdf. Table A2 provides typical values for k and ρ as functions of ϕ 0 and ϕ 1 when Ψ denotes a normal cdf for r crit = 2 and 4. Values of k and of ρ satisfying requirements (1) and (2)  When μ C ð Þ Ti could be negative, the probability statements are not simple. Following the development by Kamerud et al. (1978), the density function for the ratio Z = μ