A comparative study on high-dimensional bayesian regression with binary predictors

Abstract Bayesian regression models have been widely studied and adopted in the statistical literature. Many studies consider the development of reliable priors to select the relevant variables and derive accurate posterior predictive distributions. Moreover in the context of small high-dimensional data, where the number of observations is very small with respect to the number of predictors, sparsity is assumed and many parameters can be set to values close to zero without affecting the fit of the model. Aim of this work is to develop a comparative analysis to empirically evaluate the performances of several Bayesian regression approaches in these contexts. In this study we assume that the predictors can be expressed only as binary variables coding the presence or the absence of a particular characteristic of the system. This binary structure is often present in many real studies, in particular in laboratory experimentation and in very high-dimension genome wide association studies.


Introduction
Many research fields are nowadays characterized by the high-dimensionality of the system under study. High-dimensionality refers to the situation where there are very few observations but a large set of variables representing the system. This is often termed as "small n, large p" problems, where n indicates the number of observations and p represents the number of variables. In regression problems characterized by "small n, large p", usually the p-dimensional regression coefficients vector of the true underlying model that generates the data has many components being exactly zero or negligibly small. In this case only few variables have a meaningful relation with the response variable of the system and can be considered as relevant. This sparsity condition and the high-dimensionality of the system lead to many statistical challenges due to the failure or the inapplicability of classical inference methods for regression models such the maximum likelihood or the least squares estimates (Johnstone and Titterington 2009;Fan, Lv, and Qi 2011;Ye, Yang, and Yang 2018;Izbicki, Lee, and Pospisil 2019). For example, with large p some of the columns of the X matrix, where X represents the n Â p design matrix in the regression model, are likely to be nearly collinear leading to singularity of the Gram matrix X T X: In this situation, the inverse of the matrix has some very large eigenvalues and the classical least squares method exhibits poor performances in estimating the regression coefficient vector.
The literature on "small n, large p" problems and sparsity condition is a developing area of Statistics both from a theoretical and an empirical perspective (Tibshirani 1996;Candes and Tao 2007;B€ uhlmann and van de Geer 2011;El Karoui et al. 2013;Frigessi et al. 2016). In the last decades many approaches have been developed in the context of machine learning, classical statistics or Bayesian statistics: among others, we can recall forms of penalization and variable selection (Tibshirani 1996;Zou and Hastie 2005;Zou 2006;Fan and Lv 2010;Polson, Scott, and Windle 2014;Chang, Tzeng, and Chen 2017;Griffin and Brown 2017;Piironen and Vehtari, 2017b;Hu et al. 2018;Li and Yao 2018).
In Bayesian regression models in particular, the objective is to select the set of relevant variables which have associated non-zero coefficients and derive accurate posterior predictive distributions. The Bayesian penalized regression techniques for the analysis of high-dimensional data include, among the others, the Bayesian Lasso (Park and Casella 2008;Hans 2009) and the Bayesian Ridge (Hsiang 1975), which are built starting from the corresponding concepts of the frequentist approaches and adapting to the Bayesian perspective.
To address the sparsity, many authors have developed several priors able to shrink toward zero the coefficients associated to non relevant predictors. Griffin and Brown (2010) propose a generalization of the Double-Exponential prior distribution for regression problems, namely the Normal-Gamma prior; Carvalho, Polson, and Scott (2010) propose the Horseshoe prior based on multivariate-Normal scale mixtures comparing the performances of their approach to other existing approaches in literature, both in simulated and real data; Johnson and Rossell (2012) introduce two classes of non-local prior densities for the regression coefficients of high-dimensional models, describing their properties and developing simulation studies where the number of potential predictors is of the same order of magnitude as the number of observations; Polson and Scott (2012) propose L evy processes to generate penalty functions, generalizing the class of localglobal shrinkage rules based on scale mixtures of Normals both in small and high-dimensional contexts; Griffin and Brown (2013) develop prior distributions reflecting some characteristics of both Ridge and g-prior regression for variable selection and shrinkage when different signal-tonoise ratio structures are assumed. There are many reviews in literature on Bayesian predictive model assessment and selection which provide also discussions on the connection among different approaches; see for example (Vehtari and Ojanen 2012;Mallick and Yi 2013;Piironen and Vehtari, 2017a;Forte, Garcia-Donato, and Steel 2018;Fragoso, Bertoli, and Louzada 2018).
While many of these studies are developed assuming that the model predictors are continuous, the behavior of Bayesian regression models with binary predictors is not explored. Whilst it is true that the important characteristics of normal theory regression depend on the eigenvalues and eigenvectors of the matrix X T X, the binary nature of the predictors can likely influence the matrix operations. Moreover from the applications point of view, there is a considerable interest in such binary assumption for the predictors since this condition frequently occurs in many laboratory experiments, as for example in biochemistry, where the presence or the absence of a component affects the results of the experimentation (Hibbert and Armstrong 2009;Pickett et al. 2011;Anzanello and Fogliatto 2014), or genetics studies with the presence or absence of SNP alleles in GWAS (Johndrow, Orenstein, and Bhattacharya 2017).
The aim of the present study is two-fold. Firstly, we want to empirically compare several wellknown and frequently adopted prior distributions for the Bayesian regression models where predictors are expressed only as binary variables. Secondly, we attempt to confront a drug discovery problem aimed at modeling and predicting the biological activity of molecules relevant in treating a particular disease. Assuming high-dimensionality and sparsity of the data, we develop several simulation studies to investigate the issues of the selected models when, instead of the standard Gaussian assumptions, we consider Bernoulli distributions on the predictors. Simulations are developed assuming different sample sizes n and predictor numbers p, but maintaining the "small n, large p" condition. We evaluate the models with respect to different key aspects as, for example, the predictive performance or the variable selection capacity. These aspects are in fact very important when confronting real complex domains as we will show in the paper by analyzing the biochemical high-dimensional system considered in the drug discovery problem.
The paper is organized as follows. In Sec. 2 we present the Bayesian multiple regression model and we introduce the prior distributions considered in the analysis. We also describe the characteristics of the simulation studies and the performance indicators adopted to compare the models. In Sec. 3 we present the results of the simulations both from the predictive performance and the variable selection perspectives. In Sec. 4 we introduce the real biochemical study motivating the paper and the results of the Bayesian regression analysis developed from it. Finally, in Sec. 5 we give some discussion and concluding remarks.

Materials and methods
In the following, we present the methodological context and the set-up. We introduce the general regression model and then we shortly describe the prior structure for the different Bayesian methods relevant to the paper. We also present the set-up of the simulation study and the performance indicators used for the comparison. We would like to emphasize that our aim is to summarize the methods under comparison and not a comprehensive review of the literature.

Bayesian regression model
Consider a standard multiple linear regression model where y ¼ ðy 1 , y 2 , :::, y n Þ is the vector of the n observations on a response variable, X is a n Â p design matrix on p predictors, where the i-th row represents the values of the p predictors associated to the i-th observation, b is a vector of p unknown regression coefficients and ¼ ð 1 , :::, n Þ T is the error vector with independent Normal-distributed components i $ Nð0, r 2 Þ, i ¼ 1, :::, n: Predictors may be continuous or discrete. In this paper we assume that the predictors are represented as binary variables coding the presence or the absence of a particular characteristic of the system under study; this binary structure is quite common in real case studies, especially in laboratory experimentation. When the number of predictors p is large and exceeds the number of observations n (p ) n), the model in Eq. 1 is usually referred as the high-dimensional regression model. It is frequently under this framework that most regression coefficients b j are zero and that only a small number of predictors influences the response, those being the relevant for regression. To identify and select the relevant non-zero predictors in the regression, a great variety of variable selection methods have been proposed both in a Bayesian and a non-Bayesian context. In this work we focus on the Bayesian literature, and in particular on Bayesian regularization methods which are formulated through the specification of priors inducing shrinkage of the parameters in the model. We focus on the class of "one-group" priors as defined by Polson and Scott (2010), which simultaneously select and estimate the relevant regression coefficients encouraging shrinkage of those that are close to zero. These priors have been proven to be able to select the relevant predictors for the regression while obtaining reliable estimates of their associated parameters. Moreover, it has been shown that they are computational efficient in high-dimensional settings (Carvalho, Polson, and Scott 2010).
Considering the Bayesian specification of the linear regression model, prior distributions for the parameters and the hyper-parameters of the model are introduced in a hierarchical form as proposed by Park and Casella (2008): yja, X, b, r 2 $ N n ðXb, r 2 I n Þ, b j jr 2 , k 2 , s 2 1 , :::, s 2 p $ Nð0, r 2 s 2 j k 2 Þ, for j ¼ 1, :::, p, s 2 j $ f for j ¼ 1, :::, p, where f, g and h are densities supported on the interval ð0, 1Þ: We refer to the class of onegroup global-local shrinkage priors. Each s 2 j is called local variance component, while k 2 is the global variance which corresponds to the regularization parameter in the frequentist methods; see (Fan and Lv 2010). The local variance components are used to model the sparsity characteristics and to control the amount of shrinkage in the coefficient estimates. The global variance is used to determine the overall sparsity; see (Polson and Scott 2012). The computation of the posterior distribution for the parameter vector b can be implemented by using an efficient Gibbs sampler (Park and Casella 2008). For this aim, we need the full conditional distribution of b, r 2 , k 2 and s 2 1 , :::, s 2 p : The full conditional distribution for b is a multivariate Normal with mean A À1 X T y and variance r 2 A À1 , where A ¼ X T X þ D À1 s and D s ¼ k 2 diagðs 2 1 , :::, s 2 p Þ: Sampling from this distribution can be difficult when p is large due to the need for matrix inversion; many methods have been proposed to overcome this drawback, see for example the paper of Chakraborty, Mallick, and Bhattacharya (2016) which recently presents an algorithm based on data augmentation with linear computational complexity. We also need the full conditional distribution for r 2 , k 2 and s 2 1 , :::, s 2 p : The local variance components have conditionally independent posteriors and hence ðs 2 1 , :::, s 2 p Þ can be updated in a block by slice sampling when conditional posteriors are not available in closed form (Polson, Scott, and Windle 2014). Different priors for r 2 lead to different regression structures in terms of error distribution. Usually r 2 follows an improper prior distribution proportional to 1=r 2 , while the distribution assumed for s 2 1 , :::, s 2 p leads to different prior structures for the regression coefficients b: In this paper we consider, as special examples of this class of shrinkage priors, some of the most frequently used priors in the sparse high-dimensional setting: the Double-Exponential, the Horseshoe, the Normal-Gamma, and the Ridge priors.
2.1.1. The double-exponential prior A prior for the regression coefficients b i can be obtained by considering the Double-Exponential distribution pðb j jr 2 , kÞ ¼ k This prior leads to the Bayesian analogue of the Lasso penalization (Tibshirani 1996) and it is obtained by assuming that the hyper-parameters s 2 1 , :::, s 2 p follow a joint Exponential prior distribution dependent on the hyper-parameter k, s 2 1 , :::, s 2 p $ Y p j¼1 k 2 2 exp fÀk 2 s 2 j =2g: Generally, this assumption could be simplified by assuming that the prior for s 2 j corresponds to an Exponential distribution with mean 1, for j ¼ 1, :::, p (Park and Casella 2008). An hyper-prior for the parameter k is obtained by considering the class of Gamma priors on k 2 with parameters r > 0, d > 0: Using this prior will lead to what is generally named Bayesian Lasso regression.

The Horseshoe prior
This prior for the regression coefficients b j can be obtained by assuming zero-mean independent half-Cauchy priors for all the parameters s 2 j , where C þ ð0, 1Þ denotes a standard half-Cauchy distribution (Gelman 2006). Polson and Scott (2012) proposed an half-Cauchy prior also for the global parameter k leading to a proper posterior. They argue against the use of the inverse-Gamma prior for the global variance, since it can lead to improper posteriors. By changing the value of the global parameter k, different levels of sparsity can be achieved: large values of k lead to very little shrinkage, values of k approaching zero favor complete shrinkage.

The Normal-Gamma prior
A prior for the regression coefficients b j can be obtained by considering the Normal-Gamma distribution, which arises by assuming that each hyper-parameter s 2 j is distributed independently as a Gamma distribution with shape k and rate 1=ð2c 2 Þ, s 2 j $ Gaðk, 1=ð2c 2 ÞÞ: The prior density function for b j assumes therefore the following form where K is the modified Bessel function of the third kind. The error variance r 2 can also be incorporated to form a conjugate prior. Since the parameters k and c can have an associated prior distribution, the marginal distribution of b j is generally affected by the prior choices in a way that smaller values of shape parameter k lead to larger amount of shrinkage for the b j (Griffin and Brown 2010). A natural hyperprior for k is the Exponential distribution with mean 1, anchoring it on the Lasso, while the hyperprior for c conditioned on k is given by the inverse of the Gamma distribution with parameters r > 0 and d > 0 in Griffin and Brown (Griffin and Brown 2010).

The Ridge prior
The simple and longstanding Ridge prior for the regression coefficients b j can be expressed in the following hierarchical form with s 2 j ¼ s 2 for j ¼ 1, :::, p and where IGðr, dÞ indicates the inverse Gamma distribution with parameters r > 0 and d > 0: The penalty parameter s 2 determines the amount of shrinkage, where larger values lead to small prior variation and to more shrinkage of the coefficients toward zero (Hsiang 1975).
Many authors have developed studies to provide conditions on the priors which ensure posterior concentration at the minimax rate. They give some general guidelines for choosing a shrinkage prior for the estimation under a "nearly black" sparsity assumption, meaning that almost all the entries of the coefficients' vector in the regression model are zero. These studies include conditions for the Horseshoe, the Normal-Gamma and other shrinkage priors (van der Pas, Kleijn, and van der Vaart 2014; van der Pas, Salomond, and Schmidt-Hieber 2016).

Simulation study
We develop a simulation study to evaluate the behavior of Bayesian high-dimensional regression models for different scenarios of "small n, large p". This simulation study compares the performance of the model under different prior distributions and in terms of standard indicators used in modeling context. We focus on the Bayesian Lasso, the Bayesian Horseshoe, the Bayesian Normal-Gamma and the Bayesian Ridge regression. In the following, we present the set-up of the simulations describing the data generative process, the analyzed scenarios and the performance indicators used for the comparison.
The statistical analyses are implemented using the R-project free software environment for statistical computing; we use the R-package monomvn to fit the regression models (Gramacy and Pantaleo 2010; Gramacy 2019). The specific parameters used for the selected regression approaches are reported in the Appendix.

Set-up
The simulations are based on the linear regression model as in Eq. 1, where each error component i has a zero-mean Normal distribution with r 2 ¼ 1: The design matrix X has independent elements which are generated from Bernoulli distributions, i.e., X ij $ BerðpÞ with p ¼ 0:1, for i ¼ 1, :::, n and j ¼ 1:::, p: This setting would represent the situation in which a process is described by p independent binary predictors and a quite strong form of sparsity. The simulations consider an increasing number of predictors, p ¼200, 500, 1000, 2000 and 5000.
Following Griffin and Brown (2010), only k ? ¼ 10 regression coefficients are non-zero and they are placed evenly throughout the vector of regression coefficients which can be written as where mod is the modulo operation producing 10 equally spaced spikes. To represent also opposite elements, we swap half of the coefficients of the vector to their opposite value as follows: We simulate data using different values of b ? to identify how this could affect the results; in particular we select b ? ¼ 1 and b ? ¼ 5: To understand the effect of a set of coefficients with different order of magnitude, we also generate a simulation in which the b vector of regression coefficients has 10 non-zero values but placed in the first positions of the vector and with different values: b T ¼ ½1 1 1 1 5 5 5 5 10 10 0 0:::0 with b j ¼ 0 for j ¼ 11, :::, p: The signal-to-noise ratio is derived from the definition provided in Hastie, Tibshirani, and Friedman (2001) resulting in values ranging from approximately 1 for b ? ¼ 1, to 4.7 for b ? ¼ 5 and 5.3 for b ? ¼ f1, 5, 10g: From these data generative processes, we simulate N ¼ 2000 observations (the full dataset) and we randomly select n ¼100, 200 and 500 sample points (corresponding to the 5%, 10% and 25% respectively of the full space) as training set to estimate the models. The remaining ðN À nÞ data are considered as test set on which to evaluate the predictive performance of the various Bayesian methods. Following this set-up, we derive a total of 36 scenarios characterized by different values of b ? , n and p. Each simulation is repeated 50 MonteCarlo runs to test the robustness of the approaches in changing the training and the test data. The same data are then used by all models to make reliable comparisons. In each simulation, the MCMC algorithm was run for 1000 iterations with the first 100 discarded as a burn-in.
Following the suggestion of the reviewer, we propose also four very challenging scenarios where the absolute value of the coefficients jb ? j of the active variables is less than 1. Specifically, we set the non-zero coefficients of the relevant variables to b ? ¼ j0:1j and b ? ¼ j0:5j with n ¼ 100 and p ¼ f200, 5000g: The results of these simulations are presented as Supplemental Material.

Performance indicators
We use several measures to compare the performance of the Bayesian regressions built with the selected priors.
We derive the predictive performance of the models by means of the Predictive Mean Square where y i is the actual value of the response andŷ i represents the corresponding estimated value for each of the ðN À nÞ observations of the test set using the regression model. The smaller this value, the better the model in performing accurate predictions.
We then derive three measures to evaluate the variable selection performance of the models: the Total Number of Discoveries, TND, the False Discovery Rate, FDR, and the Sensitivity, S. In this paper we rate as "discoveries" those predictors with estimated posterior probability of being different from zero greater or equal to 0.5 1 . From this, TND represents the number of individual components of the regression coefficients b which have not been shrunk to zero with high probability. Then FDR is calculated as the ratio between the number of false discovery and TND, while S is defined as the ratio between the number of true discovery and the true number of non-zero regression coefficients (in our simulations k ? ¼ 10). There are other approaches in the literature for variable selection when using shrinkage priors: for a review and some critical discussions we refer to (Brown, Vannucci, and Fearn 2002;Li and Pati 2017;Piironen and Vehtari, 2017a). Our approach is perhaps the simplest without getting into the correlations which are however very relevant.
The average values of all the performance measures are calculated based on the MonteCarlo runs and a paired two sample t-test between all pairs of models is conducted to check whether the differences in performance between each pair of models are significant. Best models are highlighted in the results.

Predictive performance
Tables 1, 2 and 3 show the performance of the approaches in terms of prediction capacity, i.e., the values of PMSE, for the different values of b ? as described in Sec. 2.2.1. The best methods (if any) are highlighted in bold. For value of b ? ¼ 1, we can note that when n is 100 or 200 all the regressions show similar prediction errors except for the Bayesian Ridge. When n increases to 500, the Horseshoe regression seems to perform better than other approaches, highlighting the ability to give more accurate predictions. Moreover, when comparing the results of the same approach for different values of n, we can see that the lower the ratio between n and p, the worse is the predictive accuracy of each method. This trend is not surprising as it is known that the prediction performances can be poor in situations in which both n/p and the signal-to-noise ratio are very low. For example, considering the Horseshoe regression, the correlation between the actual values of the test set and the corresponding predicted values ranges, in average, from r HS ¼0.36 to r HS ¼0.21 for p ¼ 200 to p ¼ 5000. This means that even if the Horseshoe regression shows a better predictive performance when comparing with the other approaches, its predicted values can be poor and the estimated model may not be the best alternative in these particular circumstances. When n increases, we detect that the correlations range, in average, from r HS ¼0.60 to r HS ¼0.20 when n ¼ 200 and from r HS ¼0.67 to r HS ¼0.40 when n ¼ 500, showing a more reliable performance of the model.
When the values of the actual non-zero coefficients in the regression are bigger in signal, i.e., b ? ¼ 5, we can see that the best Bayesian regression in terms of PMSE is almost always the one developed with the Horseshoe prior. Moreover here, the correlation between the actual values of the test set and the predicted values obtained using the Horseshoe regression, ranges from r HS ¼0.99 when p ¼200 and 500, to r HS ¼0.97 when p ¼1000 and decreases to around r HS ¼0.43 when p ¼2000 and 5000. This situation is confirmed when n ¼ 200 and 500 with very high values of correlation for n=pP0:2: This can suggest that the approach is very accurate in prediction when high-dimensionality, sparsity and binary predictors characterized the system under study and the ratio n/p is not very low. The accuracy decreases when the number of predictors increases or the number of observations decreases. This behavior is confirmed by all the approaches, highlighting how small high-dimensional data can be very difficult to analyze and to interpret.
The last simulation shows the case in which there are different values for the non-zero regression coefficients. Here, what emerges is that again the Horseshoe regression shows the better PMSE performances when comparing to other approaches, at least when n ¼ 100, but these behavioral differences reduce when n increases. Again, correlations between the actual values of the test set and the predicted values obtained with the Horseshoe regression are very high when the ratio n/p is equal or greater than 0.2. Finally, the high variability of the results obtained from the MonteCarlo runs does not allow one to detect the best approach for most simulations when n ¼ 200 and 500.

Variable selection performance
We report in Table 4 the median value of TND (in the 50 MCMC runs) for each Bayesian regression, different values of n ¼ f100, 200, 500g and different values of non-zero simulated coefficients b ? : Recalling that in each simulation the number of non-zero coefficients is k ? ¼ 10, we can see that Horseshoe regression usually makes less "discoveries" than other regressions, confirming its shrinking properties.
To understand the accuracy in correctly estimating the non-zero regression coefficients, we show in Figures 1, 2

and 3 the FDR and the S measures for the different values of b ? :
For value of b ? ¼ 1, we can note in Figure 1 (left) that when n ¼ 100 all the methods show the same behavior for value of p ¼ 200, with both FDR and S tending toward 1. This means that they find many discoveries, as shown in Table 4, where the true non-zero coefficients are part of the set but where many other "false positive" coefficients are also selected. This behavior has lessened for p ¼ 500 and 1000 in particular with respect to the Horseshoe, the Normal-Gamma and the Bayesian Lasso regressions, where FDR and S are around 0.5 and 0.2. Noting that for these simulations the value of TND is quite small, this implies that half of the discoveries are correct but they are only few with respect to the true non-zero coefficients. For p ) n, i.e., p ¼ 2000 and 5000, FDR and S tend toward 1 and 0 respectively, highlighting that all the regressions estimate a set of non-zero coefficients which are not correctly identified. Most of these findings are still valid when n ¼ 200 (Figure 1, middle) and n ¼ 500 (Figure 1, right), but we can see that the Horseshoe regression has values of FDR and S toward 1 and 0 respectively when p is relatively small. This means that the Horseshoe regression is able to find the true non-zero coefficients with high accuracy since the median number of TND is close to k ? ¼ 10 (see Table 4). Moreover, even if the Sensitivity performance of the Bayesian Ridge seems to outperform other approaches, the high values of FDR achieved in all the simulations and the high number of TND mean that it tends to select more predictors with associated non-zero coefficients, most of which are not correctly identified. It is worth noting in this simulation, all the models show difficulties in detecting the true non-zero coefficients for the regression especially when the ratio between n and p is small and less than 0.20.
In Figure 2 we report the FDR and S measures achieved by the Bayesian regressions when b ? ¼ 5: When n ¼ 100 and p ¼ 200 both the measures tend toward 1, meaning that again all the methods identify many non-zero coefficients and among them also the true ones are collected, but for p ¼ 500 and 1000 FDR tends toward 0 while S is still high leading to more accurate variable selection. Again when p increases, the models show a worst behavior confirming the poor accuracy in finding the true non-zero coefficients for the regression. When n ¼ 200 (Figure 2, middle) and n ¼ 500 (Figure 2, right), the performance improvement of the Horseshoe regression is still confirmed especially when p ¼1000 and 2000, with small values of FDR and high values of S. Differences in the values of Sensitivity have been reduced as it can be seen from Figure 2 (bottom) but now, when the ratio between n and p is quite high, all the models are able to detect the true non-zero coefficients in the regression (see for example, values of S > 0.8 for n ¼ 100 combined with p ¼ 200, 500 and 1000, for n ¼ 200 with p ¼ 500 and 1000, and for n ¼ 500 with p ¼ 1000).  HS  NG  BR  BL  HS  NG  BR  BL  HS  NG  BR   100  200  39  34  38  53  30  36  30  23  30  37  32  17  100  500  4  3  4  14  12  13  13  11  9  10  10  7  100  1000  2  2  2  8  10  9  9  10  6  6  6  6  100  2000  2  1  1  9  4  4  3  10  4  3  3  8  100  5000  8  5  6  22  7  6  7  20  8  6  8  19  200  500  14  10  13  29  24  14  18  42  22  12  16  36  200  1000  9  6  8  22  12  10  12  18  12  8  10  20  200  2000  5  2  5  19  8  5  7  16  6  3  5  In Figure 3 we report the performance measures achieved by the Bayesian regressions for variable selection when b ? assumes different values, b ? ¼ f1, 5, 10g: In this simulation we can confirm the same behavior as already discussed in the previous scenarios, with values of S that are comparable in the results among the selected models. It is worth noting how the Horseshoe regression shows almost always the best performance in FDR with p 2000 and all the values of n. This means that the use of this prior seems to achieve an efficient degree of shrinkage leading to an accurate model both in terms of prediction and capacity of deriving the predictors actually affecting the response.

Summary of the simulation studies
To try to summarize how the compared methods behave with respect to the different scenarios, we calculate for how many of the three performance indicators previously analyzed (PMSE, FDR and S) each approach produces the best result. A graphical representation of the results achieved in the 36 scenarios of the simulation studies developed for all the methods is proposed in Figure 4. Light colors mean that the method poorly performs in comparison with the other approaches by achieving the best performance with respect to none or only one of the indicators. Darker colors mean that the method outperforms the other approaches with respect to two or all the three performance indicators, showing to be the best choice in modeling the data of the specific set-up. First, we want to highlight that none of the methods is able to be the best with respect to all the indicators. Usually approaches that have high performances in Predictive Mean Squared Error have also good performances in False Discovery Rate. Sensitivity, on the contrary, is comparable in the results among the selected models reaching very high values when p is relatively small and very low values when p is large.
Our empirical conclusions based on the simulation studies are therefore the following: (1) shrinkage approaches can be useful to deal with sparse data as they discover very few non-zero parameters while obtaining acceptable predictions errors at least when the ratio n/p is not very low; (2) the performance results do not differ much between the considered approaches but Horseshoe prior seems to perform better in some specific conditions; (3) the Horseshoe prior seems to be a good prior choice for deriving reliable models with small high-dimensional data when the predictors are expressed as binary variables. It is worth to notice that in univariate normal regressions, the Horseshoe prior has already been shown to possess many attractive theoretical properties; see for example (Datta and Ghosh 2013;van der Pas, Kleijn, and van der Vaart 2014;Bhadra et al. 2019). In particular, as demonstrated by van der Pas et al. (van der Pas, Kleijn, and van der Vaart 2014), the performance of the Horseshoe is very good when the vector of coefficients is assumed to be sparse in the "nearly black" sense meaning that almost all of its entries are zero as in our simulation studies.

Convergence
With respect to the specified set-up used in the simulations, convergence problems are observed especially with low values of the ratio n/p and when the actual non-zero coefficients of the regression are small, i.e., b ? ¼ 1: There are many papers in literature confronting with this issue that provide diagnostics for the convergence (Cowles and Carlin 1996) or stopping rule for iterations (Jones et al. 2006;Flegal, Haran, and Jones 2008;Vats, Flegal, and Jones 2019). In order to better evaluate convergence, several diagnostics are therefore derived. Particularly, we calculate the Geweke's diagnostic (Geweke 1992) and the Monte Carlo standard error (Jones et al. 2006) for a subset of scenarios characterized by different values of n, p and b ? : Moreover, a graphical representation of the value of the estimates across iterations (trace plot) is derived. These convergence issues are reported in Supplemental Material.
We are aware that 1000 MCMC iterations may not be sufficient for reaching convergence, so we develop a set of additional simulations in which the MCMC algorithm is run for a number of iterations T equal to 5000, 10000 and 15000 with thinning of the chain t equal to 1000. This increase leads to improve the convergence also for those scenarios characterized by low values of n/p ratio. Some difficulties are again detected for b ? ¼ 1, but for b ? ¼5 and 10 almost all the approaches reach the convergence in particular when the Horseshoe and the Normal-Gamma priors are adopted. The results of these simulations are provided in Supplemental Material.
In this paper we deal with small high-dimensional data which lead to many statistical challenges. We decided to limit the computational time due to the number of analyzed scenarios (36) and the number of MCMC replications (50). To quantify the computational cost of the simulations, we report in Supplemental Material the CPU and the total elapsed time spent (in seconds) when T ¼ 1000 and t ¼ 100 (the baseline values for T and t used in the paper) and when T ¼ 5000, 10000, 15000 and t ¼ 1000 (the increased values of T and t used in the supplemental simulations conducted to reach convergence). Time refers to only one MCMC run, and from the results we can see how the procedures increase their computational time making difficult the development of a high numerosity of scenarios for the comparison.

Real data example: lead molecule optimization
We estimate the selected Bayesian regression models to confront a drug discovery problem aimed at the design of small molecules whose levels of biological activity is relevant in treating certain diseases. The current problem involves the identification of the best molecules, commonly known as lead molecules, with very high values of associated biological activity. In this work, the therapeutic target protein of interest is , an enzyme that in humans is involved in several inflammation pathways (Pickett et al. 2011). The data have been originally generated and analyzed by Pickett et al. (2011), and subsequently investigated by (Borrotti et al. 2014;Giovannelli et al. 2017;Slanzi et al. 2018). The whole experimental space we are dealing with consists of a library of 2500 molecules that have been synthesized in laboratory. Only a subset of them, namely N ¼ 1704 molecules, is classified as active and has associated an experimental response value representing the level of molecular activity, varying from 3.7 to 8. In this dataset, each molecule is represented by the presence or absence of a set of 22272 fragments, thus encoded as a binary variable where 1 represents the presence and 0 represents the absence of the specific fragment composing the molecule. Fragments are therefore the p predictors in the regression model. A preliminary analysis has revealed linear dependencies among them and this multicollinearity allows to reduce the total number of possible molecular fragments to 4059. We estimate the Bayesian regression models where the predictors are the set of p ¼ 4059 binary variables representing the fragments, on a random sample of n ¼ 140 observations as training set out of all the possible N ¼ 1704 observations. The remaining ðN À nÞ ¼ 1564 observations are considered as the test set. For this real data example, the ratio between n and p is less than 0.03, being comparable with the lowest ratio p/n in the previous simulations. Moreover we observed that, when the parameters n and p in the simulations are similar to those chosen in this real data sample, then convergence occurs for almost all the approaches as the number of iterations increases. Therefore we run our analyses by setting T ¼ 15000 and t ¼ 1000; convergence results are presented in Supplemental Material.
In Table 5 we report the summary statistics of the actual response value distribution for the whole set of experimental test points (molecules) and the corresponding predicted values achieved with the selected Bayesian models.
From this table we can see how the models derive predicted values of activity in a smaller interval than the actual values. This can represent a drawback since one of the main aim of a lead optimization problem is to derive the molecules with predicted activity in the top right tail of the values distribution. The Horseshoe regression and the Normal-Gamma regression seem to produce more accurate predictions in the top right tail since their summary statistics are close to those derived from the experimental values. The estimated density distributions derived for the predicted values obtained by all the approaches are presented in Figure 5. From the graph, we can see that all the approaches have multi-modal distributions with peaks in correspondence of high values of predicted activity, with the Horseshoe regression exhibiting a peak toward the top values of predicted activity.
Calculating the rank correlation among the actual experimental activities and the corresponding predicted values, we have q BL ¼ 0:701, q HS ¼ 0:721, q NG ¼ 0:713 and q BR ¼ 0:721 for Bayesian Lasso, Horseshoe, Normal-Gamma and Ridge regression respectively. This indicates that all the approaches show good performances in predicting response values with the same ordering of the real experimentation. To understand how the different regressions perform in predicting the activity for the molecules belonging to the top right tail of the response distribution, i.e., the lead molecules, we report in Figure 6 the actual values of the activity and the corresponding predictions based on the Bayesian approaches for all the molecules with an activity value greater than the 99% percentile of the response distribution, i.e., value greater than 7.5. The experimental points, i.e., the molecules, are ordered by means of increasing actual values of the activity; the best molecule has response value equal to 8. We can see that Horseshoe regression seems to have the best ability in predicting the response values of the lead molecules, producing predictions with the smallest absolute errors. In Table 6 we report the values of the absolute prediction errors achieved with all the approaches for the actual best experimental points. From the table we can see that the minimum absolute prediction error for the best molecule (experimental point with activity value equals to 8) is achieved by the Horseshoe regression, as well as the minimum prediction error for the whole set of selected points. Moreover calculating the mean and the median of the predicted values, again Horseshoe regression presents the minimum errors, confirming the fact that this approach seems to be more accurate in prediction when high-dimensionality, sparsity and binary predictors characterize the system under study. Finally to understand how these different Bayesian regressions perform in selecting important predictors for the problem, we report in Figure 7 the probability that each estimated regression coefficient is different from 0,b i 6 ¼ 0 for i ¼ 1, :::, 4059: In this Figure the horizontal colored line represents the 0.5 probability threshold, highlighting which coefficients can be considered as relevant in the regression, with estimated value different from zero, and therefore which predictors, i.e., fragments composing the molecules, are selected as affecting the activity response. In particular, Bayesian Lasso regression has 4 coefficients with probability over the threshold while the Horseshoe regression, the Normal-Gamma regression and the Bayesian Ridge have 5 coefficients with probability higher than 0.5 to be different from 0. In Table 7 we show the relevant fragments selected by each Bayesian regression models, ranked in according with the probability of the corresponding regression coefficient being different from zero. The values of associatedb i estimates are also reported. The non-zero coefficients with high probability have the higher estimated values and therefore the presence/absence of the corresponding fragments can be considered as relevant in affecting the activity of the molecules. It is worth to notice that two of the fragments (F80 and F14) are detected with high probability by all the approaches, and they are present with high frequency in the best molecules. The Horseshoe regression then selects other two fragments (F177 and F144) which are detected also by the Normal-Gamma regression and the Bayesian Ridge respectively. All these results can provide a high confidence in considering the Horseshoe prior as a good alternative to build high-dimensional Bayesian regression with binary Figure 6. Actual values of the Activity and the corresponding predictions based on the Bayesian approaches for all the best experimental points, i.e., molecules with Activity greater than 7.5: Exp ¼ Real experimental molecules, BL ¼ Bayesian Lasso; HS ¼ Horseshoe regression; NG ¼ Normal-Gamma regression; BR ¼ Bayesian Ridge. predictors when the number of observations is much lower than the set of all the possible variables characterizing the system. Moreover the relevant fragments, i.e., those with high probability of having an associated coefficient with estimated value different from zero, are informative for the biochemical problem under study. Last, we would like to highlight that these results are achieved using fixed values of the number of MCMC samples and burn-in parameter, T ¼ 15000 and t ¼ 1000. We observe convergence for almost all the regressions with a predictive performances that are quite satisfactory.

Discussion
In this work we developed a comparative analysis to study the performance of a class of Bayesian regressions with binary predictors in terms of predictive accuracy and efficiency in variables selection. We conducted simulation studies and real data analysis to investigate the issues of choosing a good model in particular when the ratio n/p is very small and the signal-to-noise ratio is modest.
We notice that with regard to the predictive power of the models, they perform almost in the same way producing accurate predictions in particular when the number of predictors is not very high with respect to the number of observations. This predictive power tends to decrease when the ratio between the number of observations and variables tends to decrease. We also notice that there is almost the same trend for all the different methods, however the Horseshoe regression generally presents good performances for all the different values of n here considered. A different  situation emerges if we consider the variable selection measures: we show how these measures evolve through the increasing values of p assuming different values of n. Moreover, all the regression models show comparable performances and they are affected by the true value of the non-zero coefficients of the regression. Being confident that the results of this simulation can be helpful when choosing the structure of the regression model to adopt in a particular problem, we develop a comparison using a set of real experimental data in the drug discovery process. Focusing in particular on the predictions achieved for the experimental activity of molecules belonging to the best sub-set of points, we show how the Horseshoe regression can produce accurate predictions selecting a very small set of fragments affecting the response of the experimentation. We would like to emphasis that one of the aim of this work was to empirically compare the performance on Bayesian model selection in terms of different priors when binary predictors are supposed to characterize the data and simple assumptions are made on them. There are still several interests in the influence of the collinearity of predictors on model selection and in considering both categorical (binary) and continuous variables; these aspects will be developed in future researches.