A semi-parametric Bayesian approach for detection of gene expression heterosis with RNA-seq data

Heterosis refers to the superior performance of a hybrid offspring over its two inbred parents. Although heterosis has been widely observed in agriculture, its molecular mechanism is not well studied. Recent advances in high-throughput genomic technologies such as RNA sequencing (RNA-seq) facilitate the investigation of heterosis at the gene expression level. However, it is challenging to identify genes exhibiting heterosis using RNA-seq data because high-dimension of hypotheses tests are conducted with limited sample size. Furthermore, detecting heterosis genes requires testing composite null hypotheses involving multiple mean expression levels instead of testing simple null hypotheses as in differential expression analysis. In this manuscript, we formulate a statistical model with parameters directly reflecting heterosis status, and develop a powerful test to detect heterosis genes. We employ a Bayesian framework where the RNA-seq count data are modeled through a Poisson-Gamma mixture with Dirichlet processes as priors for the distributions of the parameters of interest, the fold changes between each parent and the hybrid. Markov Chain Monte Carlo sampling with Gibbs algorithm is utilized to provide posterior inference to detect heterosis genes while controlling false discovery rate. Simulation results demonstrate that our proposed method outperformed other methods utilized to detect gene expression heterosis.


Introduction
Heterosis, also called hybrid vigor, describes the phenotypic improvement of a hybrid offspring over its two inbred parents. Heterosis was documented by [7] and has been widely utilized in growing agricultural crops, such as rice [30], to increase development rates and grain yields. In China, hybrid rice is estimated to be planted on more than 50% of the rice farmland, and produces 10-20% more than inbred varieties [6]. However, the mechanism of heterosis is not yet well studied [5].
Researchers have speculated that genes which are differentially expressed between hybrid offspring and its two inbred parents, or gene expression heterosis, might be responsible for phenotypic heterosis [14,27]. The recent development of high-throughput genomic technologies, such as microarray and RNA-sequencing (RNA-seq), allow researchers to measure the expression levels for tens of thousands of genes simultaneously. Then gene expression heterosis can be studied by comparing expression levels between the hybrid offspring and its two inbred parents for all expressed genes. More specifically, it is of particular interest to test for each gene if it exhibits high-parent heterosis (HPH), i.e. the mean expression level of the hybrid is greater than both parental means, or lowparent heterosis (LPH), i.e. the mean expression level of the hybrid is less than both parental means.
For both microarray and RNA-seq technologies, tens of thousands of genes are simultaneously measured for their expression levels. However, due to the high cost of such experiments, sample sizes are usually small. This introduces the 'small n, large p' problem, where n refers to the sample size and p refers to the number of variables (genes). The power for hypothesis testing in such settings is often low after adjusting for multiple testing errors. To utilize information from other genes, hierarchical models and Bayesian methods have been employed to borrow information across genes. These strategies have been established in differential expression analysis, such as the widely applied moderated-t test for microarray data [26] and baySeq [13] for RNA-seq data. Differential expression analysis aims to identify genes whose expression levels change across treatments or conditions. Hence, the null hypothesis is no change, and is a simple null case. However, for detecting HPH or LPH genes, the null hypotheses involve the mean expression levels for three conditions in a composite null. Therefore, the well-developed differential expression analysis methods are not directly applicable for the detection of heterosis genes.
Only a few methods have been proposed to detect gene expression heterosis. In 2014, Ji et al. [15] constructed an empirical Bayesian framework to detect gene expression heterosis with microarray data where gene expression measurements were modeled as continuous variables. They proposed a normal hierarchical model, which allows information to be borrowed across genes for estimating mean and variance parameters. They applied an empirical Bayes procedure to first estimate model hyperparameters, and then obtain the posterior distributions for gene-specific parameters, based on which heterosis is evaluated. Nowadays, RNA-seq technologies instead of microarray are widely applied for gene expression studies. Generally, RNA-seq count data are modeled with a negative binomial (NB) distribution [1,20]. Based on the work of [15], Niemi et al. [23] proposed an empirical Bayes approach for estimating gene expression heterosis with RNA-seq count data based on an NB hierarchical model in 2015, where heterosis was evaluated by comparing one model parameter with the absolute value of another model parameter. In 2019, Landau et al. [17] developed a general hierarchical model for RNA-seq count data and a fully Bayesian analysis with parallelized Markov Chain Monte Carlo (MCMC) algorithm to improve the computational efficiency. They also showed that the empirical Bayes approach can be an approximation of a fully Bayesian analysis if accurate hyperparameter estimates can be obtained. Both methods [17,23] are based on the assumption that gene-specific parameters are independent and arise from given parametric distributions. However, the distributions of parameters across all genes are not guaranteed to follow the assumed parametric distributions in practice. Empirical distributions of parameters could be irregular and vary between studies [18]. Therefore, under these circumstances, it is hard to model the empirical distribution across all genes with given parametric methods. In addition, both methods [17,23] did not assess the controlling of false discovery rate (FDR), which has been the choice of error criterion in RNA-seq data analysis, where tens of thousands of hypotheses tests are simultaneously conducted.
To avoid unrealistic parametric assumptions and to take FDR control into consideration, we propose to use nonparametric Bayesian methods. The Dirichlet process (DP) mixture model is one popular nonparametric Bayesian method, and such a modeling method has been used for differential expression analyses when comparing two different conditions. For instance, Do et al. [8] utilized DP mixtures to model the mean expression levels of genes for each of two conditions with microarray data in 2005. Liu et al. [18] chose DP mixtures for modeling the distribution of fold change parameters of a treatment condition with respect to a reference condition for RNA-seq data in 2015. In 2019, Bi and Liu [3] modified the base distribution of the DP prior used in [18], in order to guarantee that the model is invariant regardless of which treatment group is set to be the reference condition.
Building on the work of [18], we capture RNA-seq data with a Poisson-Gamma mixture that is equivalent to an NB model. We treat the hybrid offspring as the reference treatment, as heterosis status is determined by comparing the hybrid genotype with two parental lines. In addition, we parameterize our model so that we have model parameters corresponding to the fold changes between the mean expression levels of the hybrid offspring versus each parental line separately. We then construct a semi-parametric Bayesian approach and use posterior results for detection of gene expression heterosis while controlling FDR.
The rest of this manuscript is organized as follows. Section 2 introduces our proposed semi-parametric Bayesian approach and prior models, then applies the MCMC sampling scheme for posterior inference and FDR estimation. Section 3 provides an algorithm for improving computational efficiency grounded on a division of the data. In Section 4, we conduct several simulation studies with NB distributions and compare the results of our approach to the method in [23]. In Section 5, we analyze a real maize dataset and identify heterosis genes with our proposed method. Section 6 provides a summary and some discussion of our work.

Method
In this section, we first introduce our modeling framework, specify the prior models we adopted, then provide the MCMC sampling method for posterior inference and FDR estimation.

Model
We consider gene expression heterosis experiments that involve three genotypes: the hybrid offspring genotype, and the two parental inbred lines. Although the offspring genotype is generated by crossing the two parental lines, plants for the three genotypes are grown together in the same environment to provide samples for gene expression heterosis studies. Suppose that a completely randomized design with independent biological replicates for each genotype has been used for the gene expression heterosis experiments. For RNA-seq experiments with biological replicates in each treatment (genotype) group, the NB distribution has been commonly employed for modeling the RNA-seq count data [1,13,20]. Notice that the NB distribution has no conjugate prior and introduces computational difficulties in Bayesian hierarchical modeling. We re-parameterize the NB model with Poisson-Gamma mixtures that make Bayesian hierarchical modeling much easier.
Consider an RNA-seq heterosis experiment that measures G genes. Let Y gij denote the observation for gene g from biological replicate j of genotype i, where g = 1, . . . , G, i = 1, 2, 3, (i = 1 denotes hybrid offspring, i = 2 denotes parental line 1, and i = 3 denotes parental line 2), j = 1, . . . , n i , and n i is the number of biological replicates in treatment i. Then count data Y gij can be modeled using a Poisson-Gamma mixture model as below, where S ij denotes a normalization factor accounting for nuisance technical effects such as sequencing depths across the replicates [1], λ gij is the conditional expression mean from replicate j in treatment i for gene g, α g denotes the shape parameter that corresponds to the reciprocal of the dispersion parameter in the NB model for gene g, β g refers to the rate parameter for hybrid offspring, the product of β g and ρ g1 is the rate parameter for parental line 1, and the product of β g and ρ g2 is for parental line 2. In fact, the marginal distribution of Y gij is NB with dispersion parameter 1/α g and mean parameter α g /β g , α g /(β g ρ g1 ), and α g /(β g ρ g2 ) for the hybrid, parental line 1, and parental line 2, respectively. Note that the mean ratio of offspring over parental line 1 is ρ g1 , which is referred to as the fold change parameter between hybrid offspring versus parental line 1 for gene g. Similarly, ρ g2 denotes the fold change parameter between hybrid offspring versus parental line 2.
As shown in (1) and (2), under our unique parameterization for heterosis detection, conditions for HPH and LPH are expressed by comparing each of the two parameters with a constant instead of comparing three means with each other, which simplifies the problem. In addition, using the fold change parameters ρ g1 and ρ g2 make interpretation more straightforward.

Prior specification
Since our primary focus is the fold change parameters ρ g1 and ρ g2 , it is crucial to choose appropriate prior distributions for them. To provide maximal flexibility, we propose to use nonparametric Bayesian modeling with DP to model the prior distributions for ρ g1 and ρ g2 . A DP is a family of stochastic processes whose realizations are probability distributions. In other words, a DP is a distribution over distributions. DP is specified by a base distribution F 0 and a positive real number M called the concentration parameter. For a given measurable set , a random probability distribution F is drawn from a DP if for any measurable finite partition of , denoted by . We denote F as F ∼ DP(M, F 0 ). The base distribution represents the mean of the process, while the concentration parameter illustrates how strong the discretization is.
Next we will utilize a DP for modeling the fold change parameters. Here we illustrate the DP modeling procedure for ρ g1 (fold change between hybrid offspring and parental line 1) as an example, the same procedure is applied to ρ g2 . Following [18], a mixture of a point mass at one and a Gamma distribution is used as the base distribution of the DP prior for ρ g1 . This can be written as for gene g, g = 1, . . . , G, where p 0 is the proportion of equivalently expressed genes between the hybrid and parent 1, δ {x} represents point mass at x. Throughout this manuscript, we set p 0 = 0.5 so that no prior preference is given to either differential expression or equivalent expression between hybrid offspring and parental line 1. We set the concentration parameter M = 1, a common choice in applications [8,12,16].
We assign an exponential distribution for the prior of α g , and a Gamma distribution for the prior of β g , where r, a 0 , b 0 and α 0 , β 0 are hyperparameters. Also, we set r = 0.01, a 0 = 0.1, b 0 = 0.1, α 0 = 0.1, β 0 = 0.1 to have non-informative priors so that the inference for α g and β g primarily relies on the observed data. All priors for α g , β g , ρ g1 and ρ g2 are set to be independent. Because we apply nonparametric priors for the fold change parameters and parametric priors for other parameters, the method we propose is a semi-parametric Bayesian approach.

Markov Chain Monte Carlo simulation
With the priors specified, the posterior distributions can be derived via multiplying the priors by the likelihood function. We adopt an MCMC [29] based sampling method to draw samples from the posterior distribution. More specifically, we utilize the Gibbs algorithm to perform MCMC when conjugate priors are utilized.
In DP mixture modeling procedure, MCMC sampling methods are generally based on integrating F over its DP prior (3), where the sequence of ρ g1 's follows a Pólya urn scheme [4,9], that is, where ρ −g1 is the vector (ρ 11 , . . . , ρ G1 ) after deleting ρ g1 . Then, the most straightforward way to draw samples from our model is to update ρ 11 through ρ G1 iteratively. However, this approach is inefficient. Since in RNA-seq experiments, it is likely that many genes share the same or very similar ρ g1 , but this method cannot change ρ g1 for multiple genes simultaneously. A change to the ρ g1 for genes in such a group occurs with a low probability. Thus, converging to the posterior distribution may take a long time [21]. Due to this computational efficiency issue of the MCMC algorithm, configuration indicators are used here as in [18]. Suppose K is the number of distinct values in (ρ 11 , . . . , ρ G1 ) and let the distinct values be denoted by ρ * 1 , . . . , ρ * K . Define ξ = (ξ 1 , . . . , ξ G ) as the configuration indicators by Then, the prior model for ρ g1 is re-parameterized with ρ * k and ξ g as below, where ρ * k and ξ g have independent priors and CRP stands for Chinese Restaurant Process, which is a random distribution with the full conditional distribution of ξ g written as where K (−g) is the number of unique values in (ρ 11 , . . . , ρ G1 ) after deleting ρ g1 , and n (−g) k is the number in (ρ 11 , . . . , ρ G1 ) who equal ρ * k after deleting ρ g1 . The MCMC sampling scheme uses Gibbs sampling algorithm to update each of the following parameters: (1) λ gij 's, (2) β g 's, (3) α g 's, (4) ρ g1 's and (5) ρ g2 's, where the update of ρ g1 's and ρ g2 's utilizes the configuration indicators as shown above.
The detailed derivations of the full conditionals for each parameter are provided in Web Appendix A. The posterior samples for both ρ g1 and ρ g2 are then used for further inference.

Bayesian FDR estimation
In gene expression heterosis studies, a massive number of hypotheses tests are conducted, each related to a gene. Therefore, the number of false significant results needs to be controlled for such multiple testing procedure. As in other genomic studies, we choose to control FDR, defined as the expected proportion of false positives among the discoveries [2], in RNA-seq data analysis. In a Bayesian framework, we are able to construct procedures for estimating FDR through Bayesian FDR [11,22] by using posterior probability.
where N denotes the total number of posterior samples used for inference. We conclude the gene exhibits HPH or LPH when the estimated 1 −v g is less than a critical value c * , which can be chosen based on a desired level of FDR, γ , .
Then the Bayesian FDR controlled at γ can be estimated by .

Data division
The method we proposed is based on the MCMC sampling scheme that updates parameters iteratively among genes. Not surprisingly, such a procedure is quite time consuming, especially when the total number of genes is huge. In order to improve the computational efficiency, we consider a strategy that divides the raw dataset into several small datasets, applies our proposed method independently to the smaller datasets using parallel computing and then combines the posterior samples together for further inference. Assume we have G genes, and we randomly divide them into m groups, so that each group has an approximately equal number of genes. We assess our proposed method with and without this data division strategy in simulation studies.

Simulation studies
In this section, we carry out several simulation studies to evaluate our proposed semiparametric approaches, SBA (without data division) and SBA_div (with data division), and compare them to the empirical Bayes method in [23] (eBayes_Laplace and eBayes_Normal, depending on the parametric prior assumption). Landau et al. [17] proposed a fully Bayesian analysis and also showed that their fully Bayesian method could be well approximated by the empirical Bayes method in [23]. In addition, the fully Bayesian method is more time consuming and requires more computational resources than the empirical Bayes method. Hence, we only include the empirical Bayes method [23] but not the fully Bayesian method [17] in our simulation studies. Converting RNA-seq count data into continuous data and applying the approach proposed in [15] is also an option, but Niemi et al. [23] have already demonstrated in their simulation studies that such approach had inferior performance to their method, thus we also omit the comparison with the method developed in [15].
To imitate the real RNA-seq data, gene-specific mean and dispersion parameters were estimated from a real maize dataset [28]. We conducted two simulation studies, A and B, which differed in how fold change parameters were simulated. For each simulation study, 32 datasets were generated independently, and test performance for each method under comparison was assessed by averaging results over the 32 datasets. Each dataset contained 3000 genes, 3 genotypes and 3 replicates per genotype, and was simulated based on NB models with estimated pairs of mean and dispersion parameters. For our SBA and SBA_div methods, posterior probabilities were estimated by 5000 posterior samples after 3000 iterations burn-in. Convergence was checked by Gelman-Rubin criteria [10].

Simulation A
We estimated the gene-specific mean from one treatment group in [28]'s maize dataset, as well as the dispersion parameters across two treatments. We randomly sampled 3000 out of 27,819 pairs of mean and dispersion parameters without replacement, to use as geometric means across three genotypes (μ g ) and dispersion parameters (φ g ) for gene g = 1, . . . , 3000. The RNA-seq count data for the hybrid offspring were generated from NB(μ g * , φ g ) for gene g. Then, 1500 out of the 3000 genes were randomly selected, and ρ g1 for these genes were set to be 1, which means that count data for parental line 1 were also drawn from NB(μ g * , φ g ). The remaining 1500 genes were simulated to have fold change parameters ρ g1 set to be 0.125, 0.25, 4, or 8, thus we had 375 genes for each value of ρ g1 . Then RNA-seq count data for parental line 1 were drawn from NB(μ g * /ρ g1 , φ g ). The count data for parental line 2 were generated similarly while ρ g2 was generated independently of ρ g1 . Note that μ * g = μ g (ρ g1 ρ g2 ) 1/3 such that the geometric mean of the hybrid and two parental lines is μ g .

Simulation B
Similar to Simulation A, 3000 genes were drawn from NB(μ g , φ g ), where pairs of μ g and φ g were sampled from the estimates from the same maize data. Again, 1500 out of 3000 genes were randomly selected to have fold changes ρ g1 = 1 between hybrid and parental line 1. For the remaining 1500 genes, we simulated ρ g1 from the following distribution, log(ρ g1 ) ∼ 0.5Normal(− log(4), 1) + 0.5Normal(log(4), 1). The fold change parameters between the hybrid and parent 2, ρ g2 , were generated in the same way independently of ρ g1 .

Simulation results for detecting gene expression heterosis
Different normalization methods may affect the performance of the methods under comparison. To avoid the impact of different normalization methods, we set normalization factor S ij = 1 for all methods in both simulation studies.
We first evaluate the performances of different methods with the receiver operating characteristic (ROC) curve, which is the plot of the true positive rate (TPR) against the false positive rate (FPR). For each simulated dataset, TPR and FPR were calculated by ranking heterosis genes via posterior probabilities. Then, given each FPR level, the average TPRs over 32 simulated datasets were calculated, leading to the ROC curves shown in Figure 1. We only plotted the ROC curves within the region where FPR is below 0.1, which is often of primary interest in practice. The partial area under curve (AUC) values were calculated as well, which is the proportion of the total area in the region where FPR is no larger than 0.1. The average AUC values and the standard deviations across simulated datasets are presented in the legends.
As indicated in Figure 1, our proposed methods (SBA and SBA_div) generated higher ROC curves and greater AUC values than the empirical Bayes method proposed in [23], under both simulation settings A and B. To implement SBA_div, we randomly divided the 3000 genes into 5 groups, with 600 genes in each group, then applied our SBA method independently to the 5 groups. Therefore, Figure 1 demonstrates that our proposed methods outperformed the empirical Bayes method in terms of the ability to correctly ranking true heterosis genes.
We also evaluated the FDR estimation method described in Subsection 2.4 using the posterior probabilities for each method. FDR plots for Simulations A and B are presented in Figure 2. Given each nominal level of FDR, the actual observed FDRs were estimated by averaging the proportion of false discoveries among declared heterosis genes across 32 simulated datasets. A well-performing method would control the FDR close to or below nominal level. As shown in Figure 2, our proposed methods (SBA and SBA_div) controlled FDR, while FDR was not controlled for the empirical Bayes method in [23].
In Figure 2, the FDR curves for our proposed methods are below the Y = X line, indicating that our methods are conservative. For further study of the FDR control, we checked the actual FDR, the number of declared heterosis genes, and the number of truly declared heterosis genes for each nominal level of FDR. The results for HPH or LPH in Simulations A and B are presented in Table 1 and Web Tables 1-3 in Web Appendix B respectively. The empirical Bayes methods identified more true heterosis genes than our methods. However, they also generated many more false positives than desired and resulted in liberal actual FDR. Based on the simulation results, our proposed methods generated higher ROC curves compared with the empirical Bayes method in [23]. Furthermore, our methods controlled FDR, and hence provided a reliable list of genes exhibiting HPH or LPH at a desired level of FDR. All in all, our proposed methods worked better than the empirical Bayes method proposed in [23] under both simulation settings.

Number of groups
In this subsection, we studied how the SBA_div method works as the number of groups, m, varies. If we randomly divide the G = 3000 genes into m groups, m = 5, 10, or 25, the ROC curves and FDR plots are shown in Figures 3 and 4. In Simulation A, the results based on different m's did not differ too much, indicating that we could choose a relatively large m to receive more computational efficiency. In Simulation B, smaller m led to slightly better results, which was as expected. All choices of divisions controlled FDR well across all simulation settings. Table 2 provides the computational time needed for each method. The computational time for each simulation was calculated on a cluster node that was equipped with two 8-core 2.6GHz Intel Haswell E5-2640 v3 processors. Our SBA method with random division (SBA_div) and the empirical Bayes methods [23] (eBayes_Laplace and eBayes_Normal) could be parallelized to increase efficiency, and the parallelization was done across 16 cores. We could notice that the computational time based on the data division of our proposed method (SBA_div) was comparable to the empirical Bayes methods. As the number of divisions increased, the computational time decreased. However, as indicated in Figures 1 and  3, a larger number of divisions led to slightly worse results but was still better than the empirical Bayes methods.

Real data analysis
We applied our proposed methods to a real RNA-seq heterosis dataset published by [24]. This data studies gene expression heterosis between parental lines, B73 and Mo17, and the hybrid genotype (B73×Mo17). We used the same criterion as in [23] to filter genes with low abundance. More specifically, we kept genes with an average count equal to or greater than one and with no more than two zero read counts within the four biological replicates for each genotype, and 28,943 genes were left for gene expression heterosis analysis.   Table 3 provides the number of heterosis genes detected by different methods when controlling FDR at 0.1 or 0.05. The eBayes_Laplace and eBayes_Normal methods detected more LPH genes than our proposed methods. However, based on our simulation results that FDR control was very liberal for the empirical Bayes methods, the list of declared heterosis genes may include more false positives than desired.
Although the eBayes_Laplace and eBayes_Normal methods detected nearly the same number of HPH genes when controlling FDR at 0.1 or 0.05, the lists of HPH genes detected by the empirical Bayes method [23] were different from what our method identified. Venn diagrams of detected HPH and LPH genes when controlling at different FDR levels are presented in Web Figures 1-2 in Web Appendix C respectively. Again, the HPH genes detected from [23] might not be reliable due to their failure of FDR control based on our simulation results. Without knowing the true heterosis genes at the moment, more biological experiments are needed to validate these results.

Discussion
Gene expression heterosis has been hypothesized to help account for phenotypic heterosis, such as grain yields increment. Thus, identifying heterosis genes is a crucial issue, and may have a strong impact on biology and genetics. Existing methods for detecting gene expression heterosis with RNA-seq data require parametric assumptions [17,23]. We proposed a novel model within a semi-parametric Bayesian framework so that heterosis is directly modeled by our model parameters. We adopted an MCMC sampling scheme to provide posterior inference for detecting gene expression heterosis. Our method provides a more flexible way that avoids the dependence on parametric assumptions. From the simulation studies, we demonstrated that our proposed method outperformed the empirical Bayes method in [23], in terms of ranking heterosis genes and FDR control. Therefore, our method offers a reliable way to detect gene expression heterosis for RNA-seq experiments.
Throughout the process of building our semi-parametric Bayesian modeling framework, we considered the two inbred parents to be independent, and modeled the fold change parameters between the hybrid offspring and each parental line, ρ g1 and ρ g2 , separately. Consider parental line 1 as an example: we set the hybrid offspring as the reference condition, and modeled the distribution of gene-specific fold change parameters between the hybrid and parental line 1 using a DP prior. In a typical heterosis study containing two parents and one hybrid, the hybrid offspring is naturally selected as reference, thus fold changes ρ g1 and ρ g2 can be viewed as effects of each parental line on the hybrid. If there is biological knowledge that ρ g1 and ρ g2 may be correlated, the two parameters may be modeled jointly.
Our proposed model assumes that the hybrid offspring and two inbred parents share the same dispersion parameter, which aligns with popular methods for RNA-seq data analysis such as edgeR [20,25], DESeq [1] and DESeq2 [19]. Our method can be extended to a more flexible model that assumes different dispersion parameters α gi for the hybrid offspring and two parental lines, where i = 1, 2, 3 denotes hybrid offspring, parental line 1, and parental line 2 respectively. Then the full conditional distributions for λ gij , β g , α gi , ξ g and ρ * k can be modified easily. However, adding more parameters would introduce additional steps in the MCMC sampling scheme and hence increase computational complexity.
The DP priors depend on the base distribution F 0 and the concentration parameter M. We used a mixture of two components as F 0 : a point mass at one and a Gamma distribution. The choice of the point mass component is due to the high frequency of estimated fold changes that lie in the small range around 1 based on real data. The choice of the Gamma distribution as the second component is because the Gamma distribution ensures conjugacy that facilitates computation. In the DP priors, the concentration parameter M is commonly chosen as M = 1 in applications [8]. We also checked the simulation results with various values of M (M = 0.2, 0.5, 2, 5, 10 or 20), where the results remained nearly the same for different M.
We specified p 0 = 0.5 so that no prior preference is given to either differential expression or equivalent expression between hybrid offspring and either parental line. To investigate the robustness of setting p 0 = 0.5 under different simulation scenarios, we conducted more simulations by varying the proportion of genes having fold change 1 between hybrid offspring and each parental line. Simulation results (presented in Web Appendix D) show that using this prior of p 0 is robust under all settings. Our proposed methods (SBA and SBA_div) performed better than the empirical Bayes method in terms of both ROC curves and FDR control under all simulation scenarios.
Although our proposed semi-parametric Bayesian method provides a reliable approach for the detection of gene expression heterosis, computational complexity might be an issue. In order to improve the efficiency, we also provided an algorithm based on a division of the data. The choice of number of groups, m, is a trade-off between efficiency and accuracy. According to the simulation results, a larger number of divisions m led to lower accuracy, but still outperformed the current empirical Bayes methods with comparable computational time. Additional discussion about data division can be found in Web Appendix D.
When performing our proposed method on the real data, the heterosis genes detected with different number of divisions are not exactly the same. Part of the reason is due to the randomness of MCMC. If we run another MCMC using a different seed, the heterosis genes detected by the two MCMCs are not necessarily the same. In addition, whether the Markov chains are long enough to get accurate results could also be a potential problem. We checked the effective sample size for each gene. Genes that were detected to be heterosis genes by all numbers of divisions had effective sample sizes greater than genes that had different declared heterosis status by different numbers of divisions. So for those genes with a low effective sample size, we may need to run longer chains. Based on simulation checking, running the Markov chains longer do increase the percentage of overlapping genes, as expected. However, running longer chains is more time consuming. Therefore, it is also a trade-off between efficiency and accuracy, and we will let the users decide which one is more important for a practical application.