Sequential Bayes Factors for Sample Size Reduction in Preclinical Experiments with Binary Outcomes

Abstract Preclinical studies are an integral part of pharmaceutical drug development, yet traditional methods for designing and analyzing these types of studies can be inefficient and wasteful. Even worse, when the units of study are animals, ethical concerns can arise. The 3Rs initiative was established for the ethical treatment of animals through the replacement, reduction, and refinement of animal experiments. In this article, we focus on the reduction aspect of the 3Rs initiative through the use of sequential Bayes factors. The use of sequential Bayes factors has the potential to help design more efficient experiments, that can be analyzed sequentially, in order to reduce the average number of animals needed in preclinical studies. An added bonus, sequential Bayes factors provide a means of quantifying evidence both for and against the null hypothesis, a characteristic not common to traditional preclinical trial analysis methods. Illustrations highlighting the success of sequential Bayes factors are provided for two real seven-day preclinical experiments in rats, as well as extensive simulation studies.


Introduction
In pharmaceutical drug development, candidate drugs must first be assessed for potential safety and toxicity issues before they can be administered to humans in clinical trials.These preliminary assessments, or preclinical studies, must be well planned and designed in order to ensure safe clinical drug development.Preclinical experiments are carried out either in vitro, in vivo, or both, with the latter case of in vivo experiments typically requiring the use of experimentation in animals.Thus, ethical issues arise over the use of animals in preclinical experiments, and in particular, with regards to the justified number of animals used.Using too few animals and the conclusions of the experiment may not be valid, yet using too many animals and we are needlessly wasting resources, and worst, unethically experimenting on more animals than needed.Realizing these issues, the 3Rs initiative was established (Russell and Burch 1959) for the ethical treatment of animals.The 3Rs initiative aims to provide more humane animal research through replacement, reduction, and refinement of animal experiments.In this article, we focus on the reduction, that is, reducing the number of animals used per experiment, aspect of the 3Rs.
In the planning phase of an experiment, the purpose of a prospective design analysis (e.g., power analysis) is to facilitate the design of a study that ensures a sufficiently high probability of detecting an effect if it exists.With regards to preclinical animal experiments, this is akin to choosing a large enough fixed sample size to run an experiment with.A sizeable literature exists on the use of frequentist power analysis in the null hypothesis significance testing (NHST) paradigm to facilitate the design of informative animal experiments (Charan and Kantharia 2013;Arifin and Zahiruddin 2017).However, not as much attention has been given to the use of Bayesian statistics even though they may present better options.For example, the very recent paper by Bradstreet (2021) showed that, under the Bayesian paradigm, future predictions of efficacy could be used to stop an animal experiment early if the probability of meeting the experiment's efficacy goal is judged to be low based on the posterior predictive distribution.Noteworthy, the Bayesian framework provides convenient ways to incorporate historical or external knowledge.In clinical study settings, this feature has been widely applied to design seamless adaptive phase I/II or II/III studies (Berry et al. 2010).Preclinical experiments, however, commonly have little existing knowledge, which explains to some degree the unpopularity of Bayesian designs.Nonetheless, we do not view the dearth of applications of Bayesian statistics as problematic, but rather as an opportunity to explore novel ideas in the preclinical space.
In this article, we propose a novel idea to use sequential Bayes factors (Schönbrodt et al. 2017;Schönbrodt and Wagenmakers 2018) to reduce the number of animals used in preclinical experiments.The use of sequential Bayes factors allows for experiments to be carried out one animal (or some predetermined cohort size m) at a time in order to assess the evidence of a given experimental model or hypothesis.The sequential nature of the test allows for the experiment to optionally stop early if the weight of evidence, after n animals, is judged to be adequate.
The remainder of this article is organized as follows.In Section 2 we introduce the concept of a Bayes factor for quantifying model evidence, and in Section 3 we go a step further and explain how Bayes factors can be calculated sequentially to produce optional stopping rules.Section 4 highlights the performance of the sequential Bayes factor framework on synthetic examples, and Section 5 demonstrates the efficiency of the sequential Bayes factor framework on two real datasets from a seven day animal study in rats.Lastly, Section 6 finishes with some discussion about potential future work and concluding remarks.

Bayes Factors
At its simplest, Bayesian statistics allow for prior opinion about an experiment to be updated into a posterior opinion through consideration of the data.A natural extension of this idea is to be able to judge your posterior beliefs about one hypothesis, given the data collected, versus another hypothesis.More mathematically, we might be interested in the following expression: where M 1 and M 0 denote the effect model (a.k.a., the alternative hypothesis) and null model (a.k.a., the null hypothesis), respectively.In simpler terms, the Equation ( 1) can be thought of as posterior odds = Bayes factor × prior odds. (2) Here, the posterior odds represent the relative beliefs about the models (hypotheses) after seeing the data, while the prior odds represent the relative beliefs before seeing the data.The term Pr(Data|M 1 )/Pr(Data|M 0 ) is the Bayes factor and it describes how beliefs are to be updated (Kass and Raftery 1995).Bayes factors quantify the relative evidence in the data, with respect to whether data is better predicted by one model/hypothesis (e.g., a null hypothesis, "there is no effect in the population") or a competing model/hypothesis (e.g., "there is a nonzero effect in the population"), and ranges in value from 0 to ∞.Consistent heuristic rules have been given based on Bayes factors and Table 1 shows the commonly referenced ones suggested in Lee and Wagenmakers (2013).Note, unlike frequentist hypothesis testing, Bayes factors allow one to quantify the evidence in Strong evidence for M 1 3-10 Moderate evidence for M 1 1-3 Anecdotal evidence for M 1 1 N o e v i d e n c e 1/3-1 Anecdotal evidence for M 0 1/10-1/3 Moderate evidence for M 0 1/30-1/10 Strong evidence for M 0 1/100-1/30 Very strong evidence for M 0 < 1/100 Extreme evidence for M 0 favor of the null hypothesis, or more simply put, the probability of accepting the null hypothesis.This is a completely foreign, and wrong, concept under the frequentist hypothesis testing paradigm where one can only reject the null hypothesis (i.e., accept the alternative hypothesis) or fail to reject the null hypothesis (which is not the same as accepting the null hypothesis).Thus, Bayes factors provide a very useful and practical alternative to frequentist hypothesis testing.

Sequential Bayes Factors
Choosing to conduct an experiment at a fixed sample size allows for too many animals to be experimented upon when potentially fewer animals could have been used to arrive at the same conclusion.To mitigate this concern, a sequential experiment could be conducted where optional stopping of the experiment would be evaluated each time a cohort of animals is added to the experiment.Optional stopping refers to the practice of peeking at data and then, based on the results, deciding whether or not to continue an experiment.In the context of an ordinary null hypothesis significance testing analysis, optional stopping is discouraged, because it necessarily leads to increased Type I error rates (false positives) over nominal values.However, this is not a concern from the Bayesian point-of-view.In contrast to p-values, the interpretation of Bayes factors does not depend on stopping rules (Rouder 2014).This property allows researchers to use flexible research designs without the requirement of special, and ad-hoc, corrections (DeMets and Lan 1994) for early "peeking at the data." Thus, from a Bayesian point-ofview, we are able to carry out a sequential Bayes factor analysis (Schönbrodt et al. 2017;Schönbrodt and Wagenmakers 2018) that allows for the Bayes factor to be calculated every time (or at a predetermined number of time points) that an animal is used in an experiment.Based on prespecified thresholds of Bayes factor values, say in favor of the alternative or null hypothesis, we could stop the experiment early once our prespecified level of evidence has been achieved.Of course if either level of evidence is not achieved, then we simply carry out the experiment based on the full sample size budget.Thus, this proposal for early stopping has the potential to reduce the number of animals needed for preclinical experiments.Additionally, practitioners concerned with the effect of stopping rules or multiplicities can simulate the Type I error rate under a sequential Bayes factor analysis to see if the Type I error rate is sufficiently controlled.

Illustrative Examples
We illustrate the performance of the sequential Bayes factor approach with simulated data based on real hypothetical examples encountered in preclinical studies.In general, the sequential Bayes factor methodology is applicable to most any outcome type and statistical hypothesis test.Here, we focus on the case of binary outcomes since (a) it is one of the most common types of response data encountered by scientists conducting preclinical animal experiments, and (b) of all the potential outcome types we explored, binary outcomes seemed to yield the greatest benefit from the sequential Bayes factor methodology in the small sample size setting encountered in preclinical experiments.Although not limited to two sample comparisons, in Sections 4.2 and 4.3 we restrict our attention to simulated examples comparing only two groups of animals to help facilitate understanding of the sequential Bayes factor method and its operating characteristics.We leave the case of comparing multiple groups of animals to the two real preclinical datasets investigated in Section 5.

The General Framework
A common endpoint of preclinical animal experiments is a binary outcome that can be used to compare the proportion of successes, say, in the control group versus the treated group.
For sake of example, consider a two-arm randomized preclinical study where animals are assigned to either the control or treatment group.Let Y be the binary outcome of interest, and T = 0 or 1 be the arm indicator for the control and treatment groups, respectively.Furthermore, let θ 0 = E(Y|T = 0) and θ 1 = E(Y|T = 1), be the probability of a successful outcome in the control and treatment group, respectively.Now consider a study that aims to test the one-sided null hypothesis H 0 : θ 1 ≤ θ 0 versus the alternative hypothesis H 1 : θ 1 > θ 0 .Priors are assigned to θ t such that θ t ∼ Beta(a t , b t ) for t = 0, 1.Moreover, it is assumed that θ 0 and θ 1 are independent a priori.Then the prior odds of H 1 to H 0 is At the time of the analysis, let N t be the number of animals on arm T = t and X t = N t i=1 Y i be the corresponding sum of the outcomes.Then, the posterior distribution of θ t is and the joint posterior distribution of θ 1 and θ 0 is where D = (n 0 , x 0 , n 1 , n 0 ).The posterior odds of H 1 to H 0 is Lastly, recalling Equations ( 1) and ( 2), we have that the Bayes factor is B po /B pr .Sequential analysis can proceed as before where optional stopping can be performed once the bar for evidence, based on the Bayes factor, has been reached.It is worth noting that Equations (3) and ( 6) are not specific to the case of binary outcomes but can also be used to derive, say, the Bayes factor under the case of continuous outcomes (i.e., nonbinary) as well.

A Small Example
Suppose there are two groups of animals receiving either an experimental treatment or control (i.e., no treatment).The purpose of the experiment is to judge whether or not the effect of the treatment induces a certain favorable outcome in the animals at a higher proportion as compared to the animals who are in the control group.The animals who respond favorably to either the treatment or control are called responders while those who do not respond favorably are called nonresponders.For sake of argument, we also assume that the experiment requires that the animals be sacrificed at the end of the study in order to assess the animal's response status, and so reducing the number of animals that needs to be used is critical from an ethical point-ofview.Since the data is simulated, we define the distributions for the outcomes of the treated and control group to be Bernoulli distributed but with different proportion parameters θ 1 and θ 0 , respectively.Thus, the two distributions of outcomes are different, and the hope is that we will be able to detect the treatment effect with our experiment.Now, again let T = 0 or 1 be the arm indicator for the control and treatment groups, respectively.Under the typical frequentist hypothesis testing framework, we can perform a pre-experiment power analysis (Wang and Chow 2007) to judge the number of animals, N t on arm t = T, needed for the experiment.In this example, we will assume that true proportion of responders for the treatment and control groups is θ 1 = 0.7 and θ 0 = 0.25, and that the total sample size per group will be N t = 10.Note, the choice of 10 animals is not arbitrary; however, preclinical experiments in the pharmaceutical industry are not typically powered to test a specific hypothesis and so sample sizes are often chosen due to budgetary or ethical constraints.Our choice of 10 animals reflects a real limit on the number of animals typically encountered in a preclinical experiment.Simulating data under these assumptions, Table 2 shows the counts of responders and nonresponders for the treatment and control groups.
From a classical frequentist null hypothesis significance testing (NHST) point-of-view, a Fisher's exact test (Fisher 1954) would be appropriate for testing this sort of data.Although under powered, running a Fisher's exact test under this scenario results in the correct conclusion, say, at the 5% α-level, that the proportion of responders between the two groups is higher in the treatment group (p-value = 0.002), but at the expense of a total of 20 animals.However, if we apply the proposed sequential Bayes factors analysis (using noninformative priors, e.g., θ 0 ∼ Beta(1, 1) and θ 1 ∼ Beta(1, 1)), with a stopping criteria of either 10 animals per group or a Bayes factor of 30 (log Bayes factor of 3.4), then we conclude that there is very strong evidence of a treatment effect after a total of 14 animals (Figure 1).The sequential stopping rule results in the correct conclusion, and a total of six animal lives are saved.Note that a different test could have been used for conducting the NHST; however, the point of this exercise is to facilitate the comparison of a sequential method versus a nonsequential rather than a comparison of the distributional assumptions made under each test.
Moreover, under these simulation settings, we can also calculate the frequentist NHST operating characteristics to see how the method manages such things as Type I and Type II errors.
Here we choose our decision making cutoffs using a Bayes factor of 30, that is, strong evidence in favor of a treatment effect if the calculated Bayes factor is greater than 30 or strong evidence in favor of no treatment effect if the calculated Bayes factor is less than 1/30.Otherwise we stop if a total of 10 animals per group have been reached and the conclusion of strong evidence is inconclusive.Rerunning the simulation 1000 times, under the assumption of a treatment effect, we observe that 67.3% of all simulations hit the correct M 1 threshold (i.e., the true positive rate), 0.02% hit the wrong M 0 threshold (i.e., the false negative rate), and the remaining 32.9% of simulations stopped at a sample size of 10 animals per group and remained inconclusive with respect to the a priori set of evidentiary thresholds.Moreover, of the 67.3% of studies that reached a conclusion, the average number of animals sacrificed per group was 5.5, thus, saving the lives of about eight animals on average.If we repeat this simulation experiment under the assumption of no treatment effect, then we have that 11.6% of all simulations hit the correct M 0 threshold (i.e., the true negative rate), 0.9% hit the wrong M 1 threshold (i.e., the false positive rate), and the remaining 88.1% of simulations stopped at a sample size of 10 animals per group and remained inconclusive with respect to the a priori set of evidentiary thresholds.This large amount of inconclusive studies would suggest that either the evidentiary bars would need to be lowered for such a small sample size, or the maximum sample sizes would need to be increased if decisions based on such high evidentiary bars are desired.
To further understand the general characteristics of sequential Bayes factors, we run another simulation where both the treatment and control group measurements follow again a Bernoulli distribution with proportion parameters θ 1 = 0.7 and θ 0 = 0.25, respectively.Here, as before, we test the following null and alternative hypotheses, for a total of 100 times in a simulation.For each simulation, we allow the maximum sample size per group to be N t = 100 animals.Although in practice we would never advocate for such a high number of animals to be used, we allow the sample sizes to be large in the simulation as it will help elucidate the pros and cons of the sequential method and its operating characteristics.Likewise, we only run 100 simulations so that the sequential trajectories of the Bayes factors in the following figures are easier to understand and visualize.In this example, increasing the number of simulations had no significant impact on the reported summary of operating characteristics.Simulating data under the alternative hypothesis, H 1 , we obtain the following distribution of sample size (when the experiment stops) based on setting the level of evidence to be 30 (Figure 2).As we can see from Figure 2, the mean number of animals per groups is around 10 to achieve a correct conclusion (a true positive) with such a high bar of evidence.Fewer animals can clearly be used if the evidentiary bar is lowered.However, this may increase the false negative rate.In this example setting the level of evidence for the Bayes factor to be somewhere between 5 and 10 seems optimal since fewer animals will be used while still keeping the false negative rate around zero.
On the other hand, simulating data under the null hypothesis, H 0 , leads to a much different sample size distribution (Figure 3).For the null hypothesis simulation we let θ 0 = 0.25 and θ 1 = 0.1.Of the 100 simulations, 7 of the simulations led to an inconclusive result when the evidence bar was set at 30.These inconclusive simulations show that even when there is a large amount of data collected, that is, N t = 100, there is still a chance that the maximum number of animals could be used without reaching a definitive conclusion.However, this is as to be expected since the evidence bar is very high.Clearly, from Figure 3, if the evidence bar were to be lowered, the average number of animals used would decrease substantially, as well as the number of correct conclusions.In this simulation, using a  Bayes factor of 30 leads to an average sample size of 39.9 animals to reach the correct conclusion, that is, to accept H 0 .
Lastly, it is worth noting that in the asymptotic sample size case, the Bayes factor will converge either to zero (if H 0 is true) or to infinity (if an effect is present) (Morey and Rouder 2011;Rouder et al. 2012;Schönbrodt et al. 2017).Thus, every sequential Bayes factor will converge toward (and cross) the correct boundary for a large enough sample size.This convergence property can be seen in both Figures 2 and 3 as the almost all of the Bayes factor trajectories are headed toward infinity (in the case of H 1 ) or toward zero (in the case of H 0 ), respectively.

Operating Characteristics
To better understand the operating characteristics of sequential Bayes factors in the binary outcome setting, we shall conduct a simulation study comparing the one-sided null hypothesis H 0 : θ 1 ≤ θ 0 versus the alternative hypothesis H 1 : θ 1 > θ 0 .Let θ 0 and θ 1 represent the proportion of successes in the control and treatment groups, respectively.Then, the effect size, d, can be defined as d = ϕ 1 − ϕ 0 where ϕ i = 2 arcsin ( √ θ i ) for i = 0, 1 (Cohen 1988).For an effect size of this form, Cohen (1988) suggested that a d near 0.2 is a small effect, a d near 0.5 is a medium effect, and a d near 0.8 is a large effect.In our simulation, we fix θ 0 = 0.25 and vary θ 1 = {0.35,0.49, 0.64} which results in d = 0.22 (small effect), d = 0.50 (medium effect), and d = 0.81 (large effect).For each of the three effect sizes, we also set the bar for evidence of the Bayes factor at either 3 (moderate evidence), 10 (strong evidence), or 30 (very strong evidence).Moreover, it can be argued that no Bayesian analysis is ever conditional on a fixed parameter value and that the appropriate thing to do would be to assign probability distributions for the parameters that are used to generate the sampling distributions (Gelfand and Wang 2002).We decided to run the simulations both ways, and have reported the results based on distributions with fixed parameters in Tables 3 and  4, and the results from assigning probability distributions for the parameters in the supplementary materials.In our case, there was no appreciable difference between the two simulation approaches.
Besides running the sequential Bayes factor analysis, we also tested the hypotheses within the simulation using the standard null hypothesis significance testing (NHST) paradigm since this would be the typical approach used in practice.Under the NHST framework, we chose to test the set of hypothesis using a Fisher's exact test based on the full N t = 10 animals for each simulation.Although more prevalent in the clinical trials literature, we also viewed group sequential designs (Wassmer and Brannath 2016) as the most competitive method to compare the sequential Bayes factor method to since it allows for prespecified interim looks at the data while also maintaining prespecified Type I and Type II error rates.With that in mind, we tested the hypotheses within the simulations using group sequential designs based on Pocock boundaries (Pocock 1977) and O'Brien-Fleming boundaries (Fleming, Harrington, and O'Brien 1984).The main difference between the two type of group sequential designs being that the Pocock boundary method maintains a constant α-level across all of the interim looks, while the O'Brien-Fleming boundary start with very low α-levels at early interim looks and progressively larger α-levels at later interim looks.Here we assume that the number of interim looks for the group sequential designs is the same as the number of interim looks for the sequential Bayes factor method which in this case is after every animal.We acknowledge that in the clinical trials setting taking an interim look after every subject is probably not what occurs in practice, however, we (a) are clearly not in the clinical trial setting and (b) do not believe that a fair comparison can be made between sequential methods if the same number of interim looks is not used.We make use of the R package gsDesign (Anderson 2021) for all group sequential design calculations which we also recommend for readers interested in learning more about group sequential design methods.
The operating characteristics under the assumption of the alternative hypothesis are summarized in Table 3.Here, each simulation setting was run 1000 times and the maximum number of animals in each group was capped at 10 to reflect real sample sizes commonly encountered in preclinical studies.In all simulations, we set the priors such that θ 0 ∼ Beta(1, 1) and θ 1 ∼ Beta(1, 1) in order to be as noninformative as possible.
The average number of animals per group, N t , increases with a larger bar for evidence but also decreases as the effect size, d, grows.Intuitively this makes sense.The lower the bar for evidence the easier it should be to declare stopping of the study, which ultimately leads to a reduction in sample size.Likewise, small effects sizes are hard to detect, and thus require more animals on average than larger effect sizes.The rate at which simulations resulted in inconclusive results (i.e., the Bayes factor did not hit an evidence bar in favor of either the null or alternative hypothesis) also decreased as a function of effect size, but also grew with increasing evidence bar, which again makes a lot of sense.With a low evidentiary bar it is easy to make a decision one way or another, however, it becomes increasingly difficult to accumulate enough proof of an effect (i.e., sample size) when the bar for evidence is high.However, in the terminology of null hypothesis significance testing, it is clear to see that the power (i.e., the true positive rate) is much higher under the sequential Bayes factor approach as compared to the NHST approach at a fixed maximum sample size of 10 animals per group.This true positive and false negative rate relationship seems to suggest that the sequential Bayes factor approach is a much more powerful test, as compared to the NHST approach, when the alternative hypothesis of a treatment effect is true under these simulation conditions.Moreover, the sequential Bayes factor approach is capable of being the more powerful approach and with far fewer animals since the largest average number of animals was estimated at 6.1 as compared to 10 animals under the NHST approach.Similar results hold when comparing the sequential Bayes factor method to the group sequential design methods.Compared to sequential Bayes factors, the group sequential designs used a larger number of animals on average while also having higher error rates.However, under this scenario, group sequential designs based on the Pocock boundary did seem preferable compared to NHST method which suggests the potential for benefit using sequential testing methods in general.
Similarly, under the scenario of the null hypothesis (i.e., the truth is no treatment effect), we see that the sequential Bayes factor approach performs as well as the NHST approach and group sequential designs approach with regards to error rates (Table 4), however, with far fewer animals used.Again, with a larger bar of evidence needed, the average sample size increases for the sequential Bayes factor method, although the rate of true negatives and false positives remains extremely good.On the other hand, the group sequential design methods require almost the full sample size available and is thus very comparable to the NHST strategy with perhaps the Pocock boundary method performing slightly better than the NHST method.

Seven-Day Animal Studies
In the early stages of preclinical experiments, lots of short duration experiments are conducted to assess the suitability of a candidate drug for progression to larger preclinical experiments.The following historical data was taken from two real preclinical experiments (i.e., what follows is a retrospective analysis of data from two completed studies).Due to issues of confidentiality, all references that might be used to identify the candidate drugs/studies have been removed from the dataset.Both studies (referred to herein as dataset 1 and dataset 2) were conducted in rats, and in each study the rats were allocated randomly to one of four groups: control, low dose, medium dose, and high dose group.Rats in the control group were given no dose of the drug, while rats in the three other groups were administered either a low, medium, or high dose of the drug at the start of the study and such things as vitals and laboratory work were collected over a seven-day period.On the eighth day, all of the rats were euthanized and their final data collected.Both studies collected organ measurements of the rats prior to treatment and at end of study, and these organ measurements where then used to classify each rat as having either a favorable outcome (i.e., a responder) or a non-favorable outcome (i.e., a nonresponder).Response status was considered an important outcome of the studies as it served as a surrogate endpoint for efficacy.Here an increased proportion of responders in the dose groups, as compared to the control group, was considered a positive finding.Thus, the hypotheses of interest of the studies were to test the one-sided null hypothesis H 0 : θ control ≤ θ i versus the alternative hypothesis H 1 : θ control > θ i where θ control and θ i , for i = 1, 2, 3, corresponds to the proportion of responders in the control and low, medium, and high dose group, respectively.
After data collection, each study in the original analysis was analyzed under the NHST framework.The counts of responders and nonresponders, by group, are shown in Tables 5  and 7.
Dataset 1 contained 8, 6, 12, and 6 animals in the control, low dose, medium dose, and high dose groups, respectively.Testing the hypothesis that the control group is different than each dose group, via a Fisher's exact test, yielded significant results, at an α = 0.05 significance level, for the comparison of the control and medium dose group (p-value = 0.018), as well as for the control and high dose group comparison (p-value = 0.028).However, these conclusions came at the expense of a total of 32 animals.On the other hand, an analysis using the sequential Bayes factor approach would have come to the same conclusion, but with fewer animals euthanized.Here we can conduct a sequential Bayes factor analysis where we test all of the three dose groups against the control group and sequentially add one animal per group as the sequential test continues.This is a retrospective study of the data and so we use time stamps from when the animals were first recorded into the study as a means of choosing the order to sequentially add more animals.Lastly, we can also take advantage of the underlying monotonicty assumption that as the dose increases there should be an increase in efficacy (i.e., a dose effect).Incorporating this assumption into the sequential testing scheme allows for us to stop adding animals at higher doses if a lower dose shows signs of a treatment effect.Carrying out the sequential Bayes factor analysis, with a very strong evidence bar of 30, yields the following results for dataset 1 (Table 6).
As we can see from Table 6, the same hypothesis testing conclusion is achieved using the sequential Bayes factor approach as compared with the NHST, but with far fewer animals used.In fact, under the NHST approach, a total of 32 rats were euthanized while only 16 rats would have been used under the sequential Bayes factor approach.This total savings of 16 rats is by no means a trivial amount.Here we recognize that stopping after three animals might not seem practical from a scientific point-of-view, and so, if desired one can also define a lower limit on the minimum number of animals tested per group as well.This simply implies that the Bayes factors calculated before the minimum sample size per group has been achieved can be ignored.Similarly, one could instead choose to test more than one animal at a time (say, m animals at a time per group) in order to reduce the chance of stopping early at a small sample size.Since this analysis is a retrospective study with no real-world implications we continue under the assumption that three rats per group is sufficient.
A valid concern, one could argue that the order in which the animals are chosen can have a large impact on the decision to stop early for a particular group.To explore the role and impact of the order in which the animals enter the study, we can rerun the previous sequential Bayes factor analysis after permuting the order in which the animals entered the study for each group.We run a 100 simulations like this and find that the average number of animals needed, across simulations, for the control, low dose, medium dose, and high dose groups is 6. 95, 5.77, 4.97, and 4.31, respectively.In total, the average number of animals increases from 16 rats to 22 rats, which is still a substantial number of animals saved.Likewise, over the 100 simulations, the minimum number of animals needed is 3 for the control, low dose, medium dose, and high dose groups, while the maximum number of animals is 8, 6, 7, and 6, respectively.The total maximum number of animals is 27 and this number represents the worst-case scenario which is still less than the total sample size of 32 animals.
Rerunning the same sets of analyses for dataset 2 highlights the fact that the sequential Bayes factor analysis can still reduce the total number of animals used in an experiment even when there appears to be no treatment effect present.With a larger number of animals used, dataset 2 contained 8, 12, 12, and 12 animals in the control, low dose, medium dose, and high dose groups, respectively (Table 7).Testing the pairwise hypotheses via a Fisher's exact test yielded no significant results at an α = 0.05 significance level.
Even though there does not appear to be a clear dose effect in Table 7, we continue to assume that there is an underlying dose effect when running our sequential Bayes factor analysis.Again, carrying out the sequential Bayes factor analysis, with a very strong evidence bar of 30, yields the following results for dataset 2 (Table 8).
Yet again, the sequential Bayes factor approach led to a reduction in the number of rats used for the study.Here, the sequential Bayes factor approach required only 20 animals as compared to the 44 animals used under the NHST approach.Now, as opposed to the NHST approach, the sequential Bayes factor approach has the highly favorable quality of being able to accept the null hypothesis where there is substantial evidence to do so.The NHST paradigm allows one to fail to reject the null hypothesis which is not equivalent to actually accepting the nulll hypothesis.This ability to accept the null hypothesis also allows for the sequential Bayes factor analysis to stop early when there is evidence in favor of the null, which was the case in Table 8, and which ultimately led to the reduction in number of rats used.
Here, the Bayes factor corresponding to the high dose group is below the reciprocal evidence bar for the null hypothesis of 1/30 and therefore allows for the study to stop early given the presumed dose effect.
To assess the impact of the order in which the animals enter the study, we again permute the ordering in which the animals entered the study within each group.After running 100 such simulations, we find that the average number of animals needed for the control, low dose, medium dose, and high dose groups is 7.76, 8.71, 7.80, and 7.68, respectively.In total, the average number of animals increases from 20 rats to 31.95 rats, which is still a substantial number of animals saved when compared to the 44 animals used in the NHST approach.Likewise, over the 100 simulations, the minimum number of animals needed is 3 for the control, low dose, medium dose, and high dose groups, while the maximum number of animals is 8, 9, 9, and 9, respectively.The total maximum number of animals is 35 and this number represents the worst-case scenario which is still less than the total sample size of 44 animals.

Discussion
In this article, we explored the utility of Bayesian methods for designing and analyzing preclinical experiments, with the focus on sequential Bayes factors.The use of sequential Bayes factors was shown to effectively reduce the average number of animals used in preclinical experiments where the outcome was a binary endpoint.Although not designed with frequentist operating characteristics in mind, in many cases the sequential Bayes factors can achieve desirable Type I and Type II error rates.Likewise, the sequential Bayes factor approach is capable of quantifying the evidence in favor of the null hypothesisa feature that is lacking from the frequentist null hypothesis significance testing paradigm.However, the sequential Bayes factor approach is not without its criticisms.Although sequential Bayes factors have the potential to reduce the number of animals used in preclinical studies, the sequential approach can easily expand the length of time it takes to run the preclinical study.This article was written in the spirit of the 3Rs initiative and therefore is aimed at the ethical welfare of the animals rather than coming up with the design for the fastest preclinical study.On the other hand, it could be argued that the reduction in cost of using fewer animals could offset some of the cost of running a longer time preclinical study.Furthermore, although we focused on the case of small animal (e.g., rat) studies, it could be argued that the sequential Bayes factor approach is well suited for large animal studies (e.g., horse or nonhuman primate) where the cost to run those studies is significantly more expensive and resources are limited, and so a reduction in the number of animals is greatly needed.
Moreover, the general criticism of using Bayesian statistics is that it requires the specification of prior knowledge in the form of a prior distribution for the data model.In practice, what is often done is to say that there exists no prior information and therefore an "uniformative" prior distribution is used.Given the nature of the very small sample sizes used in preclinical pilot studies, the sequential Bayes factor methodology may not provide much benefit in those settings.However, pilot studies have the unique potential to serve as a basis of information for eliciting an informed prior distribution to be used in subsequent larger good laboratory practice (GLP) studies.In larger GLP studies this sequential Bayes factor methodology could be invaluable in reducing the number of animals used, and likewise, using information from pilot studies to inform priors could greatly help improve the sequential Bayes factor analysis.Additional alternatives to uninformative priors might be to consider the use of nonlocal priors (Johnson and Cook 2009;Johnson and Rossell 2010), which for example might offer some sample size advantages under the null hypothesis, and should be considered an area for further research.
It is worth mentioning that the strengths of Bayesian approaches, in the context of preclinical experiments, go further beyond.For example, in preclinical experiments, the importance of estimating the treatment effect is often tantamount to that of having a yes/no conclusion whether the treatment is effective.Bayesian approaches can provide a more straightforward description of the treatment effect (e.g., in the format of "the probability that the treatment will outperform the control") especially when the sample size is limited.When there are a series of preclinical experiments conducted to evaluate the same treatment or similar treatment, applying Bayesian models (e.g., hierarchical models) can allow intuitive information sharing across the experiments (Walley et al. 2016); thus, more accurate and efficient inference.We will explore these Bayesian applications in our future work.
Another direction for future work would be, instead of using Bayes factors, to create a sequential decision function that is based on estimation of the effect size with a certain probability threshold, for example Pr( π ≡ π T − π C > δ|data) ≥ γ .Here, π is the estimate of the effect size, δ the desired effect size, and γ some prespecified probability that could be used for an optional stopping decision.Although there are still evidence thresholds to be set, that is, δ and γ , it could be argued that the aforementioned decision function might be simpler to use for practitioners who have an easier time thinking on the probability scale rather than the Bayes factor scale.
Lastly, although not the focus of the article, a practical concern when working with preclinical data is the potential to encounter confounding variables (e.g., time, analysts, and lab environment) but these issues can handled by working with a more sophisticated model that accounts for these confounding variables.For example, a regression model for dealing with proportions or count data could be used and the confounding variables could then be treated as covariates in the model.the biggest hurdle to overcome is the actual Bayes factor calculation, however, there are many available R packages for doing so with more complex models such as the R BFpack package (Mulder et al. 2021).

Figure 1 .
Figure 1.The left figure shows the calculated log(Bayes factor) as a function of sample size for the data in Table 2 under a hypothetical ordering in which the data was collected.Here the dashed line corresponds to a Bayes factor of 30, suggestive of very strong evidence of a treatment effect.The right figure shows the proportion of responses for the treatment and control groups under the hypothetical ordering in which the data was collected.

Figure 2 .
Figure 2. The distribution of the sample size when the experiment stops under the alternative hypothesis for 100 simulations.Each solid gray line represents the 100 trajectories of the individual simulated experimental runs.All of these trajectories terminate at a Bayes factor of 30.

Figure 3 .
Figure 3.The distribution of the sample size when the experiment stops under the null hypothesis for 100 simulations.Each solid gray line represents the 100 trajectories of the individual simulated experimental runs.Here, the majority of the trajectories (93 of them) terminate at a Bayes factor of 1/30 while a few of the trajectories (7 of them) do not.

Table 2 .
Response status for the 20 animals assigned to either the treatment or control group based on the simulated data.

Table 3 .
Operating characteristics based on different effect sizes for the sequential Bayes factor method, group sequential designs, and the frequentist null hypothesis significance testing (NHST) method.Avg.N t is the average sample size per group, and Inc. stands for inconclusive.

Table 4 .
Operating characteristics under the null hypothesis for the sequential Bayes factor method, group sequential designs, and the frequentist null hypothesis significance testing (NHST) method.
NOTE: Here, Avg.N t is the average sample size per group, and Inc. stands for inconclusive.

Table 5 .
Response status for the 32 rats assigned to either the control, low dose, medium dose, or high dose group.

Table 6 .
Results of running the sequential Bayes factor method for dataset 1.Here, the subscripts CL, CM, and CH denote the Bayes factor for comparing the control group to the low, medium, and high dose group,respectively.

Table 7 .
Response status for the 44 rats assigned to either the control, low dose, medium dose, or high dose group.

Table 8 .
Results of running the sequential Bayes factor method for dataset 2. Here, the subscripts CL, CM, and CH denote the Bayes factor for comparing the control group to the low, medium, and high dose group, respectively.