SMART Binary: New Sample Size Planning Resources for SMART Studies with Binary Outcome Measurements

Abstract Sequential Multiple-Assignment Randomized Trials (SMARTs) play an increasingly important role in psychological and behavioral health research. This experimental approach enables researchers to answer scientific questions about how to sequence and match interventions to the unique, changing needs of individuals. A variety of sample size planning resources for SMART studies have been developed, enabling researchers to plan SMARTs for addressing different types of scientific questions. However, relatively limited attention has been given to planning SMARTs with binary (dichotomous) outcomes, which often require higher sample sizes relative to continuous outcomes. Existing resources for estimating sample size requirements for SMARTs with binary outcomes do not consider the potential to improve power by including a baseline measurement and/or multiple repeated outcome measurements. The current paper addresses this issue by providing sample size planning simulation procedures and approximate formulas for two-wave repeated measures binary outcomes (i.e., two measurement times for the outcome variable, before and after intervention delivery). The simulation results agree well with the formulas. We also discuss how to use simulations to calculate power for studies with more than two outcome measurement occasions. Results show that having at least one repeated measurement of the outcome can substantially improve power under certain conditions.


Introduction
Adaptive interventions (also known as dynamic treatment regimens) play an increasingly important role in various domains of psychology, including clinical (Véronneau et al., 2016), organizational (Eden, 2017), educational (Majeika et al., 2020), and health psychology (Nahum-Shani et al., 2015). Designed to address the unique and changing needs of individuals, an adaptive intervention is a protocol that specify how the type, intensity (dose), or delivery modality of an intervention should be modified based on information about the individual's status or progress over time.
As an example, suppose the adaptive intervention in Figure 1 was employed to reduce drug use among youth with cannabis use disorder attending intensive outpatient programs. This example is based on research by Stanger and colleagues (2019) but was modified for illustrative purposes. In this example, youth are initially offered standard contingency management (financial incentives for documented abstinence) with technology-based working memory training (a commercially available digital training program to improve working memory for youth, involving 25 sessions with eight training tasks per session). As part of the intervention, drug use is monitored weekly via urinalysis and alcohol breathalyzer tests over 14 weeks.
Second, at week 4, youth who test positive or do not provide drug tests are classified as nonresponders and are offered enhanced (i.e., higher magnitude) incentives; otherwise, youth continue with the initial intervention.
This example intervention is "adaptive" because time-varying information about the participant's progress during the intervention (here, response status) is used to make subsequent intervention decisions (here, to decide whether to enhance the intensity of the incentives or continue with the initial intervention). Figure 1 shows how this adaptive intervention can be described with decision rules-a sequence of IF-THEN statements that specify, for each of several decision points (i.e., points in time in which intervention decisions should be made), which intervention to offer under different conditions. Note that this adaptive intervention includes a single tailoring variable: specifically, response status at week 4, measured based on drug tests. Tailoring variables are information used to decide whether and how to intervene (here, whether to offer enhanced incentives or not).
Importantly, an adaptive intervention is not a study design or an experimental design-it is an intervention design. Specifically, adaptive intervention is a set of decision rules that can be used to guide practice long after the randomized trial is completed (Collins, 2018;Nahum-Shani & Almirall, 2019). However, in many cases, investigators have scientific questions about how to best construct an adaptive intervention; that is, how to select and adapt intervention options at each decision point to achieve effectiveness and scalability. Sequential multiple assignment randomized trials (SMARTs; Lavori & Dawson, 2000;Murphy, 2005) are increasingly employed in psychological research to empirically inform the development of adaptive interventions (for a review of studies see Ghosh et al., 2020). A SMART is an experimental design that includes multiple stages of randomizations and that can be used to provide information for choosing potential adaptive interventions. Each stage is intended to provide data for use in addressing questions about how to intervene and under what conditions at a particular decision point. A SMART is not itself an adaptive intervention; instead, multiple potential adaptive interventions are embedded in the SMART, and the SMART provides randomized evidence about which are likely to be more or less effective.
Consider the working memory training SMART in Figure 2, which was designed to collect data to empirically inform the development of an adaptive intervention for youth with cannabis use disorders (Stanger et al., 2019). This trial was motivated by two questions: in the context of a 14-week contingency management intervention, (a) is it better to initially offer a technology-based intervention that focuses on improving working memory or not? and (b) is it better to enhance the magnitude of incentives or not for youth who do not respond to the initial intervention? These questions concern how to best intervene at two decision points-the first is at program entry and the second is at week four. Hence, the SMART in Figure 2 includes two stages of randomizations, corresponding to these two decision points. Specifically, at program entry youth with cannabis use disorders were provided a standard contingency management intervention and were randomized to either offer working memory training or not. Drug use was monitored weekly via urinalysis and alcohol breath tests over 14 weeks. At week 4, those who were drug positive or did not provide drug tests were classified as early non-responders and were re-randomized to either enhanced incentives or continue with the initial intervention, whereas those who were drug negative were classified as early responders and continued with the initial intervention option (i.e., responders were not re-randomized).
The multiple, sequential randomizations in this example SMART give rise to four "embedded" adaptive interventions (see Table 1). One of these adaptive interventions, labeled "Enhanced working memory training" and represented by cells D+E, was described earlier ( Figure 1). Many SMART designs are motivated by scientific questions that concern the comparison between embedded adaptive interventions (Kilbourne et al., 2018;Patrick et al., 2020;Pfammatter et al., 2019). For example, is it better, in terms of abstinence at week 14, to employ the "Enhanced working memory training" adaptive intervention (see Table 1; also represented by cells D, E in Figure 2), or the "Enhanced incentives alone" adaptive intervention (represented by cells A, B in Figure 2)?
Both adaptive interventions offer enhanced incentives to non-responders while continuing the initial intervention for responders, but the former begins with working memory training whereas the latter does not.
The comparison between embedded adaptive interventions is often operationalized using repeated outcome measurements in the course of the trial (Dziak et al., 2019;, such as weekly abstinence over 14 weeks measured via weekly drug tests. Repeated outcome measurements in a SMART have both practical and scientific utility (Dziak et al., 2019;Nahum-Shani et al., 2020). They can be leveraged not only to make more precise comparisons of end-of-study outcomes, but also to estimate other quantities, such as area under the curve (AUC; see Almirall et al., 2016), phase-specific slopes, and delayed effects (see Nahum-Shani et al., 2020). Dziak and colleagues (2019) and Nahum-Shani and colleagues (2020) provide guidelines for analyzing data from SMART studies in which the repeated outcome measurements are either continuous or binary. However, although sample size planning resources for SMART studies with numerical repeated outcome measurements have been proposed (e.g., by Seewald et al., 2020), sample size planning resources have yet to be developed for binary repeated outcome measurements. The current paper seeks to close this gap by developing sample size resources for planning SMART studies with binary repeated outcomes measurements.
We begin by reviewing existing sample size planning resources for SMARTs with only an end-of-study binary outcome (i.e., not repeated measurements). We then extend this approach to include a pre-randomization baseline assessment (here called pretest for convenience) and show that this can increase power for comparing adaptive interventions in terms of an end-of-study outcome (i.e., an outcome measured post randomizations which refer to as posttest). In this paper, we provide simulation functions in R to estimate sample size requirements, or power for a given sample size, in a SMART with binary outcomes and two or more measurement occasions.
In the special case of two occasions, we also derive an asymptotic sample size formula which agrees well empirically with the simulation results in the reasonable scenarios considered. We separately consider how to use simulations, constructed appropriately for the SMART context, to calculate power for studies with more than two outcome measurements; an example simulation is given in Appendix 1. It was not practical to derive useful formulas for more than two measurement times. We show by simulations, however, that adding more outcome measurements beyond pretest and posttest may or may not lead to substantial gains in power, depending on the scenario. Nonetheless, these additional measurements may be useful in answering highly novel secondary research questions, such as about delayed effects (see Dziak et al., 2019;Nahum-Shani et al., 2020). It is convenient to start by reviewing the derivation of power and sample size formulas, and then noticing where approximations can reasonably be made and where simulations might be more beneficial.

Sample Size Planning for Binary SMART
Suppose that in the process of planning the working memory training SMART (Figure 2), investigators would like to calculate the sample size required for comparing the 'enhanced working memory training' and the 'enhanced incentives alone' adaptive interventions (see Table   1). Note that the working memory training SMART is considered a "prototypical" SMART (Ghosh et al., 2020;Nahum-Shani et al., 2022). A prototypical SMART includes two stages of randomization, and the second-stage randomization is restricted to individuals who did not respond to the initial intervention. That is, only non-responders (to both initial options) are re-randomized to second-stage intervention options. More specifically, the first randomization stage involves randomizing all experimental participants to first stage intervention options. Next, response status is assessed. Individuals classified as responders are not re-randomized and typically continue with the initial intervention option. Individuals classified as non-responders are re-randomized to second-stage intervention options. Here, response status is a tailoring variable that is integrated in the SMART by design; that is, this tailoring variable is included in each of the adaptive interventions embedded in this SMART (see Table 1).

Notation and Assumptions
Let denote the indicator for the first-stage intervention options, coded +1 for working memory training, or −1 for no working memory training; let denote the response status, coded 1 for responders and 0 for non-responders; and let denote the indicator for the second-stage intervention options among non-responders, coded +1 for enhanced incentives and −1 for continuing without enhanced incentives. Throughout, we use upper-case letters to represent a random variable, and lower-case letters to represent a particular value of that random variable.
Each of the four adaptive interventions embedded in the working memory training SMART ( Figure 1) can be characterized by a pair of numbers ( , ), each +1 or −1. We write that a participant in a SMART study "follows" or "is compatible with" an adaptive intervention ( , ) if this participant's first-stage intervention is , and if furthermore this participant is either responsive ( = 1) to the first-stage intervention, or else is not responsive ( = 0) and hence is offered second-stage intervention . Notice that this definition includes responders who do not actually receive , as long as they did receive ; the intuition is that they might have received if they had not responded. Thus, unlike in an ordinary randomized trial, the same participant may "follow" more than one of the adaptive interventions being considered; simple statistical approaches to handle this design feature are discussed further by Nahum-Shani and coauthors (2012) and Lu and coauthors (2016).
Let i=1,…,n denote study participants. We assume that, for each i, the binary outcomes , are observed at time points = 1, … , . Let ( ) denote the potential outcome of the response status variable (see accessible introduction in Marcus et al., 2012) for person if that person is offered an adaptive intervention with initial option . Let , ( ) or , ( , ) denote the potential outcome at time for person if offered an adaptive intervention defined by intervention options ( , ). It is assumed that if ( ) = 1, then , ( , −1) = , ( , +1), i.e., no one is affected by a part of the adaptive intervention they did not actually receive, although they may still provide information about the effect of an earlier part of the adaptive intervention. Of course, for individuals with ( ) = 0, , ( , −1) need not equal , ( , +1).
For the remainder of the manuscript, we assume that the investigator's goal is to compare a pair of embedded adaptive interventions = ( , ) and = ( , ), in terms of outcome probability at end-of-study. We start by reviewing the = 1 case (final, end-of-study outcome only), then extend to = 2 (baseline outcome and final outcome), and then explore = 3 via simulations, using a flexible method that also allows for higher . We assume for most of the paper that the logit link is being used, and that the estimand of interest is the log odds ratio of the end-of-study outcome between a pair of adaptive interventions. Throughout, we assume that the investigator wishes to choose a sample size n to achieve adequate power to test the null hypothesis = 0. Similar to Kidwell and colleagues (2019) and Seewald and colleagues (2020), we assume that the pair of embedded adaptive interventions being compared differs in at least the first-stage intervention option . We also assume that there are no baseline covariates are being adjusted for. In general this is a conservative assumption because adjusting for baseline covariates sometimes improves power and usually does not worsen it (Kidwell et al., 2018).
Recall that the asymptotic sampling variance of a parameter is inversely proportional to the sample size. Across a very wide range of models, the required sample size to test a null hypothesis = 0 with power and two-sided level can be written as where = Φ ( ) is the normal quantile corresponding to the desired power, is the parameter of interest, and is a quantity such that for a given sample size , Var = / is its sampling variance; see Derivation 1 in the Appendix 2. The main challenge is to find a formula for which fits the model and design of interest, and which can be calculated from intuitively interpretable quantities, for which reasonable guesses could be elicited from a subject matter expert. In this paper we assume that the parameter of interest is the log odds ratio between outcomes for a comparison of two embedded adaptive interventions differing at least in first intervention option. That is, the null hypothesis is = 0 where where ( ) = ( ) = ( ) = 1 be the expected value of the binary end-of-study outcome for a participant who follows embedded adaptive intervention . Other quantities of interest, such as the probability ratio, are also possible.

SMART BINARY
Even after the parameter of interest has been defined and a proposed true value for it has been elicited, more information is still needed to estimate a sample size requirement. These pieces of information could be described as nuisance parameters, although some may be of secondary research interest in their own right. Specifically, let = ( ( ) = 1) be the probability that an individual given adaptive intervention will be a responder. We assume that depends only on and not on , because the second-stage intervention is not assigned until after response status is assessed, but it is still convenient to use the d subscript, with the understanding that and will be the same for adaptive interventions having the same . In Appendix 2, we also make consistency assumptions that imply that ( ) = ( ( ) = 1| = , = ) and = ( = 1| = ). ( ) is taken marginally over R, representing the overall average success probability for nonresponders who receive the and intervention stages as well as for responders who receive only. Thus, it is not the same as the mean response only of individuals who are observed to receive both and .
Let ( ) = ( ( ) = 1| ( ) = 0) and ( ) = ( ) = 1 ( ) = 1 denote the endof-study outcome probabilities for non-responders and responders, respectively, given intervention and response status. These parameters represent expected values which are conditional on R. These parameters can be elicited from investigators by asking them to specify the hypothesized probabilities that Y=1 in the six cells A-F in Figure 2. For adaptive intervention = ( , ), ( ) corresponds to the probability that Y=1 for someone who did not respond to first-stage intervention option and was then offered second-stage intervention option . Also, ( ) corresponds to the probability that Y=1 for someone who responded to . Because responders are not affected by intervention option , ( ) is equal for any two adaptive interventions having the same , although ( ) is potentially different for each intervention.
Although ( ) in particular is dependent on only the first phase (first component) of d, it is still convenient to apply the shorthand superscript d here instead of ( ), because the adaptive intervention as a whole is assumed to be the target of inference in the analysis.
In the next section, we discuss two options for calculating sample size. The first option requires eliciting hypothetical values of the ( ) and ( ) parameters, which are the end-ofstudy outcome probabilities conditional on both the intervention options and response status. Hence, and can be interpreted as the variances of ( ) conditional on R=0 or R=1, plus an extra quantity that can be interpreted as the effect of response status.
These expressions lead to a sample size recommendation for a pairwise comparison of two adaptive interventions differing at least on stage-1 recommendation. Specifically, where is the true log odds ratio between the adaptive interventions.
Appendix 2 describes how we derived the expression above, using standard causal assumptions, from a sandwich covariance formula It is assumed that the target contrast can be written as for some vector , where = Var( ) = Var = ( ) .
In the case of the logistic regression model, this would be true for a pairwise log odds ratio. For a pairwise comparison between adaptive interventions and , the researcher would set = +1, = −1, and other entries of to zero. After some algebra, the sandwich covariance therefore implies Equation (2). Details are given in Appendix 2.
It appears at first that formula (2) requires specifying hypothetical values for all probabilities, both conditional on R and marginal over R, because and depend on both sets of probabilities. However, in practice only the conditional probabilities ( ) and ( ) for each adaptive intervention and the response rate need to be specified, because the marginal probabilities can then be computed by expectations: ( ) = (1 − ) ( ) + ( ) . However, although ( ) can be computed from ( ) , ( ) , and , additional assumptions would be needed to compute ( ) and ( ) from ( ) and .
Kidwell and colleagues (2018) provide an alternative formula, which (in terms of our notation) assumes that ≤ , ≤ , ≤ , and ≤ . Under these variance assumptions, the approximate required sample size is Under the further simplifying assumption that the proportion of responders is equal in the two adaptive interventions being compared ( = = ), expression (2) simplifies to ≥ 2(2 − ) + / 1 + .
The sample size formula above is equivalent to a sample size formula for a two-arm RCT with binary outcome, multiplied by the quantity 2 − , which Kidwell and colleagues (2018) interpreted as a design effect. In practice, this formula requires eliciting hypothetical values for the marginal outcome probabilities for each adaptive intervention of interest, and the response rate . Based on these parameters, one can calculate the variance = (1 − ) for each adaptive intervention and calculate the log odds ratio = ( /(1 − )) ( /(1 − )) ⁄ .
Both formula (2) and formula (3) require that the proportion of responders be elicited.
Kidwell and colleagues (2019) note that setting = 0 provides a conservative upper bound on required sample size, but the resulting approximation is very pessimistic and may lead to an infeasibly high recommendation.
Both formula (2), which we describe here as a conditional-probabilities-based (CPB) formula, and formula (3) which we describe as a marginal-probabilities-based (MPB) formula, have advantages and disadvantages. The marginal formula requires additional assumptions, but then requires fewer parameters to be elicited. Furthermore, the marginal probabilities are related directly to the marginal log odds ratio of interest for comparing embedded adaptive interventions. In other words, since the hypothesis concerns the comparison of two embedded adaptive interventions, it may be more straightforward for many investigators to specify parameters that describe the characteristics of these adaptive intervention, rather than their corresponding cells. However, other researchers may find the conditional probabilities for each cell comprising the adaptive interventions of interest more intuitive to elicit, as they directly correspond to the randomization structure of the SMART being planned. In the following section, we extend both formulas to settings with a baseline measurement of the outcome.

Sample Size Requirements for Pretest and Posttest: Two Measurement Times
Power in experimental studies can often be improved by considering a baseline (prerandomization) assessment as well as the end-of-study outcome (see Benkeser et al., 2021; Vickers & Altman, 2001). These are sometimes described as a pretest and posttest; here, we refer to them as and . The pretest is assumed to be measured prior to the initial randomization, and therefore causally unrelated to the randomly assigned interventions. The pretest could either be included as a covariate, or else could be modeled as a repeated measure in a multilevel model; we assume the latter approach in the sample size derivations. Below we provide formulas that are similar to (2) and (3), but take advantage of additional information from the baseline measurement.
Let ( ) = ( ) be the expected value for the baseline measurement of the outcome at the beginning of the study. Here, neither nor ( ) are indexed by adaptive intervention d, because is measured prior to randomization. Let ( ) = ( ) be the expected value for the end-of-study measurement of the outcome for an individual given adaptive intervention . Then by Derivation 4 in Appendix 2, the approximate required sample size can be written as where the formulas for , , and M are derived in Appendix 2. The derivation comes from a sandwich covariance formula as in the posttest-only case, and follows the general ideas of Lu and colleagues (2016) and Seewald and colleagues (2020). Specifically = ∑ where is a 2 × 2 diagonal matrix with entries Var( ( ) ) and Var( ( ) ), is the 2 × 2 withinperson correlation matrix between ( ) and ( ) , and = . Under some assumptions (see Appendix 2), can be approximated by ∑ 4(1 − ) + ∑ 2 .
A formula like (4) can be implemented in code but provides little intuitive understanding.
However, under the further assumption that the variance is independent of response status given adaptive intervention received, equation (4) simplifies to the following: The key to the simplifications used in deriving (5) is that and can each be expressed as an "arrowhead" matrix, i.e., a matrix which is all zeroes except for the main diagonal, the first row, and the first column, and therefore can be inverted by simple algebra, using the formula of Salkuyeh and Beik (2018). Details are given in Appendix 2.
Although in practice, it is very unlikely that variance will be independent of response status, we use this approximation to generate a formula that is more interpretable and accessible.
The performance of this formula is evaluated later in the simulation studies, where the variance and response status are dependent. Expression (4) is again a CPB formula and Expression (5) is a MPB formula. If the pretest provides no information about the posttest, so that = 0, then expression (5) simplifies to expression (3), which was the sample size formula of Kidwell and colleagues (2019). In other words, using an uninformative pretest ( = 0) is approximately the same as ignoring the pretest.

Beyond Pretest and Posttest: More than Two Measurement Times
For a SMART with more than two measurement times (i.e., more than pretest and posttest), an easily interpretable formula is not possible without making assumptions that would be unrealistic in the binary case. Seewald and colleagues (2020) provide both a general and a simplified sample size formula for comparing a numerical, end-of-study outcome in longitudinal SMARTs. However, the simplified formula relies on the assumption of homoskedasticity across embedded adaptive interventions and measurement occasions, and exchangeable correlation between measurement occasions. In a binary setting, these simplifying assumptions are less realistic because two binary random variables cannot have equal variance unless they also have either equal (e.g., .20 and .20) or exactly opposite means (e.g., .20 and .80). Determining sample size requirements via simulations would be a feasible alternative in this setting (see Appendix 1).
However, if the investigator prefers not to use simulations, then we propose using the two-measurement-occasion formulas as approximations for planning SMARTs with more than two measurement occasions. Simulations shown in Appendix 1 suggest that the resulting sample size estimates would be reasonable. Although taking more measurement occasions into account might provide somewhat higher predicted power, this would depend on the assumed and true correlation structure and the design assumptions of the SMART. The power could also depend on assumptions concerning the shape of change trajectories within the first-and second-stage of the design (e.g., linear, quadratic, etc.), which might become difficult to elicit. Therefore, although more sophisticated power formulas might be developed, they might offer diminishing returns versus a simpler formula or a simulation. In the next section we discuss the use of simulations to calculate power for settings with more than two measurement times and to investigate the properties of the sample size formulas described earlier.

Simulation Experiments
In order to test whether the proposed sample size formulas work well, it is necessary to simulate data from SMART studies with repeated binary outcome measurements. Furthermore, simulation code can be relatively easily extended to situations in which the simplifying assumptions of the formulas do not apply. Below we discuss two simulation experiments. The first is designed to assess performance of the power formulas. This is done by comparing, for fixed sample sizes, the power estimated based on the sample size formulas to the power calculated from simulations. The second is designed to assess the performance of the sample size formulas as well as to investigate the extent of reduction in required sample size obtainable by taking pretest into account. This is done by comparing, for a fixed target power, estimates of the required sample sizes given by the various formulas to simulated sample size requirements.

Simulation Experiment 1: Performance of Power Formulas
A factorial simulation experiment was performed based on a SMART design with two measurement times. This experiment investigates the ability of the sample size formulas to choose a sample size which is large enough to achieve 0.80 power under specified assumptions.
All simulation code is available online at https://github.com/d3labisr/Binary_SMART_Power_Simulations or via https://d3lab.isr.umich.edu/software/ . The experiment is designed to answer the following questions: First, do the proposed sample size formulas accurately predict power compared to the power estimated via simulations? Second, how much does the estimated power change by using the CPB approach in Expression (2), versus the MPB approach in Expression (3)? Third, to what extent does using a pretest result in efficiency gains (i.e., higher power for a given sample size) when comparing adaptive interventions based on repeated binary outcome measurements? Fourth, if the pretest is to be used in the model, is there a relative advantage or disadvantage to including the pretest as a covariate (and only the posttest as an outcome), versus modeling both the pretest and the posttest in a repeated measurement model? We used simulations to answer these questions under a scenario with hypothesized true parameters described below.

Methods
Data was simulated to mimic a prototypical SMART study, similar to the working memory training SMART in Figure 1. Randomization probabilities were set to be equal (50% each) for first-stage intervention options for each simulated participant, as well as for secondstage intervention options for each simulated non-responder. We assume there are two outcome measurement occasions: a baseline measurement before randomization (pretest), and an end-ofstudy outcome measurement (posttest). 10,000 datasets were simulated and analyzed per scenario (combination of effect size and sample size).
We assumed that the contrast of interest is the end-of-study log odds of drug use between the "enhanced working memory" (+1, +1) and the "enhanced incentives alone" (−1, +1) adaptive interventions (Table 1). Also, the data were simulated under the assumption of no attrition (study dropout). In practice a researcher should inflate the final estimate of required sample size to protect against a reasonable estimate of attrition probability.
We compared the power predictions obtained by using the different formulas available for , with simulated power estimates. Specifically, we considered power calculated from expression (1) using the CPB estimates and MPB estimates for , which would correspond to the sample size recommendations in expressions (3) and (5), respectively. We generated samples of either = 300 or = 500, in which the true correlation structure was either independent ( = 0) or correlated with correlation = .3 or = .5. The datasets were simulated using the approach described below.
Steps in Simulating Datasets. We first generated a random dummy variable for baseline abstinence with probability ( ) = .40. Next, was randomly assigned to +1 or −1 with equal probability. Then, was generated as a random binary variable (0 or 1) such that the log odds of = 1 was set to −.62 + + .5 . The intercept −.62 was chosen to give an overall response rate of about 56% in the = +1 arm and 33% in the = −1 arm, or about 45% overall. Thus, we assume that in general most participants are responders, with an advantage to those receiving working memory training. The correlation between and was about .23.
Finally, the end-of-study outcome was generated. For convenience, and × were set to have zero effect, and the effect of was set so that the marginal odds ratio between a pair of adaptive interventions differing on would be approximately 1.5, 2, or 3, depending on the condition. These values are within the ranges which would be considered small, medium and large, respectively, by Olivier, May and Bell (2017). The conditional expected value for the final outcome is given by the model logit ( | , , , ) = + + + + + .
The values for and were set to zero for simplicity, and the other values were determined by trial and error to give the desired marginal quantities and are provided in Table 2. Computation of Marginal Correlation for Formulas. Although the two-wave power formulas take the marginal pretest-posttest correlation as an input, this parameter was not directly specified in the simulation code, because a simulation requires fully conditional models to be specified. Therefore, for purposes of calculating power via the formula for a given condition, we used the average marginal correlation estimate obtained from applying the weighted and replicated analysis (marginal over R) to the simulated datasets generated for this specific condition.

Results
With respect to the first motivating question (do the proposed sample size formulas accurately predict power compared to the power estimated via simulations), the results of the simulations (Table 3) are very encouraging. First, at least under the conditions simulated, the proposed sample size formulas do predict power accurately compared to the power which is estimated via simulations. As would be expected, power is higher when the effects size is higher and/or the sample size is higher.

SMART BINARY
The second motivating question concerns the extent that the estimated power will change by using the CPB approach in Expression (2) versus the MPB approach in Expression (3). The results indicate that the MPB and the CPB formulas are equivalent in the one-wave (posttestonly) case. However, these formulas differ slightly from each other in the pretest-posttest scenarios, with the MPB approach being slightly conservative, and the CPB approach being sometimes slightly overly optimistic.
The third question motivating this experiment concerns the extent that using a pretest will result in efficiency gains when comparing adaptive interventions. The results indicate that power is often higher when using a pretest-posttest model than with a posttest-only model, although this depends on within-subject correlation. There is no difference in power between these approaches when the pretest-posttest correlation is negligible (0.06) and only a very small difference when the pretest-posttest correlation is small (0.3), but there is a large difference when the pretest-posttest correlation is sizable (0.6). For example, with an odds ratio of 2 and sample size of 200, the one-wave approach has unacceptably low power of 65%, while the twowave approach has a much better power of 85%.
Finally, the fourth motivating question concerns the relative advantage or disadvantage to including the pretest as a covariate versus as a measurement occasion in a repeated-measurement model. For purposes of calculating power for comparing adaptive interventions, the working independence analysis was found to be exactly equivalent to a posttest-only analysis, and the covariate-adjusted analysis was essentially equivalent to the exchangeable analysis. Therefore, we focus on comparing results for the non-independent repeated-measures analysis versus the posttest-only analysis. Because we found the simulated power with a pretest covariate to be approximately the same as the simulated power with repeated measures, they are represented by the same column under the Two-Waves heading. This near equivalence may result from the intervention options being randomized in the current settings; had there been confounding, the two models might have dealt with it differently, leading to differences in power and accuracy.

Simulation Experiment 2: Performance of Sample Size Formulas
This simulation was intended to study the ability of the sample size formulas to choose a sample size which is large enough to achieve a specified power (set here to .80) under specified assumptions, but which is not too large to undermine the feasibility of the study. The questions were analogous to the previous three. First, do the proposed sample size formulas give similar sample size predictions to those obtained from simulations? Second, how much does the estimated sample size change by using the CPB sample size formulas (2) and (4) versus the MPB sample size formulas (3) and (5)? Third, to what extent can the required sample size be reduced, under given assumptions, by taking the pretest-posttest correlation into account?

Method
Ordinarily, Monte Carlo simulations do not directly provide a needed sample size, but only an estimated power for a given sample size. However, by simulating various points of a power curve and interpolating, it is practical to use simulations to approximate the required sample size. We consider the inverse normal (probit) transform of power, Φ ( ), to be approximately linearly associated with N, based on the form of Equation (1) and the fact that sampling variance is inversely proportional to N. That is, we assume Φ ( ) ≈ + for some and . Therefore, using the same scenarios as in the previous experiment, we perform simulations for several sample sizes in the range of interest and fit a probit model to relate the predicted power to each sample size. The needed sample size is then roughly estimated as = (Φ (. 80) − )/ . 2,000 datasets were simulated and analyzed per effect size scenario, each on a grid of 10 potential sample sizes.

Results
The first question motivating this simulation experiment focused on whether the proposed sample size formulas provide similar sample size predictions to those obtained from simulations. Consistent with the results of the first simulation experiment, the results of the current experiment ( The second motivating question concerns the extent that the estimated sample size changes by using the CPB versus the MPB sample size formulas. As in the first simulation experiment, we found the MPB approach and CPB approach to be practically equivalent in the posttest-only case. In the pretest-posttest case, the MPB approach was found to be slightly conservative and the CPB approach was found to be slightly anticonservative, probably making the MPB approach the safer choice. Finally, the third question motivating this experiment concerned the extent to which the required sample size can be reduced by taking the pretest-posttest correlation into account. The results indicate that taking pretest-posttest correlation into account reduces the required sample size. As would be expected from the previous simulation experiment, results showed that the required sample size for adequate power can be reduced dramatically (possibly by hundreds of participants) by employing a pretest-posttest approach instead of posttest-only.

SMART BINARY
The current manuscript addresses an important gap in planning resources for SMART studies by providing new sample size simulation code, as well as approximate asymptotic sample size formulas, for SMARTs with binary outcomes. These sample size

SMART BINARY
Systematic investigation of the extent of efficiency gained per added measurement occasion is needed to better assess the tradeoff between adding measurement occasions versus adding participants to the study in terms of power for a given hypothesis.
For the pretest-posttest case, we provided both simple asymptotic formulas and simulation code. Simulations have the advantage of being more easily adapted to different designs or situations, and do not require as many simplifying approximations as the asymptotic formulas do, although of course both require assumptions about parameter values.

Limitations and Directions for Future Research
Careful consideration of assumptions, preferably with sensitivity analyses, is still important for sample size planning. It would not be reasonable to argue that planning sample size to achieve exactly .80 power (and no more) is the best approach in general. More conservative sample size approaches may provide more capacity to handle unexpected situations such as higher than anticipated attrition. However, in some cases, an unreasonably high estimated sample size requirement would make it difficult to justify the conduct of a study given realistic funding or participant recruitment constraints. Hence, calculating predicted power with as much precision as possible, for a given set of assumptions, is desirable.
In this paper we have used the ordinary Pearson correlation coefficient, even for describing the relationship between binary variables. This is valid and convenient, and it follows the way correlation is operationalized in, for instance, generalized estimating equations (Liang & Zeger, 1986). However, there are other alternative measures available such as tetrachoric correlation (Bonnett & Price, 2005) which could optionally be explored. One limitation which might be encountered when choosing parameters for simulations is that very high correlations might lead to complete separation (parameter unidentifiability due to frequentist estimates of certain conditional probabilities being at zero or one). This is a limitation of binary data, but it might be avoided in simulations by not specifying very high correlations, and in practice by either simplifying the analysis model or using priors.
This paper has assumed that sample size calculations would be motivated by a primary   Note. The conditional regression parameters refer to Expression (6). For simplicity, is set to 1 and = = 0. This leads to an average percentage of responders across arms of 45%, with responder proportions of 56.5% and 33.5% for the +1 and −1 levels of . Because of a small remaining indirect effect of and via (i.e., correlations between pretest, response variable and posttest), the lowest level of correlation considered here is still not exactly zero (about 0.06), despite specifying a zero parameter for the conditional effect of and . Notes. "MPB" = marginal-probabilities-based (expression 3); "CPB" = conditional-probabilitiesbased (expression 5), In all conditions, the proportion of responders was set to approximately 0.565 given = +1 and 0.336 given = −1; this difference is the reason why the pre-post correlation Cor( , ) could not be set to exactly zero. The odds ratio shown is for pairwise comparison of (+, −) to (−, −) adaptive interventions, which is equivalent here to the effect of . For simplicity of interpretation, and the × interaction were set to have no effect on . The simulated power shown for the two-wave model uses the covariate adjustment approach (pretest as covariate); the repeated measures approach had approximately the same power, or in some conditions about 0.005% higher. Notes. "MPB" = marginal-probabilities-based (expression 3); "CPB" = conditional-probabilitiesbased (expression 5). The data-generating model settings are the same as those used for Table 3.

Figure 1
An example adaptive intervention to reduce drug use among youth with cannabis use disorder attending intensive outpatient programs Figure 2 Working Memory Training (WMT) SMART Study Notes: WMT denotes Working Memory Training. Circled R denotes randomization. All participants additionally received contingency management.

Appendix 1 Simulation Illustrating Three-Wave Analysis
In this appendix we assume that there are two different follow-up times per participant, and , instead of a single end-of-study (posttest) outcome , so that there are now three measurement occasions per participant. For simplicity we assume here that both follow-up times occur after the second treatment phase. Therefore, the variables for a given individual in the study would be observed in the following order: pretest , initial randomization , tailoring variable , second randomization for nonresponders, first follow-up , second follow-up .
This results in a somewhat different and simpler setting compared to that of Seewald (2020), who considered (in the linear modeling case) a mid-study outcome taken about the same time as , and preceding the second randomization.
In the current setting, the second outcome can potentially depend on any of , , , , and , making a very wide range of different DAGs and simulation scenarios possible. For simplicity, we chose scenarios in which depends on but is conditionally independent of most or all the other preceding variables given .
Specifically, we continue to assume the same parameters used in the "high" correlation setting from Simulations 1 and 2 when simulating and . We then further assume one of two scenarios for the relationship of to the preceding variables. In the "no delayed effect" scenario, depends on the preceding variables only through , and is conditionally independent of all other variables given , as if in a Markov chain. Thus, the effect of on is mediated entirely by . In the "delayed effect" scenario, has an effect on which is only partly mediated by .
The conditional effect of on was set to be weaker in the delayed effect scenario, so that the total effect of on (i.e., the direct effect conditional on plus the indirect effect mediated through ) would be comparable between scenarios. In particular, the resulting odds ratio for the contrast of interest, still assumed to be (+1, −1) versus (−1, −1), was 3.0 for and 2.0 for .
We assume that the estimand of interest is comparison of embedded adaptive interventions on the final outcome, where final outcome is interpreted as either the early followup or the later follow-up , in order to compare the simulated power for each. We fit onewave models to predict alone or alone. We also fit two-wave models to predict or separately adjusting for , and assuming exchangeable correlation structure (equivalent to AR-1 for the two-wave model). These models only consider two of the measurement occasions available. Finally, we fit three-wave models to predict adjusting for and , by applying methods similar to Lu and colleagues (2016) and Dziak and colleagues (2020) and using working assumptions of either independence, AR-1 or exchangeable correlation structure. In the threewave models, we assumed a separate piecewise linear trajectory from to and from to for each embedded adaptive intervention.
Each scenario was replicated in 10,000 datasets each for sample sizes = 300 and = 500. Simulated power for each model in each scenario is shown in Table 5. Power for models using as the final outcome was very high, and much higher than those using as the final outcome. However, this is not surprising because the effect size for was also higher. More interesting is the power comparison among the five models for (the rightmost five columns).
In the no-delayed-effect scenario, power was clearly higher for methods which used information from to predict (i.e., " Adjusted for ," working AR-1, and working exchangeable) versus those which ignored (" Only" and working independence). However, there was very little additional benefit in using , possibly because is on the causal chain between and on the left, and on the right. Also, as expected, power was higher for a working correlation that approximately fit the data-generating model (AR-1) than one which did not (exchangeable). Although neither structure corresponded exactly to the data-generating model, the exchangeable working structure made the unhelpful assumption that Corr( , ) = Corr( , ). In contrast, in the delayed effect scenario, it made little difference which model was used. This was presumably because in this scenario and had relatively little value for predicting once was accounted for.
There are many other possible data-generating models that could be explored in a threewave simulation. For instance, we did not explore power for detecting an effect of , or whether power might be different depending on the order and timing of the measurements. However, it appears that at least in some circumstances, a two-wave ( → ) model provides about as much benefit as a three-wave model ( → → ) with less complexity, assuming that contrasts in expected values for are of primary interest. Of course, for more complicated estimands (e.g., for studying whether the effect is delayed), more than two waves would be needed. Notes. In all of these conditions, the average estimated odds ratio for the effect of was set to 3.0 for and 2.0 for , in terms of the pairwise comparison of (+, −) to (−, −) adaptive interventions, which is equivalent here to the effect of . For simplicity of interpretation, and the × interaction were set to have no effect. The conditions differ in the relationship of the simulated late follow-up to the baseline assessment and initial treatment . The simulated decay in effect size over time between and is intended to be analogous to that found in many real-world clinical trials.

Derivation of Expression (1)
The approximate power to reject the null hypothesis is typically found by setting: This formula can be derived for a generic Wald -test as follows. By the central limit theorem, for sufficiently large sample size , Var Δ is inversely proportional to the sample size , so that there is some quantity for which we could write Var ≈ for large . Therefore, the needed sample size for a power can then be expressed as in Expression (1) (as in, e.g., Kidwell et al., 2019;Seewald et al., 2020). Thus, Expression (7) can be rewritten as where Var = / . Solving this for leads to Expression (1).

Derivation of Expression (2)
Suppose a researcher is considering sample size required for a two-stage restricted (Design II) SMART with weighted and replicated estimating equations (see, e.g., Lu et al., 2016;Kidwell et al., 2019;Seewald et al., 2020), either with a single endpoint or with repeated measures. Suppose further that the contrast for which power is being planned is the comparison of two embedded adaptive interventions (i.e., dynamic treatment regimens), and , in terms of end-of-study outcomes. We assume that the difference in effectiveness will be measured by a log odds ratio after fitting a logistic model.
Following the potential outcomes framework, we consider each individual participant to have had, in advance, potential response status ( ) and ( ) for each possible adaptive intervention = ( ) , (( ) ) which that participant could receive. As usual, some of these will be unobserved counterfactual outcomes. As a special consequence of the design of the study, ( ) actually depends only on ( ) and not ( ) . Furthermore, if ( ) = 1, then ( ) also participant would otherwise have received, is not assigned, or at least not revealed or used. It may seem counterintuitive to be making inferences for the effect of ( ) , (( ) ) as a whole, when it is known that the second component is irrelevant to some study participants. However, the analysis is intended to provide generalizable information on which adaptive intervention would be best on average for future patients, considering both responders and nonresponders (Lu et al, 2016;Dziak et al., 2019;Nahum-Shani et al., 2020). When writing expectations, we suppress the subscript in ( ) and ( ) and simply write ( ) and ( ) .
In the case of a significance test of the log odds ratio, it is assumed that the log odds ratio where and are effect-coded except that = 0 for responders. We also assume that randomizations will be done with equal probability. As a result, responders are compatible with two of the four embedded adaptive interventions, and have a weight of 1/(1/2) = 2, while responders are compatible with only one of the four adaptive interventions, and have a weight of 1/(1/4) = 4. These weights can be thought of as inverse propensity weights, or equivalently as adjustments in a weighting and replication approach (see Lu and colleagues, 2016).
We are assuming the estimand Δ of interest is a linear combination Δ = of the regression coefficients. To derive expression (2) from expression (1), it is necessary to have a value for = Var( ). However, itself is an abstract quantity which would be impractical to elicit from a clinician or substantive researcher, and would not be reported in literature. Thus, it must be reexpressed in terms of other quantities. Under our assumption that Δ = , we have Var = Cov . This means the remaining problem is to re-express Cov( ) in terms of quantities that can be easily interpreted and elicited.
Suppose that, as in Kidwell and colleagues (2019), we only consider data from the final measurement time (i.e., no baseline observation). Then Equation (8)  which resembles an ordinary (independence) weighted logistic regression.
In order to proceed in computing this sandwich formula, we make three further assumptions. The first two are consistency assumptions (see Seewald et al., 2019; see, e.g., Hernán & Robins, 2020 for a more general discussion of consistency) needed in order for the results of the proposed study to be identified and interpretable. The third is a working assumption (similar to those by Seewald and colleagues, 2020) made in order to allow for a tractable expression for in terms of elicitable quantities. Let be an embedded adaptive intervention defined by intervention options ( , ).
1. Consistency assumption for : The observed response for a given individual consistent with adaptive intervention equals that individual's potential outcome ( ) . This assumption is required in order for the model quantities to be identifiable. We use the notation ( ) for convenience, but it might be more precise to write (( ) ) , because ( ) depends only on the component of adaptive intervention , not the component, which would not even be used except after observing nonresponse. That is, if the embedded adaptive interventions are listed as (+1,+1), (+1,-1), (-1,+1), and (-1,-1), then ( ) = ( ) by construction and ( ) = ( ) .
Nonetheless, the notation ( ) is less cumbersome than a more precise alternative would be. We write the expectation of ( ) as = ( ( ) ).

2.
Consistency assumption for : The observed outcome for a given individual who is consistent with adaptive intervention , and who has response status ( ) , equals that individual's potential outcome ( ) . This is required in order for the model quantities to be identifiable. In the current context is a binary variable with expected value ( ) , and can therefore be shown to have variance ( ) (1 − ( ) ).
3. Cross-world independence assumption for responders: For two adaptive interventions and with the same initial intervention option but different second intervention options, This assumption cannot be checked because responders cannot be consistent with both adaptive interventions and and therefore, for responders, only one of ( ) or ( ) can ever be observed.
However, it seems reasonable under the assumption that intervention option is randomized. It does not matter whether the variable being conditioned on is written as ( ) or ( ) , because the adaptive interventions are assumed to entail the same initial intervention option.
As in Seewald and colleagues (2020), the bread of the sandwich, i.e., the naïve modelbased covariance matrix for the longitudinal regression parameters before adjusting for the special features of the SMART design, is , where is the weighted sum of squares = ∑ ( ) .
Recall that responders have a 1/2 chance of being compatible with a given adaptive intervention (depending only on the randomization of ) and have a weight of 2, while nonresponders have a 1/4 chance (depending on the randomization of and ) and have a weight of 4. Because of this, the expected values of the inverse probability weights ( ) are 1 by the law of iterated expectations. The and are not dependent on observed data but only on the true parameters (recall that we are assuming no baseline covariates) so they can be treated as constants. That is, assuming consistency assumption 1 in section 4.1, and using the weight ( ) given in (2), a given cross-product term in (6) simplifies to Therefore, As an aside, it may seem strange that the exponent of in the expression above is positive, not negative, since is the model-based covariance estimate. All else being equal, the higher a is, the smaller the error variance of the effect of interest will be. In that sense, is behaving opposite to how a variance term would behave in a linear model. However, this does not constitute a problem with the model or the estimator; it is instead a result of the basic properties of the logistic link function. Roughly speaking, when the mean (i.e., probability parameter) of a Bernoulli variable is near 0 or near 1, the variance of the observed values will be very small, because most observed values of the variable will be identical (0 or 1, respectively).
However, under those circumstances the error variance of the odds ratio will be large, because either the odds ratio or its inverse is tending towards infinity. When the probability is instead near 1/2, the variance of the probability estimate is large (because the observed values are roughly evenly distributed between 0's and 1's, none of them near the mean). However, this is the circumstance in which the odds ratio has the lowest variance. In general, for a Bernoulli random variable with probability , we have Var( ̅ ) = (1 − ), but by the delta method Var(log( ̅ /(1 − ̅ )) = (1 − ) . Intuitively, the equivalent of in logistic regression models is , which determines the variance of the estimand of interest (log odds ratio), rather than , which determines the variance of Y itself.
Returning to the expression for , it is still necessary to handle the sum over adaptive interventions . It is most convenient for purposes of derivation to have a specific dummy code (indicator) for each adaptive intervention instead of having dummy or effect codes with a general intercept term. Therefore, with the four embedded adaptive interventions, we can write Thus for a given , has only one nonzero entry, specifically the th diagonal entry.
This implies that can be treated as a diagonal matrix whose th diagonal entry is . Therefore, is a diagonal matrix whose th diagonal entry is . Of course, this assumes that ( ) is neither 0 or 1.
Letting the notation ⊗ represent the outer product of a vector with itself, the "meat" of the sandwich (the empirical covariance of the estimating function) is In this context = , so Therefore, a typical entry of is .
We can substitute the potential outcomes for the observed outcomes above by assumption 2, because only participants consistent with adaptive intervention will have nonzero ( ) . Thus To simplify the expression for further, there are three cases to consider. where In this context we use to denote a particular value 0 or 1 of the random variable ( ) , although we caution that we follow other literature elsewhere in the paper in also using the symbol to indicate ( = 1) in cases where all are equal.
• Case Two: If ≠ but the two adaptive interventions recommend the same firststage intervention option, then ( ) ( ) is nonzero for responders and zero for nonresponders.
(Note that in this case that the probabilities and are equal; response probability is assumed to depend only on first intervention option, as second assignment would not even occur in responders.) Therefore, by assumption 2 (consistency), It is difficult to evaluate or elicit the cross-world covariance above, so we make the simplifying assumption 3 which says that it can be treated as zero because of randomization.
• Case Three: If ≠ and the adaptive interventions differ on first-stage recommendations, then ( ) ( ) = 0 in all cases so that the cross-product can safely be ignored.
The quantities and are not the same as the conditional variances for nonresponders and responders to adaptive intervention , although they could be considered adjusted forms of the conditional variances, which add a positive term as described below. In this sense, the second term in (9) can be interpreted as the extra variability contributed by the tendency of responders and nonresponders to have different average outcomes to a given adaptive intervention for non-randomized reasons, above and beyond the effect of the second intervention option. The expressions can be combined so that is a diagonal matrix with entries Substituting this expression for into Equation (1) leads to Equation (2).

Derivation of Expression (4)
Suppose that a baseline measurement or pretest is being considered, so there are two observations per subject. Suppose the measurement times are coded as 0 and 1. We still solve the estimating equations given in expression (1), but now we consider the adaptive-interventionspecific expected value as a vector ( ) = [ ( ) , ( ) ] rather than a scalar ( ) . By the randomization design, the baseline measurement does not depend on the adaptive intervention that will be assigned, i.e., ( ) = regardless of . Therefore, the mean vector can be written as where matrices and are defined in the following subsection.
The outcome of interest is still assumed to be a log odds ratio comparing expected endof-study outcomes under two adaptive interventions differing at least in first intervention option, that is, logit ( ) − logit ( ) , although now it is assumed that they will be estimated using a longitudinal model which takes within-subject correlation into account. The variance of the log odds ratio can be derived indirectly by noticing that is a linear combination = ∑ , in which is a vector having +1 in the position corresponding to adaptive intervention , having −1 for the position corresponding to adaptive intervention , and having 0 everywhere else. For example, for comparing the "enhanced working memory training" to the "enhanced incentives alone" adaptive interventions (Table 1) leading to the sample size recommendation (4). However, expression (4) is not useful on its own as a sample size formula unless can be expressed in terms of quantities that have intuitive scientific or practical meaning and hence are impractical to elicit from investigators seeking to design a SMART. Suppose also that the weighted estimating equations of expression (8) will be used.
Therefore, Equation (1) becomes Several parameterizations are possible for constructing the matrix. In practice, it may be good to use effect coding (see Dziak and colleagues, 2020). However, for purposes of the current derivation, a simpler parameterization is more convenient. Therefore the simplest parameterization would have a parameter for the general pretest expected value and a parameter for the posttest expected value of each adaptive intervention. That is, ( ( ) ) = logit ( ) and ( ( ) ) = logit ( ). The parameters in this parameterization are the log odds corresponding to the probabilities given by the parameters.
In this parameterization, is not exactly an intercept, because it is not found in the Let be the correlation between baseline outcome and the outcome at the final time point, assumed for simplicity to be the same for each embedded adaptive intervention.
Analogously to section 4.1, we make assumptions in order to approximate and . The first two are consistency assumptions needed for identifiability, while the third and fourth are working assumptions used to provide a more tractable simplified formula for .
1. Consistency assumption for : same as assumption 1 in section 4.1.

2.
Consistency assumption for : The observed posttest (end of study) outcome for individuals consistent with adaptive intervention equals the potential outcome ( ) , which can be written ( ) for short. It is a binary variable with expected value ( ) and variance ( ) (1 − ( ) ). The observed pretest (baseline value of the outcome variable, before the intervention) ( ) for a given individual is a constant regardless of the adaptive intervention which that individual will receive and can therefore be written as ( ) . Its expected value is ( ) and its variance is ( ) (1 − ( ) ). The observed marginal pretest-posttest correlation is a constant for each adaptive intervention, that is, Corr( ( ) , ( ) ) = .
3. Within-subject correlation: Pretest and posttest residuals for each adaptive intervention are correlated within person at some nonnegative value, with The marginal correlation is here assumed to be the same value for each adaptive intervention for purposes of deriving the formulas. Note that the correlation conditional on a value of need not be the same as the marginal correlation but is also assumed nonnegative: 4. Cross-world independence assumption for responders: For two adaptive interventions and with the same initial intervention option but different , cross`products of posttest residuals are independent: This last working assumption is somewhat unrealistic, but relaxing it would make the sample size formula much more complicated. Further, simulations presented in the main paper show that the simpler sample size formulas obtained by using this assumption perform very well in a fairly realistic scenario.
Generalizing the single-time-point results, Notice that, for example, has an analogous structure to , but with the zeroes placed differently in the matrix. Therefore, Continuing in this way, To proceed from here we need an expression for . Recall the parameterization which provides that ( ) = and ( ) = . Therefore letting ( ) = ( | = 0) and ( ) = ( | = 1). In particular, under the unrealistic assumption that future responder status is not related to baseline outcome, and would each simplify to ; otherwise they will be somewhat larger due to the added positive term.
Then, the "meat" of the sandwich is Following Seewald and colleagues (2020), we expand as Consider terms in the first summation. Using iterated expectation, Using the consistency assumptions, and the definition of the weights ( ) which give zero weight to participants not consistent with a given adaptive intervention , we can replace above with ( ) . Also, Next consider an off-diagonal term in (11), specifically If adaptive interventions and do not recommend the same initial intervention option, then no participant can be compatible with both, and this cross-product is zero because ( ) ( ) is zero.
However, if adaptive interventions and recommend the same initial intervention option, then ( ) ( ) can be nonzero for responders. In this case, using iterated expectation, the off-diagonal cross-product equals where = ( ( ) − ( ) )( ( ) − ( ) ) | = 1 . This is not necessarily zero in practice, but it is very hard to calculate or to interpret, and so we propose to use expression (10) in assumption 4 to assume that it is zero for purposes of deriving the power formula.
Under this assumption, the extra terms in (11) disappear so that Then We now have expressions, although not very simple ones, for and . One could therefore substitute = in to equations (6). However, this approach could be challenging in practice. This is because for each adaptive intervention , it requires values for and in addition to the marginal covariance matrix . The quantities and do not have an intuitive interpretation themselves as quantities of scientific interest.
This issue is not insurmountable, because and can be computed indirectly from the and parameters, while the parameters can be computed from the parameters and response rates . Lastly, the parameters can be elicited indirectly as conditional probabilities for outcomes in appropriate cells. For example, for adaptive intervention (+1, +1), the value of should equal the expected outcome probability in the observed cell with = +1, = 0, and = +1. Similarly, would equal the expected outcome probability in the observed cell with = +1, = 1, and = +1. There are only six observed cells, despite eight possible combinations of = 1,2,3,4 and = 0,1, but this is reasonable because will equal ⋆ for a pair of adaptive interventions and ⋆ differing only on . That is, under reasonable elicitation assumptions, ( ( , ) = 1| ( , = 1) should not depend on . That is, the hypothetical choice of second intervention option does not affect the outcome for responders, who never receive this part of the adaptive intervention. Nonetheless, even though can be calculated using expression (12), a simpler formula would be desirable.

Derivation of Expression (5)
If one is willing to assume ≈ ≈ for each , i.e., that the variance is independent of response status, and also that the response rate is the same for each , then = ≈ 2(2 − ) . Then can be shown to equal 4 − 3 4 − 2 + 4 − 3 4 .
Combining this with expression (1) leads to the sample size recommendation.

Alternative to Expression (5)
It was noted earlier that the working independence assumption (10) is somewhat unrealistic.This is because it implies that This cross-world independence is standard after randomization, but less intuitive before randomization. It would be much more parsimonious to make the assumption of cross-world independence only for the final outcomes and not for the baseline outcomes, i.e., to assume ( ( ) − ( ) )( ( ) − ( ) )| ( ) = 1 = 0 instead of (10). Therefore, it is worth considering what the 2 × 2 matrix would equal if equation (13) was true but equation (10) was not true; that is, post-randomization potential outcomes are independent across counterfactual worlds but pre-randomization potential outcomess are identical among them. In this case, because posttest residuals are still independent, the lower right corner of is still zero. However, the off-diagonal entries of are not zero: In this subsection we only need to consider pairs of adaptive interventions agreeing on first stage intervention option , because other pairs will have weight zero in (11). Therefore, we can at least assume that ( ) = ( ) . Then where ( , ) = ( | = 1) and ( , ) = ( ( ) | = 1), and where ⋆ is the pretest-posttest correlation conditional on = 1, assumed here to be the same for each adaptive intervention.
However, this expression is likely to be more difficult for a substantive researcher to use.
Fortunately, the exact error variance of the pretest mean (or intercept) does not matter very much to the contrasts of interest in this paper, which focuses on pairwise end-of-study contrasts between adaptive interventions differing in initial intervention option . Accordingly, in our simulations the power formula still performs quite well, despite working assumption (10).

Alternative Formulas with Identity Link Function (Linear Model)
In this paper we have used an approach similar to logistic regression, with a logit link function.
Other options would be to use a log link or an identity link. Although the identity link is generally not recommended for binary data, it is helpful to derive formulas for this case because this was the link used for linear modeling by Seewald and colleagues (2020), and it therefore allows the formulas to be compared directly with theirs. This would involve using linear mean function ( ) = in Equation (1) instead of the logit link function. Accordingly we would assume the estimand Δ of interest in Equation (1) would be a difference in probabilities (risk difference). Then the estimand Δ could still be written as a linear combination of the new coefficients , using the same coefficients as before. The power and sample size formulas (1) and (2)  In the cross-sectional (single-wave) case, with as the end-of-study binary outcome, we still have the same form = ( ) (1 − ( ) ), because that is dictated by the distributional form of and not by the link function or estimand of interest. However, the derivative of the link function is now simply instead of ( ( ) (1 − ( ) )) . Therefore, the simplified form of Equation (8) is the weighted least squares equation The bread matrix is ∑ = ∑ = ∑ , which simplifies to a diagonal matrix whose th diagonal entry is , exactly the inverse of the bread matrix obtained before for the logistic link function. Thus is used here where was used in (6). Intuitively, this discrepancy occurs because we are now studying probabilities (which are constrained above and below and therefore have more sampling variance for common events than rare ones) than odds (which are based on ratios and therefore have more sampling variance for rare events than common ones).
The meat matrix is now Therefore, becomes a diagonal matrix whose entries are simply 4(1 − ) + 2 .

SMART BINARY
However, = instead of as before. The bread matrix becomes ∑ = ∑ . Next, The cross-product within adaptive intervention would be If we assume ≈ then the above approximates 2(2 − ) , so that ≈ 2(2 − Notice also that if = 0 then expression (15) would simplify to the cross-sectional equivalent, expression (14).