Testing Hypotheses of Covariate-Adaptive Randomized Clinical Trials

Covariate-adaptive designs are often implemented to balance important covariates in clinical trials. However, the theoretical properties of conventional testing hypotheses are usually unknown under covariate-adaptive randomized clinical trials. In the literature, most studies are based on simulations. In this article, we provide theoretical foundation of hypothesis testing under covariate-adaptive designs based on linear models. We derive the asymptotic distributions of the test statistics of testing both treatment effects and the significance of covariates under null and alternative hypotheses. Under a large class of covariate-adaptive designs, (i) the hypothesis testing to compare treatment effects is usually conservative in terms of small Type I error; (ii) the hypothesis testing to compare treatment effects is usually more powerful than complete randomization; and (iii) the hypothesis testing for significance of covariates is still valid. The class includes most of the covariate-adaptive designs in the literature; for example, Pocock and Simon’s marginal procedure, stratified permuted block design, etc. Numerical studies are also performed to assess their corresponding finite sample properties. Supplementary material for this article is available online.


INTRODUCTION
In clinical trials, it is usually important to balance treatment arms with respect to key covariates. To do this, many covariate-adaptive designs have been proposed in the literature. One natural idea is doing stratification first and then employing separate randomization within each strata; for example, stratified permuted block design, etc. To deal with many covariates, Taves (1974) and Pocock and Simon (1975) introduced the minimization method, attempting to minimize the weighted sum of marginal differences between numbers of patients for all covariates. Recently Hu and Hu (2012) discussed some limitations of these classical designs and proposed a generalized family of covariate-adaptive designs, and obtained their theoretical properties. For more discussion of handling covariates in clinical trials, see McEntegart (2003), Zhang et al. (2007), Rosenberger and Sverdlov (2008), Hu, Zhu, and Hu (2014), and reference therein.
Both stratified permuted block design and Pocock and Simon's marginal procedure are extensively implemented in clinical studies. Stratified permuted block design was used in a number of clinical trials, including Iacono et al. (2006) and Jakob et al. (2012). The usage of Pocock and Simon's marginal method had also increased in last decades. According to Taves (2010), Pocock and Simon's marginal procedure was implemented in over 400 clinical trials from 1989 to 2008. Some recent examples include Anderson et al. (2000), Gridelli et al. (2003), Krueger et al. (2007), Molander et al. (2007), Ohtori et al. (2012), and so on.
Even though many covariate-adaptive designs have been proposed and implemented in clinical trials, the discussion of statistical inference associated with those methods is limited. In practice, conventional tests are often employed without consideration of covariate-adaptive randomization scheme. It remains a concern if conventional tests are still valid under covariateadaptive designs. It is now generally accepted that covariates used in trial design should also be incorporated in inference procedures. Forsythe (1987) suggested all covariates used in minimization method should be included into analysis to achieve a valid test through simulation studies. Shao, Yu, and Zhong (2010, p. 347) pointed out that "one way to obtain a valid test procedure is to use a correct model between outcomes and covariates, including those used in randomization." However, in practice, it is often that not all covariate information used in randomization can be fully used in inference procedures. In a clinical trial described in Anderson et al. (2000), the Pocock and Simon's marginal procedure was implemented to balance allocation over three covariates including clinical centers, performance status, and disease extent. A continuous primary endpoint was compared between two treatment groups using a two-sample t-test, without adjusting covariate effects at all. In practice, some randomization covariates are omitted in final analysis because: (i) it is difficult to incorporate some covariates in the analysis model, for example, investigation sites, etc.; and (ii) adjusting too many covariates usually means complicated modeling technique.
There have been doubts about the validity of statistical inference for covariate-adaptive designs, especially covariates that are fully or partially omitted in inference procedures. Birkett (1985) and Forsythe (1987) raised concerns about validity of unadjusted analysis under covariate-adaptive designs. They found that the two-sample t-test is conservative in terms of small Type I error if Taves' minimization is used to allocate patients to treatment through simulation studies. They also found that the t-test is less powerful for minimization than complete randomization for small treatment difference, but more powerful if a larger treatment difference exists. Feinstein and Landis (1976) and Green and Byar (1978) studied inference problems for stratified randomization for binary responses. Shao, Yu, and Zhong (2010) theoretically proved that, under the covariate-adaptive biased coin procedure, the two-sample t-test is conservative, by assuming that the response primarily follows a simple homogeneous linear model. More discussions can be found in Simon (1979), Tu, Shalay, and Pater (2000), Aickin (2009), and so on.
In the literature, the results of statistical inference for covariate-adaptive designs are restricted in several aspects. (i) Conclusions are drawn primarily by simulation, and theoretical work is very limited. Shao, Yu, and Zhong (2010) proved the property of the two-sample t-test based on a covariate-adaptive biased coin design, which is a stratified design to apply the biased coin method (Efron 1971) to each stratum and is not commonly used in practice. (ii) Only the two-sample t-test is discussed, where no covariate information is incorporated in the final analysis. In practice, it is often that a subset of randomization covariates are used in the final statistical inference. The corresponding theoretical properties remain unknown. (iii) All studies focus on hypothesis testing for comparing treatment effects. There is very little, if any, discussion about inference of covariates under covariate-adaptive designs in the literature. In view of the importance of inference of covariates in clinical and medical studies-for example, in some personalized medicine and biomarker discovery studies-we also want to study inference properties for covariates in covariate-adaptive clinical trials.
Over the past several decades, scientists have identified more and more biomarkers (Ashley et al. 2010;Li et al. 2010;Lipkin et al. 2010;etc.) that may link with certain diseases in the fields of translational research (genomics, proteomics, and metabolomics). Based on these biomarkers, we would like to develop personalized medicine that provides better treatment regimens for patients based on their individual characteristics (which could be biomarkers or other covariates). Balancing treatment allocation for influential covariates has become more and more important in today's clinical trials (Hu 2012). Therefore, it is essential to study the theoretical behavior of testing hypotheses of both treatment effects and covariates under covariate-adaptive randomized clinical trials.
In this article, we establish a theoretical foundation of conventional testing under linear model framework. For a large family of covariate-adaptive designs, we derive the asymptotical distributions of the test statistics of testing both the treatment effects and the significance of covariates under null and alternative hypotheses. We find that: (i) the hypothesis testing to compare treatment effects is usually conservative in terms of small Type I error; (ii) the hypothesis testing to compare treatment effects is usually more powerful (than complete randomization); and (iii) the hypothesis testing for significance of covariates is still valid. This article is organized as follows. The general framework is described in Section 2, and the theoretical results are given in Section 3. Section 4 presents numerical studies that were performed to assess Type I errors and power comparison. Some concluding remarks and recommendations are given in Section 5.

HYPOTHESIS TESTING UNDER COVARIATE-ADAPTIVE DESIGNS
In this section, we study hypothesis testing based on a linear model framework for covariate-adaptive designs. Suppose two treatments, 1 and 2, are studied under a covariate-adaptive randomized clinical trial, and μ 1 and μ 2 are parameters measuring the main effects of treatment 1 and 2, respectively. Let N be the total number of patients enrolled in the study. Let I i be the assignment of ith patient, that is, I i = 1 for treatment 1 and I i = 0 for treatment 2, i = 1, 2, . . . , N. Conditional on the treatment assignment I i , the following linear model is assumed for the response of the ith patient Y i , where • the X i,k 's and Z i,j 's are discrete or continuous covariates which are independent and identically distributed as X k and Z j , k = 1, . . . , p and j = 1, . . . , q; • both the X i,k 's and Z i,j 's are used in the randomization procedure, but only the X i,k 's are used in the final statistical inference, k = 1, . . . , p and j = 1, . . . , q; • all covariates are independent of each other, and EX k = 0 and EZ j = 0 for all k and j, k = 1, . . . , p and j = 1, . . . , q; and • the ε i 's are independent and identically distributed random errors with mean zero and variance σ 2 ε and independent of X k and Z j ,k = 1, . . . , p and j = 1, . . . , q.
Notice both X i,k and Z i,j are assumed to be scalars in model (1). If X i,k (or Z i,j ) is a discrete covariate, X i,k (or Z i,j ) is a scalar that can take several values corresponding to different categories. In practice, a vector is usually used to represent a discrete covariate with multiple categories. For example, a covariate having three categories can be coded as a two-dimensional vector with values of (0, 0), (1, 0), and (0, 1). In this article, X i,k (or Z i,j ) is assumed to be a scalar for simplicity, but all the results can be extended to the situation where discrete covariates with multiple categories are represented by vectors.
Then model (1) can be written as The working model of inference would be, or in the matrix form, The ordinary least squares method is used to obtain the estimate of β, which has the explicit form When model (2) is constructed to study data from a covariateadaptive randomized clinical trial, the primary interest is usually to compare treatment effects between different groups. To compare treatment effects of μ 1 and μ 2 , the following hypothesis testing is used The test statistic for (3) is quantile of a standard normal distribution, we will reject the null hypothesis, otherwise accept the null hypothesis. Under many situations, we consider general forms of hypothesis testing for significance of covariates (Cheung et al. 2014). Let C is an m × (p + 2) matrix of rank m with m < (p + 2), where entries of the first two columns are all zeros. Our hypothesis would be The test statistic for hypothesis testing (5) is th percentile of a χ 2 distribution with degree of freedom m, we will reject the null hypothesis; otherwise accept the null hypothesis.
A special case of testing (5) is evaluating the significance of a single covariate (biomarker). This is usually important in personalized medicine (Zhu, Hu, and Zhao 2013). Without loss of generality, we consider the hypothesis testing for β 1 , the coefficient of X 1 . To test the significance of β 1 , the hypothesis would be The test statistic for hypothesis testing (7) can be reduced, where = (0, 0, 1, 0, . . . , 0). If |T 1 | > Z 1−α/2 , where Z 1−α/2 is (1 − α/2)th quantile of a standard normal distribution, we will reject the null hypothesis; otherwise accept the null hypothesis.
In clinical trials, covariate-adaptive designs are usually based on discrete covariates (Taves 2010). If a continuous covariate is to be used in randomization, a continuousdiscrete conversion need be performed to breakdown continuous covariate into a discrete variable with several subcategories. Let C = {j |Z j is continuous, j = 1, . . . , q} and C * = {k |X k is continuous, k = 1, . . . , p}. If k ∈ C * or j ∈ C, the covariate-adaptive design is applied with respect to discrete variables, d * k (X k ) or d j (Z j ), where d * k , d j are discrete functions. In this case, define δ * i,k = X i,k − E(X i,k |d * k (X i,k )) and Here,X i,k andZ i,j are ith observations of covari-atesX k andZ j , k = 1, . . . , p and j = 1, . . . , q.X i,k and Z i,j are used in covariate-adaptive randomization process. We further define all levels of imbalance between patients in two treatments. ConsiderX k have s * k levels and Z j have s j levels, resulting in For convenience, we use (t 1 , t 2 , . . . , t p , r 1 , r 2 , . . . , r q ) to denote the stratum formed by patients who have the same covariate profile , and use (k; t k ) to denote the margin formed by patients withX k = x t k k , and similarly (j, r j ) to denote the margin formed by patients withZ j = z r j j . Then let • D N be the difference between the numbers of patients in treatment group 1 and 2 as total, that is, the number in group 1 minus the number in group 2; • D N (k; t k ) and D N (j ; r j ) be the differences between the numbers of patients in the two treatment groups on the margin (k; t k ) and (j, r j ), respectively; • D N (t 1 , t 2 , . . . , t p , r 1 , r 2 , . . . , r q ) be the difference between the numbers of patients in the two treatment groups within the stratum (t 1 , t 2 , . . . , t p , r 1 , r 2 , . . . , r q ).
These differences play important roles in properties of statistical inference for covariate-adaptive designs (see Section 3 for details).

THEORETICAL PROPERTIES OF HYPOTHESIS TESTING
Two types of hypothesis testing will be considered in this section, one is comparing main treatment effects between two groups, and the other is testing significance of covariates. The testing hypotheses are performed based on the working model (2) when data are generated from the true model (1). Properties of this hypothesis testing are studied under both null hypothesis and alternative hypothesis. A test is said to be (asymptotically) conservative, if the true Type I error is smaller than the significance level under the null hypothesis.
First we consider the hypothesis tests of comparing treatment effects. We have the following main theorem.
Proof. To prove Theorem 3.1, the asymptotic properties are studied for both the numerator and the denominator of test statistic T. We first consider the numerator of test statistic (4), In the online supplementary materials it is proved that, if the conditions (A) and (B) are satisfied for a covariate-adaptive design, then Further, for the denominator of test statistic T, we could also show that, Then, it follows from Slutsky's Theorem that, under H 0 : The detailed proofs of Theorem 3.1 and two related lemmas are in the online supplementary materials.
In Theorem 3.1, theoretical properties of hypothesis testing for treatment effects are obtained under covariate-adaptive designs. In a covariate-adaptive randomized clinical trial, covariate information is incorporated in the design process to reduce imbalance of different levels (within-stratum, within-covariatemargin, and overall). Two mild conditions of covariate-adaptive designs are assumed to derive the asymptotic distribution of the test statistic for comparing treatment effects. Condition (A) states that the overall imbalance is bounded in probability and condition (B) requires that marginal imbalances are all bounded in probability. These conditions are satisfied by various covariate-adaptive designs (see, e.g., Corollary 3.1 and Theorem 3.3). Under these conditions, it can be shown in the proof that the numerator (μ 1 −μ 2 ), that is, the estimate of difference between treatment effects, has a variance 4σ 2 δ /N which is smaller than the model-based variance estimate 4σ 2 z /N in the denominator. The reduction of variance of (μ 1 −μ 2 ) can be attributed to balanced distributions of covariate profiles across treatment groups, which enhance the comparability of different treatment groups and eliminate the variability between estimates of treatment effects. When a linear model with omitting covariates is implemented to do inference, the covariate-adaptive randomization scheme is ignored and an inflated model-based variance estimate is used, which results in a variance smaller than 1 in the asymptotic distribution under the null hypothesis and thus conservative Type I error smaller than the nominal level. Furthermore, power performance can also be assessed based on the asymptotic distribution of the test statistic under the alternative hypothesis.
Now consider the power of the hypothesis test (3). Under the alternative hypothesis, the power is The power of test (3) under complete randomization (under the same setting as described in Section 2) is From the power expressions for both covariate-adaptive designs and complete randomization above, we can conclude that the limiting power under covariate-adaptive designs could be smaller than that under complete randomization when δ is relatively small, and it is usually larger than complete randomization when δ is large. This conclusion agrees with some simulation studies about two-sample t-test in the literature (Forsythe 1987;Shao, Yu, and Zhong 2010) for some covariate-adaptive designs. Our conclusion is more general, which can be applied to linear models under a large family of covariate-adaptive designs.
The following theorem shows that hypothesis tests regarding significance of covariates can still achieve correct Type I error in covariate-adaptive designs, even though the power would be influenced if not all covariates are incorporated in the analysis model.
Theorem 3.2. Under the same conditions as in Theorem 3.1: Hence, the hypothesis testing (5) can achieve correct Type I error.
Proof. The test statistic can be written as We first have, under H 0 : and under H A : Further, by the weak law of large numbers and independence of covariates, Then it follows that, By defining M = diag(1/2, 1/2,M) and C = [0 m×2 ,C], where 0 m×2 is an (m × 2) matrix with all entries equal to zero, we have By the central limit theorem and the fact C M −1 C T = CM −1C T , we can prove the results. The detailed proofs of Theorem 3.2 are in the online supplementary materials.
From Theorems 3.1 and 3.2, it can be seen that the overall difference and marginal difference play important roles in statistical inference for covariate-adaptive designs. There is no constraint on within-stratum imbalance. For stratified permuted block design, the difference within any stratum is at maximum the half of block size. Since the number of strata is finite, the overall and marginal difference are less than a constant, thus the conditions (A) and (B) are satisfied. Hu and Hu (2012) proposed a large class of covariate-adaptive designs, which satisfy the conditions (A) and (B) under certain conditions. Here we summarize these results in the corollary.
Corollary 3.1. Both Theorems 3.1 and 3.2 hold under the following covariate-adaptive designs: (i) stratified permuted block designs; and (ii) the class of covariate-adaptive designs proposed by Hu and Hu (2012).
The theoretical properties for Pocock and Simon's marginal procedure remain unknown for decades. The next theorem demonstrates that marginal difference and overall difference are bounded in probability for Pocock and Simon's marginal procedure. Proof. Because here we consider only the problem of imbalances for covariates, we can assume that q = 0 and all covariates are discrete, for otherwise we can combine (X 1 , . . . , X p ) and (Z 1 , . . . , Z q ) together as a new covariate vector and consider W n instead. We denote D n = D n (k; t k ); t k = 1, . . . , s * k , k = 1, . . . , p to be the collection of all marginal imbalances after n assignments. Further, at the stage n, when the nth patient falls which are the weighted "potential" imbalances that would be caused if the nth patient were assigned to treatment 1, 2, respectively. Here w k > 0, k = 1, . . . , p, are weights placed within a covariate margin. Then, under the Pocock and Simon's randomization procedure, the probability of assigning the nth patient to treatment 1 is q if Imb (1) n > Imb (2) n , 1 − q if Imb (1) n < Imb (2) n , and 1/2 if Imb (1) n = Imb (2) n , where 0 < q < 1/2. By defining n−1 (t) = p k=1 w k D n−1 (k; t k ) and a norm-like function of D n by V ( D n ) = p k=1 s * k t k =1 w k D 2 n (k; t k ), we can verify that (D n ) n≥1 is a Markov chain with So, there exist a bounded set C in R p and a constant b such that The above drift condition implies that the sequence (D n ) n≥1 is a positive ( Remark. If we consider complete randomization as a special case of covariate-adaptive design, it does not satisfy the condition on the overall imbalance in Theorem 3.1, because the overall imbalance D N = N i=1 (2I i − 1) = O p (N 1/2 ) by the independence of I i and the central limit theorem. The numerical study in next section shows that the test under complete randomization is not conservative.
Based on Theorem 3.2, we can see that hypothesis testing of covariates is still valid in the sense of Type I error under all covariate-adaptive designs. The regression method can be directly used to test significance of prognostic factors with a working model only containing partial covariate information. On the other hand, power will decrease by omitting influential covariates in the working model. Consider the noncentral parameter in (12): it would increase with σ 2 z reduced, so power would increase with more influential covariates in the model. Therefore, it is helpful to incorporate more influential covariates, if possible, to obtain a more powerful test.
Hence, the hypothesis testing (7) can achieve correct Type I error. (ii) Under H A : β 1 = 0, consider a sequence of local alternatives, that is, β 1 = δ β 1 / √ N for a fixed δ β 1 , then where σ 2 1 = var(X 1 ). Hence, the power will increase as more covariates are incorporated into the model. Corollary 3.2 gives an important special case of testing covariates, where only a single coefficient is considered.
According to Theorems 3.1 and 3.2, we should select a model with only influential covariates. It is known that too many unnecessary variables in the model increases the variations of estimates and impacts the statistical results. Hence, if only influential variables are incorporated in the model, it will not only reduce unnecessary variations, but will also give valid inference. Some numerical studies are performed in Section 4 about model selections.

NUMERICAL STUDY
Case 1: Testing Treatment Effects. First we consider simulations to study Type I error of hypothesis testing for comparing treatment effects under three designs: Pocock and Simon's marginal procedure, stratified permuted block design, and complete randomization. For each type of design, both continuous case and discrete case are considered. The following linear model (including two covariates Z 1 and Z 2 ) is assumed for responses Y i , where ε i is distributed as N (0, 1), β 1 = β 2 = 1. No difference in treatment effects is assumed to study Type I error, that is, μ 1 = μ 2 . For the discrete case, Z 1 follows Bernoulli(p 1 ) and Z 2 follows Bernoulli(p 2 ); for the continuous case, both Z 1 and Z 2 follow normal distributions N (0, 1). If covariates Z 1 and Z 2 are continuous, they are discretized into Bernoulli variables Z 1 and Z 2 with the probabilities p 1 and p 2 to be used in randomization. More specifically, if Z 1 < Z (p 1 ) , where Z (p 1 ) is p 1 quantile of the standard normal distribution, then Z 1 = 0, otherwise Z 1 = 1. Original variables (without discretization) are used in statistical inference procedures.
To carry out simulations, the biased coin probability 0.75 and equal weights are used for Pocock and Simon's marginal procedure, and the block size 4 is used for stratified permuted block design. The significance level is α = 0.05 and sample size N is 100, 200, or 500. The hypothesis tests include the two-sample t-test (t-test), the linear model with a single covariate Z 1 (lm(Z 1 )), the linear model with a single covariate Z 2 (lm(Z 2 )), and the linear model with both covariates Z 1 and Z 2 (lm(Z 1 , Z 2 )). By choosing (p 1 , p 2 ) = (0.5, 0.5), the simulation results for Pocock and Simon's marginal procedure, stratified permuted block design and complete randomization are demonstrated in Table 1.
In each simulation, Type I error of covariate-adaptive randomization methods is also examined with the bootstrap ttest described in Shao, Yu, and Zhong (2010). To do the test, 1, 2, . . . , B, are generated independently as simple random samples with replacement from (Y 1 , Z 1,1 , Z 1,2 ), . . . , (Y N , Z N,1 , Z N,2 ). The covariate-adaptive procedure on the original data is applied on the covariates of each bootstrap sample , from which the bootstrap analogues of treatment assignments, I * b 1 ,. . .,I * b N can be obtained.
The bootstrap estimator of the variance ofȲ 1 −Ȳ 2 is then the sample variance ofθ * (b), b = 1, 2, . . . , B, represented byv B . Then the bootstrap t-test has the form of Shao, Yu, and Zhong (2010) showed that the bootstrap t-test can maintain nominal Type I error under covariate-adaptive biased coin design. B = 500 is used in all following simulations.
Several conclusions can be drawn from Table 1. First, the Type I error is close to 5% under the full model lm(Z 1 , Z 2 ). This coincides with theoretical results in Section 3, when no randomization covariate is omitted in the construction of the final analysis model. Second, under both Pocock and Simon's marginal procedure and stratified permuted block design, the two-sample t-test, lm(Z 1 ) and lm(Z 2 ) are all conservative. Among these three tests, the two-sample t-test is the most conservative one with the least Type I error. Furthermore, the Type I error of the bootstrap t-test (BS-t) is close to the nominal level 5% under both Pocock and Simon's marginal procedure and stratified permuted block design. Under complete randomization, the Type I error is close to 5% for all four tests. We also tried different (p 1 , p 2 ), similar results are obtained and are not shown here.
Case 2: Power Comparison. Now we compare power for different hypothesis testing methods under Pocock and Simon's marginal procedure and complete randomization. The same model as in Case 1 is used, except that difference exists  between treatment effects μ 1 and μ 2 , that is, μ 1 − μ 2 = 0. Sample size N = 100 and (p 1 , p 2 ) = (0.5, 0.5) are used in simulations. Several hypothesis testing methods for treatment effects are compared under covariate-adaptive randomization. All the results of power are given in Figure 1, from which several conclusions can be made. The two-sample t-test is less powerful than (lm(Z 1 )) and (lm(Z 2 )), and all those three methods are less powerful than (lm(Z 1 , Z 2 )) under Pocock and Simon's marginal procedure. The bootstrap t-test has similar power performance as the full model with correct model specification (lm(Z 1 , Z 2 )) when covariates are discrete. However, the bootstrap t-test is less powerful than (lm(Z 1 , Z 2 )) if covariates are continuous. Furthermore, the power of each test can also be compared between covariate-adaptive randomization and complete randomization. For example, the two-sample t-test has smaller power under Pocock and Simon's marginal procedure than that under complete randomization when |μ 1 − μ 2 | is relatively small due to conservativeness, but has larger power when |μ 1 − μ 2 | becomes larger. We also tried stratified permuted block randomization (not reported here), all tests considered have similar power performance as Pocock and Simon's marginal procedure. Based on Table 1 and Figure 1, we can see that the theoretical properties in Theorem 3.1 hold for sample size around 100. We further consider the power comparison for small sample sizes 32 and 64. The simulated power is reported in Table 2 under the same model as in Case 1. In Pocock and Simon's marginal procedure, the probability of biased coin assignment is 0.8 and equal weights are assigned to the two covariates. The hypothesis testing is based on the model (lm(Z 1 , Z 2 )) with both covariates Z 1 and Z 2 . Table 2 (7) for significance of covariates is studied under Pocock and Simon's marginal procedure and complete randomization. As the same model in Case 1, we set β 1 = 0, β 2 = 1 and μ 1 = μ 2 . To run simulations, the biased coin probability 0.75 and equal weights are used for Pocock and Simon's marginal procedure. The significance level is α = 0.05 and sample size N is 100, 200, or 500. Hypothesis testing of the significance of β 1 is performed based on the two models with or without Z 2 , that is, lm(Z 1 , Z 2 ) and lm(Z 1 ). The Type I error results are given in Table 3, from which it can be seen that the tests of β 1 are valid in terms of Type I error for both Pocock and Simon's marginal procedure and complete randomization. Figure 2 reports the power of testing β 1 under two randomization methods, which indicates that the model lm(Z 1 , Z 2 ) with Z 2 in the analysis is more powerful than lm(Z 1 ). This agrees with Theorem 3.2.
Case 4: Model Selection. So far, all simulation results are based on linear models with up to two covariates. Now we consider a linear model with more than two covariates. In this situation, variable selection techniques can be applied to select a subset of influential covariates on outcomes to be used in inference procedures. Suppose outcomes Y i follows the following model with five covariates, where X 1 is a binary covariate with the probability of 0.5 to take 0 or 1, X 2 is a discrete variable with four possible values (0, 0, 0), (1, 0, 0), (0, 1, 0), and (0, 0, 1) for equal probabilities 0.25, Z 1 is a discrete covariate with three possible values (0, 0), (0, 1), and (1, 0) for equal probabilities 1/3, X 3 , and Z 2 are standard normal distributed variables, which are discretized to Bernoulli variables with the probability 0.5 for randomization process. Also it is assumed β 1 = 3, β 2 = (2, 3, 4), β 3 = 2, β 4 = (0, 0), β 5 = 0, so that only X 1 , X 2 , and X 3 have effects on the outcome Y. ε i is normally distributed with a mean of 0 and a standard deviation of 2. Here only Pocock and Simon's marginal procedure is considered, since there would be too many strata (total 96 strata) if stratified permuted block design is implemented.
To adjust conservative Type I error, one way is to incorporate all randomization covariates into the analysis model (lm5), where there is no loss of information used in hypothesis testing compared to randomization. A more efficient approach according to Theorem 3.1 is to incorporate only the covariates that are influential on outcomes, that is, constructing the model  with X 1 , X 2 , and X 3 (lm3). In Table 4, the model selection with BIC by stepwise algorithm is used to select the analysis model to do statistical inference. The algorithm is realized with "stepAIC" function in R by specifying k = log(N ). Stepwise selection can be implemented backward, forward, and with both directions. Here we report only the results for the backward selection(backward). Similar results are obtained from the other two stepwise selection methods (not shown). These methods are able to automatically achieve a final model from a bunch of candidate models with different combinations of multiple covariates, based on which treatment effects can be compared. The results of Type I error of lm3, lm5, and backward are given in Table 4. The results of the two-sample t-test (t-test) and the bootstrap t-test (BS-t)are also included. Power comparison results are shown in Figure 3. From the results of Table 4 and Figure 3, hypothesis testing for treatment effects based on conventional stepwise model selection techniques has valid Type I error and has similar power performance compared to the hypothesis testing with all randomization covariates used in the analysis, as long as the model selection techniques are able to identify the subset of all influential covariates. It is worth pointing out that the bootstrap t-test is less powerful than lm3, lm5, and backward based on Figure 3. Case 5: Model Misspecification. In Section 2, the underlying response model (1) is assumed to have a form of linearly ad-ditive covariate effects in addition to treatment effects. Based on the model, theoretical properties of hypothesis testing are studied under covariate-adaptive designs. Theorem 3.1 shows that valid Type I error can be obtained by using a linear model incorporating all randomization covariates (the full model). This is within expectation since the full model has the correct model specification that coincides with the assumed model. However, the underlying response model is often unknown in practice and it is possible that covariate effects are not linearly additive on responses. For example, a covariate may have a nonlinear form or have interaction effect with other covariates. Under these scenarios, the full model with all randomization covariates in a linearly additive pattern no longer has the correct model specification. In this section, a nonlinear response model with covariate effects in an exponential function is assumed to investigate properties of hypothesis testing for comparing treatment effects. Other situations of model misspecification can be studied similarly and will be left for future research projects. The following model is assumed for the responses Y. Two covariates Z 1 and Z 2 have a nonlinear effect on the response Y in addition to treatment effects.
where ε i is distributed as N (0, 4). Z 1 and Z 2 are distributed as normal distributions N (0, 1), which are discretized into Bernoulli variables with the probability 0.5 to be used in randomization procedures. Three randomization methods are investigated, including Pocock and Simon's marginal procedure, stratified permuted block design, and complete randomization. The two-sample t-test (t-test), the linear model incorporating both covariates (lm(Z 1 , Z 2 )), and the bootstrap t-test (BS-t) are used to compare treatment effects. The results of Type I error and power are given in Tables 5 and 6, respectively. From Table 5, the two-sample t-test is conservative under covariate-adaptive designs. Due to model misspecification, the Type I error of the linear model (lm(Z 1 , Z 2 )) is deviated from the nominal level 5%, which is particularly obvious under the stratified permuted block design. However, the bootstrap t-test tends to be robust under covariate-adaptive designs with the Type I error closer to 5%. Under complete randomization, the Table 6. Simulated power under model (15) for Pocock and Simon's marginal procedure (PS), stratified permuted block design (SPB) and complete randomization (CR); simulation based on 10,000 runs and number of patients N = 100 Type I error should be close to 5% as sample size increases based on the central limit theorem. Regarding the power comparison in Table 6, the model (lm(Z 1 , Z 2 )) has better performance than the two-sample t-test by using more covariate information under all these three design methods. In addition, the power of the model (lm(Z 1 , Z 2 )) is slightly larger than the bootstrap t-test under covariate-adaptive designs.

DISCUSSION
In this article, we studied the theoretical properties of hypothesis testing under general covariate-adaptive designs based on a linear model framework. We derived the corresponding asymptotic distributions of the testing statistics under both null and alternative hypotheses. Instead of focusing on a specific covariate-adaptive design, we studied the problem from the angle of imbalance measure of different levels (overall, marginal, within-stratum). So the conclusion can be applied to a broad range of designs, including stratified permuted block design and Pocock and Simon's procedure. For example, to apply Theorems 3.1 and 3.2 to a specific covariate-adaptive randomized clinical trial, one just need to check the conditions (A) and (B). The results in this article provide new insights about balance and efficiency of clinical trials, and the framework can be used to study properties of other statistical methods under covariateadaptive designs.
We study hypothesis testing based on a linear regression framework, where response of patients is a continuous variable. In reality, the endpoints of clinical trials may be other types of variable. For instance, a clinical trial can be designed to compare success rates between a new medication and a standard treatment, where the response would be a binary variable. In the literature, Feinstein and Landis (1976) and Green and Byar (1978) studied the properties of unadjusted tests under stratified randomization for binary response, and concluded that Type I error would decrease when stratified randomization is used rather than unstratified randomization. The results for other types of covariate-adaptive design, including Pocock and Simon's procedure, remain unknown. As future research projects, we will study the behavior of conventional statistical inference for different types of response based on generalized linear models, survival models, and so on. There are some other kinds of covariate-adaptive randomization procedures in the literature, including Zelen (1974), Wei (1978), Begg and Iglewicz (1980), and Atkinson (1982). Theorems 3.1 and 3.2 may not apply to these designs, because it is unknown whether the conditions (A) and (B) remain true for these designs.
Based on Theorems 3.1 and 3.2, incorporating all randomization covariates in the final analysis model can achieve valid hypothesis testing for treatment effects and covariates. In practice, model selection techniques can also be used as in Case 4 of Section 4 to target the subset of all influential covariates, based on which valid tests can be obtained. However, sometimes not all important randomization covariates are used in the inference step for a covariate-adaptive design. Then the actual Type I error may be not equal to the nominal level when comparing treatment effects and adjustment is necessary to achieve a valid test in this situation. If no covariate is incorporated in the final analysis model, a bootstrap t-test as described in Shao, Yu, and Zhong (2010), can be implemented to restore Type I error, and this method is shown to be more powerful than the conventional two-sample t-test under covariate-adaptive designs. Similar bootstrap adjustment methods can be considered to correct the variance estimation and to make τ = 1 in Theorem 3.1. We leave this as a future research project.
The proposed properties of hypothesis testing for covariateadaptive designs can be generalized in several ways. First, all covariate-adaptive designs considered in this article are based on discrete covariates. We may consider covariate-adaptive designs (Lin and Su 2012;Ma and Hu 2013) that directly use continuous covariates without discretization. However, related theoretical work is limited in the literature. Second, one important assumption to derive theoretical results is the independence between covariates. We may apply the similar idea to dependent covariates by incorporating correlation structure. Third, the proposed properties are based on clinical trials with two treatments, which can be generalized to multiple treatments (Tymofyeyev, Rosenberger, and Hu 2007). Those topics are left for future research.

SUPPLEMENTARY MATERIALS
The supplementary materials contain detailed proofs of all theorems in the main article and intermediate lemmas. [Received September 2013. Revised April 2014