Tests for homogeneity of risk differences in stratified design with correlated bilateral data

ABSTRACT Correlated bilateral data arise from stratified studies involving paired body organs in a subject. When it is desirable to conduct inference on the scale of risk difference, one needs first to assess the assumption of homogeneity in risk differences across strata. For testing homogeneity of risk differences, we herein propose eight methods derived respectively from weighted-least-squares (WLS), the Mantel-Haenszel (MH) estimator, the WLS method in combination with inverse hyperbolic tangent transformation, and the test statistics based on their log-transformation, the modified Score test statistic and Likelihood ratio test statistic. Simulation results showed that four of the tests perform well in general, with the tests based on the WLS method and inverse hyperbolic tangent transformation always performing satisfactorily even under small sample size designs. The methods are illustrated with a dataset.


Introduction
Bilateral data can arise from studies involving paired body organs or body parts in the same subjects. The two data points within a subject are usually correlated, which should be accounted for in the design and analysis. Stratification is usually adopted to control for potential confounding effects and/or to improve statistical efficiency. A sensible method for the analysis of such data is to obtain a summary effect measure across strata, assuming the homogeneity of the chosen effect measure across strata. Thus, assessment of homogeneity becomes a necessary step in statistical analysis.
As an example, consider a double-blinded randomized clinical trial by Mandel et al. [8]. The objective of the study was to assess effects of two antibiotics, i.e. cefaclor and amoxicillin, for the treatment of otitis media with effusion(OME). A total of 214 children were randomized to one of the two antibiotics. At the end of 14-day treatment period, the number of cured ears in each child was recorded. The number of OME-free ears equals 2 if both ears are cured, equals 1 if only one ear is cured, and equals 0 if neither is cured. As age is  0  8  1 1  6  3  0  1  1  2  2  6  1  1  0  2  8  2  1 0  5  3  6   Total  18  15  22  9  4  7 an important factor for the outcome, stratified analysis may be appropriate. Data from 75 children with bilateral disease are stratified by age group and presented in Table 1.
Rosner [11] considered bilateral data without stratification and proposed two chi-square based on a dependent and an independent models for assessing whether the proportion of affected ears (or eyes) is the same among different treatment groups. Simulation results show that the chi-square test based on the dependent model perform well for highlycorrelated observations obtained from two ears of an individual. Tang et al. [14] proposed eight procedures for testing the equivalence of risk difference in such correlated data, assuming the risk differences are homogeneous among strata. Tang et al. [13] investigated various confidence interval estimators for risk difference based on Wald-type statistics, Fieller theorem for ratios, likelihood ratio statistic, score statistics and bootstrap resampling method. Sample size formulas for testing difference between two proportions for bilateral studies with binary outcomes has been considered in Qiu et al. [10].
For the stratified correlated bilateral data, Pei et al. [9] considered ten homogeneity test procedures for risk ratios and derived three sample size formulas based on three tests that performed well. Simulation results show that four test procedures based on bootstrapresampling generally produce empirical type I errors close to the pre-specified significant level when sample sizes are not large. Although the four test procedures can usually well control their type I errors around the pre-chosen nominal level under moderate sample sizes, it could be very time-consuming due to the resampling procedure.
Despite risk difference is regarded as a preferred effect measure due to its ease of interpretation, test of homogeneity in risk differences has rarely been considered, with exception of Tang and Qiu [12] who developed a modified Score test. A disadvantage of this test is that it requires an iterative algorithm to obtain the common risk difference.
The primary objective of this article is to propose reliable test procedures for testing homogeneity of risk differences for stratified correlated bilateral data. Specifically, we propose and evaluate eight new tests for the homogeneity of risk differences for the design of studies involving correlated bilateral data, i.e. the test statistic derived from the weighted-least-squares (WLS) method, the test statistic based on the Mantel-Haenszel (MH) estimator, the test statistic based on the WLS method and tanh −1 (x) transformation, and the test statistics based on their log-transformation, the new modified Score test statistic and Likelihood ratio test statistic. Sample size formulae based on recommended tests are derived to achieve the pre-specified power at a given significance level.
The rest of the article is organized as follows. Section 2 describes data structure and the statistical model. Eight new tests for homogeneity of risk differences across strata are developed and assessed in Section 3. Sample size formulae are derived and evaluated on the basis of the proposed homogeneity tests in Section 4. Section 5 illustrates the methods with the data mentioned above. The article concludes in Section 6.

Data structure and the statistical model
Let m hij be the number of subjects having h ears improved in the ith treatment within the jth stratum, and p hij be the corresponding probability, i = 0,1, h = 0,1,2 and j = 1, 2, . . . , J. Let m +ij = 2 h=0 m hij , m ++j = 1 i=0 m +ij , z ilkj = 1 (i = 0, 1, l = 1, . . . , m +ij , k = 1, 2, j = 1, 2, . . . , J) if the kth ear of the lth individual in the ith treatment group from the jth stratum showed improvement, and 0 otherwise. In this article, we only consider the case of each patient having two observations. Following Rosner [11], assume that the probability of a response at one side given a response at the other side was proportional to the success rate, and Pr[z ilkj = 1] = λ ij , then It is easily shown that the correlation between z il1j and z il2j is represents a measurement of dependence between two ears of the same child from the jth stratum. Specifically, if R j = 1, the two ears are completely independent; whilst if R j λ ij = 1, the ears are completely dependent. Then, we have ij for i = 0,1 and j = 1, 2, . . . , J. Under the above notations, the frequencies and probabilities for the jth stratum in the stratified bilateral data design can be summarized into Table 2. Let The log-likelihood function of the parameters λ 0 , λ 1 and R under the observed data m = {(m 00j , m 10j , m 20j , m 01j , m 11j , m 21j ): j = 1, 2, . . . , J} is given by where C is a constant that does not involve the unknown parameters. Let δ j = λ 0j − λ 1j (j = 1, 2, . . . , J) be the risk difference between two treatments in the jth stratum. We are interested in testing the homogeneity of the risk differences across strata, i.e. H 0 : δ 1 = δ 2 = · · · = δ J = δ.
Under the assumption δ j = δ, the log-likelihood function becomes is the log-likelihood function in the jth stratum.

Tests for the homogeneity of risk differences across strata
In this section, we consider eight test statistics for the homogeneity testing: As shown by Rosner [11], Tang et al. [13] and Tang et al. [14], statistical methods that ignore the presence of intraclass correlation could lead to inflated significance levels. Thus, we derive the tests under the dependent model only.

Test statistic derived from the weighted-least-squares (WLS) method
Note that under the independence model (i.e. R j = 1 for j = 1, 2, . . . , J), the sample estimate of 1). In this article, we only consider the sample estimateλ ij since R j is unknown. Hence, the estimate of δ j can be given bŷ and the variance estimate ofλ ij can be given bŷ for i = 0, 1, j = 1, . . . , J. Assume that data obtained from two treatment groups in the same stratum are independent, then the estimate of Var(δ j ) can be obtained byσ 2 ij . Therefore, following Fleiss et al. [3] and Lui [6] (Section 2.2.2), the weighted least square (WLS) estimator for δ can be given byδ wls = J j=1ŵ jδj / J j=1ŵ j , whereŵ j = 1/σ 2 j . Thus, the weighted-least-square test statistic is given by

Test statistic based on the Mantel-Haenszel (MH) estimator
We now apply the Mantel-Haenszel estimator for a common risk difference (Greenland and Robins [4]) given bŷ with variance estimator given bȳ Similarly, the WLS estimator for δ is given byδ wls = J j=1w jδj / J j=1w j withw j = 1/σ 2 j . Therefore, this leads us to consider a weighted least square test statistic with the Mantel-Haenszel estimator, which is given by Following Lui & Chang [7], T mh is asymptotically distributed as chi-square distribution with J−1 degrees of freedom when all m +ij (i = 0, 1; j = 1, 2, . . . , J) are sufficiently large.

Log-transformation of test statistics T wls , T mh and T twls
Following Fisher [2], to improve the normal approximation of a chi-square distribution, a logarithmic transformation can be used to the weighted-least-square test statistic T wls . Therefore, the test statistic is given by which is asymptotically distributed as a standard normal distribution. We can also apply the log-transformation on the test statistics T mh and T twls , which are denoted as T lmh and T ltwls , respectively. Obviously, they are also asymptotically distributed as standard normal when all m ++j → ∞ (j = 1, 2, . . . , J).

The modified score test statistic
Based on the dependence model, the constrained maximum likelihood estimateδ,λ 1j and R j of parameters δ, λ 1j and R j under H 0 satisfy the following equations: where l(m; δ, λ 1 , R) is given by (2). Applying the theory of Rao's score test results in a test given by where S j is the score function which is given by (A.1) and I j (1, 1) is given by (A.2) (see Web Appendix A), respectively. However, it is rather difficult to use the above statistic to test H 0 because no closed forms exist for the MLEs of δ, λ 1j and R j for j = 1, 2, . . . , J. We propose the following modified Score test statistic: whereδ * = J j=1 ω jδj / J j=1 ω j is a consistent but non-iterative estimator of δ with the weight ω j = 1/(1/m +0j + 1/m +1j ) andδ j is given by Equation (4) (j = 1, 2, . . . , J). Based on the consistent but non-iterative estimatorδ * ,λ * 1j andR * j are obtained by solving the following equations: for j = 1, 2, . . . , J. Closed form solutions to the estimatesλ * 1j andR * j are not available, we use an iterative algorithm to find the solutions. Following Tang et al. [13], Fisher scoring method can be used to compute the estimatesλ * 1j andR * j for j = 1, 2, . . . , J. Following Tarone [16], T * sc is asymptotically distributed as chi-square distribution with J−1 degrees of freedom when all m +ij (i = 0, 1, j = 1, 2, . . . , J) are sufficiently large.

Performance of tests for homogeneity
The tests above are derived by assuming large sample size. We now investigate the performance of the various homogeneity test procedures via simulation studies. We first considered balanced (i.e. 3 (values of δ)× 2 (values of (λ 11 , λ 12 , λ 13 )) × 5 (values of (ρ 1 , ρ 2 , ρ 3 )) = 30 parameter combination settings. For each configuration and each stratum, K = 5000 random samples were generated from the product of J (J = 2 and 3) independent trinomial distributions. That is, (m 00j , m 10j , m 20j ) was generated from the trinomial distribution with total size m +0j and cell probabilities (p 00j , p 10j , p 20j ), and then (m 01j , m 11j , m 21j ) was generated from the trinomial distribution with total size m +1j and cell probabilities (p 01j , p 11j , p 21j ) for j = 1, . . . , J under the model with J = 2 and the model with J = 3, respectively. Table 3 an 4 show the empirical sizes of the various test procedures developed in this article and the modified Score test (denoted as T † sc ) proposed in Tang & Qiu [12] based on the models with J = 2 and J = 3, respectively.
To assess the performance of empirical powers of various test procedures, we considered the above sample size designs with the following parameter settings: (δ 1 , δ 2 ) = Table 3. The empirical sizes (percent) for the homogeneity testing H 0 : δ 1 = δ 2 at 0.05 significance level based on the dependence model.          Table 5, and (m +01 ,m +02 ,m +03 ,m +11 ,m +12 ,m +13 ) = (50, 50, 50, 50, 50, 50) in Table 6, respectively. Results in Tables 3 to 6 show that the newly proposed tests (i.e. T wls , T mh , T twls , T lwls , T lmh , T ltwls and T * sc ) performed satisfactorily in terms of type I error rates close to the nominal level. As expected, the test statistic T twls based on inverse hyperbolic transformation successfully improved the statistic in terms of type I errors even under small sample sizes. The likelihood ratio test did not perform well for testing homogeneity of risk differences for stratified correlated bilateral data even under moderate sample sizes (e.g. (m +01 , m +02 , m +11 , m +12 ) = (30,50,30,50), (m +01 , m +02 , m +03 , m +11 , m +12 , m +13 ) = (30,40,50,30,40,50)). Therefore, the likelihood ratio test is not recommended for testing of risk difference homogeneity for this stratified bilateral data. Compared with the modified Score test proposed in Tang & Qiu [12], the modified Score test procedure (T * sc ) performs slightly better even though it does not require iterative computation. The empirical powers of T * sc , T † sc and T l in general are greater than the other tests under the same parameter settings. It is noteworthy that the test procedures based on the statistics T wls , T mh , T twls and T ltwls usually produce satisfactory empirical Type I error rates even under small sample sizes. Therefore, they are strongly recommended for practical applications.

Sample size determination
Sample size calculation is a crucial step in the design of a study. In this section, we derive the approximate sample size formulae that achieve the desired power level for a test at a significance level based on the aforementioned recommended test statistics. Let m +1j /m +1 = r j , m +0j /m +1j = κ, m +1 = J j=1 m +1j in the following sections.

Sample size formula based on T wls
Under the alternative hypothesis H 1 , T wls is asymptotically distributed as non-central chi-square distribution with J−1 degrees of freedom and non-central parameter τ 1 . Let χ 2 J−1,1−β (τ 1 ) be the 100(1 − β) percentile of this non-central chi-square distribution, and χ 2 J−1,α be the 100α percentile of chi-square distribution with J−1 degrees of freedom. Thus, the non-central parameter τ 1 can be obtained by solving the equation J−1,α . Therefore, the sample size for achieving the power of 1 − β at α-level can be given by: where W 1j is given by (B.2) (see Web Appendix B for details).

Sample size formula based on T mh
Under the alternative hypothesis H 1 , T mh is asymptotically distributed as non-central chisquare distribution with J−1 degrees of freedom and non-central parameter τ 2 . Similarly, the sample size for achieving the power of 1 − β at α-level based on T mh are denoted as: where the non-central parameter τ 2 is obtained by solving the equation given by (B.4), respectively, and W 3j is given by (B.5) (see Web Appendix B for details).

Sample size formula based on T twls
Similarly, under the alternative hypothesis H 1 , T twls is asymptotically distributed as noncentral chi-square distribution with J−1 degrees of freedom and non-central parameter τ 3 . Therefore, the sample size for achieving the power of 1 − β at α-level is given by: where the non-central parameter τ 3 is obtained by solving the equation and W 4j is given by (B.7) (see Web Appendix B for details).

Sample size formula based on T * sc
It is obvious that the asymptotic powers of T * sc is given by Pr(T * sc ≥ χ 2 J−1,1−α | H 1 ). To find the sample size that can achieve the desired power 1 − β, we can set The approximate sample size m +1 that is required to achieve the desired power of 1 − β at level α with δ j = δ for some j ∈ {1, 2, . . . , J} can be obtained by solving the above equation. However, the sample size does not have the closed form due to no closed forms for the constrained MLEsλ * 1j andR * j of nuisance parameters λ 1j and R j (j = 1, 2, . . . , J). Therefore, we consider the following algorithm to find the solutions numerically.
Step 3: If the approximate probability p * (m +1 ) given in Step 2 is less than (greater than) the given power 1 − β, repeat Step 1 and Step 2 via larger (smaller) m +1 .

Evaluation of sample size formulas
In this section, we evaluate the proposed methods for sample size determination by examining the respective empirical powers under parameter settings of δ j , λ 1j and ρ j for j = 1, 2, . . . , J at α = 0.05 and 80% power. For the model with J = 2, the parameter settings are same as those for the empirical power study in Section 3.7. For the model with J = 3, (δ 1 , δ 2 , δ 3 ) = (−0.1, 0.05, 0.1), (0.0, 0.1, 0.15) and (0.1, 0.15, 0.2), and the other parameters are same as those for the empirical power study. Tables 7 and 8 report the estimated sample sizes based on the formulas (15)-(17) and the proposed procedure based on Score test in Section 4.4 for the model with J = 2 and J = 3, respectively.
From the simulation study, it is interesting to find that the performance of sample sizes based on T * sc , T wls and T twls are satisfactory in the sense that their estimated power are close to the pre-specified power, and are hence recommended. Especially, sample size estimator Table 7. Estimated sample sizes for m +1 and corresponding estimated powers (in parentheses) based on T sc , T wls , T mh and T twls with 80% power at 0.05 significance level for J = 2.
Par. n sc n wls n mh n twls n sc n wls n mh n twls κ = 1.0, (r 1 , r 2 ) = (0.5, 0.5) κ = 1.0, (r 1 , r 2 ) = (0. 4    based on T * sc usually smaller than those based on the tests T wls , T mh and T twls . The results can also be shown in the simulation studies for the empirical powers given by Tables 5 and 6.

Illustrative example
We now illustrate the methods using the data arose from a clinical trial for the treatment of otitis media with effusion (OME) described in Section 1. First, we investigate whether a significant difference among the three age groups, i.e. we consider the following homogeneity testing: H 0 : δ 1 = δ 2 = δ 3 versus H 1 : δ i = δ j for some i = j, i, j ∈ {1, 2, 3}.

Conclusion and discussion
In this article, we considered the homogeneity testing of risk difference for stratified bilateral correlated data. Eight new test statistics are derived and evaluated, including the test statistic derived from the weighted-least-square (WLS) method (T wls ), the test statistic based on the Mantel-Haenszel (MH) estimate (T mh ), the test statistic based on WLS method and inverse hyperbolic tangent transformation (T twls ), and the test statistics based on their log-transformation (i.e. T lwls , T lmh , T ltwls ), the modified score statistic (T * sc ) and the likelihood ratio test statistic (T l ). Simulation results show that all tests except T l can well control their empirical type I error rates in general. It is noteworthy that the test procedures based on the statistics T wls , T mh , T twls and T ltwls always perform satisfactory even under small sample size designs. Therefore, they are preferred to practical applications. When the sample size is not small (e.g. (m +01 , m +02 , m +11 , m +12 ) = (50, 50, 50, 50) or (m +01 , m +02 , m +03 , m +11 , m +12 , m +13 ) = (50, 50, 50, 50, 50, 50)), all test procedures give type I error rates closer to the nominal level with higher powers.
We also derived approximate sample size formulas based on the recommended tests T wls , T mh , T twls and T * sc for testing the homogeneity of risk differences to guarantee a prespecified power of a test at a given significance level. Simulation results suggested that the sample size formulas based on T * sc , T wls and T twls are satisfactory as the empirical powers are close to the pre-specified power at a given nominal level, and hence be recommended to practical applications. It is interesting that the sample size estimator based on T * sc is smallest and has satisfactory performance.
The methods proposed in the article should be useful in the design and analysis of stratified studies with bilateral data. A common risk difference can be obtained when there is no evidence to reject the null hypothesis of homogeneity in risk differences.On the other, if this hypothesis is rejected, one may need to presented stratum-specific risk difference.
In this article, we only consider the case of each patient having exactly two observations. Future research should consider cases where not all subjects provide two observations.