Homogeneity test of relative risk ratios for stratified bilateral data under different algorithms

Medical clinical studies about paired body parts often involve stratified bilateral data. The correlation between responses from paired parts should be taken into account to avoid biased or misleading results. This paper aims to test if the relative risk ratios across strata are equal under the optimal algorithms. Based on different algorithms, we obtain the desired global and constrained maximum likelihood estimations (MLEs). Three asymptotic test statistics (i.e. , and ) are proposed. Monte Carlo simulations are conducted to evaluate the performance of these algorithms with respect to mean square errors of MLEs and convergence rate. The empirical results show Fisher scoring algorithm is usually better than other methods since it has effective convergence rate for global MLEs, and makes mean-square error lower for constrained MLEs. Three test statistics are compared in terms of type I error rate (TIE) and power. Among these statistics, is recommended according to its robust TIEs and satisfactory power.


Introduction
In medical clinical studies, observations from patients' paired parts (e.g. eyes, ears, and arms) are usually collected as paired data. The paired outcomes for each patient will be no, unilateral or bilateral response(s). Data from all patients can be summarized in a contingency table. The correlation between responses from paired parts should be taken into account to avoid biased or misleading results. In clinical practice, research subjects often can be distinguished by some control variables (e.g. age, gender), which contribute to stratified data. Although the questions involving treatment-by-stratum interaction are often secondary in most multi-center trials, they are still important as preparatory work for the overall and subgroup analyses. For a stratified bilateral design with two groups, the interaction can be tested by comparing different ratios across strata. If the ratios are not significantly different, the effect of stratum is negligible. Relative risk ratio, odds ratio and risk difference are often used to quantify the strength of the association. Generally, relative risk ratio is more visual than odds ratio. Walter [22] pointed out the population risks of some diseases were rather small such that risk differences between groups were less dramatic. Thus, risk relative ratio can be effectively used to study the homogeneity test in stratified bilateral data.
Several models have been proposed for analyzing correlated paired data. Rosner introduced a constant R model with the assumption that the conditional probability of a response at one side of the paired body parts given response at the other side was R times the unconditional probability [13]. Under Rosner's model, several asymptotic and exact tests are discussed [4,7,14,20]. However, Dallal pointed out that Rosner's model could give a poor fit if the characteristic was almost certain to occur bilaterally with widely varying group-specific prevalence [2]. He assumed that each group had a constant conditional probability γ , and derived the likelihood ratio statistic to test prevalence equality. M'lan and Chen presented three objective Bayesian methods for analyzing bilateral data under Dallal's model [5]. Donner proposed a model by assuming that the correlation coefficient was a fixed constant ρ in each of group [3]. Thompson proved that Donner's model could make full use of single and two-organ data to optimize the power of study [21]. Pei et al. applied the model into stratified paired data and assumed that the correlation coefficients of responses were the same for all subjects in the two groups of each stratum [10]. More details refer to [6,9,19].
Recently, homogeneity test has been a matter of continuing concern for stratified correlated bilateral data. Under Rosner's model, Tang and Qiu presented score and Wald-type statistics to test differences of proportions in stratified bilateral-sample designs [18], where the constrained maximum likelihood estimations (MLEs) were provided by Tang et al. [20]. Pei et al. used Fmincon function of MATLAB to search constrained MLEs since there are no closed-form solutions [11]. Further, they proposed five statistics for homogeneity test of proportion ratios across strata under Donner's model. Ma and Liu proposed a twostep approach to calculate the global MLEs by Newton-Raphson method and a third-order polynomial [6]. Shen and Ma also used the two-step method to obtain the global MLEs. Combining Newton-Raphson method and Fisher scoring algorithm, a new two-stage procedure was introduced to obtain the constrained MLEs [15]. Then, they investigated three statistics for testing homogeneity of risk differences of two proportions. Based on the work of Pei et al. [11], Zhuang et al. used two-step approach involving Fisher scoring algorithm for the global and constrained MLEs to test common ratio of two proportions across strata [23]. From the above results, we observe that a fast and efficient algorithm is very crucial to obtain MLEs under different hypotheses. However, there are few researches on comparison of different algorithms in order to obtain a more accurate results of homogeneity test.
Under Donner's model, the paper aims to obtain the optimal MLEs by comparing several algorithms and proposes various test statistics based on these MLEs. The remainder of the work is organized as follows. In Section 2, we review data structure and Donner's model. In Section 3, the global and constrained MLEs are obtained through Fisher scoring algorithm, two-step method, and two-stage procedure. Three test statistics are proposed for the homogeneity test of relative risk ratios across strata according to the optimal MLEs. Different algorithms are compared with respect to mean-square errors of MLEs and convergence rate in Section 4. The performances of these test statistics are investigated in terms of Type I error rate (TIE) and power. Two real examples are provided to illustrate the proposed methods in Section 5. Conclusions and further work are given in Section 6.

Test methods
This section first derives global and constrained MLEs by Fisher scoring algorithm, twostep method and two-stage procedure. Then, likelihood ratio, score and Wald-type test statistics are proposed based on the optimal MLEs.

Global MLEs based on Fisher scoring algorithm
There are totally 3J unknown parameters in Donner's model. For the jth stratum, the initial values of π ij , ρ j (i = 1, 2) can be given by 0 is a 3 × 3 Fisher information matrix (see supplementary material). Repeat the process until the result converges.

Global MLEs based on two-step method
The two-step method is described by a third-order polynomial and Newton-Raphson algorithm. The detail procedure is provided as follows. (1) can be simplified as a third-order polynomial where i = 1, 2. The real roots of the above polynomial is shown in supplementary material. Put ρ (t) j (t = 0, 1, 2, . . .) into the polynomial and solve real roots to obtain the t-th approximates of π ij (i = 1, 2), denoted by π where Repeat (i)-(ii) until convergence.

Constrained MLEs based on two-stage procedure
The two-stage procedure is different from two-step method in Section 3.1.2. Firstly, the MLEδ of δ is given by Newton-Raphson algorithm. Then,π 1j andρ j (j = 1, . . . , J) are obtained by Fisher scoring algorithm under the given MLEδ. The detailed process is described as follows.

Likelihood ratio test
Likelihood ratio test statistic can be constructed through the global and constrained MLEs as follows whereπ 1j ,π 2j ,ρ j are the global MLEs, andπ 1j ,δ,ρ j are the constrained MLEs. Following Silvey [16], likelihood ratio test is asymptotically distributed as a chi-square distribution with J−1 degrees of freedom under H 0 . Thus, H 0 should be rejected if the value of test statistic T L is larger than χ 2 J−1,1−α at the significant level α, where χ 2 J−1,1−α is the 100(1 − α) percentile of the chi-square distribution with J−1 degrees of freedom.

Wald-type test
Let β = (δ 1 , δ 2 , . . . , δ J , π 11 , π 12 , . . . , π 1J , ρ 1 , ρ 2 , . . . , ρ J ) and The null hypothesis H 0 : δ 1 = δ 2 = · · · = δ J δ is equivalent to Cβ T = 0. Thus, we have Wald-type statistic where the Fisher information matrix I 3 is the same as that of score test. It can be simplified as , and a j is the jth diagonal element of the inverse of I 3 (see supplementary material). Similarly, test statistics T SC and T W are asymptotically distributed as chi-square distribution with J−1 degrees of freedom under H 0 . Although T L , T SC and T W are asymptotically equivalent, they may take very different values. For example, likelihood ratio T L is scale-invariant, and T SC does not require to calculate MLE. More details refer to Agresti [1].

Algorithm selection
In Section 3, we provide several algorithms to obtain the global or constrained MLEs; that is, the global MLEs based on Fisher scoring algorithm and two-step method, denoted by Algorithms 1 and 2; the constrained MLEs based on Fisher scoring algorithm and twostage procedure, denoted by Algorithms 3 and 4. Next we investigate the performance of the algorithms for the global and constrained MLEs with respect to mean-square error (MSE) and convergence rate. Let N i1 = · · · = N iJ = m (i = 1, 2).

Mean-square errors
MSEs can be used to evaluate the stability and accuracy of Algorithms 1, 2, 3 and 4. If an algorithm can produce the MLEs with lower MSEs, then it has a better performance than other algorithms.
The MSEs of the global MLEs can be evaluated by the differences from the corresponding true parameters in one stratum. Take m = 30, 50, 100. For each m, true values of 1000 parameter sets (δ 1 , π 11 , ρ 1 ) are randomly generated under the alternative hypothesis H a . 10,000 replicates are randomly produced for each parameter setting. The global MLEs (δ 1 , π 11 ,ρ 1 ) of every sample can be calculated based on Algorithms 1 and 2. The convergence accuracy is defined by the differences from two close iterations and fixed as 1 × 10 −10 in these algorithms. The MSEs of Algorithms 1 and 2 have no significant difference as shown in Table 2. That is to say, the global MLEs are identical by the two algorithms. Especially, their MSEs become smaller when m increases.
Take m = 30, 50, 100 and J = 2, 3, 4. For every combination of J and m, true values of 1000 parameter sets (δ, π 1j , ρ j , j = 1, . . . , J) are randomly generated under the null hypothesis H 0 . 10,000 samples are randomly produced under every set of parameters. The constrained MLEsδ,π 1j andρ j can be obtained by Algorithms 3 and 4 when the convergence accuracy = 1 × 10 −10 . Their MSEs are shown in Table 3. The results show that Algorithm 3 generally has better performance than Algorithm 4. For example, MSEs ofδ andπ 1j (j = 1, . . . , J) from Algorithm 3 are obviously smaller than those from Algorithm 4 in most cases. MSEs ofρ j are close each other under these algorithms.

Convergence rate
Convergence rate is an important index for evaluating the efficiency of an iterative algorithm. It is defined as the speed at which a convergent sequence approaches its local optimum (if the local optimum exists). An algorithm can be said to have a higher convergence rate if it needs less iterations for a specified convergence accuracy, or less time to finish a certain number of iterations. Since Algorithms 1 and 2 can produce the same MSEs, we further compare them by convergence rate for solving the global MLEs from one stratum, including converge frequency, iteration number and time cost. Let m = 30, 50, 100, and ρ 1 = 0.2, 0.4, 0.6, 0.8. For every setting of m and ρ 1 , true values of 1,000 parameter sets (π 11 , δ 1 ) are randomly generated from the alternative hypothesis H a . Further, 10,000 replicates are randomly produced for each configuration. Table 4 shows the proportions of Algorithms 1 and 2 failing to converge within 100 iterations for 1000 ×  10, 000 replicates when the convergence accuracy =1 × 10 −10 or 1 × 10 −15 . Obviously, Algorithm 1 works better than 2. The failure probability of Algorithm 1 decreases if m increases. However, the failure probabilities of two algorithms decrease if ρ 1 increases. Under the same parameter setting, we calculate the average of iterative numbers for all 10,000 replicates. Figure 1 provides the frequency distributions of the average iteration numbers for Algorithms 1 and 2. We observe that the parameter settings have a marked effect on these two algorithms. For example, there is little difference between them when m = 30, ρ 1 = 0.4 and m = 100, ρ 1 = 0.2. However, for m = 30, 50, 80 and ρ 1 = 0.8, the iterative averages of Algorithm 1 are obviously less than those of Algorithm 2. On the other hand, we fix the iteration number as 50 and record their time cost. The average time cost distributions of the algorithms are shown in Figure 2. The result shows that the average time of Algorithm 1 is smaller than that of Algorithm 2 for the specified iterations. Thus, the Algorithm 1 takes less time to obtain the global MLEs than Algorithm 2.
Based on the above results, Algorithms 1 and 2 can produce the same global MLEs when the number of iteration is large enough. However, Algorithm 1 has a better convergence rate than Algorithm 2. Moreover, MSEs of MLEs from Algorithm 3 are smaller than that from Algorithm 4. Therefore, it is advisable to choose Algorithms 1 and 3 to obtain the optimal global and constrained MLEs required by the homogeneity test in this paper.

Comparison of test statistics
In this section, we calculate the global MLEs from Algorithm 1 and the constrained MLEs from Algorithm 3 to construct test statistics. The performance of test statistics is investigated in terms of TIEs and power. All tests are conducted at the significance level α = 0.05.

Empirical type I error rates
TIE is the probability of rejecting a null hypothesis that is true. According to Tang et al., a test is robust if its empirical TIE is between 0.04 and 0.06; liberal if it is greater than 0.06; conservative if it is less than 0.04 [20]. Power is the probability of correctly rejecting a null hypothesis when it is false in a statistical test. A good test should not only have robust TIEs but also make power as high as possible.
Next we calculate the empirical TIEs of test statistics under different parameters settings. Let m = 30, 50 or 70 in J = 2, 4 or 6 strata, and δ = 1, 1.1, 1.2. Other parameters ρ T = (ρ 1 , . . . , ρ J ) and π T 1 = (π 11 , . . . , π 1J ) are listed in Table 5. For each parameter setting, 10,000 replicates are randomly generated from the null hypothesis H 0 . We add 0.5 to each cell whenever test statistics are undefined according to Agresti [1]. The empirical TIEs can be computed by dividing the number of times of rejecting H 0 with 10,000. Tables G1−G3 in supplementary material provide the empirical TIEs of all test statistics, where the values bigger than 0.06 or smaller than 0.04 are marked in bold. We observe that the empirical TIEs of T SC are usually close to 0.05, but T W are conservative. Thus, T SC are more robust than other tests. Figure 3 further proves the better performance of T SC through 1,000 random parameter settings involving δ, ρ j and π 1j (j = 1, . . . , J) under H 0 for J = 2, 4, 6. 10,000 replicates will be randomly generated for calculating the empirical TIE. As shown in Figure 3, the medians of T SC are close to 0.05, T L is relatively liberal, and T W is relatively conservative. Thus, T SC is more robust than likelihood ratio and Wald-type tests. There is an obvious improvement  for the empirical TIEs of T W as J increases. Likelihood ratio, score and Wald-type tests have the robust TIEs when sample size m ≥ 50 through simulations.

Empirical power
We calculate the empirical power of test statistics with the parameters listed in Table 5. A total of 10,000 samples will be randomly generated from H a under every set of parameters. Empirical power can be computed by dividing the number of times of rejecting H 0 with 10,000. Tables G4−G6 in supplementary material show that likelihood ratio and score tests have close empirical power under the same configurations, but T W always produces the smallest empirical power. Figures 4-6 further reflect how empirical power changes when some parameters change for J = 2, 4, 6. Their empirical power is calculated by setting δ T = 1 Test statistics T L and T SC always have higher empirical power, and T W produces the lowest power. As m increases, the power of test statistics is relatively close each other.
For example, take m = 90, the values of power are very close. Larger difference between 1 and δ p can bring greater power. The moderately and highly relevant data (i.e. ρ j = 0.5 and ρ j =0.8) have higher power than the mildly relevant data (i.e. ρ j =0.2). The power difference between tests becomes more obvious when ρ j increases. Moreover, increasing the number  of stratum can produce higher power. It means that all test statistics work well under multistratum cases.
Since T SC has robust empirical TIEs and satisfactory power, it should be recommended for investigating the equality of relative risk ratios for stratified bilateral data based on Algorithms 1 and 3.

Real examples
In this section, we introduce two real examples to illustrate the aforementioned methods based on Algorithms 1 and 3.
The first example is about a double-blinded randomized clinical trial conducted by Mandel et al. [8]. In this trail, children who have otitis media with effusion (OME) were   stratified by age and then randomly assigned to Cefaclor or Amoxicillin treatment groups. We conduct the homogeneity test with data from 2 to 5 and > 5 yr groups shown in Table 6. The global and constrained MLEs can be derived by Algorithms 1 and 3 as shown in Table 7. The estimated relative risk ratiosδ 1 = 0.6205/0.5881 = 1.0551,δ 2 = 0.8840/0.8341 = 1.0598 under H a andδ=1.0579 under H 0 . Table 8 shows the three test statistics and their p-values. The values of likelihood ratio, score, Wald-type test statistics are 1.6018 × 10 −4 , 1.5873 × 10 −4 and 1.6069 × 10 −4 , respectively, which are very close. Three tests' p-values are all equal to 0.9899. When the significance level is 0.05, there is no evidence to reject H 0 : δ 1 = δ 2 δ. In other words, the ratios between Cefaclor and Amoxicillin treatments for children older than 2 years old are clinically equivalent.
The second example is proposed by Postlethwaite et al. [12]. Patients with diffuse scleroderma were randomly allocated by control or treatment groups. The disease durations (early-or late-phase) are treated as two strata. Their disease improvement can be evaluated with modified Rodnan skin score (MRSS) proposed by Tang et al. [17] and   converted to binary variable shown in Table 9. In this example, we take H 0 : δ 1 = δ 2 δ versus H a : δ 1 = δ 2 . The MLEs are shown in Table 10. Under H a , the estimated ratios arê  Table 11. T L has the biggest value 1.2636 and the corresponding p-value is 0.2609. The value of Wald-type test T W = 1.1050 is the lowest. Score test has the value T SC = 1.2631. The p-values of T SC and T W are 0.2611 and 0.2932. All p-values are greater than 0.05. As a result, we have to receive the null hypothesis that there is no treatment-by-stratum interaction.

Conclusion
This paper mainly studies homogeneity test of relative risk ratios from stratified correlated bilateral data under Donner's model. For the unknown parameters, we obtain their global and constrained MLEs under several algorithms, including the global MLEs based on Fisher scoring algorithm (Algorithm 1) and two-step method (Algorithm 2), the constrained MLEs based on Fisher scoring algorithm (Algorithm 3) and two-stage procedure (Algorithm 4). Likelihood ratio (T L ), score (T SC ) and Wald-type (T W ) statistics are proposed to test the equality of relative risk ratios under the optimal algorithms.
The simulation results show that Algorithms 1 and 2 can produce the same global MLEs with respect to MSEs. However, Algorithm 1 has less iteration number and time cost than 2.
Compared with Algorithm 4, Algorithm 3 obtains the constrained MLEs with lower MSEs. Thus, Algorithms 1 and 3 are selected to obtain the global and constrained MLEs, respectively. Under the optimal algorithms, we investigate the performance of all the tests in terms of empirical TIEs and power. We observe that score test T SC is generally robust, likelihood ratio test T L is relatively liberal, and Wald-type test T W is usually conservative. Moreover, T SC and T W have close power and T W always produces the lowest power. Therefore, T SC is recommended in the proposed test statistics.
The statistics proposed by this paper are based on asymptotic methods for the homogeneity test. They may be not suitable to small sample size. The exact tests will be considered in our future work.

Disclosure statement
No potential conflict of interest was reported by the author(s).