How reliable are the multiple comparison methods for odds ratio?

ABSTRACT The homogeneity tests of odds ratios are used in clinical trials and epidemiological investigations as a preliminary step of meta-analysis. In recent studies, the severity or mortality of COVID-19 in relation to demographic characteristics, comorbidities, and other conditions has been popularly discussed by interpreting odds ratios and using meta-analysis. According to the homogeneity test results, a common odds ratio summarizes all of the odds ratios in a series of studies. If the aim is not to find a common odds ratio, but to find which of the sub-characteristics/groups is different from the others or is under risk, then the implementation of a multiple comparison procedure is required. In this article, the focus is placed on the accuracy and reliability of the homogeneity of odds ratio tests for multiple comparisons when the odds ratios are heterogeneous at the omnibus level. Three recently proposed multiple comparison tests and four homogeneity of odds ratios tests with six adjustment methods to control the type-I error rate are considered. The reliability and accuracy of the methods are discussed in relation to COVID-19 severity data associated with diabetes on a country-by-country basis, and a simulation study to assess the powers and type-I error rates of the tests is conducted.


Introduction
It has recently become very popular to investigate the mortality or severity of COVID19 in relation to demographics, clinical characteristics, or signs and symptoms to understand the impacts of COVID-19 more clearly. Not only in COVID-19 studies, but also in general meta-analysis projects, the odds ratio of death or severity for a patient with comorbidity (diabetes, hypertension, cardiac disease, etc.) or a symptom (fever, cough, fatigue, etc.) have been investigated. Under the assumption that the odds ratios across studies are homogeneous, the results of several studies are aggregated via a meta-analysis. Before calculating a common odds ratio, the homogeneity of studies should be tested. Studies directly focusing on testing the homogeneity of the odds ratios date back to the first half of the 1950s. While Woolf's [39] test was the first application in the literature to test the homogeneity of the logarithm of odds ratios, it was found to be very conservative by Gavaghan et al. [16]. Breslow and Day [6] used the Mantel-Haenszel estimator instead of the conditional maximum likelihood estimate. Tarone [34] suggested the adjusted version of the Breslow-Day (BD) test. Because the BD statistic is based on the Cochran-Mantel-Haenszel odds ratio estimator and this estimator is not efficient, Almalik and van den Heuvel [26] suggested using the Tarone test. However, studies showing that there was a difference between the BD and Tarone tests reported it as only in the 4th decimal place [24]. Reis et al. [30] discussed the limitation of the asymptotic chi-square tests and did not recommend using them when most of the expected values were less than five. The Zelen test [45] was recommended to overcome this problem, but it was found to be biased and inconsistent [18,26]. Yusuf et al. [44] proposed a chi-square-based method, called the Peto method, that is identical to the asymptotic Zelen test [16,30]. DerSimonian-Laird [11] statistic is the likelihood ratio (LR) test of a mixed logistic model [1,3] and the conditional maximum likelihood score statistic [16] have also been used to test the homogeneity of the odds ratios. The natural logarithm of the oddsbased DerSimonian-Laird statistic is equivalent to the Woolf statistic [16]. Because of its simple calculation, the use of BD test has been recommended instead of the mixed logistic model and score tests [3,16,30]. All of these methods are calculated under the assumption of homogeneity of the odds ratios.
There are many studies that have compared the properties of the homogeneity tests of odds ratios in the literature. Jones et al. [23] compared the power of seven tests of homogeneity of the odds ratio for balanced and unbalanced designs. As a result of their simulation study, they suggested using the BD statistic for non-sparse tables. Paul and Donner [27] also compared the performance of nine tests for the homogeneity of odds ratios according to the data designs (balanced, mildly unbalanced, severely unbalanced, and within-strata unbalanced) and the number of strata. Because of its simple calculation and power performance, they recommend using the Tarone test in practice. Additionally, they recommend using the Woolf test for balanced or mildly unbalanced designs. Reis et al. [30] conducted a Monte Carlo simulation to compare the performance of six asymptotic tests for the homogeneity of odds ratios. As a result of their simulation study, the BD and Pearson chi-square tests were slightly better than the other tests for a non-small sample size. Gavaghan et al. [16] compared the performance of Peto, Woolf, DerSimonian-Laird, and BD statistics, and the test scores. They suggested using the BD statistic in the meta-analysis of pain studies. Their simulation study showed that when the Woolf statistic under-estimated the degree of heterogeneity, the DerSimonian-Laird statistic over-estimated it. Bagheri et al. [3] compared the likelihood ratio test of a mixed logistic model, DerSimonian-Laird statistic, and BD test when the sample size was equal and non-equal. They concluded that the BD test was the most powerful test among these three tests, and the studies with more strata had a higher power. Wei and Lai [37] discussed the effects of small sample size on the power of homogeneity tests. They suggested using U-statistics (U3 and WU3), which have higher power than the other tests. They concluded that the sample size had a positive effect on the power. When the number of strata and sample size increased, the power of the U-statistics improved.
These studies in the literature did not agree on a particular test that could directly be used in practice. The main reason for this was the design spaces of the simulation studies. None of the studies considered an extensive simulation space that could account for real-world scenarios, including an extensive combination of the number of studies, true odds ratios, different combinations of sample sizes, and the distributions of the cells counts across the cells of resulting contingency tables. The aim of this study is to produce new knowledge on the power and type-I error performances of a large bunch of tests under an extensive simulation space. In this way, results with a higher likelihood of generalizability will be provided.
A meta-analysis was used to pool independent studies focused on the same question. In meta-analysis studies, it is required that all available studies are reported. The heterogeneity of these studies (effect sizes) was tested with I-square and its related statistics. If heterogeneity is observed, it is important to consider a strategy for handling the sources of heterogeneity. There are different types of heterogeneity in a meta-analysis, such as clinical heterogeneity (differences in participant characteristics (gender, age group, race, etc.), types or timing of the outcome measurements, intervention characteristics, methodological heterogeneity (trial design and quality), and statistical heterogeneity (treatment effects between trials) [15]. Gagnier et al. [15] discussed that clinical and methodological heterogeneity can cause significant statistical heterogeneity and affect the results.
COVID-19 meta-analysis aims to investigate the relationship between different demographic characteristics (gender, age group, race, Hispanic origin, etc.) by comparing the difference between mortality/severity and comorbidities across the strata. Such studies aim to reveal the difference between the odds ratios across the strata (demographics). For example, the risk of death of a patient with hypertension may be specifically higher than in other races or the relationship between mortality and hypertension may not be statistically significant in some races. When the odds ratios of COVID-19 data are heterogeneous, it is not appropriate to calculate a common odds ratio and it is important to detect the odds ratio that causes heterogeneity among all the considered odds ratios. In this case, a multiple comparison procedure is needed. To produce reliable results in such a crucial area, it is essential to understand the power and type-I error behaviors of the multiple comparison procedures used for heterogeneous odds ratios.
Even though there are many methods to test the homogeneity of odds ratios, there is limited literature about the multiple comparison procedures to be applied when the odds ratios are heterogeneous. Yilmaz and Aktas Altunay [43] suggested using the BD-based least significant difference (LSD), chi-square-based LSD, and adjusted BD tests for multiple comparisons of odds ratios. They used these tests to compare six COVID-19 mortality data from China, and the study was limited to real-life data. Their numerical application showed that the Bonferroni and Dunn-Šidák adjustment methods were very conservative when comparing the odds ratios and their proposed methods were less conservative. They also recommend the use of the BD-based and chi-square-based LSD methods for sparse tables.
In this article, the focus was placed on the use of the homogeneity of odds ratio tests for the performance of multiple comparisons when the odds ratios are heterogeneous. Specifically, we focused on COVID-19 data to get accurate and reliable inferences when the odds ratios are heterogeneous in meta-analysis studies on COVID-19. Following the simulation study results of Bagheri et al. [3], Gavaghan et al. [16], Jones et al. [23], Paul and Donner [27], and Reis et al. [30], the BD, Tarone, Woolf, and Peto homogeneity of the odds ratio test statistics were considered. Bonferroni, Dunn-Šidák, Holm, Hochberg, Hommel, and Benjamini-Hochberg adjustments were used to control the type-I error rate with multiple comparison tests. Also, we considered the BD-based LSD, chi-square-based LSD, and adjusted BD tests for multiple comparisons [43]. In total, 27 methods were taken into consideration and compared in terms of seven measures, consisting of the any-pair power, all pairs power, positive predicted value, true negative rate (TNR), per comparison error rate, family-wise error rate, and false discovery rate, including the different number of strata, sample sizes, sample size designs (equal, within-center inequality, among-centers inequality), and structure of the table (balanced, imbalanced). With such an extensive numerical study, clearer and more reliable results were obtained on the power and error characteristics of multiple comparison tests for odds ratios than in previous studies. The contributions of this study were that we (1) demonstrated the importance of heterogeneous odds ratios in COVID-19 studies and discussed the use of multiple comparison procedures to get reliable results, (2) examined the performance of the multiple comparison procedures for odds ratios in terms of power and error rates using an extensive simulation space that covered a wide range of realistic scenarios that can occur in practice, (3) examined the performance of the homogeneity tests of odds ratios in multiple comparison procedure and discussed the effect of adjustment methods on the tests, and (4) identified methods that can be used under different data compositions and areas of practice. In Section 2, the methods to test the homogeneity of odds ratios and the multiple comparison procedures are presented. In Sections 3 and 4, the results of numerical studies with COVID-19 and synthetic data are presented. In Section 5, the general recommendations and conclusions are given.

Methods
In this section, the methods to test the homogeneity of odds ratios and the multiple comparison methods are introduced.

Test methods
Consider K different strata, which are investigated in the association of X and Y binary variables. Let n ijk be the number of observations in the ith row, jth column, and kth stratum, where i, j = 1, 2 and k = 1, . . . , K. n ..k is the total number of observations in the kth stratum. A 2 × 2 × K study design is summarized in Table 1.
The odds ratio formulation is θ k = n 11k ×n 22k n 12k ×n 21k , where k = 1, . . . , K. The null hypothesis for the homogeneity of odds ratios is against H 1 : θ i = θ j for at least one pair of (i, j), where i, j = 1, . . . , K, (i = j). The Peto, BD, Tarone, and Woolf methods were used to test the null hypothesis of the equality of several odds ratios. The Mantel and Haenszel [25] odds ratio iŝ Yusuf et al. [44] proposed an alternative method to the MH method for pooling odds ratios across the strata. This method is referred to as the Peto method. The Peto statistic is where the expected frequency (E k ) and its variance (V k ) at the kth stratum are .
Breslow-Day (BD) test is used to test of homogeneity of the odds ratios across K strata [5,6]. BD test statistic is Here,μ k (θ) andσ 2 k (θ) are the expected value and the variance of n 11k under the assumption of homogeneous odds ratios, respectively. The BD formula uses the MH odds ratio to generate the expected values using the conditional maximum likelihood method.
The Tarone adjustment is a special case of the BD test statistic [34].
Here,μ k (θ) andσ 2 k (θ) are the expected value and the variance of n 11k , the same as in the BD test statistics.
The Woolf statistic is where the weights are w k = [ 1 n 11k + 1 n 12k + 1 n 21k + 1 n 22k ] −1 , where k = 1, . . . , K. All of these methods were used to determine whether there are any statistically significant differences between the independent or unrelated odds ratios calculated from 2-by-2 studies. The Peto, BD, Tarone, and Woolf test statistics follow the chi-square distribution with k−1 degrees of freedom.

The multiple comparison procedures for the odds ratio
When the methods represented in Section 2.1 indicate the presence of heterogeneity in the odds ratios, it is important to determine which of these groups are different from the others. With this purpose, the Peto, BD, Tarone, and Woolf tests were used for each pair of the studies or strata. The null hypothesis for the multiple comparisons of the odds ratios is Because the multiple comparison procedure affects the error rates, different methods have been proposed to adjust the type-I error.
• Bonferroni Adjustment: This method is the most popular but also the most conservative one [13]. The Bonferroni method controls the family-wise error rate. Suppose m = K(K − 1)/2 is the number of simultaneously tested hypotheses. The Bonferroni adjusted type-I error is α = α/m. • Dunn-Šidák Adjustment: The Šidák [31] method is slightly more powerful than the Bonferroni method [12]. The Dunn-Šidák adjusted type-I error is α = 1 − (1 − α) 1/m . • Holm Adjustment: The Holm [20] sequential adjustment is also based on the Bonferroni method, but it is less conservative. First, the p-values of the m tests are ranked from smallest to the largest. Starting from the smallest p-value (i = 1), the adjusted The p i -value is compared with the α i . The comparison continues until p i ≥ α i . All of the remaining hypotheses are considered as non-significant. • Hochberg Adjustment: The Hochberg [19] sequential adjustment is very similar to the Holm adjustment. For this method, the p-values of the m tested hypotheses are ranked from the largest to the smallest and the procedure starts from the largest p-value (i = 1). The comparison continues until p i < α i . All of the smaller p i -values are considered as significant. • Hommel Adjustment: The Hommel [21]method is also less conservative than the Bonferroni method. It is slightly more powerful than the Hochberg method. First, the p-values of the m tests are ranked from the smallest to the largest. Suppose j is the number of hypotheses in the largest subset of the hypotheses and j = max{i ∈ (1, . . . , m) : If there exists no j, all of the hypotheses are rejected. Otherwise, the hypotheses are rejected when p i ≤ α/j [40]. • Benjamini-Hochberg Adjustment: The Benjamini and Hochberg [4] method is suggested to control the false discovery rate. It is less conservative than the other methods and gives better results when a large number of hypotheses are tested [7]. First, the p-values of the m tested hypotheses are ranked from the largest to the smallest. Let j = max{i : p i ≤ αi/m}. All of the hypotheses are rejected, for which p i ≤ p j and any of the hypotheses are not rejected if j does not exist.
In addition to the tests of the homogeneity of the odds ratios, the BD-based LSD, the chisquare-based LSD, and the adjusted BD tests can be used for multiple comparisons [43].
To avoid confusion with other adjustment methods of the BD test, the adjusted BD test will be referred to as the YA test from now on.
Zwinderman and Bossuyt [46] and Van den Ende et al. [35] reported that the odds ratios were more useful if converted to log values. Armistead [2] discussed the limitations of the measures of associations and that taking the natural algorithm of the odds ratio makes it symmetric above and below one, with ln(1) = 0. The BD-based and chi-square-based LSD test methods are based on the difference between the two log-odds ratios. Assume that θ i is the odds ratio of the ith stratum and θ j is the odds ratio of the jth stratum, where (i, j = 1, . . . , K). Let δ be the difference between these two log-odds ratios, as The common standard error is The δ value is compared with ORDIF in Equation (10). The null hypothesis is rejected if the difference is greater and equal to the critical value, δ ≥ ORDIF.
The chi-square-based LSD test, where the expected values are based on the chi-square approach, is used for multiple comparison. The excepted values (E 11k , E 12k , E 21k , E 22k ) are calculated from the ordinary chi-square, where (k = 1, . . . , K). The standard error for stratum i is Then, all of the steps used in the first method are followed by calculating the ORDIF.
The YA test is based on the BD test. In order to calculate this method, the expected values based on the overall MH statistic are used.
The calculated chi-square value is compared with a chi-square value for the appropriate α and df = 1.

COVID-19 studies
Even though the effects of COVID-19 on human health have been investigated, recent studies in China, Europe, and America have revealed a relationship between disease severity or mortality and the comorbidities (diabetes, hypertension, cardiovascular disease, liver injury, etc.) for COVID-19 [9,41]. These studies were followed by meta-analysis studies [9,38,42]. de Almeida-Pititto et al. [9] applied several meta-analyses for diabetes, hypertension, and cardiovascular disease, and the use of ACE/ARB in COVID-19 mortality and severity cases. Their study indicated a high heterogeneity in ACE/ABE; hence, they calculated a common odds ratio based on the random-effects meta-analysis. Wong et al. [38] collected data from Asia and applied a meta-analysis of COVID-19 severity cases associated with liver injury, discussed the heterogeneity of the data and applied a subgroup meta-analysis to minimize the heterogeneity. Yang et al. [42] also used a meta-analysis to describe the risk of hypertension, diabetes, respiratory system disease, and cardiovascular in COVID-19 severity cases. In these studies, the heterogeneity of the data was tested by Q-statistic and related measures, then a random effects meta-analysis was applied due to the presence of moderate to high heterogeneity. The risk of comorbidities in severity or mortality may differ depending on the age group, region, Hispanic origin, etc. Furthermore, this risk of some groups may be specifically higher than the others. When the odds ratios of COVID-19 data are homogeneous, a common (Mantel-Haenszel) odds ratio is calculated. On the other hand, if the test results indicate a difference in the odds ratios, it is neither suitable nor reliable to calculate a common odds ratio, and it is important to detect the different ones. For instance, in the study of de Almeida-Pititto et al. [9], even though they concluded the necessity of further studies to detect the risk association in different age groups, they did not provide any further analysis. In that case, applying multiple comparison procedures is strongly suggested to make reliable inferences.
In previous studies [8,9,17,33,36], the risk of developing diabetes in severe patients was compared with the same risk in non-severe patients, and the relationship between the severity of COVID-19 (severe/non-severe) and diabetes was presented. Severity was described as ICU admission or the need for mechanical ventilation [9].
The aim of this study is to investigate the risk of developing diabetes for different levels of disease severity, while investigating if there is a statistically significant difference between different countries. With this purpose, data sets from Greece [17], Italy [8], China [36], and France [33] were used to determine COVID-19 severity with regard to diabetes, as presented in Table 2.
The BD test results indicated that the odds ratios were not statistically homogeneous (χ 2 = 9.743, df = 3, p−value = 0.021 ). Even if the odds ratios for Greece, France, and Italy were close, the odds ratio for China was quite a bit higher than the others. In this case, multiple comparison tests were used to detect the difference between each pair of odds ratios between China and the other countries. To assess the behavior of the tests for such an obvious difference between the estimates of the odds ratios among the countries, multiple comparison tests were applied, and their results are summarized in Table 3.
According to the multiple comparison results in Table 3, there were no statistically significant differences between those of Greece and Italy, or between the odds ratios of Italy and France, as expected. The YA test was the only test that found a statistical difference between the odds ratios of Greece and France. All of the tests with Benjamini-Hochberg adjustment methods, the Peto test with all of the adjustment methods, and the YA test method indicated the difference between the odds ratios of China and the other countries, except for the Benjamini-Hochberg adjusted Woolf test between the odds ratios of China and France. The BD-based LSD and chi-square-based LSD tests, and also the Bonferroni, Holm, Hochberg, and Hommel adjusted BD, Tarone, and Woolf tests did not indicate any statistically significant difference between China and Greece, Italy, or France even though the odds ratio of China was obviously greater than the other odds ratios.
This observation from the COVID-19 data not only strongly showed the necessity of a multiple comparison procedure for the heterogeneous odds ratios in COVID-19 studies, but also demonstrated the importance of relying on the most suitable test to make inferences. The results showed that the different multiple comparison procedures indicated different results, even when the tests recommended in the previous literature were used. Thus, a detailed simulation study needs to be performed to discuss the reliability and accuracy of the methods.

Simulation study
A simulation study was conducted to compare the performance of the multiple comparison methods introduced in Section 2.

Simulation design
Odds ratios were simulated considering K = 3, 5, 7 strata. Three scenarios were considered for the alternative hypothesis space: P1 corresponds to the case where only one odds ratio was different, in P2 more than one odds ratios are different, and in F all the odds ratios are different.
Three different sample size designs, given by Bagheri et al. [3], were used. In the first one, equal, in the second one, within-center inequality, and in the third one, amongcenters inequality sample size designs were created (see Table 2 in Bagheri et al. [3]). The marginal probabilities were accepted as (π .1k = π .2k = 0.50) and (π 1.k = π 2.k = 0.50) for the balanced design, and (π .1k = 0.25) and (π 1.k = 0.25) for the imbalanced design, where k = 1, . . . , K. All of the simulation scenarios are given in Table 4. In total, 72 different simulation scenarios were selected to be run. The results were based on 5000 replications. While E1, WI1, and AI1 represented the small sample size for equal sample size design (E), within-center inequality (WI), and among-centers inequality (AI), the medium sample size was represented by E2, WI2, and AI2. The large sample size was represented by E3, WI3, and AI3.
A total of 27 different methods, consisting of four tests with six adjustment methods and three multiple comparison methods were considered.
The simulation software was developed in R version 3.6.1 by the author. The Breslow-DayTest() and WoolfTest() functions of the DescTools package [32] were used to perform the BD, Tarone, and Woolf tests, and the p.adjust() function of the stats package [28] to calculate the adjustment method. The critical value was accepted as α = 0.05. Table 5 summaries the possible outcomes in the hypothesis testing [4]. Assuming that m 0 is the number of true null hypotheses and m is the possible outcomes for testing, where the combination of K odds ratios was m = K(K − 1)/2.

Measures to evaluate the tests
In Table 5, U is the number of hypotheses that were correctly declared as non-significant and S is the number of hypotheses that were correctly declared as significant. R is the number of hypotheses that were declared as significant. V is the type-I error and T is the type-II error.
The measures to assess the hypotheses tested were divided in the classes of power measures and error measures. The former are the any-pair power (ANPP), all pairs power (APP), positive predictive value (PPV), true negative rate (TNR). The latter are percomparison error rate (PCER), family-wise error rate (FWER), and false discovery rate (FDR).
The ANNP is defined as the probability of identifying at least one true difference between the pairs and the APP is the probability of detecting all of the significant pairs [22,29]. The TNR is the proportion of correctly declared non-significant hypotheses (TNR = U/(m − R) ). The PPV or precision is the proportion of correctly declared significant hypothesis (PPV = S/R ). The FWER is the probability of having at least one type-I error over the comparisons and the PCER is the probability of observing a type-I error in any comparison [10]. The FDR is the proportion of falsely declared significant hypotheses (FDR = V/R ) [4].
In the simulation study, m odds ratios were compared using the multiple comparison procedures. Table 5 was created for each replication of each scenario mentioned in Section 4.1. Then, the mean values of the measures to assess the hypotheses tested were calculated as represented in Table 5. To evaluate the tests, the ANPP, APP, PPV, TNR, PCER, FWER, and FDR measures were used for the scenarios with the P1-and P2-type alternative hypotheses. Because it is required to have at least one true-non-significant hypothesis to calculate TNR, PCER, FWER, and FDR measures, and F-type alternative hypotheses were defined as the case when all the true odds ratios are different, only the ANPP, APP, and PPV were calculated for the F-type alternative hypotheses.

Simulation results
The values of the power and error measures for all of the methods and scenarios are presented for each run in Tables 6-9. All the results are not tabulated here due to limited space. Some of them are given in Tables 1-28 of Supplemental Material.

ANPP
The ANPP results for the case where only one odds ratio was different are summarized in Tables 6 and 7 (see Tables 1-4 of Supplemental Material for the cases where more than one of and all of the odds ratios were different results). The ANPP values ranged between 0 and 1, and high values indicated a desirable high power for the test. The results are interpreted below by the number of strata, sample size design, and methods.
• Results based on the number of strata: The probability of identifying at least one true difference between the pairs for the tests was not impacted by the number of strata for the large sample sizes. While the number of strata increased from 3 to 7, when only one of the odds ratios was different and the sample size was small, the ability of the tests to identify at least one difference between the pairs decreased averagely 24% for all the tests except for the BD-based and chi-square-based LSD tests and the YA test. • Results based on the number of different odds ratios: We compared ANPP values of by the number of different odds ratios (P1, P2, and F-type alternative hypotheses). The  Abbreviation: SSD, sample size design; E, equal sample size design; WI, within-center inequality; AI, among-centers inequality; BDLSD, BD-based LSD; CSLSD, chi-squared-based LSD.
probability of identifying at least one true difference between the pairs for the tests was not notably impacted by the changes in number of different odds ratios for the large sample sizes (E3, WI3, or AI3). For the tables with 3 strata, the ability of the tests to identify at least one difference between the pairs was higher when all of the odds ratios were different when compared to the case where only one of the odds ratios was different. For the tables with 5 or 7 strata, in general, the ability of the tests to identify at least one difference between the pairs increased when the number of different odds ratios increased. • Results based on the sample size design: For all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test, the ability of the tests to identify at least one difference between the pairs increased averagely 24.5% when the sample size increased. Nevertheless, the ability of the BD-based and chi-square-based LSD tests had the highest values when the sample size was small. • Results based on the multiple comparison methods: For all of the scenarios, when the sample size was large, the probability of identifying at least one true difference between the pairs for all of the adjusted BD, Tarone, Woolf, and Peto tests and YA the test were over 0.930. For the tables with 5 or 7 strata, more than one or all of the odds ratios were different, and when the sample size was medium, the ANPP values of these multiple comparison methods were similar and over 0.828 (Tables 1-4 of Supplemental Material).

APP
The APP results are summarized in Tables 5-10 of Supplemental Material. The APP values ranged between 0 and 1, and high values of the APP were expected. The APP values were mostly lower than those of the ANPP. For the tables with 5 and 7 strata, the APP values of all of the methods were around '0' when more than one of or all of the odds ratios were different, except for the YA test in some scenarios (Tables 5-6 of Supplemental Material). Thus, a comparison of the APP by strata was possible for only some of the scenarios.
• Results based on the number of strata: When only one of the odds ratios was different, the probability of identifying all of the pairs that are actually different for the tests, except the YA test, decreased averagely 17.6% as the number of strata increased. • Results based on the number of different odds ratios: In general, when the number of different odds ratios increased, the power of the methods decreased. When more than one of or all of the odds ratios were different, the ability of the tests to identify all of the pairs that are actually different were mostly close to zero, except for the YA test. • Results based on the sample size design: The probability of identifying all of the pairs that are actually different for all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test increased when the sample size increased. • Results based on the multiple comparison methods: By considering the ability of the tests to identify all of the pairs that are actually different, the YA test performed better than the other tests in most of the scenarios. The APP values of all of the tests, except for the YA test, were very low.

PPV
The PPV results are summarized in Tables 11-14 of Supplemental Material. All PPV values for the case where all of the odds ratios were different were found '1'. The high PPV values are desired. The PPV values of the BD-based and chi-square-based LSD tests could not be calculated for the larger-sample size since no declared significant differences were found during the simulation runs.
• Results based on number of strata: The proportion of the tests that correctly declared differences between the odds ratios was not impacted by the number of strata, except for some of the scenarios of the BD-based and chi-square-based LSD tests and the YA test.
• Results based on the number of different odds ratios: In general, when the number of different odds ratios increased, the ability of all of the tests that correctly declared differences between the odds ratios increased averagely 1.4%. • Results based on the sample size design: The proportion of the tests that correctly declared differences between the odds ratios was not affected by the changes in the sample size design, except for some of the scenarios of the BD-based and chi-square-based LSD tests and YA test. The lowest values of the YA test were found for the table with among-centers inequality.
• Results based on the multiple comparison methods: High PPV values were found in most of the scenarios and most of the methods. In general, the ability of YA test to detect the true differences between the odds ratios was mostly slightly lower than in the other methods, but the average value was 0.870. The PPV of the YA test was lower than 0.50 for the tables with 5 or 7 strata, one different odds ratio (P1-type) and an among-center inequality design (AI).

TNR
The TNR results are summarized in Tables 15-18 of Supplemental Material. As with the PPV, the high TNR values are desired.
• Results based on the number of strata: While the number of strata increased from 3 to 7, when only one of the odds ratios was different (P1-type), and the sample size was small (E1, WI1, or AI1), the probability of the tests that correctly declared non-significant differences between the odds ratios increased averagely 16.3%. When more than one odds ratio was different, the ability of the tests to indicate significant differences between the odds ratios that were actually not different were not too much differed by the changes in the number of strata. • Results based on the number of different odds ratios: In all of the scenarios, the ability of the tests that correctly declared non-significant differences between the odds ratios was higher when only one of the odds ratios was different when compared to the case where at least one of the odds ratios was different. • Results based on the sample size design: For all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test, the ability of the tests that correctly declared non-significant differences between the odds ratios increased averagely 17.3% when the sample size increased. In general, the highest TNR values of all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test were observed in large sample size design. On the other hand, the highest TNR values of the BD-based and chi-square-based LSD tests were observed in the within-sample inequality design with a small sample size. • Results based on the multiple comparison methods: The Benjamini-Hochberg adjusted tests showed better performance for the tables with 3 strata. The chi-square-based test showed better performance for the tables with 3 or 5 strata and within center inequality design with small sample size. The YA test mostly showed a much higher ability to indicate significant test that correctly declared non-significant differences between the odds ratios for all of the scenarios of 5 and 7 strata.

PCER
The PCER results are summarized in Tables 19-22 of Supplemental Material. The PCER values were expected to be around α = 0.05.
• Results based on the number of strata: The change in the number of strata had a slight effect (around 0.001 ± 0.015) on the probability of observing a type-I error in any comparison. When only one of the odds ratios was different and the table had among-centers inequality with medium or large sample sizes, the YA test was the only method that diverged from α = 0.05 with the increasing number of strata.
• Results based on the number of different odds ratios: While the YA test mostly performed better for tables with more than one different odds ratio than the case with only one different odds ratio, the change in the number of different odds ratios had a slight effect on the probability of observing a type-I error in any comparison for the other methods. • Results based on the sample size design: The probability of observing a type-I error in any comparison of the adjusted BD, Tarone, Woolf, and Peto tests was not impacted by the sample design. The probability of observing a type-I error in any comparison of the BD-based and chi-square-based LSD tests converged to α when the table had the within-center inequality design with a small sample size. The PCER value of the YA test diverged from α when the table had an among-centers inequality design with large sample size, followed by a medium sample size. • Results based on the multiple comparison methods: When one or more than one odds ratio was different and the table had among-centers inequality designs with medium and large sample sizes, the Benjamini-Hochberg adjusted BD, Tarone, and Woolf tests showed the best PCER performance. When only one of the odds ratios was different and the table had the within-center inequality design with a medium or large sample size, the Benjamini-Hochberg adjusted Peto test showed the slightly better PCER performance. The BD-based and chi-square-based LSD tests showed the best PCER performance for the within-or among-center inequality designs with a small sample size. The YA test showed the best PCER performance for the tables with 3 strata, only one different odds ratio and an equal and small sample size; for the tables with 5 strata, only one different odds ratio and an equal or within-center inequality design with a small sample size; for the tables with 7 strata with medium or large sample sizes.

FWER
The FWER results for the case where only one odds ratio was different are summarized in Tables 8 and 9 (see Tables 23 and 24 of Supplemental Material for the case where more than one of the odds ratios were different results). FWER values are also expected to be around α = 0.05.
• Results based on the number of strata: In all of the scenarios, the probability of having at least one type-I error over the scenarios of all of the adjusted BD, Tarone   Abbreviation: SSD, sample size design; E, equal sample size design; WI, within-center inequality; AI, among-centers inequality; BDLSD, BD-based LSD; CSLSD, chi-squared-based LSD.
• Results based on the multiple comparison methods: In most of the scenarios, the Benjamini-Hochberg adjusted tests controlled the type-I error rate better than the others. In all of the scenarios, the YA test was the most diverged method from α = 0.05.
Considering that the FWER is the probability of having at least one type-I error over all the comparisons and the PCER is the probability of observing a type-I error in any comparison, PCER and FWER are the probability of a type-I error (V). For each run of simulation, when there was no type-I error, we found V = 0. Because we calculated the mean of these measures over the runs, this caused low values of PCER and FWER. When PCER and FWER were low, this indicated that these tests controlled the type-I error well. Except for BD-based and Chi-squared-based LSD tests and YA tests in some scenarios, we found low PCER and FWER values for the tests. We also found that PCER values less than or equal to FWER, similar to the literature.

FDR
The FDR results are given in Tables 25-28 of Supplemental Material. A low FDR value is expected from a good method. FDR was calculated as FDR = V/R where R was the number of hypotheses that were declared as significant. Because BD-based and chi-squarebased LSD tests did not indicate significant differences between any pair of the odds ratios tested, no significant difference was obtained during the simulation runs and the number of hypotheses that were declared as significant was found as 0. Then, the proportion of falsely declared significant hypotheses of the BD-based and chi-square-based LSD tests could not be calculated for the larger sample size tables. Because FDR = 1 − PPV, the results of the FDR were consistent with the PPV results.

Discussion
In meta-analysis studies, comparisons of the odds ratios may lead researchers to detect statistical heterogeneity. In some specific studies, instead of pooling different studies, the results are pooled over a third factor (i.e. age, gender, country) and this may cause clinical heterogeneity. Even though Fletcher [14] discussed the fact that judgments about clinical heterogeneity are qualitative and do not involve any calculations, the clinical heterogeneity among the participant characteristics can be detected by comparing the odds ratios. In this study, the focus was placed on meta-analysis studies in which the results are pooled over a third factor and the odds ratios were non-homogeneous. The necessity for a multiple comparison procedure was discussed when the null hypothesis of homogeneity of the odds ratios is rejected.
The results showed that some of these methods controlled the type-I error rates at the desired level, while some of them were more powerful than others. By considering the power and type-I error performance together, some promising tests were identified for the considered scenarios. The recommended tests by considering the main findings of the study are summarized in Table 10.
The Breslow-Day, Tarone, Woolf, and Peto tests were suggested to test the homogeneity of more than three odds ratios. The multiple comparison procedures are applied when the odds ratios are heterogeneous. We discussed the performance of these tests in the multiple comparison procedure. For this purpose, we used the adjustment methods to control the type-I and type-II errors. The BD-based and chi-square-based LSD tests and the YA test were specifically suggested as multiple comparison tests of odds ratios. Thus, the BD-based and chi-square-based LSD tests and the YA test behaved differently from the other tested methods. Unexpectedly, even though BD-based and chi-square-based LSD tests were multiple-comparison tests, their performance was below the others.
The simulation space of this numerical study covered the COVID-19 data given in Section 3. The data had 4 strata and only the odds ratio for China was greater than the others (only one different odds ratio). While the studies from Italy and France had large sample sizes, those of Greece and China were smaller than the others. The simulation study results showed that the tests with Benjamini-Hochberg adjustment were more powerful and controlled the error rates for the tables with 4 strata, only one different odds ratio, and among-centers inequality design. The Benjamini-Hochberg adjusted tests results indicated the difference between the odds ratios for China and the other countries and was consistent with the simulation study results. Only the Woolf test with Benjamini-Hochberg adjustment between the odds ratios for China and France, which were slightly non-significant (p = 0.056). As the simulation results showed that Bonferroni, Holm, Hochberg, and Hommel adjusted BD, Tarone, and Woolf tests, and the Abbreviation: TOR, true odds ratio; P1, only one of the odd ratios is different; P2, more than one of the odds ratios is different, F, all the odd ratios are different; E, equal sample size design, WI, within-center inequality, AI, among-centers inequality.
BD-based and chi-square-based LSD tests were not suitable for the multiple comparisons of odds ratios for such tables, these tests did not indicate any difference between the odds ratios. The YA test indicated the statistically significant difference between the odds ratios for Greece and France, which were less conservative than the others. Because it is difficult to design a simulation study by considering the heterogeneity of odds ratios and the difference between the sample sizes in each stratum, this study was limited in that the maximum number of strata was 7. Even though the minimum value was theoretically 2, working with a larger number of studies made the meta-analysis more powerful and reliable.

Disclosure statement
No potential conflict of interest was reported by the author(s).