Revisiting the analysis pipeline for overdispersed Poisson and binomial data

Overdispersion is a common feature in categorical data analysis and several methods have been developed for detecting and handling it in generalized linear models. The first aim of this study is to clarify the relationships among various score statistics for testing overdispersion and to compare their performances. In addition, we investigate a principled way to correct finite sample bias in the score statistic caused by estimating regression parameters with restricted likelihood. The second aim is to reconsider the current practice for handling overdispersed categorical data. Although the conventional models are based on substantially different mechanisms for generating overdispersion, model selection in practice has not been well studied. We perform an intensive numerical study for determining which method is more robust to various overdispersion mechanisms. In addition, we provide some graphical tools for identifying the better model. The last aim is to reconsider the key assumption for deriving the score statistics. We study the meaning of testing overdispersion when this assumption is violated, and we analytically show the conditions for which it is not appropriate to employ the current statistical practices for analyzing overdispersed data.


Introduction
Overdispersion is an important issue when analyzing data with Poisson and binomial generalized linear models (GLMs) [1,3,4]. Hinde and Demétrio [15] mentioned that if overdispersion is ignored, standard errors are seriously underestimated and inferences for regression coefficients are problematic. Liang [19] considered overdispersion in the context of testing for homogeneity with many strata and emphasized the importance of testing overdispersion because substantial efficiency gains can be made when homogeneity holds.
Researchers have considered a random intercept model as a way of generating overdispersion [15,19,21]. The random intercept, or more generally, random effects often refers to an unmeasured covariate or neglected heterogeneity. The random effect has a variance equal to zero in the case of no overdispersion, and a positive variance indicates overdispersion. In the literature, variance component testing refers to testing whether the random effect has zero variance. Two widely used approaches for variance component testing are the likelihood ratio (LR) test and the score test. The LR statistic requires specifying the random effect distribution and requires estimates from both null and alternative models, whereas the score statistic does not require specifying the random effect distribution and requires estimates only under the null model. Considering these points, using the score statistic for testing overdispersion has gained popularity in the statistics literature. However, score statistics have been proposed in different forms. Liang [19], Lin [21], Dean [11], and Zhang and Lin [37] employed asymptotically normal tests, whereas Verbeke and Molenberghs [31] used a mixture of chi-squares. In this analysis, we start by clarifying the relationships among various score statistics in the context of overdispersion testing.
For Poisson and binomial GLMs, Dean [11]'s score test is popular and is known to be reliable. For example, O'hara Hines [23] concluded that Dean [11]'s score test is more dependable than other competing tests are when the data contains a large number of small expected counts. When constructing the score statistic, the regression parameters are replaced by their maximum likelihood estimates (MLEs) under the null. To mitigate the finite sample bias caused by using MLEs, Dean [11] modified the score statistic to be more zero unbiased. However, this modification was not derived in a principled way. Furthermore, Dean [11] corrected only the numerator of the score statistic, and the denominator remained the same. Thus, the asymptotic variance term of Dean [11]'s modified score statistic did not take into account the uncertainty caused by the modification term, so the asymptotic normal approximation still may not be satisfactory for finite sample cases. To overcome this limitation, we study a more principled way of deriving the modified score statistic from the restricted likelihood [5,13,25]. Using the restricted likelihood provides a likelihood justification for both Dean [11]'s correction and an approximate asymptotic variance that takes into account the modification term. We further consider the parametric bootstrap approach to check whether the finite sample performance of the proposed score statistic can be improved further. We compare the performances of several score statistics in terms of their type I errors and powers for various binomial and Poisson GLMs.
In addition to conducting a comparative study, we critically investigate the current statistical practices for analyzing overdispersed Poisson and binomial data. Two approaches are widely used to deal with overdispersion. The first approach is the so-called quasilikelihood approach, which incorporates an additional dispersion parameter into the Poisson and binomial variances. The second approach is to use distributions that allow for overdispersion. For Poisson and binomial data, negative binomial and beta-binomial distributions are popular choices, respectively. For these two approaches, two interesting questions arise: The first is which method is more robust to various overdispersion generating mechanisms, and the second is how to identify which method is more appropriate for a given dataset. For these questions, we investigate the type I errors and powers of the two approaches for various random effect models, and we introduce some graphical methods to identify the better approach.
In deriving the score tests, the independence of observed covariates and the random intercept was assumed. However, in many observational studies, the distribution of the random effect is likely to depend on observed covariates. We study the implications of testing overdispersion analytically when the independence assumption is violated, and we explain why it is very dangerous to employ the current statistical practices for analyzing overdispersed data in that situation.
The remainder of this paper is organized as follows. In Section 2, we review the score statistics and derive the modified score statistic from the restricted likelihood. We also clarify the relationships among the score statistics. A comparative study of score statistics is given in Section 3. Sections 4 and 5 investigate two approaches for overdispersed Poisson and binomial data under various overdispersion generating models, and simple graphical methods are proposed to identify the better approach. Section 6 studies the meaning of testing overdispersion when the independence between observed covariates and the random intercept is violated. Through two real examples, we illustrate the proposed method and pipeline for analyzing the overdispersed categorical data in Section 7. Concluding remarks are provided in Section 8. We provide the R-package OST (overdispersion score test) to implement the proposed score test.

Score tests for overdispersed categorical data
In this section we investigate the relationship among various score tests proposed in the literature.

A review of score statistics and their relationships
We consider the following six score test statistics: Cox [9]'s overdispersion test, Liang [19]'s homogeneity test, Dean [11]'s score test, Commenges et al. [8]'s score test, Lin [21]'s score test, and Verbeke and Molenberghs [31]'s score test. Although these score tests can be applied more generally, we restrict them to testing overdispersion in Poisson or binomial GLMs.
Suppose that the data are composed of n observations with responses y i and p × 1 covariate vectors x i associated with the fixed effect β. To generate overdispersion, we introduce the random effect v i into the regression models. For Poisson data, y i |v i is assumed to follow a Poisson distribution with the conditional mean specified by For binomial data, we assume that The model specification is completed by assuming that the random effects v i are generated as independent and identically distributed (iid) from some distribution G with E(v i ) = 0 and Var(v i ) = τ ≥ 0. Testing overdispersion corresponds to testing H 0 : τ = 0. Under the null hypothesis, the model becomes an ordinary GLM. The marginal likelihood for (β, τ ) takes the form is the Poisson or binomial distribution of y i given v i . A big advantage of using score statistics is that the full specifications for G are not required and they are computed only with parameter estimates under the null hypothesis. Therefore, standard software results that can implement GLMs are sufficient to evaluate the score statistics.
where E denotes the expectation at τ = 0 and β is replaced by its MLEβ under H 0 . Liang [19] proposed using and I = I τ τ − I τβ (I ββ ) −1 I T τβ . Liang [19] showed that this statistic is obtained straightforwardly by using L'Hôspital's rule under some regularity conditions. Cox [9] used the same statistic in the case of no covariate and outlined a method for generalizing it to regression models. Commenges et al. [8] also derived a score test based on (3) for model (2). Their score statistic is a special case of T L . As an extension of Dean and Lawless [10]'s test for overdispersion in Poisson GLMs, Dean [11] proposed a general overdispersion testing method for y i following the natural exponential family: where θ i = θ i (x i ; β). Poisson and binomial GLMs are special cases of (4). Dean [11] derived the score statistic for testing H 0 : τ = 0 by using the Taylor expansion, which was given by with denoting the differentiation with respect to θ i and the explicit form of b i depending on the choice of distribution. Since S L = S D for the natural exponential family, Liang [19]'s and Dean [11]'s statistics are equivalent.

Dean's score test for overdisperion: Poisson case
For the Poisson case, the explicit form of Dean [11]'s score statistic for model (1) is whereμ i denotes the value of μ i at β =β. Cameron and Trivedi [6] used the same statistic for testing overdispersion in a Poisson regression model versus a negative binomial regression model. To correct the finite sample bias ofβ in the numerator of T P D1 , Dean [11] noted that E( 1 , X is the covariate matrix for which x T i is the ith row, and W 1 is a diagonal matrix with (W 1 ) ii =μ i . Using this, Dean [11]'s modified score statistic is and she argued that this statistic converges very quickly to normality.

Dean's score test for overdisperion: binomial case
In order to test H 0 : τ = 0 for model (2), the explicit form of Dean [11]'s score statistic is 1 , and W 1 is a diagonal matrix with W 1i as its ith diagonal element. Note that our formula for V 2 is not the same as Dean [11]'s. The factor of 1/4 in the latter term of V in Dean [11]'s formula is a typo and should be removed. As in the Poisson case, Dean [11] corrected the finite sample bias ofβ in the numerator of T B D1 , and the modified score statistic was given by

Alternative tests for detecting overdispersion
To derive the score statistic for testing overdispersion, Lin [21] considered the secondorder Taylor expansion for f (y i | v i ; β) at v i = 0 for GLMs with random effects. This technique is the same as that for deriving the marginal quasi-likelihood [5]. To test H 0 : τ = 0, the proposed statistic was given by where χ 2 1 denotes the chi-square distribution with one degree of freedom. Therefore, the relationship T Lin = T 2 L holds. Verbeke and Molenberghs [31] considered a score statistic of the following form: which was evaluated at the estimates under H 0 . Whenτ > 0, i ∂τ < 0 at τ = 0, so T VM = 0. Thus, the statistic (5) can be reexpressed as By the results of [26], it can be shown that T VM follows the distribution 0.5χ 2 0 + 0.5χ 2 1 . One might be curious about why the asymptotic distributions for the score test statistics are different. There are three distributions: N(0, 1) (normal), χ 2 1 (chi-square) and 0.5χ 2 0 + 0.5χ 2 1 (a mixture of chi-square distributions). The differences are determined by the choice between one-sided and two-sided tests. Under H 0 , for the asymptotically normally distributed score statistic, the type I error remains the same regardless of the choice of a one-sided or two-sided test as long as the tests have the same significance level. Using χ 2 1 is equivalent to using a two-sided test with the asymptotically normally distributed score statistic, while using the mixture of chi-square distributions corresponds to performing a one-sided test because it involvesτ > 0 only. Lin [21] used χ 2 1 (i.e. they did a two-sided test), whereas Liang [19], Dean [11] and Commenges et al. [8] used one-sided tests with an N(0, 1) statistic, and Verbeke and Molenberghs [31] used 0.5χ 2 0 + 0.5χ 2 1 . Whether a one-sided or two-sided test is used has an influence on the power of the test. In general, a one-sided test has a larger power for overdispersion testing because the score statistic is positive with a high probability under H 1 . Thus, Zhang and Lin [37] described that the two-sided score test has the correct size under the null hypothesis but is subject to some loss of power. Hall and Praestgaard [12] argued that the power of Lin's test can be improved by concentrating the test in the directions that are given a priori by the constraint that the covariance matrix of random effects is valid (the variance component of random effects is nonnegative). In addition, Verbeke and Molenberghs [31] demonstrated why the one-sided score test should be strongly favored for variance component testing.

Modification of the score statistics based on restricted likelihood
Dean [11]'s finite sample correction often improved the convergence of the original score statistic to normality, but our numerical study in Section 3 shows that its performance is still not satisfactory. This result is because the asymptotic variance used by Dean [11] does not take into account the uncertainty caused by using the modification term in the numerator of the score statistic.
To overcome this limitation of Dean [11], we investigate the restricted likelihood-based score test as a principled way of adjusting for the finite sample bias caused byβ. In the context of generalized linear mixed models (GLMMs), Breslow and Clayton [5] developed a normal theory procedure based on restricted likelihood for variance component testing. The details of the score function and the approximate Fisher information matrix for GLMMs were provided in their analysis, and some theoretical properties of the score statistic for GLMMs were studied [21]. We apply Breslow and Clayton [5]'s results to our overdispersion setting.
Following Breslow and Clayton [5]'s derivation steps, the restricted likelihood-based score statistic for Poisson data is represented as The numerator of (6) is the same as that of Dean [11]'s modified score statistic. Using the restricted likelihood automatically corrects the finite sample bias due toβ. However, the denominator is different from Dean [11]'s and is given by If we ignore H in the denominator of (6), T P BC becomes T P D2 . If we ignore H in both the numerator and denominator of (6), T P BC becomes T P D1 . Similarly, the restricted likelihood-based score statistic for binomial data is represented as The numerator is the same as that of Dean [11]'s modified score statistic. For the denominator, we have which differs from that of Dean [11]'s. For the binomial case, even if H is ignored in the denominator of T B BC , T B BC is not equivalent to T B D2 . In addition, we consider the parametric bootstrap approach for testing overdispersion with T P BC (or T B BC ). The implementation process is described as follows. For a given dataset, we fit ordinary GLMs corresponding to the null hypothesis. Then, we generate simulated data from the fitted model B times and compute T P BC (or T B BC ) for each dataset. These bootstrap statistics are denoted by T P * BC,b and T B * In our application, B is 1000. Compared with computing the p-value with the asymptotic normal distribution, the parametric bootstrap approach is computationally intensive. However, Sinha [27] mentioned that the parametric bootstrap test has empirical type I errors much closer to the nominal ones and a higher power than that of asymptotic score tests for variance component testing in GLMMs. Yang et al. [35] employed the parametric bootstrap approach to improve the underestimated nominal significance level of the asymptotic score tests for a generalized Poisson model. However, since the existing literature considered only score tests that did not correct the finite sample bias caused by estimating β, it is worth investigating the numerical performance of the parametric bootstrap approach when the restricted likelihood-based score statistic is used. This performance is discussed in the numerical study presented later.

A note on score tests for overdispersion in binary data
We demonstrate below why the above score tests for overdispersion cannot be applied to binary (i.e. 0-1) data. For simplicity, consider the logistic regression model with only an intercept. Under this model, are the same and are exactly zero because, in V 2 , Therefore, regardless of whether H 0 holds or not, T B D1 has the form of 0/0, T B D2 has the form of a positive constant divided by zero, and T B BC becomes zero. This result implies that these statistics are meaningless for testing overdispersion for binary data. This problem is persistent even when there are covariates. In a numerical study, which is not shown in this paper for brevity, T B D1 never rejects H 0 because it always remains small and positive, T B D2 always rejects H 0 , and T B BC always accepts H 0 . For binomial data, the sign of overdispersion can be detected by estimating the overdispersion parameter φ, where Var(y i ) = φm i p i (1 − p i ). φ is often estimated by using the method of moments (MME) with Pearson's residual: Overdispersion is suspected whenφ shows strong evidence that φ > 1. However, judging overdispersion based onφ is only valid for binomial data (m i > 1) and is not applicable to binary data because Similar conclusions have been expressed in several ways in the literature. For example, McCullagh [22] pointed out that the deviance and Pearson residuals are uninformative regarding a lack of fit for binary data. Skrondal and Rabe-Hesketh [28] illustrated the impossibility of detecting overdispersion for binary data. Instead of overdispersion testing, under the name of goodness of fit testing, various methods for detecting model misspecification have been developed for binary GLMs. Azzalini et al. [2] developed formal methods to assess the assumptions of binary GLMs. A pseudo likelihood ratio test was considered and its statistical significance was obtained by simulating the sampling distribution under the fitted null model. Le Cessie and van Houwelingen [17] developed a goodness of fit test for binary regression models with a canonical link function based on nonparametric kernel methods. Since these methods are not purely designed for testing overdispersion, we do not consider them further in our comparison study.

A comparative study for score tests
We first compare the three score tests, T D1 (Dean [11]'s score statistic), T D2 (Dean [11]'s modified score statistic) and T BC (restricted likelihood-based score statistic), under various random effect distributions. This simulation study is conducted to investigate the empirical type-I error rates and powers of the score statistics for testing overdispersion in Poisson and binomial GLMs. To In the linear predictor, the covariate vector is For each τ , we fit the corresponding GLM with the generated sample (x i , y i ) for i = 1, . . . , 100, and we then calculate T D1 , T D2 , and T BC . Then, we calculate the proportion of cases for which each score statistic is greater than the critical value z 1−α (1 − α quantile of N(0, 1)). The simulation results for the type-I error and power are based on 5000 replications. We set the nominal level to α = 0.05.
The results of the empirical type-I error rate (when τ = 0) for the three score tests are shown in Figure 1 when the distribution of v i is normal. The results for the other distributions of v i are similar, so they are provided in the supplementary information. For the cases of Poisson GLMs and binomial GLMs with m i = 5, T D2 and T BC generally maintain their type-I error well, and T D1 tends to be conservative when β 0 is small. When β 0 increases, the powers of all methods increase quickly. Figure 1(g-i) shows different increasing patterns when m i = 2. T D2 suffers from an inflated type-I error, and T BC and T D1 perform similarly. Overall, when the tests are performed based on the asymptotic theory, T BC shows superior performance in controlling the type-I error relative to the other score tests. Without relying on the asymptotic distribution of the test statistics, we can modify the critical value for each test by using the Monte Carlo samples generated from the null model. When the three tests have the same type-I error under the null, the powers of the three tests are numerically very close to each other. This is expected because the numerators of the three test statistics are similar except for the minor modification term to improve the convergence to the normality. The results when v i is normal are provided in the supplementary information, and the results for the other random effect distribution are omitted for brevity.
In our numerical studies, although T BC exhibits the best performance among the three score tests in terms of type-I error control and power, it is not always satisfactory. For example, T BC becomes conservative for the binomial case with m i = 2. Therefore, we utilize the parametric bootstrap approach to see whether the performance of T BC is improved. For the Poisson case and binomial case with m i = 5, T BC and the parametric bootstrap approach have almost similar patterns. However, for the binomial case with m i = 2, the parametric bootstrap approach is quite effective, and the power is much improved (Results are in the supplementary information). Therefore, from our limited numerical studies, we recommend using T BC for testing overdispersion for Poisson and binomial GLMs with moderate or large m i s. For binomial GLMs with small m i s, the parametric bootstrap approach can be used as an alternative to T BC .
Next, reflecting the request of an anonymous referee, we investigate the effect of misspecified link functions in the power of the proposed score test T BC . We generate the binomial data using probit or complementary log-log link function. The numerical details of the simulation setting and results are provided in the supplementary information. Even if the link function is misspecified, T BC maintains type-I error well under the binomial denominator m i = 5, and is conservative with m i = 2. Interestingly, at the same level of τ , the powers of the score test with misspecified link functions are larger than the test with logit link.

Two approaches dealing with overdispersed Poisson data
Suppose that y i ∼ Poi(μ i ). Since Var(y i ) = μ i , the variance is completely determined by the mean function. To be more flexible, the quasi-Poisson (QP) approach introduces an additional dispersion parameter φ so that Var(y i ) = φμ i . If φ = 1, there is no overdispersion. If φ > 1, overdispersion occurs. For Poisson GLMs, φ is not in the likelihood function, so an extra step is required to estimate it. In practice, MME is widely used such thatφ = Dev n−p , where Dev denotes the deviance statistic orφ = 1 [14]. Note that estimating φ does not affect the regression parameter estimate and only standard errors are affected. The new standard errors are obtained by multiplying φ by the original standard errors.
The second approach for overdispersed Poisson data is using a distribution that allows overdispersion. A popular choice is to use negative binomial (NB) regression, which has Var(y i ) = μ i + αμ 2 i (> μ i ). The NB model can be viewed as the Poisson generalized linear mixed model with saturated random effects following a gamma distribution [14]. Since NB is a true distribution, the likelihood approach is available and MLEs are easily obtained by using the glm.nb function in the R-package MASS [30].
In this section, we compare the performances of these two approaches under various mechanisms generating overdispersed Poisson data and make a general recommendation for analyzing them.

A numerical study for quasi-Poisson and negative binomial approaches
We compare the type I errors and powers of the two approaches when the QP approach is true and when the NB model is true, respectively. The R-function 'rnbinom' can generate y i from the NB with mean μ i and variance μ i + α i μ 2 i . To generate the QP data, we set log(μ i ) = β 0 + β 1 x 1i + 0.5x 2i and α i = (φ − 1)/μ i . Doing so leads to Var(y i ) = φμ i . For φ, we use two or five. To generate the NB data, the same mean model is used and α i = 2 or 5. The x 1i s and x 2i s are generated independently from U(0, 1), β 0 is zero, and three sample sizes, n = 50, 100, 500, are considered. To study how the NB (QP) model behaves when data are generated from the QP (NB) model, we look at the type I error when β 1 = 0 and the power when β 1 = 1 based on 5000 replications for each approach. The results are summarized in Table 1. When data are generated from the QP model, using the QP model shows a slightly better control for type I error than the NB model. The NB model has a higher power than the QB model, but this comes at the cost of a slightly inflated type I error. When data are generated from the NB model, the NB model shows a slightly better type I error control and a higher power. Table 1 shows the mean-squared error (MSE) ofβ 1 when β 1 = 1. When data are generated from the QP model, the MSE for the QP model is generally smaller than that of the NB model. Even when data are generated from the NB model, the MSE for the QP model is smaller than that of the NB model when the sample size is 50 or 100. This is related to the fact the NB model should estimate an extra parameter α, which is not in the QP model. However, in a large sample (n = 500), the MSE for the NB model is smaller than that of the QP model. In conclusion, it is very difficult to recommend either the QP model or the NB model as the first option when there is no subject matter knowledge to favor one between the two models. Likewise, Ver Hoef and Boveng [32] argued that there is no general solution for determining whether the QP model is better than the NB model and vice versa. Therefore, to be on the safe side, it is worth to developing some methods to discriminate between the QP and the NB model, which will be discussed in Section 4.3.

A numerical study for various random effect distributions
This numerical study checks the performance of the Poisson, QP, and NB models under various Poisson random effect models. Since some of the literature has used MME for α because of its robustness [16], we include the NB model with MME for α, which is denoted by NB * . y i is generated from Poi(μ i ) with log(μ i ) The first is a symmetric distribution, the second is a bimodal distribution, and the third is a skewed case. For √ τ , we consider 0, 0.1, 0.3, 0.5, 1. The x 1i s are generated from U(0, 1), and three sample sizes, n = 50, 100, 500, are considered. We calculate the type I error rates when β 1 = 0 and the power when β 1 = 1 based on 5000 replications. We summarize the results in Table S1 (supplementary information). The ordinary Poisson model suffers from the highest type I errors when random effects exist. For symmetric random effects such as normal or normalmixture, the QP model controls the type I error appropriately. The NB model shows some inflated type I errors when the variance component is large; for example, see the results when τ = 1. For a skewed random effect, both the QP and the NB model suffer from inflated type I errors, but the problem is more serious for the NB model. In summary, when we can assume that the distribution for random effects is symmetric, the QP model can be used as the first option. Both the QP and the NB model are not recommended for skewed random effects. Using MME for α in the NB model does not improve the performance of that model substantially.

A graphical method discriminating QP vs. NB
Cameron and Trivedi [7] proposed a method for testing whether NB regression is needed.
. Consider the following regression model: where e i is the error term. If δ > 0, Cameron and Trivedi [7] argued that the NB model should be used. They also proposed the following regression model: If η > 0, they claimed that the QP model should be used. However, these tests have serious weaknesses when discriminating between the QP and the NB model. We applied Cameron and Trivedi [7]'s test to the models used in Table 1, and the results are summarized in Table 2. The numerical studies show that the first test rejects H 0 : δ = 0 with a high probability when the QP model is true and that the second test rejects H 0 : η = 0 with a high probability when the NB model is true. Therefore, these tests have no ability to discriminate the QP model from the NB model and vice versa. However, these tests can be used for checking whether an a priori determined overdispersion type exists or not.
Alternatively, Ver Hoef and Boveng [32] recommended plotting (y i −μ i ) 2 againstμ i . Since E(y i − μ i ) 2 = φμ i for the QP model and E(y i − μ i ) 2 = μ i + αμ 2 i for the NB model, this plot shows a linear or quadratic relationship between the mean and variance for the QP and NB models, respectively. When this plot is messy, they considered binningμ i into categories and averaging (y i −μ i ) 2 within categories.
Instead of E(y i − μ i ) 2 = φμ i , we note that E( for the QP model and E( y i −μ i √ μ i ) 2 = 1 + αμ i for the NB model. Therefore, we suggest plotting the squared Pearson residuals along the y-axis and the mean along the x-axis. This plot shows a constant or linear relationship between the mean and variance for the QP and the NB model, respectively. Since discriminating a line from a constant is easier than discriminating a quadratic curve from a line, our suggestion may have some benefits in practice. Our experience also favors binningμ i into some categories when the plot is messy. Figure 2 shows some examples based on simulated data. The dashed lines are nonparametric regression fits by lowess function in R, and the size of bubbles is proportional to the number of observations falling into each bin. The upper panel refers to the QP case. Ver Hoef and Boveng [32]'s plot (left) shows a strong linear trend, whereas our proposed plot (right) shows no special pattern and is close to a constant. The lower panel refers to the NB case. Ver Hoef and Boveng [32]'s plot (left) shows an increasing trend, but it is not clear whether this trend is linear or quadratic. On the contrary, our proposed plot shows a linear increasing trend and is far from a constant, which shows the benefit of plotting the squared Pearson residuals.

Two approaches dealing with overdispersed binomial data
Two approaches are widely used in practice for taking into account overdispersion in binomial data. The first approach is to introduce an additional dispersion parameter φ into the binomial variance. We call this method the quasi-binomial (QB) approach. For binomial data y i ∼ Bin(m i , p i ), we have Var(y i ) = m i p i (1 − p i ), which implies that the variance is completely determined by the mean function. By introducing φ into the variance as Var(y i ) = φm i p i (1 − p i ), we can deal with overdispersion in binomial data. If φ = 1, there is no overdispersion. If φ > 1, overdispersion occurs. As in the case of Poisson GLMs, since φ is not included in the binomial likelihood function, MME is widely used in practice for φ [14]. Here,φ = Dev n−p , where Dev denotes the binomial deviance orφ = 1 is used. As in the QP model, correct standard errors are obtained by multiplying φ to the original standard errors. The second approach is to use a true distribution allowing overdispersion. A popular choice is the beta-binomial (BB) regression, which has Var( (1 − p i )). This regression is implemented by the R-function betabin in the aod package [18]

A numerical study for quasi-binomial and beta-binomial approaches
We compare the type I errors and powers of two approaches when the QB model is true and when the BB model is true. The R-function rbetabinom provided in the rmutil package [29] can generate y i from the BB distribution with mean m i p i and variance Var( To generate QB data, we use log( , and m i is generated from Poi(1) + φ + 1. For φ, we use two or four. Doing so leads to Var(y i ) = φm i p i (1 − p i ). To generate BB data, the same model is used, m i is generated from Poi(4) + 1 and s i = 1 or 4. When s i = 1 and 4, ρ i = 0.5 and 0.2, respectively. The x 1i s and x 2i are generated independently from U(0, 1), β 0 is zero, and three sample sizes (n = 50, 100, 500) are considered. We look at the type I error when β 1 = 0 and the power when β 1 = 1 based on 5000 replications. The results are summarized in Table 3. Regardless of whether the data are generated from a QB or BB model, the QB and BB models show similar performances in most cases, but the BB model provides slightly better control for keeping the type I error at its nominal level and also shows a slightly higher power. Table 3 shows the MSE ofβ 1 when β 1 = 1. The MSE of the BB model is slightly smaller than that of the QB model in most cases, and their differences disappear as the sample size increases. In conclusion, the BB model is recommended as the first option if there is no subject matter knowledge to favor the QB model.

A numerical study for various random effect distributions
This numerical study checks the performances of the binomial, QB and BB cases under various binomial random effect models. y i is generated from Bin(m i , p i ) with where E(v i ) = 0 and Var(v i ) = τ . Random effect distributions, the range of τ , and the sample sizes are the same as in the Poisson data case. The x 1i are generated from U(0, 1). We calculate the type I error rates when β 1 = 0 and the power when β 1 = 1 based on 5000 replications. We summarize the results in Table S2. The ordinary binomial model suffers from the highest type I error when random effects exist. In most cases, the QB and BB models control the type I error well and are close to each other. In terms of power, the BB and QB models are quite similar. In summary, both the QB and BB models can be used for a wide range of random effect distributions, and it is very difficult to choose one model because of their similar performances.

A graphical method discriminating QB vs. BB
Although we recommend the BB model as the first option based on our numerical study, this choice can be changed if data shows a strong evidence favoring the BB model. As in the Poisson case, there have been attempts to discover numerical or graphical evidences to discriminate between the QB and the BB model. Liang and McCullagh [20] considered overdispersion for binomial data and suggested plotting the ordinary Pearson residual vs. the binomial denominator for model diagnostic as well as discriminating between the QB and the BB model. The residuals showing a constant variability support the QB model, whereas an increasing variability with the binomial denominator supports the BB model. Contrastively, we note that E( ρ for the BB model. Therefore, plotting the squared Pearson residuals along the y-axis and the binomial denominator along the x-axis gives more information to discriminate between the QB and BB models. This plot shows constant and linear relationships for the QB and BB models, respectively. Our experience also favors binning m i into categories when the plot is messy. Figure 3 shows some examples based on the simulated data and emphasizes the benefit of the proposed method. The upper panel refers to the QB case, where Liang and McCullagh [20]'s plot (left) shows a constant variability and our proposed plot (right) shows a pattern close to a constant. Therefore, both plots support QB. The lower panel refers to the BB case, where our proposed plot (right) shows a linear increasing trend, whereas an increasing variability of the residuals is not obvious in Liang and McCullagh [20]'s plot (left). Therefore, the QB example shows the benefit of plotting the squared Pearson residuals.

A caution on misusing the analysis pipeline for overdispersed data
In deriving the score tests, independence between the observed covariates and the random intercept was assumed, but this assumption is often violated in practice. Therefore, it is interesting to see how the score statistics behave in detecting overdispersion under the violation of this assumption. A numerical study is performed to understand this behavior.
For the Poisson case, we generate We investigate the behavior of the overdispersion testing method by varying the correlation parameter ρ. We set τ = 0.1. Based on 5000 replications with β 0 = β 1 = 1, Figure 4(a) shows the power curves of two score tests, T P D2 and T P BC . The power tends to be smaller when the correlation between v i and x i is higher. But, importantly, the score tests detect overdispersion even when v i and x i are correlated. For the binomial case, we gener- (7). We set β 0 = −3, β 1 = 0.1, τ = 1, and m i = 5. Figure  4(b) shows the power curves of two score tests, T B D2 and T B BC , by varying the correlation coefficient ρ. The power of T B D2 is larger than that of T B BC at the cost of inflated type I error. As in the Poisson case, the power tends to be smaller when the correlation between v i and x i is higher, and these score tests detect overdispersion.
These two numerical studies show that the score tests can have nonnegligible power as long as the correlation between v i and x i is not too high. Once overdispersion is detected by the score tests, we may employ the analysis pipelines for overdispersed data studied in Sections 4 and 5. However, we first need to ask whether these procedures are valid when the independence assumption is violated. We show that they may not provide correct statistical inferences because the fitted model considered in the analysis pipeline can be grossly misspecified, which will be shown concretely below. with

Example 6.1: Consider a Poisson regression model
Then, the marginal mean E(y i | x i ) has the following form: Therefore, fitting QP or NB models with log μ i = β 0 + x i β 1 is not helpful for accurate inference for the marginal mean because the mean model is grossly misspecified.

Example 6.2: Consider the Poisson regression model in Example
The marginal mean E(y i | x i ) has the following form: For this marginal mean, it does not make sense to fit log μ i = β 0 + x i β 1 as long as γ 1 is not too small. Note that even when Cov(v i , x i ) = 0, the usual analysis pipeline does not give accurate inference for the marginal mean. Example 6.3: Suppose that there are two observed covariates, x 1 and x 2 , and the parameter of interest is the regression coefficient β * 1 associated with x 1 . Consider a Poisson regression model . This model assumes that v i is independent of the covariate of interest x 1i but is correlated with x 2i . The marginal mean E(y i |x 1i , x 2i ) has the following form Considering that log μ i = β 0 + x 1i β 1 + x 2i β 2 is fitted in the usual analysis pipeline, the model can be grossly misspecified. Therefore, although the unobserved variable is only correlated with the subsidiary covariate, it is hard to expect to have correct inference for the marginal mean with the misspecified QP or NB models.

Example 6.4:
Consider a logistic regression model (2) where the random intercept v i follows a bridge distribution [33], which preserves the logit link function after integrating out v i . In this case, an explicit marginal form is obtained so that we can directly investigate model misspecifications. The marginal form is given by where β m = β/ 1 + 3τ/π 2 and τ denotes the variance of v i . Depending on the functional relationship between τ and x i , the marginal form can be arbitrarily complicated. Therefore, the conclusion is the same as in the previous examples.
The above examples show that the existing analysis pipeline based on the QP, QB, NB and BB models can give incorrect inferences for regression parameters when the random effect depends on observed covariates. Recall that in the analysis pipeline, these four models are mainly used for correcting standard errors because they have the same covariate form as that of the corresponding random intercept model. However, if the mean model is grossly misspecified, the estimated regression coefficients are heavily biased so that correcting only standard errors is not sufficient for valid statistical inference.
To avoid this fallacy, we propose checking whether the mean model used for the analysis pipeline is correctly specified. For example, suppose that T P BC rejects the null hypothesis and an NB regression model with log μ i = β 0 + x i β 1 is fitted. We propose considering a semiparametric NB model with log μ i = β 0 + x i β 1 + g(x i ), where g(·) denotes a nonparametric term beyond the linear term. To guarantee the validity of this fitted parametric NB model, testing H 0 : g(·) = 0 is required. For the case with two covariates, we may consider log μ i = β 0 + x 1i β 1 + x 2i β 2 + g 1 (x i ) + g 2 (x i ) and test H 0 : g 1 (·) = g 2 (·) = 0. Recently, [34] developed a method for implementing this type of testing based on a mixed model representation and provided an R-package mgcv. In principle, the same testing method can be applied to check the validity of a fitted BB model, but there is no available R-package currently to implement this test. the number of COVID-19 infected people (dependent variables) while adjusting several control variables (population density, percentage of elderly people and etc). To analyze the variation in the number of COVID-19 infected people across 144 countries, the negative binomial regression model was fitted. With the same mean model in Oztig and Askin [24], we calculated the score test-statistics T P D1 , T P D2 , T P BC and the corresponding p-values in Table 4. As expected, all three test-statistics are very large, we can conclude there is the overdispersion with high confidence and the standard Poisson regression model is not appropriate for analyze this dataset. Although Oztig and Askin [24] only applied the NB model to overcome the overdispersion model, QP model can be also used. To check which model is suitable, we applied the proposed graphical way described in Section 4.3. As shown in the right panel of Figure S7, NB model seems to be more appropriate because the relationship between the mean and variance is close to linear rather than constant.
As an example of overdispersed binomial data, we consider the research of Young-Xu and Chan [36] that reviewed 37 studies to examine the adverse effects of oral anti-fungal agents and to model the cumulative incidence of patients who withdrew from study due to adverse effect. With the same model considered in Young-Xu and Chan [36], when we perform the score test, there is significant overdispersion (Table 4). Before fitting the the BB model to combine event rates as in Young-Xu and Chan [36], we tried to check which overdispersed binomial model is better by plotting the graphical way described in Section 5.3. Although the pattern of the ordinary Pearson residuals is unclear, the proposed method using the squared Pearson residuals quite clearly tells that the BB model is better than the QB model by showing the increasing smoothing line based on the squared Pearson residuals ( Figure S8).

Concluding remarks
We clarified the relationship among various score tests proposed in the existing literature and provided a principled way of deriving the modified score test from a restricted likelihood. The restricted-likelihood based score statistic shows superior performance to that of the others because it takes into account the finite sample bias caused by estimating the regression coefficients in a principled way.
We also investigate widely used analysis pipelines for analyzing overdispersed count and binomial data. Generally, for the QP and NB model, it is difficult to choose one method over the other because their performance is quite similar. For the QB and BB models, the BB model is recommended, but care is necessary because our simulation setting is not the only mechanism that can produce overdispersion. Based on these difficulties, some simple graphical methods are introduced to discriminate the QP model from the NB model and the QB model from the BB model. Although we do not believe that these graphical methods are always successful in discriminating between the two models, we observe many numerical examples where the graphical methods are helpful. For cases where discrimination is too difficult, we recommend considering both models together and observing how sensitive the results are to the choice of model. This sensitivity analysis is important to achieve robust empirical findings.
In particular, we emphasize simultaneous checks for overdispersion and the mean model specification. Without the latter, the score tests should be regarded as simple goodness of fit tests, and the existing analysis pipeline for taking into account overdispersion should not be applied indiscriminately. The current analysis pipeline coping with overdispersion is valid only when the current mean model is correctly specified.

Disclosure statement
No potential conflict of interest was reported by the authors.