Beta-negative binomial nonlinear spatio-temporal random effects modeling of COVID-19 case counts in Japan

Coronavirus disease 2019 (COVID-19) caused by the SARS-CoV-2 virus has spread seriously throughout the world. Predicting the spread, or the number of cases, in the future can facilitate preparation for, and prevention of, a worst-case scenario. To achieve these purposes, statistical modeling using past data is one feasible approach. This paper describes spatio-temporal modeling of COVID-19 case counts in 47 prefectures of Japan using a nonlinear random effects model, where random effects are introduced to capture the heterogeneity of a number of model parameters associated with the prefectures. The negative binomial distribution is frequently used with the Paul-Held random effects model to account for overdispersion in count data; however, the negative binomial distribution is known to be incapable of accommodating extreme observations such as those found in the COVID-19 case count data. We therefore propose use of the beta-negative binomial distribution with the Paul-Held model. This distribution is a generalization of the negative binomial distribution that has attracted much attention in recent years because it can model extreme observations with analytical tractability. The proposed beta-negative binomial model was applied to multivariate count time series data of COVID-19 cases in the 47 prefectures of Japan. Evaluation by one-step-ahead prediction showed that the proposed model can accommodate extreme observations without sacrificing predictive performance.


Introduction
Coronavirus Disease-2019 (COVID- 19), which is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is spreading worldwide. In December 2019, an outbreak of COVID-19 was confirmed in Wuhan, China. The virus spread among Asian countries from mid-January to February 2020 and then to the rest of the world over the next few months [6]. It was declared a pandemic in March 2020 [26].
In Japan, the first COVID-19 case was identified on 16 January 2020, in a person who had traveled to Wuhan. After that, there were few infected people in Japan for a month, but the number gradually increased from mid-February. On 7 April 2020, the Japanese government declared a state of emergency in seven major prefectures (districts): Saitama, Chiba, Tokyo, Kanagawa, Osaka, Hyogo, and Fukuoka. This declaration remained in effect for about one month. After that, the number of cases of infection repeatedly increased and decreased, with three main peaks occurring. On 7 January 2021, with the occurrence of a peak much higher than the previous one, the Japanese government again declared a state of emergency. For the number of cases in Japan, see e.g. Karako et al. [15]. The spread of infection has not abated and the number of infected people is currently increasing. See [19] for the current infection status in Japan.
The infection situation in Japan varies considerably from district to district. In fact, major prefectures have independently declared states of emergency depending on their own infection situation. By predicting the future spread of the infection and the number of people infected with COVID-19, we can prepare for and prevent the worst case scenario. For this purpose, modeling based on past data is considered. To date, there have been few studies in Japan that utilized time-series models for patient-count data, and those studies mainly modeled the number of patients in Japan as a whole, e.g. Amengual and Atsumi [1] and Iwata et al. [13]. The regional infection situation -including cumulative number of patients or number of persons tested -can differ depending on the characteristics of each district, such as population size. Predicting the number of infected people in Japan as a whole is not necessarily relevant to individual districts, such as those with a small number of infected people. Convenience of transportation also affects the infection situation; people tend to move to neighboring districts rather than more-distant areas. Therefore, it is more reasonable to include data from surrounding districts of the focal district than to use data from each district separately, especially in districts where people tend to move around. Joint modeling of all districts, borrowing information from neighboring districts, might therefore improve the accuracy of the modeling. Spatio-temporal modeling is one such approach, in which neighborhood information is incorporated into multivariate time-series data of multiple districts.
In this paper we study spatio-temporal modeling of COVID-19 count time series data in the 47 prefectures of Japan using a nonlinear spatio-temporal model developed by Paul, Held, and colleagues [10][11][12]17,18,20,21]. The Paul-Held model is a standard multivariate count time series model that assumes either a Poisson or a negative binomial distribution. For example, the Paul-Held model has been applied to COVID-19 data in Italy [5,8]. The model includes random effects to capture heterogeneity of parameters associated with effects across a large number of districts. The specified statistical distribution enables probabilistic prediction of future observations. For accurate prediction, correct specification of a statistical model is important. While the negative binomial distribution is more flexible than the Poisson distribution in that overdispersion is taken into account, and it has been used to model infectious disease cases [16], it is known to fail to adequately model data with a heavy-tailed distribution [25]. In COVID-19 case count data, extreme observations are commonplace due to patient clusters [7], which suggests the presence of heavy-tailed distributions.
To account for such extreme observations, we propose using the beta-negative binomial distribution with the Paul-Held model, rather than the negative binomial distribution. The beta-negative binomial distribution (e.g. Section 6.2.3 of [14]) is a generalization of the negative binomial that has attracted much attention in recent years for its ability to model heavy-tailed distributions [9,25]. Another similarly motivated generalization of the negative binomial distribution is the Poisson-Tweedie family; however, the Poisson-Tweedie probability mass function cannot be expressed in closed form, which complicates the estimation method [2,22]. In contrast, the beta-negative binomial distribution has a closed-form probability mass function and is analytically tractable. As a result, the first and second derivatives of the loglikelihood function can be used explicitly, which makes it easier to apply to parameter estimation even when the number of parameters is large, and also contributes to faster computation.
In this paper we compare the modeling capabilities of existing models to those of the proposed model as applied to daily case count data of COVID-19 in the 47 prefectures of Japan. The rest of the paper is organized as follows. Section 2 presents the existing and proposed prediction models. Section 3 provides a comparison of the prediction models through one-step-ahead prediction applied to COVID-19 daily case counts in the individual prefectures. Section 4 concludes the paper with some remarks.

Negative binomial distribution
This section presents a negative binomial distribution used for count time series data. A random variable Y having negative binomial distribution with dispersion parameter r > 0 and success probability p ∈ (0, 1) has probability mass function P( is the gamma function. The mean and variance are given by E(Y) = rp/(1 − p) and Var(y) = rp/(1 − p) 2 , respectively. If E(Y) is denoted by μ, Var(y) can be expressed as μ(1 + μ/r). As r → ∞, the negative binomial distribution converges to a Poisson distribution with mean and variance μ. The negative binomial distribution is suitable for count data that are overdispersed. In the following, the set of parameters (r, μ) is used instead of (r, p), and the notation Y ∼ NegBin(r, μ) is used. The probability mass function is re-expressed as

Beta-negative binomial distribution
The negative binomial distribution is best suited for overdispersed count data that are not necessarily heavy-tailed. To model count data having a heavy-tailed distribution, the betanegative binomial distribution is appealing, being a generalization of the negative binomial distribution while possessing analytical tractability. The beta-negative binomial distribution arises as a beta mixture of negative binomial distributions. Let Y conditional on P have a negative binomial distribution with dispersion parameter r and success probability P. If P follows a beta distribution, P ∼ Beta(α, β) with shape parameters α and β, then the marginal distribution of Y is beta-negative binomial with probability mass function is the beta function. The parameter α is called a tail parameter: the smaller α is, the heavier the tail. The mean and variance are E(Y) = rβ/(α − 1) and Var(Y) = r(α + r − 1)β(α + β − 1)/{(α − 2)(α − 1) 2 }, respectively. Note that the mean is only defined when α > 1 and the variance only when α > 2.
The beta-negative binomial distribution is non-identifiable as swapping r and β does not change the probability mass function. To overcome this issue, we use the parameterization of [9], i.e. the mean μ = rβ/(α − 1) is considered as a parameter instead of β, which avoids the non-identifiability issue (c.f. page 1330 of [9]). Then, Y ∼ BetaNegBin(r, α, μ) has a beta-negative binomial distribution with mean μ > 0, dispersion parameter r > 0, and tail parameter α > 1 if If α → ∞, the beta-negative binomial distribution converges to the negative binomial distribution with mean μ and dispersion parameter r. In this paper, the beta-negative binomial distribution is applied to the Paul-Held model as a straightforward extension of the negative binomial distribution.

Univariate model
The simplest case is a separate univariate count time series model for each of the 47 prefectures. Let Y it be the number of cases of a disease in district i ∈ {1, . . . , I} at time t ∈ {1, . . . , T}. In this paper, Y it corresponds to the number of daily COVID-19 cases in Japan for the ith prefecture (i = 1, . . . , I and I = 47), and t denotes a specific date. In this paper, we consider negative binomial and beta-negative binomial distributions for the distribution of Y it as described in Sections 2.1 and 2.2.
The negative binomial distribution model assumes that Y it ∼ NegBin(r i , μ it ), with conditional mean, and r i is the dispersion parameter. The conditional variance of Y it is μ it (1 + μ it /r i ). While [20] uses an alternative parameterization, ψ i = 1/r i , in this paper we use r i for consistency with subsequent sections that introduce the beta-negative binomial distribution. The parameters λ i and ν i represent the autoregressive effect and the intercept, respectively; both take positive values. All parameters are estimated by maximum likelihood independently for the univariate count time series data of COVID-19 cases in each district. Estimation is performed with the surveillance package in R [18]. The beta-negative binomial distribution model instead assumes that Y it ∼ BetaNegBin (r i , α i , μ it ), with the same conditional mean as (1), dispersion parameter r i > 0, and an additional tail parameter α i > 1.

Paul-Held model
Whereas the univariate model independently considers the time series in each district, jointly modeling multiple time-series data including neighboring districts might lead to improved performance. The Paul-Held model [20] is a nonlinear spatio-temporal random effects model.
The negative binomial distribution model assumes that the counts Y it follow a negative binomial distribution, Y it ∼ NegBin(r i , μ it ), with conditional mean, where r i is the dispersion parameter for the ith district. The parameters λ it , φ it , and ν it all take positive values. The first and second terms of μ it are called the epidemic component, and it is further divided into autoregressive and neighborhood effects that depend on the past counts Y i,t−1 . The first term λ it Y i,t−1 is the autoregressive effect proportional to the past observation of the same district Y i,t−1 . The second term φ it j =i w ji Y j,t−1 is the neighborhood effect proportional to an aggregate of past observations in the districts neighboring district i with weights w ji . The weight w ji represents a flow from district i to j and is set to 1 in the following analysis if prefectures i and j are geographically connected and 0 otherwise. The third term ν it e i is called the endemic component, where e i is the fraction of the population living in the ith district. The three parameters λ it , φ it , and ν it are modeled as follows: , and β (ν) are the fixed effects common to all districts, and b Paul and Held [20] modeled these random effects as multivariate normal, and σ 2 ν are unknown parameters, and I is the I × I identity matrix (although a more general specification of is possible). All model parameters are estimated by the surveillance package.
The beta-negative binomial distribution model assumes that Y it ∼ BetaNegBin (r i , α i , μ it ), with the same conditional mean as (2) involving random effect parameters, dispersion parameters r i s, and additional tail parameters α i s. The unknown parameters, r i s, λ , σ 2 φ , and σ 2 ν are estimated in the same line of reasoning as with the inference procedure of Paul and Held [20]. More details are given in Supplementary Text. In the parameter estimation, we constrained tail parameters, α i s, to be greater than 1 for the existence of the conditional mean (2). Stronger constraint that the tail parameter is greater than 2 is needed for the existence of the second moment. It is also required for the asymptotic normality of the maximum likelihood estimation under the beta-negative binomial autoregressive models [9]. In this paper, the weaker constraint, α i > 1, is considered because we do not consider inference in the spatio-temporal model, but it may be necessary when inference is required.

Data description
The data were obtained from the Toyo Keizai Online 'Coronavirus Disease (COVID-19) Situation Report in Japan' (see Data Availability Statement). The data are daily confirmed COVID-19 cases from May 8, 2020 to August 31, 2021 for the 47 prefectures in Japan; data before 8 May 2020 were excluded from the analysis because the data source was different.
The time-series count data are shown in the top panel of Figure 1; a heatmap of total cases in each prefecture is shown in the bottom panel of Figure 1.

Modeling the COVID-19 multivariate count time series data in Japan
The four predictive models described above were fitted to the COVID-19 multivariate count time series data. First, the univariate model with negative binomial distribution or beta-negative binomial distribution was separately fitted to the data of each of the 47 prefectures. Next, the Paul-Held model with negative binomial distribution or beta-negative binomial distribution was fitted to the data jointly for all 47 prefectures.
The Paul-Held model allows the variation to be decomposed into three components: autoregressive, neighborhood, and endemic. The estimated proportions obtained with the Paul-Held model with negative binomial distribution for each of 47 prefectures are depicted in Supplementary Figures S1, S3, and S5. The estimated components differed among the prefectures. For example, Tokyo and Hokkaido had large autoregressive components but small neighborhood components; in contrast, Fukushima had a small autoregressive component but a large neighborhood component. This decomposition revealed several interesting characteristics of the infection situation in the 47 prefectures. The autoregressive component is in theory captured by the univariate model as well. However, the neighborhood component is not captured without joint modeling of all of the time series. The existence of prefectures for which the neighborhood component dominated the other two components confirms the utility of spatio-temporal modeling.
The estimated dispersion parameters derived from the Paul-Held models with negative binomial distribution showed heterogeneity between prefectures. Comparing the dispersion estimates with the population fraction in Japan revealed a tendency for prefectures with larger population fractions to have less dispersion than prefectures with smaller population fractions. A scatterplot matrix between population fraction and estimated dispersion is given in Supplementary Figure S7.
Estimated parameters under the Paul-Held model with beta-negative binomial distribution were also compared. The estimated proportions obtained from the Paul-Held model with beta-negative binomial distribution for each of the 47 prefectures are depicted in Supplementary Figures S2, S4, and S6. According to Supplementary Figure S7, the dispersion parameters were highly correlated with those under the Paul-Held model with negative binomial distribution. On the other hand, the estimated tail parameters were positively correlated with the dispersion parameters; that is, prefectures with higher dispersion tended to have heavier distributional tails, and such prefectures tended to have smaller population fractions. The observed heterogeneity in estimated tail parameters across prefectures verifies the need to model prefecture-specific COVID-19 case-count data by using distributions that are heavy-tailed. In the following section we further explore predictive assessment of the impact of the beta-negative binomial distribution.

Assessment of prediction models
The prediction models were evaluated by repeated one-step-ahead prediction, i.e. whether the model trained on past time-series data was able to predict the number of case counts observed the next day. The one-step-ahead predictions were made by sequentially repeating the most-recent past 90 days. The models are all probabilistic models, and the evaluation was done in terms of predictive distributions rather than point prediction. Supplementary Figures S8-S11 give the results of 90 one-step-ahead predictions for the 47 prefectures. Five curves corresponding to 99.9%, 99%, 50%, 1%, and 0.1% quantiles of the prediction models are depicted together with the number of observed cases to be predicted. Perhaps due to its heavy-tailed nature, the beta-negative binomial distribution produced slightly wider prediction intervals than the negative binomial distribution in both the univariate and Paul-Held models. Days where the number of cases exceeded the 99.995% quantile are emphasized; many cases of models using the negative binomial distribution exceeded this threshold. To facilitate visualizing this in greater detail, Figure 2 shows enlarged panels for two prefectures in which extreme values were observed, Nara and Kagawa, chosen from Supplementary Figures S8-S11. Models with the beta-negative binomial distribution properly covered the extreme observations.

Heavy-tailedness assessment
The following gives a more precise evaluation of heavy-tailedness. Let P be the predictive distribution corresponding to the fitted model to be evaluated, and y the future observed count to be predicted by the model P trained from past data.
In this paper we propose an approach to evaluate heavy-tailedness on the basis of a quantile-quantile plot, which is commonly used for quality control in the field of genomewide association studies (GWASs) [24]. The quantile-quantile plot compares expected and observed P values produced from a genome-wide scan. However, instead of P values, we consider the tail probability that the observed data equals or exceeds the new observation under the prediction model, i.e. using the survival functionF(y) = P(Y ≥ y) = ∞ k=y P(Y = k) where y is the new data and P is the fitted prediction model. The tail prob-abilityF(Y), like the P-value, quantifies the rarity of the new data Y under model P. For a continuous random variable Y,F(Y) is uniformly distributed if the model is correctly specified (i.e. the probability integral transform), but this is not true for a discrete random variable. We therefore propose a modified approach.
To begin with, for a discrete random variable Y that takes values in {0, 1, 2, . . . } and has survival functionF(·), for any z ∈ [0, 1], where U ∼ Unif(0, 1) and is independent of Y. See e.g. Smith [23], Brockwell [3], and Czado et al. [4] for a cumulative distribution function version. By monotonicity (i.e.F(y − 1) ≥F(y) for any y ∈ {1, 2, . . . }), it holds that UF(Y) Thus, points in the quantile-quantile plot based on the modified tail probabilityF(Y − 1), whereF(−1) = 1, instead ofF(Y), are distributed under the area of the diagonal line, which is not guaranteed for the quantile-quantile plot based onF(Y) since P{F(Y) ≤ z} ≤ z does not necessarily hold. If there exist unexpectedly many Y with large values under the model P, the correspondingF(Y − 1) should be very small, and too many points will exceed the expected number under the model P. In this case, the quantile-quantile plot may exhibit upward departure from the diagonal line, and the plot suggests that the actual data distribution of Y has a heavier tail than that of P. Figure 3 shows the proposed quantile-quantile plot based onF(Y − 1) for the four models aggregated for all prefectures. It can be seen that both the univariate and Paul-Held models with negative binomial distribution suffer from severe upward departure, implying the existence of extreme observations. On the other hand, both the univariate and Paul-Held models with beta-negative binomial distribution showed very little or no inflation, implying a reduction in the number of extreme (poorly predicted) observations. This result demonstrates the ability of the beta-negative binomial distribution to model heavy-tailedness.

Prediction model assessment
The above evaluation using the quantile-quantile plot focused on the heavy-tailedness of the prediction model. It is also of interest to assess the overall predictive performance of the model. The following gives prediction model assessment through widely used metrics. Those metrics allow quantitative evaluation of the fitted distribution P with respect to future data y. Four metrics were considered following Paul and Held [11,20]: logs: logarithmic score, logs(P, y) = − log P(Y = y); rps: ranked probability score, rps(P, y) = ∞ k=0 {P(Y ≤ y) − 1(y ≤ k)} 2 ; dss: Dawid-Sebastiani score, dss(P, y) = {(y − μ P )/σ P } 2 + 2 log σ P ; and ses: squared error score, ses(P, y) = (y − μ P ) 2 .
Here, 1(·) denotes the indicator function, and μ P and σ 2 P respectively denote the mean and variance of the fitted distribution P. The above metrics measure the discrepancy between the predictive distribution P and future data y. A scoring rule is proper if its expected value under the probability distribution P reaches its minimum if the observed value y is a realization from the model P, and it is called as strictly proper if this minimum is unique. [20] suggest evaluating one-step-ahead predictions from predictive models by proper scoring rules for count data [4]. The logs and rps are elaborated scoring rules that can take into account the whole predictive distribution to assess calibration and sharpness simultaneously. The logs is the most popular strictly proper scoring rule. The rps is an alternative strictly proper scoring rule based on the rank, which is the sum of Brier scores for binary predictions at all possible thresholds k = 0, 1, . . . , ∞. The ses depends only on the first moment μ P of the predictive distribution; the ses scoring rule is known to be proper but not strictly proper, and is considered here merely because of its widespread use. The dss is a normal approximation of the logs. For all metrics, lower scores indicate better prediction ability of the predictive distribution P with respect to y.
The resulting scores of the four metrics from one-step-ahead predictions for the most recent past 90 days were averaged. The averaged scores for the 47 prefectures are given in Figure 4 and Supplementary Figures S12-S14.
For logs, rps, and dss, in the majority of the 47 prefectures the univariate model with beta-negative binomial distribution and the Paul-Held models with both negative binomial and beta-negative binomial distributions showed better predictive performance than the univariate model with negative binomial distribution. That the performance of the Paul-Held model was superior to that of the univariate model in most prefectures demonstrates the effectiveness of the spatio-temporal modeling that incorporates neighborhood effects.
In the univariate model, the beta-negative binomial distribution model showed lower values than the negative binomial distribution model in the logs, rps, and dss metrics, but higher values in the ses metric. On the other hand, with the Paul-Held model, the betanegative binomial distribution model and the negative binomial distribution model both showed comparable values among the four metrics. This implies that, for the Paul-Held model, the beta-negative binomial distribution model retained the predictive performance of the negative binomial distribution.

Conclusion and discussion
This paper presents several modeling strategies for time-series data of COVID-19 case counts in the 47 prefectures of Japan. Univariate time-series models and spatio-temporal models developed by Paul and Held [20], the latter having nonlinear random effects, were considered. The main finding was that outlying observations appearing in the model based on the negative binomial distribution of COVID-19 data could be taken into account in the model based on the beta-negative binomial distribution. The beta-negative binomial distribution is a generalization of the negative binomial distribution that allows varying the heavy-tailedness through a single parameter. With COVID-19, patient clusters are repeatedly reported, and these clusters result in sudden increases in the number of cases in time-series data as seen in Figure 2. This feature of the COVID-19 data may explain the better fit of the beta-negative binomial distribution vis-a-vis the negative binomial distribution.
In the real COVID-19 data application, the spatio-temporal model was demonstrated to be more effective than the univariate model, at least in the case of the negative binomial model. This can be attributed to the consideration of neighborhood effects. The spatiotemporal model is a reasonable approach to simultaneously model multivariate time-series data from geographical regions by borrowing information from other regions to supplement the modeling accuracy. Various extensions of the Paul-Held spatio-temporal model have been proposed [18]. Since the beta-negative binomial distribution is easy to handle analytically, these extensions are straightforward.
The models considered in this paper were the simplest; more sophisticated models might be worth considering in practice. Candidates for a more sophisticated model include the following. First, the definition of neighboring prefectures is based on geographical connections. The accuracy of the model can be further improved by considering time-related events, such as arrivals of planes and trains. Second, the time-series model includes case data from the previous day and is a first-order autoregressive model. For better model performance, it may be worthwhile to consider second-order or third-order models. Third, it would be useful to consider models for longer-term prediction instead of just the onestep-ahead prediction. Fourth, the data used in this paper were publicly available. If more detailed district data were used instead of aggregate data for each of the 47 prefectures, the accuracy of the modeling would be further improved. It would be interesting to evaluate whether the beta-negative binomial distribution retains its validity with these more sophisticated types of models.