Hierarchical Bayesian spatio-temporal modeling of COVID-19 in the United States

We examine the impact of economic, demographic, and mobility-related factors have had on the transmission of COVID-19 in 2020. While many models in the academic literature employ linear/generalized linear models, few contributions exist that incorporate spatial analysis, which is useful for understanding factors influencing the proliferation of the disease before the introduction of vaccines. We utilize a Poisson generalized linear model coupled with a spatial autoregressive structure to do so. Our analysis yields a number of insights including that, in some areas of the country, the counterintuitive result that staying at home can lead to increased disease proliferation. Additionally, we find some positive effects from increased gathering at grocery stores, negative effects of visiting retail stores and workplaces, and even small effects on visiting parks highlighting the complexities travel and migration have on the transmission of diseases.


Introduction
First originating in Wuhan, China in late 2019 and quickly spreading throughout the world, COVID-19 has been a virus unlike any other seen in modern times. The first known case of COVID-19 on American soil occurred in Washington State in January 2020 and subsequently spread to all 50 states in the following months [16]. On March 12, 2020, the World Health Organization (WHO) deemed COVID-19 to be a pandemic. Soon thereafter, the United States federal government and governors in all 50 states issued guidelines intended to protect public health, which included stay at home orders and prohibitions on gatherings exceeding a certain number of people [16]. COVID-19 vaccines, approved by the Food and Drug Administration (FDA) in December 2020 significantly reduced, albeit did not eliminate, the proliferation and/or severity of the virus [15,40].
Statistical analysis of COVID-19 data before the introduction of vaccines is useful for understanding the factors influencing disease proliferation during this most difficult part of the pandemic, which can be especially useful for preparing for other public health crises that may come our way. Spatial modeling provides a useful set of techniques for doing so [2,10,29,47]. These tools have been used in varieties of fields including epidemiology, applied economics, climate modeling, and crime among other fields [6,39,50,51]. Spatial models are grounded on the assumption that localities, individuals, or other units of analysis are located in a space and that units near one another in the space generate similar outcomes. A variety of techniques fall under the umbrella of spatial modeling such as geographically weighted regression, kriging, and spatial autoregression [6].
A number of researchers have utilized spatial modeling to understand the spread of COVID-19. For example, Guliyev examined the factors affecting COVID-19, using spatial modeling to analyze the relationship between confirmed cases of COVID-19, associated deaths, and recovered cases as a result of treatment [28]. Kang et al. used spatial modeling to understand the spread of COVID-19 in mainland China, finding spatial associations to be paramount in understanding the early spread of the disease [35]. Huang et al. also used spatial modeling to understand COVID-19's growth in mainland China and estimate the virus's basic reproductive number (R0) [32]. Mollalo et al. looked at the spatial variability of COVID-19 incidence utilizing spatial lag and spatial error models to investigate spatial dependence [41]. They also used geographically weighted regression (GWR) and multiscale GWR models to locally understand spatial non-stationarity. The authors note that spatial modeling does indeed improve explanatory power but still find that local models are strongest at disease prediction. Pourghasemia et al. employed spatial modeling to understand COVID-19 proliferation in the Iran, developing a model that quantifies risk and predicts trends in death. [44]. Danon et al. used a spatial model to understand transmission in England and Wales to understand early disease spread and forecast the peak timing of population infection [13].
Despite this work, however, there is a dearth of research in the academic literature employing spatial modeling of COVID-19 across American counties to understand the factors influencing disease proliferation before the introduction of vaccines. Sun et al. does some such work using data through June 28, 2020 [54]. Their work, however, is limited specifically to linear models, which have been demonstrated to be inadequate in modeling count data compared to Poisson models [19]. Dimaggio et al. employed spatial Poisson modeling to understand the spread of COVID-19 amongst African Americans but only do so amongst New York city zip codes [18]. Bhaskar et al. [5] used spatial Poisson modeling to understand disease proliferation in Tamil-Nadu, estimate reproductive rates, and identify at risk areas. A number of other studies have employed Poisson modeling to understand COVID-19 proliferation in countries throughout the world, but have not included a spatial component [1,5,30,33,37,55].
Many of these models have been used by statisticians and applied mathematicians for decades. The types of data available, however, has changed markedly in recent years. For example, there is now a significant amount of rich mobility data compiled by technological companies that has only recently become available. For example, SafeGraph, Descartes Labs, Google, and Apple for example all provide either references or datasets [3,17,27,48] regarding individual movements, which has been very informative for understanding the spread of COVID-19. This data has been particularly useful as it can shed light on how individuals' and populations' movements impact case growth. Chiang et al., for example, utilized Google mobility data to understand COVID-19 disease spread early in 2020 to estimate basic reproductive numbers and discern the impact of mobility and demographic factors on the spread of COVID-19 [8]. Wang and Yamamto also took advantage of this data, utilizing a partial differential equation model to help understand COVID-19 proliferation in Arizona [58]. Wang et al. introduced a graph based neural network incorporating mobility data to understand the dynamics involving the spread of COVID-19 [59]. Rader et al. used mobility data to understand the impact of crowding and population density on COVID-19 proliferation [45]. A few other studies have employed mobility data to analyze COVID-19 as well [12,36,49].
It is useful to understand the impact of these factors on COVID-19 disease proliferation in the United States on a county by county basis before the introduction of vaccines, incorporating spatial interrelationships between counties in the process. Although some research (such as Chiang et al. [8]) has utilized mobility data to understand its impact on COVID-19 proliferation, no research to our knowledge, has incorporated spatial interrelationships to do so. This paper contributes to the literature by ameliorating this limitation. Namely, we utilize hierarchical Bayesian modeling to incorporate both county by county spatial interrelationships through an autoregressive structure along with economic, demographic, and mobility-related trends as covariates to better understand COVID-19 disease proliferation before the introduction of vaccines. The hierarchical Bayesian nature of our model enables us to explicitly model county by county heterogeneity. We hope our work, based on data before the introduction of vaccines, will be useful for preparing for other public health crises that may come our way.

Data
As of December 14, 2020, the United States had 16.5 million cases total confirmed COVID-19 cases and 300,000 COVID-19 related deaths [34], with the first COVID-19 vaccine being administered at 9:00 am that morning [40]. To model COVID-19 case proliferation across the lower 48 states before the introduction of vaccines, we used publicly available COVID-19 case data available from USAfacts.org, which provides daily US data of COVID-19 cumulative cases and deaths on a county by county basis through December 2, 2020 [57]. Heat maps of monthly new COVID-19 cases throughout the country are presented in Figures 1-3. As the charts illustrate, COVID-19 has been much more prevalent in some areas of the country compared to others over the course of the pandemic.
We also utilized data on population density and income to assess the impact of these factors on disease spread [46,53]. Heat maps depicting the distribution of these factors throughout the country are presented in Figures 4 and 5. Once again, these maps illustrate the vast degree of heterogeneity across the United States regarding these two factors.
Thirdly, in order to understand people's behavioral responses intended to avert the spread of COVID-19, we utilized mobility data provided by Google. Google's publicly available dataset provides aggregate county by county time series data of mobility to retail/restaurants, grocery stores/pharmacies, public parks, workplaces, and within residences with respect to a baseline movement level from earlier in the year. Google defines this baseline level as the median value from January 3-February 6, 2020, which is considered to be a 'normal' value for each particular day of the week, providing a percentage change for each day with respect to the baseline [27]. 1 Plots for the Manhattan County,  Lastly, to model the spatial interrelationships amongst different counties, we garnered county by county longitudinal and latitudinal data from Github.com [26]. In the following section, we describe our model including the spatial component that incorporates spatial distances based on this data.

A spatio-temporal model for COVID-19 proliferation
Due to the high dimensionality associated with spatially modeling the entire Continental United States, we analyzed each of four Census regions depicted in Figure 8 [56].  In particular, we utilized a hierarchical Bayesian Poisson generalized linear model to analyze the proliferation of COVID-19 amongst all i = 1, . . . , I j counties in the each of four Census regions j = 1, . . . , 4. Specifically, we defined y i,t is the the number of total COVID-19 confirmed cases in county i and time t and our primary explanatory variable x i,t is the number of days since February 29, 2020 in the following model. 2 Mathematically,  this relationship can be modeled over i ∈ {1, . . . , 53} t ∈ {1, . . . , T} time periods (through December 2, 2020) as follows: We employ the following set of prior distributions for the parameters of Equation (2): and the following set of hyperpriors: As the above equations indicate, in our prior distribution for μ α i , we incorporated dependence on mobility. Since COVID-19 had a two to fourteen day incubation period in 2020, it is important to lag these mobility variables as changes in behavior will take several days at the very least to manifest themselves in terms of case growth [38]. As a result, we incorporated a 14 day lag for this variable, which enables us to understand the impact of people's movement on subsequent disease proliferation. In our prior distribution of β i , we incorporated dependence on mobility, population density, (PopDensity i ) and Income (Income i ), which allows us to quantify the effect of these factors on the growth rate of cases. Lastly, Equation (5) represents the spatial coefficient of this model. In particular, γ i,t is a conditionally autoregressive term incorporating spatial dependence between counties, having the following intrinsic Gaussian conditional autoregressive specification: where each w i,j are elements of a spatial matrix W [4]. These elements were operationalized by compiling information on longitude and latitude of each county's centroid and calculating the Euclidean distance between each pair of counties as [26]: where d 1 j denotes the latitude and d 2 j denotes the longitude of the centroid of county j and assumed that the geographic influence of neighboring counties to decay exponentially with distance as follows: This model captures a variety of important factors influencing disease proliferation during the pandemic. Firstly, Equation (6) incorporates the impact of mobility on case growth, while Equation (7) models the impact of population density and income on how quickly the cases proliferate in particular localities. Lastly, Equation (5) incorporates spatial interrelationships between counties, which is important because, as Figures 1-3 illustrate, disease proliferation has often been localized, albeit in different areas, during various times of the pandemic. Understanding these relationships before the introduction of vaccines can be very useful in helping prepare for potential future public health crises.

Model estimation and results
We estimated our model using WinBUGS over 30,000 MCMC iterations with 3000 iterations for burn in [23]. To assess the convergence of our MCMC sampler, we used the Geweke convergence test (Geweke 1991) to compare data early in the test period (first 20% of MCMC iterations after the burn-in period) with data late in the test period (final 50% of MCMC iterations), to determine whether they are significantly different [25]. Comparisons amongst all marginal posterior coefficient estimates are insignificant, suggesting that initial conditions of the MCMC sampler had dissipated within the burn-in period and thus the posterior density had been adequately navigated in each simulation. Our analysis has a few notable aspects. Firstly, the posterior means of the χ 2 goodness of fit statistics are approximately 0.5 for each Census region, suggesting the spatial models adequately fit the modeled data [24,42]. Additionally, as Figures 5-8 illustrate in the Supplemental Materials addendum, in most of the regions, the mobility coefficients exhibit some correlation between each other, with the highest degree of correlation occurring in the Southern census region. In terms of the posterior coefficients themselves, the marginal posterior densities corresponding to ResMobility and for GroceryMobility are positive for three of the four census regions (within their 95% credible intervals), with the lone exeption being the Southern region. The strongest effects pertaining to this coefficient occur in the Northeast Census region (posterior means 0.096 and 0.012 respectively). These results suggest that, across these regions, these factors have a positive impact on α i,t and thus proliferation of the disease. The marginal posterior densities for RetailMobility

Model efficacy
In order to ascertain the utility of our spatial model, we performed an external validation of the model to assess its predictive accuracy compared to a simpler more conventional counterpart. We did so by utilizing pseudo-Bayes factors (PsBF) [20,21]. Based on a model's cross-validation predictive density, the PsBF is considered a useful surrogate to standard Bayes Factors, which can be computationally difficult to compute [14]. If we let y i,t be our observed data over parameter space , at county i = 1, . . . , I across t across t = 1, . . . , T days, and y (i,t) be the data of confirmed COVID-19 cases with observation (i, t) deleted, then the cross validative predictive density π(y i, t | y (i,t) ) is: where f is the model's likelihood function and (r) is the vector of parameter values obtained during the r th MCMC iteration [42]. The relation in (23) is based on a truncated series approximation of the harmonic mean of the Poisson regression function evaluated at each posterior draw, averaged across all R post burn-in MCMC iterations [42]. As a result, the PsBF comparing the non-spatial model (M = 1) to the spatial model (M = 2) can be written as follows: We can take advantage of the additive properties of logarithms and utilize (23) to estimate the logarithms of the numerator as well as the denominator in (24). These quantities, approximations to the log-marginal data likelihoods for each model, enable us to estimate PsBFs for each conflict examined. Table 1 contains our results. 3 Our pseudo-Bayes Factor presented in Table 1 are considerably higher than one, thus indicating that our spatial model offers considerably more explanatory power than the simpler non non-spatial counterparts examined. These results thus suggest that the spatial interrelationships between counties are a fundamentally important aspect of modeling the effect of mobility on the spread of COVID-19.

Analysis and policy implications
As we noted earlier, the marginal posterior densities corresponding to ResMobility are positive (up to the 95% credible interval) for the Northeast, Western, and Midwestern Census region, with the strongest effect occurring in the Northeast (posterior mean 0.096). These results suggest that in these regions staying at home may, perhaps counterintuitively to some, results in greater disease proliferation with respect to baseline activity. Although the intention of requiring people to stay at home during the COVID-19 outbreak was to limit the spread of the disease, a number of policymakers observed that a significant number of new cases arose from people staying at home [9,11]. This result is likely due to the fact that staying at home by itself is not sufficient to curb the spread of the disease unless doing so is accompanied by isolating from other household members. Policy research has indicated that the use of voluntary isolation centers could have been useful for separating the sick from others to prevent them from continuing to spread the disease to others, including to family members. [16].
Additionally, all four Census regions note a negative effect of retail mobility via 95% credible interval estimates of RetailMobility , once again with the most pronounced effect in the Northeast (posterior mean −0.017). This result may also be counter-intuitive to some, but perhaps may be explained by the fact that, due to COVID-19 concerns, retail stores take recommended precautions intended to curb the spread of COVID-19 including limiting the number of customers entering the store and encouraging social distancing among others [52]. As a result, the diseases proliferates less frequently in the retail setting.
The Northeast, Midwest, and Southern census regions also notes a positive effect of visiting grocery stores on disease proliferation with posterior means ranging from 0.002 (South) to 0.012 (Northeast). This finding is not surprising -Purchasing groceries are an integral aspect of American's lifestyles. With grocery stores filled with people compared to typical retail stores, without proper precautions, these stores provide a ripe opportunity for disease to spread. Other regions note no effect within the 95% credible interval, suggesting that grocery store visitation does not meaningfully affect disease proliferation. Social distancing within the grocery stores as well as the increase in use of online grocery delivery likely would have helped curb any effects that may occur.
The negativity of WorkplaceMobility for all four regions (Southern, Midwest, and Western) is also notable. Although, as suggested by the Google mobility data (see for example Figures 6 and 7, many have been working from home this year, some people either must go to work or choose to out of their own volition. As part of the effort to curb disease proliferation, however, public health officials have advised people to avoid going to work if they are sick [7]. Adherence to such guidelines may very well have contributed reduce the spread of COVID-19 in workplaces.
Lastly, all four Census regions indicate a positive effect of frequenting parks on disease proliferation. The extent to which this effect occurs, however, is extremely small with the marginal posterior mean coefficient being less than 0.01 suggesting that this effect is less impactful that the other mobility factors examined in this study.
Additionally, the marginal posterior coefficients of population density and income also yield some insightful results. From a theoretical perspective, localities that have higher population densities have a greater ability to spread COVID-19, as well as other illnesses, than more sparsely populated areas [31]. All four census regions, however, find a lack of an effect of population density on disease spread. Given the fact that our spatial model has more explanatory power than it's non-spatial counterpart, this result also makes sense -People may travel to more densely populated areas, such as for work or for shopping, and then subsequently bring the disease home and spread it locally. Regarding income, our coefficient Income also has both positive and negative density up to the 95% credible regions for all four Census regions. This result suggests that temporal growth of the disease is not confined to communities of particular income levels.
Finally, and perhaps most importantly, our estimates of α signify whether there are factors outside of the variables looked at in this study affecting the spread of the disease. For example, the positive effects of α for Midwest census region and the negative effects of this coefficient for the Northeast and Southern regions suggest that there may be factors outside of the mobility factors examined in the associated prior distribution that may be contributing to disease proliferation and mitigation. One possibility, for example, is that specific sub-populations that might be contributing to disease proliferation. Young adults frequenting bars, for example, has been attributed as a reason COVID-19 continued to spread in the summer [43]. Others may be socially distancing as had been encouraged for months, which may help mitigate disease spread [16].

Conclusions and future research
COVID-19 has evolved into a pandemic unparalleled by any other disease in the last hundred years [16]. Although the introduction of vaccines in December 2020 significantly blunted the proliferation and impact of the virus throughout much of 2021 [15], the model presented in this paper offers useful insights about the spread of COVID-19 before such tools were available. Methodologically, we have developed a hierarchical Bayesian model that incorporates county-level heterogeneity and their interrelationships to understand the factors influencing the spread of COVID-19. Most notably, we find the counter-intuitive result that staying at home can lead to greater disease proliferation. Exploring ideas about how to limit disease spread at home would be particularly useful in preparing for other challenges that may lie ahead.
Even after the pandemic, it is imperative to continue to study COVID-19 to help understand the spread of other possible diseases that may come our way. We therefore hope that this work encourages future research in statistically understanding the spread and mortality of COVID-19 as well as other diseases. One potential avenue of future research could involve adding additional explanatory variables to our model (such as age, ethnicity, and health status) as well as performing an analogous bivariate examination of COVID-19 cases and deaths simultaneously. Such analysis could be useful for understanding the interrelationship between disease spread and death as well as the associated factors influencing their outcomes. Another potential avenue of future research could involve examining the county-level coefficients of this model during different time periods of the disease and examining their evolution over time. This analysis could be useful for assessing the efficacy of policy approaches attempted across the different states. A third avenue of future research could involve adjusting the prior specification in this paper and utilizing spatial Dirchlet Process priors instead to extract spatial clusters that could yield useful insight on factors contributing to the spread of the disease in particular localities [22]. Finally, it would be fascinating to apply this model to other countries to understand the spread of COVID-19 within and across other nations' borders.
Spatial modeling is a useful tool to understand how different localities contribute to changes in important phenomena. We hope that this work provides a useful contribution in understanding the proliferation of COVID-19, especially to help prepare for other public health crises that may come our way.

Notes
1. To preserve anonymity, Google sometimes does not report data on certain days in the time horizon. To address this limitation, we interpolated missing data by computing the average mobility index over a time horizon spanning three days before and three days after the missing data point in question. If after this interpolation, a particular county still contained missing data, it was dropped from our analysis. 2. Since the reporting of the Google Mobility data used in this regression model begins on February 15th, 2020, in order to introduce a two week lag in between the mobility data and COVID-19 cases, we begin our analysis on February 29, 2020. 3. Due to the high dimensionality of the Midwest and the Southern Census regions, memory limitations in WinBUGS preclude log-marginal density calculations for the entire region. Therefore for these two regions, the pseudo-Bayes factor computations are based on a subset of the data. Other subsets from these regions, available from the authors but not presented here, also demonstrate large pseudo-Bayes factors and thus significant explanatory gains.