A spatio-temporal statistical model to analyze COVID-19 spread in the USA

Coronavirus pandemic has affected the whole world extensively and it is of immense importance to understand how the disease is spreading. In this work, we provide evidence of spatial dependence in the pandemic data and accordingly develop a new statistical technique that captures the spatio-temporal dependence pattern of the COVID-19 spread appropriately. The proposed model uses a separable Gaussian spatio-temporal process, in conjunction with an additive mean structure and a random error process. The model is implemented through a Bayesian framework, thereby providing a computational advantage over the classical way. We use state-level data from the United States of America in this study. We show that a quadratic trend pattern is most appropriate in this context. Interestingly, the population is found not to affect the numbers significantly, whereas the number of deaths in the previous week positively affects the spread of the disease. Residual diagnostics establish that the model is adequate enough to understand the spatio-temporal dependence pattern in the data. It is also shown to have superior predictive power than other spatial and temporal models. In fact, we show that the proposed approach can predict well for both short term (1 week) and long term (up to three months).


Introduction
The recent Coronavirus pandemic has made the year 2020 immensely challenging across the world. It started in December 2019, presumably from a fish market in Wuhan, China [39]. The virus is known as the novel Coronavirus (2019-nCoV), while the pandemic is termed COVID-19. World Health Organization (WHO) declared it a Public Health Emergency of International Concern on 30th January 2020. As of 29th November 2020, more than 60 million people have been affected by the virus, while the death toll is recorded at nearly 1.5 million across the world.
With the epidemic spreading wildly, a lot of statistical studies have come up as well. The primary focus of many of these papers is to use an appropriate technique to model the number of cases infected by the virus. We briefly discuss a few of them here. Anastassopoulou et al. [3] developed a susceptible-infectious-recovered-dead model and estimated the reproduction number using the data from China. Another similar study was done by Wu et al. [47]. In order to capture the temporal dependency pattern of the spread of the disease, Alzahrani et al. [2] worked on an autoregressive integrated moving average (ARIMA) method and showed that it works well as a prediction model for the number of COVID-19 cases in Saudi Arabia. Deb and Majumdar [8] also developed an ARIMA framework and showed that a time-dependent quadratic trend fits the data better. Based on an ARIMA structure, a nonlinear autoregressive (NAR) neural network was developed by Khan and Gupta [22] to compare the accuracy of predicted models. Several other machine learning models have also been written in this regard. The reader is referred to Tuli et al. [42], Wang et al. [45] and the references therein for related literature. Kırbaş et al. [23] is also worth mention in this regard. They provided a great comparative study of ARIMA and machine learning models in terms of analyzing and predicting COVID-19 numbers in different countries.
While the aforementioned methods have their advantages, they lack in providing a complete view of how the disease spreads across different locations within a country. COVID-19 being a highly contagious disease, inter-state or inter-district movements affect the number of confirmed cases. Further, there is often substantial heterogeneity in the ethnic composition, socio-economic status, political views, and urbanization across different states of a country, especially one like the United States of America (USA), and that should have an effect on the pandemic. Existing epidemiological or time series models can be applied separately on state-level data or on the overall country-wide data, but in both approaches, they fail to address the inter-state movements, the social, geographic, or economic disparities in different regions, and their potential effect on the pandemic. That naturally calls for an appropriate spatio-temporal model. However, while evidence of spatial dependence in the spread of COVID-19 has been established in some papers (Andersen et al. [4], Kang et al. [20] for example), very few have discussed appropriate spatio-temporal techniques in this context.
In a related study, Zhou et al. [49] proposed an epidemiological model where they combined the extended susceptible-antibody-infected-removed model (see Wang et al. [46]) and a cellular automata (see Neumann et al. [31]). They analyzed spatio-temporal data from the USA using the proposed method. Meanwhile, a Bayesian implementation of a hierarchical space-time susceptible-exposed-infected-removed model was developed by Sartorius et al. [38]. Guliyev [16], on the other hand, moved away from the epidemiological paradigm and worked on spatial panel data models for the data from China. The study showed that the spatial lagged X model (SLX) has a better performance than others. Sun et al. [40] also worked on a similar model, focusing on the data from the USA. However, both of these papers fail to combine spatial and temporal aspects together in one framework. The works by Dickson et al. [10] and Giuliani et al. [14] are major improvements in this aspect. These papers consider a spatio-temporal generalized linear mixed effect model with a negative binomial distribution, which takes care of the possible over-dispersion in the count data analyzed. In the current article, we focus on an appropriately defined statistical spatio-temporal model to understand the spread of the disease and apply it to the data from the USA. The advantage of this technique is three-fold. First, the proposed model incorporates the spatial dependence as a function of the geodesic distance between two states and subsequently gives an idea of the extent of significant spatial dependence given other factors remaining constant. Second, the method is able to estimate the effect of spatio-temporal dependence and random noise separately, thereby providing a better idea of which one is more significant in spreading the disease. Third, from a methodological and implementation point of view, the proposed Bayesian framework is attractive and efficient. It is also flexible enough to be extended to other similar studies, possibly with a different and more extensive set of regressors.
The paper is organized as follows. In the following section, a brief description and summary of the data, along with the motivation of considering a spatio-temporal model, is provided. Our proposed method and related discussions are given in Section 3. Application to the data from the USA is presented in Section 4. We finish with some important remarks in Section 5. Detailed mathematical derivations are provided in the Supplemental Material.

Data and exploratory analysis
The data used for this study is collected from JHU-CSSE [18]. We convert the raw data to state-level data for all states of the contiguous USA along with the District of Columbia, recorded at a weekly level from the week of 20th January 2020 until the week of 23rd November 2020. The first case of 2019-nCoV in the USA appeared on 19th January 2020 in Washington [17]. Thus, our data include all confirmed cases in the country till the last date of the dataset. In total, there are 2205 observations from 49 locations and 45 weeks. For our spatial analysis, we calculate the center of the states by taking the median of longitude and latitude of all the counties within each state, as given by the raw data used in the study. We verified that this approach provides geographic centers close to the centers obtained by other geographic studies, such as Rogerson [35].
We compute the cumulative number of confirmed cases and the cumulative number of deaths for every week in every state. Subsequently, state-wise numbers of new cases and new deaths in each week are calculated. The population of the states is used in the analysis as well. Our primary focus in this paper is on the prevalence of the disease, which in epidemiology is defined as the proportion of affected people in a population. In this work, we calculate it as the proportion of confirmed cases. If z s i ,t and p s i ,t denote the number of affected cases and total population for the ith state at time t, respectively, then the prevalence π s i ,t is defined as In the main model, we work with y(s i , t) = log π(s i , t). It is worth mention that we choose a logarithmic transformation over a logit or a probit transformation since researchers (see Leffler et al. [25] and Takagi et al. [41] for example) have already shown that the logarithmic transformation can explain the trend in the spread of COVID-19 in a better way.
In Table B1 in the Supplemental Material, summaries for all the states are presented. Along with the date of the first case observed in every state, we also display the number of total cases and the log-prevalence as of the last date in the data. It is interesting to note that the first case for few states came as late as 16th March 2020. As of the week of 23rd November 2020, Vermont has the lowest number of confirmed cases with 4092, whereas Texas has the highest number of confirmed cases at 1,225,118. Vermont also has the lowest number of deaths with 67, whereas New York has the highest number of deaths with 34,411. Prevalence-wise, the lowest and highest states are Vermont and North Dakota, respectively. Overall mean, median, and variance of the log prevalence for the last week for all the states are −1.947, −1.943, and 0.015, respectively.
Next, we turn our attention to the spatial dependence of the log-prevalence across the states. Through exploratory analysis, we find significant spatial correlation in the data, thereby motivating us to consider a spatio-temporal model. To that end, we use Moran's I statistic to measure the spatial autocorrelation for every week. Recall the definition of y(s i , t), the log-prevalence of the ith state in tth week. Letȳ t be the mean of the variable at time t. Then, for a properly defined spatial weight matrix W with zeroes on the diagonal, Moran's I statistic is given by where n is the total number of spatial locations, w ij is the element at the ith row and jth column of the matrix W and S W is the sum of all the elements of W. For the weight matrix, we choose an exponentially decaying distance matrix such that Here, s i − s j is the haversine distance between the ith and jth state with given latitude and longitude. It is the great circle distance between two points on a sphere. Values of the Moran's I statistic for different weeks are plotted in Figure 1. It is evident that the statistic starts getting significant from the sixth week and after the 13th week, which corresponds to 13th April 2020, it is always highly significant. That indeed implies a strong spatial correlation.
The evidence of spatial correlation is further confirmed through Figure 2 where we show the log-prevalence of the COVID-19 spread in all the states for four different time-points.  The map depicts that the states with high prevalence are closer together and the same can be said about states with low prevalence. This gives us stronger evidence for spatial correlation in the COVID-19 spread throughout the USA and hence the motivation for creating an appropriate spatio-temporal model.

Model
Let s i ∈ R 2 , for i = 1, 2, . . . , n, denote the latitude and longitude of the locations and t ∈ {1, 2, . . . , T} denote the weeks in which data were collected. For the complete data, n = 49 and T = 45.
Let y(s i , t) be the log-prevalence for the ith state in the tth week. The proposed model is defined as where μ(s i , t) corresponds to the mean structure, v(s i , t) is a zero mean spatio-temporal process and (s i , t) is a white noise process. (s i , t), for all s i and for all t, are assumed to be independent and identically distributed N(0, σ 2 ) random variables. The mean space-time function μ(·, ·) is given by Here, p(s i , t) is the population and d(s i , t) is the number of new deaths in the ith state on the tth week. Note that the above mean structure can be written as μ = Xβ where X is the design matrix with every row of the form ). The parameter vector β = (β 0 , β 1 , β 2 , β 3 , β 4 ) captures the effect of the covariates in the mean structure. A discussion on the formulation of the covariates is warranted at this point. First of all, the population is commonly used in similar analyses and has proved to be significant. Next, terms associated with β 2 and β 3 signify a linear and quadratic time trend, respectively. Similar formulation of the trend has proved to work better for COVID-19 spread in earlier studies, cf. Deb and Majumdar [8]. Finally, following Guliyev [16], we include the number of deaths in the mean structure, and it improved the model significantly. We also explored the possibility of choosing the cumulative number of deaths for every location, but the previous week's number in the logarithmic scale performed the best. Next, v(s i , t) is treated as a spatially varying temporal trend. A Gaussian process with zero mean is assumed for this space-time process. We adopt a separable structure (see Mardia and Goodall [28] for reference) for the covariance function defined as We find that a non-separable specification does not improve the performance, and hence the separability assumption is considered for ease of implementation. This phenomenon has been commonly observed in many similar spatio-temporal works, cf. Sahu et al. [37], Deb and Tsay [9] and the references therein. For both spatial and temporal covariance function, an exponentially decaying pattern is assumed, i.e. ρ(d; φ) = exp(−φd). Note that it is a special choice (ν = 0.5) among the Matérn class of covariance functions [30]. In Equation (6), t i − t j is taken to be the usual difference between the two time-points while s i − s j is calculated using the haversine formula, which calculates the great circle distance between two points on a sphere.
Let us denote the unknown parameters as θ = (β, σ 2 v , σ 2 ) . A Bayesian framework is used to estimate these parameters. First, we assume that, a priori, the coefficients in β are iid N(0, k), where k is large. We take k of the order 10 4 . For σ 2 v and σ 2 , independent inverse gamma priors, denoted by IG(a, λ), are considered. a is ideally taken to be greater than 1 in order to avoid improper posterior distributions. We choose a = 2 so as to ensure a noninformative prior with infinite variance. It is worth mention that extensive experimentation established that the parameter estimates and model choice criteria are not sensitive to the initial values of λ.
Finally, let φ = (φ s , φ t ) denote the decay parameters for the exponential covariance functions. Ideally, φ can be estimated through the Bayesian framework, but under weak prior distributions, weak identifiability and slow mixing of the Markov chains cause the Gibbs sampler of φ and θ to behave poorly. So, a computationally less expensive procedure is adopted to find the most optimum values of φ. We explore a two-dimensional grid for (φ s , φ t ) values and use a cross-validation scheme to choose φ. See Section 4 for more details.

Posterior distributions
Here in this section, we only discuss the posterior distributions of the parameters. Detailed calculations are presented in the Supplemental Material.
Let M = nT be the total number of observations. Y, a vector of order M × 1, denotes all the data, arranged first by location and then by time. V is defined accordingly for the v(s i , t) process. Note that V is the concatenated form (by column) of the matrix Further, let s and t denote the spatial and temporal correlation matrices for the v(s i , t) process. That is, for i, j = 1, . . . , n and for k, l = 1, . . . , T, Now, from Equation (4), it can be inferred that: where N M denotes an M-variate Gaussian distribution and I M is an identity matrix of order M × M. The prior specification for v(s, t) process can be written as where ⊗ denotes the Kronecker product. For other parameters, the prior distributions are given by ). Then, the joint posterior can be calculated as We follow the idea of the Gibbs sampling procedure. Straightforward calculations show that the conditional posterior for β is as follows.
Conditional posterior of σ 2 v is given by Similarly, conditional posterior of σ 2 is given by Finally, the usual posterior distribution of the vector V demands the inversion of a M × M matrix in each iteration of the Gibbs sampling procedure. That is computationally very challenging, and so, we adopt a less expensive way. Divide V into V 1 , V 2 , . . . , V n with V i being the ith column of the V matrix in Equation (7). Let V −1 = (V 2 , . . . , V n ). Consider the following partitioned form of the spatial correlation matrix, with subscript 1 denoting the correlation part corresponding to the first location and subscript 2 denoting the same for the rest of the locations. s = 11 12 21 22 .
multivariate normal distribution theory indicates the following: Let Y 1 be the data for the first location and all time-points from t = 1, 2, . . . , T. X 1 is defined accordingly. Then, using the above, we get that Following a similar idea, conditional posterior distributions of V 2 , . . . , V n can be computed as well. We use these conditional posterior distributions iteratively in Gibbs sampler and generate independent posterior samples after a sufficiently large burn period. The samples are used to estimate posterior means and the corresponding credible intervals for the parameters.

Prediction
A main advantage of using an appropriately specified space-time process is that it can be used to predict the response variable for any spatial location, even if it is unobserved in the data, at any time-point. Generally, from Equation (9), we can see that for any location s at a future time-point t , y(s , t ) is conditionally independent of Y given v(s , t ) and θ , While executing the Gibbs sampler, we draw samples from posterior π(θ , V|Y). Using the samples for θ, we can obtain μ(s , t ) and σ 2 . In order to do the predictive analysis for y(s , t ), we need to use the conditional distribution of v(s , t ) given V. Note that the joint distribution of V and v(s , t ) is given by By using the conditional multivariate normal distribution theory, Thus, by predicting v(s , t ) through the above conditional distribution and using θ from the Gibbs sampling procedure, we can predict y(s , t ) using Equation (17). In this process, we generate many realizations from the posterior predictive distribution and use the mean as the desired point prediction. A prediction interval is also computed in this way.

Model implementation and evaluation
In this section, we discuss different important steps in the implementation of the proposed model, some other competing models we use as benchmarks to compare against, and the criteria used to evaluate the effectiveness of the models. First, the exogenous regressors used in Equation (5) to fit our model are checked for multicollinearity using standard procedures. Next, as mentioned in Section 3.1, while implementing the model, we estimate the optimum value of the decay parameter φ through a cross-validation approach. For that, we take all but the last two weeks of data in the training set (hereafter called model-fitting set to avoid confusion) to fit the model and the last two weeks of observations to validate it. Let us use S val to denote the validation set of observations. Let y(s i , t k ), for (s i , t k ) ∈ S val , be the true observations of log-prevalence in that set. For different values of φ, we train our model on the model-fitting set and use that to predict log-prevalence for all points in S val using the methods explained in Section 3.3. Denote these predicted values asŷ φ (s i , t k ). Then, the validation mean absolute percentage error (VMAPE) is computed as where |S val | is the number of observations in the validation set. Note that the mean absolute percentage error (MAPE) is a standard technique to evaluate the effectiveness of a model (see Katris [21] for example). The optimal choice of φ is obtained by minimizing the above quantity over different φ s and φ t values. For the spatial decay parameter, we consider the values 0.0001, 0.001, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75 and 1 whereas for the temporal decay parameter, we consider 0.10, 0.50, 0.75, 1, 1.5 and 3.
In line with the above, we use a similar criterion to assess the predictive accuracy of the model. Using the optimal value of φ from Equation (20), denoted by φ S val , we evaluate the predictive power by calculating the MAPE on the test set as follows: In the above, |S test | is the number of observations in the test set S test , i.e. the data of the last two weeks for all the locations, On the other hand, standard residual diagnostics are carried out to check for the adequacy of the proposed model. In order to check for the validity of the normality assumption, we use the Anderson-Darling normality test Pettitt [33] on the residuals obtained after fitting the model. A quantile-quantile (QQ) plot is also used in this regard. Next, to assess whether our model captures the spatial dependence in the data appropriately, we calculate the Moran's I index on the residuals, following the procedure used in Section 2. Along a similar line, the Box-Pierce test [6] and the autocorrelation functions (ACF) are used to investigate if there is any temporal dependence remaining in the residual series corresponding to different locations.
Subsequently, we focus on another crucial assumption in our model, i.e. the separability of the covariance functions (Equation (6)). To judge the appropriateness of the assumption, we take the residuals and conduct a test for separability following the procedure laid out by Fuentes [13]. This procedure is based on coherence, which is computed using the crossspectral density at different frequencies between the series for two locations. We carry out a procedure similar to the analysis of variance (ANOVA) method to determine if there is a significant deviation from the separability assumption, which indicates that the main effects corresponding to the frequency factor should be zero. Note that the author recommended using sufficiently distanced locations and sufficiently apart frequencies to conduct this test. Keeping that in mind, we use every fifth frequency in the data and use randomly selected pairs of states. This testing procedure is repeated multiple times to get an overview of the validity and usefulness of the separability assumption. We also point out that another advantage of this procedure is that it simultaneously allows us to test for spatial stationarity as well. It can be checked by testing for the significance of the main effects corresponding to the location factor in the above ANOVA-like model. The reader is referred to Fuentes [13] for detailed reading on both the tests.
As a last piece of the analysis, we conduct a comparative study, where we consider six other candidate models which are suitable for this data. In all of the models, we use the same covariates as mentioned in Equation (5). We omit the details of these models for conciseness for this paper. The references cited below can be used for in-depth discussions and understanding of the individual models, while the implementations can be found in the GitHub repository mentioned in Appendix C of the Supplemental Material.
First, as the baseline of our comparative analysis, an ordinary linear model is considered. Second, following Guliyev [16] and Sun et al. [40], we develop an SLX model. We use the row-standardized coding scheme in this approach. Third, we implement a spatial Durbin error model (SDEM) which uses both spatially lagged covariates and white noise term. SDEM is a well-known spatial panel data analysis approach which can capture the potential spillover effects of the covariates. See Elhorst [11] for an introduction to such models. We also cite the works by Zhang et al. [48] and Alexakis et al. [1] who utilized the concept of spatial dynamic panel data models in other COVID-19 related studies. Fourth, we implement a spatio-temporal model via integrated nested Laplace approximation (INLA). INLA is a popular approach in Bayesian inference as it is computationally faster than the classic MCMC method by approximating the posterior distributions. Following Rue et al. [36] and Blangiardo et al. [5], we create an appropriate stochastic partial differential equation (SPDE) model with an exchangeable covariance structure in the temporal dependence and a suitable mesh structure in the spatial dependence part for the spatio-temporal data we have. The R-INLA project by Lindgren et al. [27] is utilized in this context. The fifth model in our comparative study is the conditional autoregressive (CAR) model proposed by Leroux et al. [26], where an appropriately defined spatial autocorrelation is induced in the random effect. In spatial statistics, CAR is one of the most common techniques and is extensively used. We follow Lee et al. [24] to implement the model in our data.
The last two models included in the analysis are related to our proposed model, and they help us in judging the relative effectiveness of the spatial and temporal dependence terms in the system. In particular, a spatial model with an exponentially decaying covariance function is developed, which is essentially the same as the proposed model, but with a very large value of φ t in Equation (8). This signifies that the correlation between two different time-points for the same location is negligible. Then, following a similar idea, we assume negligible spatial covariance and fit a model with just temporal covariance as our sixth model in the comparative discussions.
We compare the candidate models in primarily two aspects using different measures. As a standard measure from inferential perspectives, R 2 values from the fitted models are computed. This helps us in understanding how well each model explains the variation in the data. Then, prediction MAPE on the test set, in the same manner as in Equation (21), is calculated for each of the models. We also compute the root mean squared error (RMSE) of the predictions as defined below: Note that MAPE provides a sense of relative error while RMSE provides an idea of the absolute errors, and both are useful in understanding the overall accuracy of the models. In addition, as pointed out by Gneiting and Raftery [15], proper scoring rules should also be used in similar comparative studies. So, along with the aforementioned two measures, we calculate the continuous ranked probability score (CRPS) as defined below: Here, F s i ,t k (y) denotes the predictive distribution for a model. The above integral is easy to compute for the models in this study since we use Gaussian structure. Refer to the aforementioned paper for detailed reading on this. We calculate the CRPS through the scoringRules package by Jordan et al. [19]. All computations in the analysis are carried out in RStudio version 1.4.1103 and R version 4.0.4.

Results and discussions
As the first step in our analysis, we test for multicollinearity between the exogenous regressors and confirm the absence of it. Now, in order to assess the effectiveness and the predictive power of the proposed method, we partition the whole data from the USA into two parts. The final two weeks from all locations are kept as the 'test set' and the remaining data, i.e. the data from all states and from the first 43 weeks (sample size is 2107), are used to train the model. Prediction accuracy of the proposed method is judged based on the 'test set'. Further, to describe how our model can be used to predict values for new locations, we repeat the analysis by setting two states aside at random and excluding them from the training set. In this case, the size of the training set is 2021 (47 states and 43 weeks of data). Then, the same 'test set' as above is used to check for prediction accuracy. The two randomly selected states are New Hampshire and Vermont.
Next, in order to estimate φ through the cross-validation scheme, out of the 43 weeks in the training data, we take the first 41 weeks as the model-fitting set and the other two weeks of observations to validate it. Thus, the last week included in the model-fitting, validation, and test set are 26th October 2020, 9th November 2020, and 23rd November 2020, respectively.
After searching over all combinations of the φ s and φ t values, we find that the minimum VMAPE φ is obtained at φ s = 0.001 and φ t = 0.1. To put it into perspective, based on exp(−φd) ≈ 0.05, these choices suggest that the temporal dependence remains significant over a long time period, and the spatial dependence is significant for up to 1000 kilometers. We also want to point out that the VMAPE is quite stable near these values of φ s and φ t , and so we do not need to refine the grid further. Moreover, extensive experiments by varying the number of weeks and number of locations show that the optimum choices of the two decay parameters are quite robust for different training sets. Hence, for the rest of this analysis, we are going to use φ s = 0.001 and φ t = 0.1.
Using the above φ, we fit the proposed model on the training data and obtain the estimates of the parameter vector θ . In the Gibbs sampler, the first 5000 iterations are taken as the burn period, and after that, we take a posterior sample of size 1000, each sufficiently apart from the other to ensure independence. Posterior means are used as point estimates for the coefficients (see Equation (5)) while a credible interval is computed using the above posterior samples. In Table 1, the estimates along with the credible intervals are presented. For comparison, in Table 2, we present the results for the case where we leave New Hampshire and Vermont out of the training data.
Both tables depict similar results. The quadratic term in the time trend turns out to be significantly positive, thereby suggesting a convex pattern. It says that the spread is increasing more with time. The disease's spread is positively correlated with the number of new deaths as well. Interestingly, the population is not a significant factor in this model. Both population and the linear term in the time trend are, however, significant if we consider a linear model with the same covariates. We can argue that these effects are captured primarily by the introduction of the spatio-temporal term v(s i , t).
Coming to the variance parameters, we can see that σ v is almost twice the value of σ . This implies that the spatio-temporal process explains more variation in the data than the white noise process. That is a strong argument in favor of using a spatio-temporal model to understand the spread of COVID-19. As discussed before, we run standard diagnostic  procedures to identify if the model captures the spatial and temporal dependence patterns correctly. First, we compute the Box-Pierce test for residuals for every state and find that at 1% significance level, the test is insignificant for all locations, implying that there is no autocorrelation left in the residual series for each location. To further validate this, we compute the ACF values of the residuals for different states and find that they are mostly non-significant, thereby confirming that the residuals are not temporally correlated. For example, Figure 4 shows the ACF plot for four randomly selected states. We avoid the rest for the brevity of the paper. Next, we compute the Moran's I statistic (refer to Equation (2)) of the residuals for different weeks and plot it in Figure 3. A quick comparison of this plot with Figure 1 tells that the spatial correlation is no longer significant. Thus, the model specification is adequate to capture both the temporal and spatial dependence patterns of the data. Continuing along the line, the separability assumption of the covariance function in the space-time process is investigated next. Following the procedure laid out in Section 3.4, at 1% level of significance, we find that the separability assumption is violated only for few instances. In view of the fact that the assumption reduces the computational burden significantly, it shows that the model specification is very useful in the context of the problem. We also find that there is no deviation from the spatial stationarity assumption. Another crucial assumption in our model is that the spatio-temporal process is Gaussian. To validate this assumption, we perform the Anderson-Darling normality test on the residuals and get strong support for normality with a p-value of .55 . From the QQ plot for the residuals as presented in Figure 5, we can also see that there is little deviation from normality. Further, measures of skewness and kurtosis for the residuals are close to 0 and 3, respectively, which supports the model assumption as well.
We now turn our attention to the prediction performance of the model. Below, we summarize the findings of the predictive analysis and defer the detailed results to the Supplemental Material. Recall that the 'test set' consists of two weeks of data for all the locations. In Table B2 of the Supplemental Material, detailed prediction results for every state for those two weeks, in particular the true values, along with the predicted values and the prediction intervals, are presented. Overall, it is evident that the predictions are very close to the true values, and the MAPE is computed to be 2.06%.
Next, we look at the prediction accuracy for the case where we leave New Hampshire and Vermont from the training data. Table B3 in the Supplemental Material displays those results. The overall MAPE, in this case, is 2.13%. Clearly, even without all locations in the  training data, the model performs pretty well. To understand it better, let us take a look at the predictions for New Hampshire and Vermont in Table 3. Note that in both cases, the prediction intervals include the true value, and the individual absolute percentage errors are at most around 7%. This establishes the efficacy of the proposed method for similar spatio-temporal data.
Further, in order to check for robustness, we vary the 'test set', by taking 1 week to 12 weeks of data for all locations. We use the remaining weeks' data for training our model. In Figure 6, Training set 1 corresponds to all states' data, whereas Training set 2 is for all  states but New Hampshire and Vermont. There, the overall MAPE for varying test periods is presented. We can see that the prediction MAPE is less than 2% for one-week-ahead predictions in both cases. As we increase the size of the test set, the MAPE increases, but it becomes stable at around 3% for four weeks or more. Evidently, the prediction accuracy of our model remains robust across all scenarios. Finally, we provide the results of the comparative study regarding the performance of our model against the other candidate models discussed in Section 3.4. Once again, we keep two weeks of data from all the locations as the 'test set' and use everything else to train the individual models. Note that for consistency across all models, we use all states' data in the training set. For all of the candidate models, different measures, as mentioned before, are given in Table 4. We can see that our spatio-temporal model outperforms others. Three of the other spatial models (SLX, SDEM and the purely spatial model) perform poorly, both in terms of fitting the data and the predictive accuracy. In fact, they are no better than a simple linear model. Turning attention to other popular methods such as INLA-SPDE and CAR, we see that our model's prediction accuracy is much better while their training R 2 is higher than our model. In fact, for INLA-SPDE, it is more than 0.95. This is an indication to suggest that these two methods are overfitting the data, which potentially affects the predictive ability as well. Although INLA is a flexible approach that can fit various models, our experimentation show hints of overfitting in several of them. This is not uncommon   [32] and Wang et al. [44]. Even considering different prior distribution, which some researchers have suggested, this issue remains for our data. Meanwhile, we find that the second best model in terms of prediction accuracy is the purely temporal model, implying that there is more temporal dependence than spatial. It also establishes that the model specification has been suitable. Overall, by considering both spatial and temporal dependence patterns appropriately, we see that the proposed approach can help us understand the spread of COVID-19 in a much better way.
In Table 4, we provide the accuracy metrics of different models in terms of forecasting the prevalence of COVID-19 as well. Here, we use the standard reverse transformation on the response variable used in the models. However, it is worth mentioning that this approach would provide a biased prediction, and appropriate techniques can be adopted to circumvent that problem. Miller [29] is one of the earliest papers to discuss strategies of bias correction. The primary idea is to apply a bias correction factor that uses the underlying distribution's estimated variance. Refer to the articles by Crooks and Özkaynak [7] and Vrac and Friederichs [43] for appropriate bias correction mechanisms in spatio-temporal studies. We do not adopt similar strategies as the primary focus of our research is on understanding the spatio-temporal dynamics of the spread, which can be done through the log-prevalence variable. We also point out that the posterior sample from the predictive distribution can obtain an empirical distribution of the original variable if needed. Coming to the prediction accuracy measures for the prevalence of COVID-19, we see a similar pattern as earlier. The other spatial models (SLX, SDEM, and the purely spatial model) record more than 42% prediction MAPE, which is at par with the linear model and is worse than the other models used in the comparative study. The two Bayesian models (INLA-SPDE and CAR) display similar results as before, with the INLA-SPDE model registering slightly higher accuracy than the spatial models. In contrast, the CAR model sees considerable improvement from the aforementioned spatial models and provides 22.47% accuracy. Finally, akin to the previous results, we observe that the purely temporal model is the second best model in predicting prevalence, while our proposed model outperforms all others with approximately 15% prediction accuracy.

Conclusions
The proposed spatio-temporal model combines the spatial and temporal impact of the COVID-19 spread throughout the USA. It is observed that for any particular state, the correlation in the prevalence between two time-points remains significant for a long time.
On the other hand, we find a significant correlation in the confirmed cases between the locations. In fact, the spatial dependence remains significant for up to 1000 kilometers. One of the reasons for this could be the movement of people across different states. Because of the unavailability of the movement data, analyzing this further is beyond the scope of this paper. It, however, can be an interesting future project to extend the current work to address a more general problem in spatio-temporal study of epidemics.
The approach in this paper has a clear advantage over other time series or panel data approaches in terms of predicting for new locations. Performance-wise, our model stands out from other spatial or temporal models, as discussed in the previous section. It also provides valuable insights into the COVID-19 spread. The significance of the quadratic term in the time trend corroborates the findings from earlier studies. The number of new deaths in the previous week is found to positively affect the prevalence. Taking a cue from the studies by Pujadas et al. [34] and Faíco-Filho et al. [12], a possible explanation behind this result is that a higher number of deaths imply more people with higher viral loads and, therefore, a higher chance of infection for people in contact. Interestingly, once we introduce the spatio-temporal structure in the model, we get that population is no longer a significant factor in the spread of the disease. It supports the aforementioned idea that the movement of people is potentially the primary factor in the increased number of cases in the USA.
In light of the above, an obvious hypothesis is that restrictions and lockdowns imposed by the government would inevitably affect the COVID-19 spread. Along the same line, other interventions such as vaccination which started in USA in December 2020 (source: https://www.bbc.com/news/world-us-canada-55305720) are likely to cause different types of effect on the prevalence. For the lack of relevant data, we could not analyze the effects of these variables in this paper. However, our proposed model is flexible to incorporate any such factor easily and can be used to do a more holistic study of a similar dataset. In fact, one can adopt a time-dependent structure for the covariate effect in the proposed model to identify the difference in the spread of the disease before and after such interventions. This will surely be an appealing extension to the current paper. One can also use the proposed approach to analyze county or district-level data and understand how the disease is spreading within a state. State-wise comparison can also be made in this way. A further future direction is to use the model to analyze similar spatio-temporal data on a more granular level, perhaps using daily data from every district. Although the proposed method works in theory, it would require appropriate approximations to ensure feasible computation.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Data availability statement
COVID-19 data from the United States of America are used in this study. This dataset is publicly available in JHU-CSSE [18].