Nonparametric estimation and inference for spatiotemporal epidemic models

Epidemic modelling is an essential tool to understand the spread of the novel coronavirus and ultimately assist in disease prevention, policymaking, and resource allocation. In this article, we establish a state-of-the-art interface between classic mathematical and statistical models and propose a novel space-time epidemic modelling framework to study the spatial-temporal pattern in the spread of infectious diseases. We propose a quasi-likelihood approach via the penalised spline approximation and alternatively reweighted least-squares technique to estimate the model. The proposed estimators are consistent, and the asymptotic normality is established for the constant coefficients. Utilizing spatiotemporal analysis, our proposed model enhances the dynamics of the epidemiological mechanism and dissects the spatiotemporal structure of the spreading disease. We evaluate the numerical performance of the proposed method through a simulation example. Finally, we apply the proposed method in the study of the devastating COVID-19 pandemic.


Introduction
Since the beginning of the reported cases in December 2019, the outbreak of COVID-19 has spread globally within weeks. To assist in prevention efforts and ultimately stop the pandemic, it is crucial to identify the vulnerable communities and why they are more likely to be infected. One way to answer these questions is through scientific modelling (Gog 2020;Vespignani et al. 2020). Several attempts have been made to model and forecast the spread and mortality of COVID-19; for example, see Kucharski et al. (2020) and Sun et al. (2020).
The fundamental concept of infectious disease epidemiology is the investigation of how infections spread. Mathematical methods, such as the class of susceptible-infectiousrecovered (SIR) models (Brauer, Van den Driessche, and Wu 2008;Pfeiffer et al. 2008;Lawson, Banerjee, Haining, and Ugarte 2016), are widely used in epidemics to capture the dynamic process of the spread of infectious diseases. These mechanistic models characterise disease transmission through a set of differential equations, and they demonstrate the average behaviour of the epidemic. However, these mathematical models typically focus more on the descriptions of physical phenomena rather than the parameter estimation for observed data, and thus it is challenging to use them to make statistical inferences. On the other hand, statistical models are a class of data-driven methods, and they usually focus on the inference about the relationships between variables. For example, when analysing the confirmed cases and deaths of COVID-19, other factors, such as demographics, socioeconomic status, mobility, and control policies, may also be responsible for temporal or spatial patterns. In this article, we create a state of the art interface between mathematical models and statistical models to understand the dynamic pattern of the spread of contagious diseases, such as COVID-19 and many others.
We borrow the mechanistic rules from the SIR model and form a data-driven model with three compartments: infectious, susceptible, and removed states. The capacity of the health care system, and control measures, such as government-mandated social distancing, also have a significant impact on the spread of the epidemic. Since it is challenging to incorporate those factors in mathematical models, we borrow the strength from statistical models and include various explanatory variables to study not only the spatiotemporal structure but also the effects of the explanatory variables. Notice that the spread of the disease varies a lot across different geographical regions; we incorporate discrete-time spatially varying coefficient models to different compartments to reconstruct the spatiotemporal dynamics of the disease transmission. In general, the spatiotemporal models are able to bring in more information to the epidemic study (Held, Hens, D O'Neill, and Wallinga 2019;Jia et al. 2020).
With an emerging disease such as COVID-19, it is hard to measure many features of the transmission process, which may take a long time to understand fully. Thus, it is desirable to make inferences from observed data as model-free as possible. For a parametric epidemic model, the typical inference problem involves estimating the parameters associated with the parametric models from the data at hand. Such specifications are ad hoc, and if misspecified, can lead to substantial estimation bias problems. This issue might be addressed in practice by considering alternative nonparametric models or sensitivity analyses if some of the underlying model parameters are assumed to be known. By adopting a nonparametric approach, we do not impose a particular parametric structure, which significantly enhances the flexibility of the parametric epidemic models. Nonparametric approaches to fitting epidemic models to the data have received relatively little attention in the literature, possibly due to the lack of data. With the rich COVID-19 epidemic data released every day, we can consider the nonparametric method to model the covariates and coefficient functions.
By allowing the response (such as infected and death counts) to depend on time and location, we consider a generalised additive varying coefficient model to estimate the unobserved process of the disease transmission. We propose a quasi-likelihood approach via the penalised spline approximation and an iteratively reweighted least-squares technique for our model estimation. Our proposed algorithm is sufficiently fast and efficient for the user to analyse large datasets within seconds. We derive the asymptotical normality of the constant coefficients in the linear components under some regularity conditions. For the varying components, we show the consistency of the estimators and obtain their convergence rates.
Finally, as an empirical illustration, we apply the proposed model and estimation method to a study of COVID-19 at the county level in the US. We illustrate how the proposed method can be used to analyse the spatiotemporal dynamics of the disease spread and guide evidence-based decision-making. Modeling COVID-19 at the county level and combining local characteristics are beneficial for the community in understanding the dynamics of the disease spread and support decision-making when urgently needed.
The rest of the paper is organised as follows. Section 2 outlines the nonparametric spatiotemporal modelling framework and describes how to incorporate additional covariates. Section 3 introduces the estimation method, presents the asymptotic properties of the estimators, the computational algorithm, and details of the implementation. Section 4 evaluates the finite sample performance of the proposed method using a simulation study. Section 5 describes the epidemic and endemic data, the results and findings of the case study. Section 6 concludes the paper with a discussion. Supplementary Material A contains some animation videos of the dynamic estimation results in the COVID-19 study, and Supplementary Material B provides the technical assumptions and detailed proofs of the asymptotic results.

Space-time epidemic modelling
To study the spatiotemporal pattern of COVID-19, we develop a novel spatiotemporal epidemic model (STEM) to estimate and predict the infection and death cases at the area level based on the idea of the compartment models. For simplicity, we introduce the STEM based on the parsimonious SIR models, but it can be extended to the SEIR models with an extra 'exposed' compartment for infected but not infectious individuals.
For area i and day t, let Y it be the number of new cases, and let I it , D it , R it , and S it be the number of accumulated active infectious cases, accumulated death cases, accumulated recovered cases, and susceptible population, respectively. Let N i be the total population for the ith area, and denote Z it = log(S it /N i ). Let U i = (U i1 , U i2 ) be the GPS coordinates of the geographic centre of area i, which ranges over a bounded domain ⊆ R 2 of the region under study. Let X i = (X i1 , . . . , X iq ) be a q-dimensional vector of explanatory variables collected from the U.S. Census Bureau. For example, the socioeconomic factors, health resources, and demographic conditions. Let A ijt be the jth dummy variable of actions or continuous measures taken for area i at time t, and let A it = (A i1t , . . . , A ipt ) , which varies with the time.
In this paper, we consider the exponential families of distributions. The conditional density of Y it given ( z, a, x, u) can be represented as for some known functions B and C, dispersion parameter σ 2 and the canonical parameter ζ . As mentioned earlier, the idea of the proposed model is based on conventional compartmental models and time series regression models. The conventional compartmental models aim to predict the number of susceptible individuals, infected cases, and recoveries. Specifically, in a SIR model, the progression is that a susceptible individual becomes infected through contact with one or more infected individuals. Then after a period, the infected individual advances to the recovery state, i.e. a noncontagious state. Therefore, at any given time, the SIR model captures the dynamical mechanism of the disease spread, assuming that the rate at which susceptible individuals become infected depends on the number of susceptible and infected individuals. In general, the dynamics of the SIR system inform us of the deterministic skeleton on which the behaviour of the corresponding stochastic system is built. In contrast, time series regression predicts the future of the response variable based on its history. It produces a combination of the variables with weights that indicate the variable importance. It also provides a neat result of how the predictors affect the response. Furthermore, we assume that the determinants of the daily new cases of a particular area can be explained not only by the features of that area but also by the spread of the virus in the surrounding areas. Therefore, by combining the advantages of compartmental and statistical models, we develop a novel discrete-time spatial epidemic model comprising the susceptible state, infected state, removed state, and area-level characteristics. Below we use superscripts I, D, and R to denote infected, death, and recovered states. We assume that the conditional mean value of daily new positive cases (μ I it ), fatal cases (μ D it ) and recovery (μ R it ) in area i and day t can be modelled via a link function g as follows: where α I jt 's, α D jt 's, β D 1t and ν R t are unknown constant coefficients, β I 0t (·), β I 1t (·), and β D 0t (·) are unknown bivariate coefficient functions, γ I kt (·), γ D kt (·), k = 1, . . . , q, are univariate functions to be estimated, δ and δ are the time delay between illness and death or recovery, and the parameter m in A ij,t−m 's denotes a small delay time allowing for the control measure to be effective (here we take m = 7, m = 7). For the new infection, following the assumptions in SIR, we assume that the number of the newly infected cases at time t depends on the situation of the pandemic at time t−1. For the death, according to CDC (2021a), the median number of days from symptom onset to death is around an incubation period. For the recovery, according to CDC (2021b), people with mild to moderate COVID-19 remain infectious no longer than ten days after their symptoms began, and those with more severe illness or those who are severely immunocompromised remain infectious no longer than 20 days after their symptoms began. As a result, we choose δ to be 14 and δ = 10 in our real data analysis. For model identifiability, similar to the conventional generalised additive model literature (Hastie and Tibshirani 1990;Wood 2017), we assume E{γ I kt (X k )} = 0, E{γ D kt (X k )} = 0, k = 1, . . . , q. The STEM encompasses many existing models as special cases. For example, the traditional generalised linear regression models, generalised additive models (Liu, Yang, and Härdle 2013), generalised partially linear additive models (Wang, Liu, Liang, and Carroll 2011) and generalised additive coefficient model (Xue and Liang 2010).
Note that, for the log link, exp{β I 0t (u)} illustrates the transmission rate at location u, exp{β D 0t (u)} represents the fatality rate at location u, ν R t is the recovery rate, β I 1t (·), α I 0t and β D 1t are the mixing parameters of the contact process. The rationale for including β I 1t (·) and β D 1t (β I 1t (·) > 0, β D 1t > 0) is to allow for deviations from mass action and to account for the discrete-time approximation to the continuous time model (Finkenstädt and Grenfell 2000;Wakefield, Dong, and Minin 2019). In many cases, the standard bilinear form may not necessarily hold. The above proposed epidemic model incorporates the nonlinear incidence rates, which represents a much wider range of dynamical behaviour than those with bilinear incidence rates (Liu, Hethcote, and Levin 1987). These dynamical behaviours are determined mainly by β I 0t (·), β I 1t (·), β D 0t (·), and β D 1t . For example, when β I 1t (·) and α I 0t are both 1, it corresponds to the standard assumption of homogeneous mixing in Jong, Diekmann, and Heesterbeek (1995). In our study, since we model the number of new cases at time t for area i, Poisson or Negative Binomial (NB) might be an appropriate option for random component (Kim and Wang in press;Yu, Wang, Wang, Liu, and Yang 2020). For example, for the infection model, we can assume that where μ I it can be modelled via the same log link as follows: We can consider similar models for the death count. At the beginning of the outbreak, infected and death cases could be rare, so 'Poisson' might be a reasonable choice of the random component to describe the distribution of rare events in a large population. As the disease progresses, the variation of infected/death count increases across counties and states. So, at the acceleration phase of the disease, the negative binomial random component might be an appropriate option for the presence of over-dispersion.
The above spatiotemporal epidemic model (STEM) is developed based on the foundation of epidemic modelling. It can provide a rich characterisation of different types of errors for modelling uncertainty. Moreover, it accounts for both spatiotemporal nonstationarity and area-level local features simultaneously. It also offers flexibility in assessing the dynamics of the spread at different times and locations than various parametric models in the literature.

Estimation of the STEM
In this section, we describe the estimation method of the parameters and nonparametric components in the proposed STEM model. To capture the temporal dynamics, we consider the moving window approach. From Equations (1) and (2), we can see that the infectious model and the death model share many common features, so we adopt a similar approach when estimating them. Thus, in the following, we only use the infectious model (1) to illustrate the estimation method of the entire STEM system. Moreover, for notation simplicity, we drop the superscripts, such as I and D.

Penalized quasi-likelihood method
Suppose that Var(Y | I = w, Z = z, A = a, X = x, U = u) = σ 2 V(μ(w, z, a, x, u)) for some positive function V(·) and dispersion parameter σ 2 , and L(μ, y) is a quasi-likelihood function satisfying ∇ μ L(μ, y) = y−μ σ 2 V(μ) . We first describe the estimation of model (1). For the current time t, and rounghness parameters λ 0 and λ 1 , we consider the penalised quasi-likelihood problem defined as follows: where t 0 + 1 is the window width for the model fitting, and it can can be selected by minimising the prediction errors or maximising the correlation between the predicted and observed values. The energy functional is defined as follows: where ∇ q u j β(u) is the qth order derivative in the direction u j , j = 1, 2, at any location u = (u 1 , u 2 ) .
Note that, except for parameters {α jt } p j=0 , other functions are related to curse of dimensionality due to the nature of functions. To handle this difficulty, we employ the basis expansion approach to approximate the univariate and bivariate functions discussed below. The univariate additive components {γ kt (·)} q k=1 and the spatially varying coefficient com- (4) are approximated using univariate polynomial spline and bivariate penalised splines over triangulation (BPST), respectively. The BPST method is well known to be computationally efficient to deal with data distributed on complex domains with irregular shape or with holes inside; see the details in Lai and Wang (2013) and Sangalli, Ramsay, and Ramsay (2013).
Assume that X k takes value on an interval [a k , b k ], k = 1, . . . , q. To satisfy the model identification constraint E{γ kt (X k )} = 0, in this paper, we consider the space of the centred spline functions U 0 be the space of the polynomial splines of order + 1; see Xue and Liang (2010), Liu et al. (2013) and Wang, Xue, and Yang (2020). Let J be the index set of the basis functions, and then denote by { kJ (x k ), J ∈ J } the B-spline basis functions of U 0 k for the kth covariate. For J ∈ J and k = 1, . . . , p, kJ (x k ) satisfies E kJ (X k ) = 0 and E 2 kJ (X k ) = 1; see Yu et al. (2020) for the details of basis construction. For all x For the bivariate coefficient functions β 0t (·) and β 1t (·) in the STEM model (4), we consider the bivariate spline over triangulation (Lai and Wang 2013). The spatial domain with either an arbitrary shape or holes inside can be partitioned into a set of triangles. Denote by a triangulation of the domain (Lai and Schumaker 2007); see, for example, Figure 1. In practice, the triangulation can be obtained through varieties of software; see for example, the 'Delaunay' algorithm (delaunay.m in MATLAB or DelaunayTriangulation in MATHEMATICA), the R package 'Triangulation' , and the 'DistMesh' Matlab code.
Let C r ( ) be the space of rth continuously differentiable functions over the domain . For 0 ≤ r < d and , we construct the spline space of degree d and smoothness r over in the following: These bivariate spline basis can be generated via the R package 'BPST' ). More discussions of the bivariate spline over triangulations can be found in Mu, Wang, and Wang (2018) and Yu et al. (2020). Then, we can approximate the bivariate functions β t (·) ∈ S r d ( ) in the STEM model (4) Considering the basis expansion, the energy functional E(β t ) in (6) can be approximated by E(B(·) θ ) = θ Pθ , for = 0, 1, where P is a block diagonal penalty matrix. By introducing the constraint matrix H which satisfies Hθ = 0, = 0, 1, we can reflect global smoothness in S r d ( ) in (7). For the current time t, the maximisation problem (5) Directly solving the optimisation problem in (8) is not straightforward due to the smoothness constraints involved. Instead, suppose that the rank r H matrix H is decomposed into where Q 1 is the first r H columns of an orthogonal matrix Q, and R 2 is a matrix of zeros, which is a submatrix of an upper triangle matrix R. Then, reparametrization of θ t = Q 2 θ * t for some θ * t , = 0, 1, enforces Hθ t = 0. Thus, the constraint problem in (8) can be changed to an unconstrained optimisation problem as follows:

Penalized iteratively reweighted least squares algorithm
In this subsection, we are going to describe the estimating algorithm in detail.
with the derivative of link function as element, and the weight matrix Iteratively reweighted least squares algorithm (IRLS) is commonly used to find the maximum likelihood estimates of a generalised linear model. Therefore, in this work, we design a penalised iteratively reweighted least squares (PIRLS) algorithm as described below. Suppose at the jth iteration, we have μ (j) t ) and V (j) . Then at (j + 1)th iteration, we consider the following objective function: Algorithm 1 The Penalized Iteratively Reweighted Least Squares (PIRLS) Algorithm.
Take the first order Taylor expansion of μ(ϑ) around (ϑ (j) ), then where is for s = 1, . . . , t. The detailed procedure for the PIRLS is illustrated in Algorithm 1. In the numerical studies, we consider the following initial values Compared with the traditional nonparametric techniques, such as kernel smoothing, the proposed algorithm is much more computationally efficient. Therefore, we can easily apply our method to analyse massive spatiotemporal data sets.

Asymptotic results
where μ 0 it is the conditional mean based on the true parameter α 0 jt 's and functions β 0 t 's and γ 0 kt 's. Let ε it = Y it − g −1 (η 0 it ) be the error term. For the quasi-likelihood function, The following theorem provides the L 2 convergence rate of the spline estimators, β t (u), for = 0, 1. The detailed proof is illustrated in the Supplemental Material B.
(11) The next theorem shows that the maximum quasi-likelihood estimator of α 0 is root-n consistent and asymptotically normal.

Modeling the number of fatal and recovered cases
To fit the proposed STEM and make predictions for cumulative positive cases, one obstacle is the lack of direct observations for the number of active cases, I it . Instead, the most commonly reported number is the count of total confirmed cases, C it . Some departments of public health also release information about fatal cases D it and recovered cases R it . Based on the fact that I it = C it − R it − D it , we attempt to model D it and R it to facilitate the estimation and prediction of newly confirmed cases Y it based on the proposed STEM model. For the death model (2), we can use the same maximum quasi-likelihood approach to fit the model.
Ideally, if sufficient data for recovered cases can be collected from each area, a similar model can be fitted to explain the growth of the recovered cases. However, most people who become sick with COVID-19 only experience mild illness and can recover at home without medical care within a week. Meanwhile, according to the U.S. Centers for Disease Control and Prevention, severe cases are often hospitalised to receive supportive care, which may take several weeks. Although there have been regional, national, and global data on confirmed cases and deaths, not much has been reported on recovery. Currently, there is a lack of a uniform method for reporting recoveries across the U.S. (Howard and Yu 2020).
Only a few states regularly update the number of recovered patients, but the counts can seldom be mapped to counties. Due to a lack of data, we are no longer able to use all the explanatory variables discussed above to model daily new recovered cases. Instead, we mimic the relationship between the number of recovered and active cases from some compartmental models in epidemiology (Siettos and Russo 2013; Anastassopoulou, Russo, Tsakris, and Siettos 2020). At current time point t, we assume that in which δ represents the time delay from infection to recovery (δ = 10 in our analysis), and ε is is the random noise. The recovery rate ν R t enables us to make reasonable predictions for future recovered patients counts and provide researchers with the foresight of when the epidemic will end. The rate ν R t can be either estimated from available state-level data, or obtained from prior medical studies to alleviate the under-reporting issue in actual data.

Zero-inflated models at the early stage of the outbreak
It is well known that in the early stage of an epidemic, the quality of any model output can be affected by the restricted quality of data that pertain to under-detection or inconsistent detection of cases, reporting delays, and poor documentation, regarding infections, deaths, tests, and other factors. There are many counties with zero daily counts at the early stage of disease spread. Therefore, we consider zero-inflated models based on a zero-inflated probability distribution, i.e. a distribution that allows for frequent zero-valued observations. Following the previous works (Arab, Holan, Wikle, and Wildhaber 2012; Wood, Pya, and Säfken 2016), we assume the observed counts Y it contributes to a zero-inflated Poisson (ZIP) distribution, ZIP(μ I it , p I it ), specifically, we assume that (4), and a 1 , a 2 are unknown parameters estimated along with the roughness parameters. See Wood et al. (2016) for the estimation of a 1 and a 2 .
Let D it = D it − D i,t−1 be the number of new fatal cases on day t. Similarly, we can consider zero-inflated models for fatal cases, in which we assume the observed count D it contributes to a ZIP distribution ZIP(μ D it , p D it ): (2), and v 1 , v 2 are unknown parameters that can be similarly estimated as in the above.

A simulation study
In this section, we conduct a simulation study to evaluate the finite sample performance of the proposed method. In the simulation, we use a subset of covariates of the county-level characteristics analysed in Section 5. The response variable Y it and D it are generated from a ZIP distribution with the logarithm of Poisson parameters generated as following: where δ = 14, p = 2, q = 5, m = 7, and A ijt , X ik , j = 1, 2, k = 1, . . . , 5, come from the covariates in the COVID-19 dataset described in Section 5. The true univariate functions γ 1t (x), γ 2t (x), . . . , γ 5t (x), together with their estimate and confidence band in one typical iteration, are displayed in Figure 3(a-e). Figure 4(a-c) depict the bivariate coefficient functions β I 0t (·), β I 1t (·) and β D 0t (·), which are generated to mimic the spatial pattern of infection/mortality rate in the pandemic. We also include the corresponding estimated functions from one experiment in Figure 4(d-f) to show that the spatial pattern can be very well captured using the proposed method. For recovery data, the daily recovered cases are simulated by R it = ν R I i,s−1 , where ν R = 0.07. We simulate data by assuming that a pandemic emerged on March 15 with 1 case showed up in each of the 420 selected counties. These counties are selected if COVID-19 cases had been found by March 15 in real data. Then, daily confirmed, fatal and recovered cases are generated based on model (12) and (13) from a ZIP distribution with the complimentary log of the zero probability being linearly dependent on the log of the Poisson parameter μ I it and μ D it . To evaluate the performance numerically, we conduct 100 Monte Carlo experiments with 9 or 14 days as the training window sizes. For the univariate spline smoothing, we use cubic splines with two interior knots; and for the bivariate spline smoothing, we consider degree d = 2, smoothness r = 1, and two different triangulations in Figure 1: 1 (119 triangles with 87 vertices) and 2 (522 triangles with 306 vertices). The root mean squared errors (RMSEs) for some of the parametric and nonparametric components in models (12) and (13) are illustrated in Figure 2, and the boxplots of the RMSEs for all the parametric and nonparametric components are shown in Supplemental Material A. Moreover, the average RMSEs over 100 experiments are reported in Table 1. According to the numeric results, the proposed model is not sensitive to the choice of triangulation. Based on Figure 2 and Table 1, one can see that increasing the window size of training data can help improve the accuracy in estimating most of the coefficient functions while increasing the computational burden at the same time. Thus, in practice, users can balance the choice of the window size with the power of the computation resource.

COVID-19 case study
The goals of the following study are two-fold. First, we develop a new dynamic epidemic modelling framework for public health surveillance data to study the spatiotemporal pattern in the spread of COVID-19. We aim to investigate whether the proposed model  could guide the modelling of the dynamics of the spread at the county level by moving beyond the typical theoretical conceptualisation of context where a county's infection is only associated with its own features. Second, to understand the factors that contribute to the spread of COVID-19, we model the daily infected cases at the county level, considering the demographic, environmental, behavioural, and socioeconomic factors in the U.S.

Data description
The data for the COVID-19 outbreak in the U.S. is collected and cleaned from a combination of public data repositories, including official state Health Department Websites, the New York Times (NYT 2020), the COVID-19 Data Repository by the Center for Systems Science and Engineering at Johns Hopkins University (JHU CSSE 2020), the COVID Tracking Project (Atlantic 2020) and the USAFacts (USAFacts 2020).  provides a thorough comparison of the COVID-19 data collected from the above four sources, details on anomaly detection and repair, as well as how to integrate the data with other local characteristics. The USA Counties Database compiled by the U.S. Census Bureau and the Homeland Infrastructure Foundation-level Data prepared by the U.S. Department of Homeland Security (see Table 2) were used as the source of local information. The county-level features can be categorised into the following groups.
Control measures. We mainly consider two control measures in our work, emergency declarations and 'stay-at-home' or 'shelter-in-place' orders, among a variety of social distancing policies (e.g. school closures, closures of non-essential services focussed on bars and restaurants, bans on large gatherings and the deployment of severe travel restrictions). Dates of interventions were compiled by checking national and state government websites, executive orders, and newly-initiated COVID-19 laws. Starting in Washington on February 29, 2020, the declarations of state emergency soon swept the nation. By March 16, 2020, every state had made an emergency declaration, with most taking the form of a State of Emergency or a Public Health Emergency. The executive orders of 'stay-athome' or 'shelter-in-place' started in California in the middle of March, and within one to two weeks, the majority of the states had taken similar actions. Due to the immense pressures of the crippled economy and anxious public, states in the U.S. started to reopen successively in late April. A state is treated as 'reopening' once its stay-at-home order lifts, or once reopening is permitted in at least one primary sector (restaurants, retail stores, personal care businesses), or once reopening is permitted in a combination of smaller sectors.
Socioeconomic status contains (a) social affluence (b) concentrated disadvantage and (c) Gini coefficient. Social affluence is a measure of more economically privileged areas, including factors: (i) percent of households with income over $75,000; (ii) percent of adults obtaining bachelor's degree or higher; (iii) percent of employed persons in management, professional and related occupations; (iv) median value of owner-occupied housing units. The concentrated disadvantage is a measure for conditions of economic disadvantage, including factors: (i) percent of households with public assistance income; (ii) percent of households with a female householder and no husband present; (iii) civilian labour force unemployment rate. Gini coefficient, known as the Gini index, is a measure of economic inequality and wealth distribution among a population.
Healthcare infrastructure contains (d) local government expenditures for health per capita, (e) percent of persons under 65 years without health insurance, and (f) logarithm of total bed counts per 1000 population.
Demographic characteristics contain (g) percent of African American population, (h) percent of Hispanic or Latino population, (i) logarithm of population density per square mile of land area, (j) aged people (age ≥ 65 years) rate per capita, (k) ratio of male over female, and (l) urban rate.
Mobility data are collected and cleaned from the U.S. Department of Transportation, Bureau of Transportation Statistics, and Descartes Labs. It describes the daily number of trips within each county produced from an anonymized national panel of mobile device data from multiple sources. Trips are defined as movements that include a stay of longer than 10 min at an anonymized location away from home.

Analysis and findings in COVID-19
For the model estimation, we consider the following model for the infected count: where i = 1, . . . , 3104. For the death count, we consider the following semiparametric model: We consider the data collected from March 16 to September 3, 2020; see the data description in Section 5.1. Note that in Models (14)- (15), the covariate Control it is a dummy variable for the executive order 'shelter-in-place' or 'stay-at-home', namely Control it = 1 suggesting 'shelter-in-place' taken place for county i at time t, while Control it = 0 suggesting no restriction or restriction lifted. See Table 2 for details of other county-level predictors. We use 28 days, two incubation periods, as an estimation window to examine how the covariates affect the newly infected and fatal cases, and we choose δ = 14. The roughness parameters are selected by generalised cross-validation (GCV). The performance of the univariate and bivariate splines depends on the choice of the knots and triangulations, respectively. We use cubic splines with two interior knots for the univariate spline smoothing. We generate the triangulations according to the 'max-min' criterion, which maximises the minimum angle of all the angles of the triangles in the triangulation. We consider two triangulation choices, 1 and 2 , as shown in Figure 1. By the 'max-min' criterion, 2 is better than 1 , but it also significantly increases the number of parameters to estimate. As a trade-off, for the estimation of β I 0t (·) and β I 1t (·), we adopt the finer triangulation 2 , and use the rough triangulation 1 to estimate β D 0t (·) due to the sparsity problem in the death count and many zeros observed.
First of all, we describe our findings from modelling the COVID-19 related infection counts in 3104 counties from the 48 mainland US states and the District of Columbia. To examine the effect of the control measures ('shelter-in-place' or 'stay-at-home' orders) and mobility level after 7 days, we test the hypothesis: H 0 : α I jt = 0, j = 1, 2 in model (14). Figure 5(a) shows that the control measure is significant for the infected count most of the time. Figure 5(b) shows that the p-value of the mobility is always very close to zero, and thus the mobility is significant for the entire study period. Next, we examine the effect of the other predictors in Model 14. To test hypotheses H 0 : γ I kt (·) = 0, k = 1, . . . , 12, we construct the 95% simultaneous confidence band (SCB) for γ I kt (·)'s. In function estimation problems, SCBs are an important tool to quantify and visualise the variability of the functional components; see Yang (2009), Cao, Yang, andTodem (2012), and Zheng, Liu, Yang, and Härdle (2016) for some related theory and applications. Figure 6 illustrates the estimated curves for different explanatory variables together with the corresponding SCBs based on the data period 03/22/2020-04/18/2020. Based on Figure 6, we can observe that at the beginning of the pandemic, the infected cases increase with the population density (PD), which is consistent with our intuition. We also find that the infections increase with African American Ratio and Hispanic Latino Ratio at the beginning of the outbreak.
We also study the effect of the covariates over time. Figure 7 shows the effect of the aged people rate at different time points over the outbreak. In the early stage of the COVID-19 pandemic, from March to April, COVID-19 struck the elderly more severely than the younger people. By mid-April and May, we saw that those communities with fewer aged people suffered more from COVID-19. Counties with a very high rate of aged people still experience high infection rates. However, when people understood the virus more and took action to protect the senior people, from mid-June to September, those counties with a higher rate of aged people became those least infected. However, from mid-June to September, older people tended to stay home and were more cautious about the virus. Also, as many states reopened bars, restaurants, and offices, people in their 20s and 30s were more likely to go out socialising, and the coronavirus spread more widely to young people (Bosman and Mervosh 2020).
Movies 1-12 in the Supplementary Material A show the estimates and SCBs of the nonparametric functions γ I kt (·), k = 1, . . . , 12, over the entire study period in the STEM model (14).
After the discussion of our finding in the infection model, let us focus on the death model. For Model (15), we focus on the following hypothesis tests: H 0 : α D jt = 0, j = 1, 2, H 0 : β D 1t = 0 and H 0 : γ D kt = 0, k = 1, . . . , 12. Figure 8(b) shows that 'Mobility' is significant over the entire study period. "OLD" is significant in the beginning of the pandemic. For other county-level covariates, ' Affluence', 'Disadvantage', 'EHPC', 'AA' and 'SEX' are significant with p-values smaller than 0.05 most of time, while the rest of the predictors are significant on some days, but insignificant on other days. In addition, movies 13 and 14 in the Supplementary Material A illustrate the estimated coefficient functions of β I 0t (·) and β I 1t (·) in model (14). From Movie 13, we can see that the transmission rate, β I 0t (·), varies at different locations and in different phases of the outbreak, especially the high rate in late March and April. Movie 14 shows that β I 1t (·) also varies from one location to another location, which indicates that the homogeneous mixing assumption of the simple SIR models does not hold. The transmission rate is high in most states at the end of April; however, it has become much lower since June. Movie 15 on the Supplementary Material A shows the pattern of β D 0t in model (15). From this animation, we observe a severe fatality condition in the southern states in July and a pattern of a general decrease in the entire U.S. since August 2020.

Conclusion and discussion
This work has aimed to bridge the gap between mathematical models and statistical analysis in infectious disease studies. We created a state-of-art interface between mathematical and statistical models to understand the dynamic pattern of the spread of contagious diseases. Our proposed model enhances the dynamics of the SIR mechanism through spatiotemporal analysis. For analysing the confirmed and death cases of COVID-19, other factors may also be responsible for temporal or spatial patterns. We investigated the spatial associations between the infected count, death count, and factors or characteristics of the counties across the U.S. by modelling the daily infected/fatal cases at the county level considering the county-level factors. Modeling COVID-19 at the county-level and combining local characteristics are very beneficial for the community to understand the dynamics of the disease spread and support decision-making when urgently needed. To examine spatial nonstationarity in the transmission rate of the disease, we proposed a nonparametric spatially varying coefficient model, which allows the transmission to vary from one area to another area. The proposed method can be used as an essential tool for understanding the dynamics of the disease spread, as well as for assessing how this outbreak may unfold through time and space.
Based on our results, disease mapping can easily be implemented to illustrate highrisk areas and thus help policymaking and resource allocation. Our method can also be extended to other situations, including epidemic models in which there are several types of individuals with potentially different area characteristics or more complex models that include features such as latent periods or a more realistic population structure.
Our paper did not take the under-reported issue (for example, the asymptomatic coronavirus infectious cases) into account. Although our model may have partially corrected the problem with the spatiotemporal information, some better ways are proposed in several recent pieces of research, such as Pullano et al. (2021), Shaman (2021), Giordano et al. (2021), and Moore, Hill, Dyson, Tildesley, and Keeling (2021). Furthermore, some of the newly developed methods also investigate the effect of vaccinations and different COVID-19 variants. For example, Giordano et al. (2021) and Moore et al. (2021) proposed an extended SEIR-type framework. In these models, individuals start from the susceptibleunvaccinated or susceptible-vaccinated states. Then those in the asymptomatic state will recover, and those in the symptomatic state may become either recover or die. Moreover, the infected individuals could also be divided into several groups based on COVID-19 variants. As discussed in Section 2, although we introduce our discrete-time spatial epidemic model based on the SIR model, we can extend it to such a kind of SEIR-type framework as well.