Bayesian models for spatial count data with informative finite populations with application to the American community survey

The American Community Survey (ACS) is an ongoing program conducted by the US Census Bureau that publishes estimates of important demographic statistics over pre-specified administrative areas. ACS provides spatially referenced count-valued outcomes that are paired with finite populations. For example, the number of people below the poverty line and the total population for each county are estimated by ACS. One common assumption is that the spatially referenced count-valued outcome given the finite population is binomial distributed. This conditionally specified (CS) model does not define the joint relationship between the count-valued outcome and the finite population. Thus, we consider a joint model for the count-valued outcome and the finite population. When cross-dependence in our joint model can be leveraged to ‘improve spatial prediction’ we say that the finite population is ‘informative.’ We model the count given the finite population as binomial and the finite population as negative binomial and use multivariate logit-beta prior distributions. This leads to closed-form expressions of the full-conditional distributions for an efficient Gibbs sampler. We illustrate our model through simulations and our motivating application of ACS poverty estimates. These empirical analyses show the benefits of using our proposed model over the more traditional CS binomial model.


Introduction
The American Community Survey (ACS) is an ongoing program conducted by the US Census Bureau that collects demographic statistics from pre-specified areas.The ACS often contains tabulations across categories by region (e.g. the number of males and females in a county).Data such as these can be modeled with a binomial distribution provided that the finite population is observed at each region.It is fairly common to use a conditionally specified (CS) model and ignore the distribution of the finite population to perform inference on count-valued outcome.It often happens that the finite population is not a fixed deterministic quantity, but rather a realization of a random variable that may have a complicated joint relationship with the paired count-valued outcome.In this article, we model outcomes given the finite population as binomial with probability π and the finite population as negative binomial with probability p. Several studies indicate that the population of a region is directly related to the number of individuals below the poverty line (Refs.[1,26,31,33,34], among others).It is unclear whether or not this direct relationship is simply due to the fact that a larger population naturally implies a larger mean number of count-valued observation or if more intricate cross-dependence exists between the count-valued observation and the finite population.By 'intricate' dependencies (and joint relationships), we mean that π and p may be correlated.If these correlations are present, then one can leverage this dependence to improve spatial prediction.When cross-dependence between π and p can be leveraged to 'improve spatial prediction' (i.e.obtain smaller mean squared prediction error), we say that the finite population is 'informative.' As an example, consider Figure 1, where we plot five-year period estimates of the number of families below the poverty threshold along with the number of households (population).Here, we see that large and small values of poverty and population appear together.An immediate explanation is that the increase of poverty is due solely to a larger trial number (i.e. the ACS five-year period estimate of the population).However, we consider two reasons for this pattern: (1) the increase of poverty is due solely to a larger population and (2) the increase of poverty is because both the population have increased and π and p are positively correlated.This perspective allows for two types of cross-variable dependence, where item 1 refers to dependence in the data models and item 2 refers to dependence between π and p.Thus, in what follows, we develop a Bayesian statistical model that can be used to leverage both sources of dependence to improve the accuracy of predictions.
There are several methods available to model correlated count data.For example, a bivariate Poisson model has been used to define a joint model and account for the dependence between outcomes and finite populations (see Refs. [6,11,35], among others).However, the Poisson distribution is not robust to highly heterogeneous populations (see Refs. [9,11] for further comments), especially when there is a significant overdispersion in the outcome.The inflated bivariate Poisson model [23] can accommodate moderate overdispersion by mixing an inflation component to define the joint probability density function.However, maximum likelihood estimation through the expectation maximization algorithm can be computationally burdensome.To alleviate the overdispersion issue, the negative binomial distribution is more appropriate because it has one more parameter to mitigate issues with heterogeneity among populations (e.g.see Refs.[13,21,25], among others).Furthermore, quite often inference on π is of primary interest, and bivariate Poisson models are not explicitly parameterized to model π .However, bivariate Poisson models are the predominant choice used to model two dependent count-valued spatial variables.Thus, our proposed model offers an exciting new tool for this literature.
We use a model similar to the latent Gaussian process (LGP) model, which has been widely used to model non-Gaussian data and can be used to do statistical inference and account for uncertainty (see Refs. [7,15,17], among others).Specifically, the count-valued outcomes given the finite populations are modeled with a binomial distribution and finite populations are modeled with a negative binomial distribution at each region.One could assume that latent random effects are Gaussian, however, the LGP is not always realistic [9].Furthermore, many non-Gaussian data lead to full-conditional distributions that do not have a closed-form analytical solution.Therefore, the LGP can lead to computational inefficiencies when it requires subjective tuning of Markov chain Monte Carlo (MCMC) algorithms and can subsequently have convergence challenges especially in high-dimensional correlated settings [14].One option available to avoid these issues is Polya-Gamma augmentation, which has been used for multinomial data.This model has a conjugate full-conditional distribution where the logit of the proportions can be expressed as a multivariate normal distribution after marginalizing across variances that follow a Polya-Gamma distribution [29].This method can also suffer from some computation difficulties because an iterative accept/reject sampling algorithm is needed to generate Polya-Gamma random variables within the MCMC algorithm.
Another alternative available in the literature is to drop the Gaussian assumption and replace it with a multivariate logit-beta (MLB) distribution [4,5,12].The MLB distribution is derived through the linear transformation of univariate independent logit-beta random variables.This MLB distribution is a reasonable alternative because it can be specified arbitrarily close to the default choice of a Gaussian distribution.The full-conditional distribution associated with this distribution are conditional MLB distributions and there are straightforward MCMC sampling schemes that do not require subjective Metroplis-Hastings tuning steps.These tuning steps can be avoided by other algorithms such as Hamiltonian MCMC [28] and integrated nested Laplace approximations [32].However, these models are limited to small parameter space settings, which are not the case for our motivating ACS dataset.The MLB has been used to define a prior on the adjacency matrix in a conditional autoregressive model [12] to model spatial-temporal multivariate data [4] and to model multivariate binomial and negative binomial data [5].However, a latent MLB model that jointly models a binomial spatial dataset and a negative binomial spatial dataset does not exist.
We propose a Bayesian hierarchical joint model for spatially referenced binomial counts with informative finite populations.We incorporate dependence directly into the data hierarchically by assuming the finite population in the binomial outcome is distributed as negative binomial.We also incorporate cross-dependence between the (logit) of the probability of outcome and the (logit) of the probability of a trial through a shared spatial basis function expansion (e.g.see Ref. [8], for a standard reference).We also model within data type (i.e.binomial and negative binomial data) spatial dependence with another spatial basis function expansion.All random and fixed effects are assumed to follow a MLB distribution.
We now outline the remainder of the paper.In Section 2, we introduce our proposed model.A simulation study is given in Section 3, and an analysis of ACS county-level poverty estimates in the Southeast USA is given in Section 4. In each of our empirical analyses, we compare to a CS model, which is the primary competitor for modeling spatially referenced binomial data.We end this paper with a discussion in Section 5.All technical details and reviews are provided in the appendices.

Methodology
We assume that spatial binomial data are observed at n regions.Let y π ≡ (y π ,i : i = 1, . . ., n) and y p ≡ (y p,i : i = 1, . . ., n) be n-dimensional vectors consisting of spatial binomial and negative binomial random counts, respectively.These vectors are paired, that is, the ith binomial count y π,i is paired with the ith total y p,i .For example, y π ,i might consist of the number of families below the poverty threshold out of a total of y p,i families in county i.The probabilities associated with the binomial and negative binomial distributions are denoted with π i ∈ (0, 1) and p i ∈ (0, 1), respectively.Recall that one of the developments in this article is that we allow π i and p i to be correlated.The primary goal is to accurately predict the proportions associated with the binomial distributed observations (i.e.π i ).
We use the MLB distribution for our prior distribution specification.A review of the MLB distribution can be found in Appendix A in the Supplemental Materials.Additional details on the Gibbs sampler are provided in Appendices B and C in the Supplemental Materials.

Binomial and negative binomial data models
Count-valued observations y π,i , and y p,i are assumed to be binomial and negative binomial distribution as follows: where 'BN ' and 'N B' are shorthands for a binomial and negative binomial distribution and d i is the dispersion parameter.Here, logit( Equation (1) can be rewritten as [4] where f denotes a probability density function/mass function.

Process models
The unobserved processes ν π and ν p are jointly modeled as where X j is a known n × m matrix of covariates, and β j ∈ R m is the associated unknown m-dimensional parameter vector with j = π, p.The m-dimensional within data-type fixed effect β j , the r-dimensional within data type random effects η j , the n-dimensional within data-type random effects ξ j and the r-dimensional between data-type random effect of η are modeled using MLB distribution.We provide a review of the MLB distribution in Appendix A in the Supplemental Materials.Recall from Section 1 that we use the MLB as a prior because it leads to a Gibbs sampler with full-conditional distributions that one can directly sample from (i.e.no subjective Metroplis-Hastings tuning steps are required).As such, MCMC is considerably easier to implement with the MLB prior than with a more traditional Gaussian prior.Additionally, the MLB prior has the more traditional Gaussian prior as a special limiting case [4] and, consequently, can more flexibly model the tails of the distribution.
Here, the term k η j represents 'small-scale' variability and ξ j represents 'fine-scale' variability from a white Gaussian noise process with k = π 1, π 2, p1, p2 and j = π , p.The term k η includes cross-data-type dependence between the logit of probabilities, since the covariance between ν π and ν p equals π2 cov(η) p2 , which is not necessarily equal to a zero matrix.Recall that allowing for such cross-covariances is a key motivating feature of our model, and when non-zero π2 cov(η) p2 improves prediction, we say that the finite population is informative.
For areal data, the Moran's I basis function is a useful specification of k (see Refs. [3,[18][19][20]22,30,36] for more discussion).The Moran's I operator is defined as where I n is an n × n identity matrix, and A is the n × n adjacency matrix associated with the edges formed by all the areal units in the study.Let M M M be the spectral decomposition of the n × n matrix MI(X j , A).The n × r matrix k contains the first r columns of M .Moran's I basis functions represent a basis for a spatially weighted orthogonal column space of X j and consequently this specification of basis functions makes the fixed and random effects not confounded.Additionally, Moran's I basis functions can facilitate low-rank modeling (i.e.choosing r n) for high-dimensional spatial data [22,30].If the regions are equally spaced and 'small,' then one could treat areal data as pointreferenced, and use for example, a radial basis function [38].In Section 3, we use a thinplate spline [37] as follows: where s i is the ith data points and c b is a pre-defined knot points.Then, ) and the n × r matrix k = (ψ 1 , . . ., ψ r ) are the associated basis vector and basis matrix, respectively.The ability to specify any class of basis functions makes our proposed model adaptable to several given correlation specifications.Specifically, suppose you are interested in a covariance matrix of a given form = VV , where V is a n × n Cholesky decomposition of .Then, for example, k = VV −1 η,j produces cov( k η j ) = VV −1 η,j cov(η j )V −1 η,j V = , where V η,j is the Cholesky decomposition of cov(η j ).Thus, one can develop versions of our model that imply a desired given correlation structure.

A simulation study
In this simulation, we generate binomial and negative binomial data, where the true proportions are specified to be a known function.We fit our model (which is different from the model generating the data) and we verify that our proposed model can recover the true unknown proportion.The following model is used to generate pairs of binomial and negative binomial counts over a fine-one dimensional grid from zero to 2π (i.e. the spatial domain D={0, 2π/100,. . ., 2π}): ν p,i = 0.01 + 4.00 sin(s i ), i = 1, . . ., 100, where 'G(α, κ)' stands for 'Gamma' distribution with shape α > 0 and scale κ > 0. We interpret D = {0, 2π/100, . . ., 2π} as a fine-one dimensional grid over 0 to 2π (i.e. in steps of 1/100), however ultimately, the length of the grid is an arbitrary choice that serves as illustration.The values of the slope and intercept in ν π ,i and the values 0.01 and 4.00 in ν p,i are chosen to define the range of values of ν π ,i and ν p,i .The logit(p) ≡ {log(p/(1 − p))} ≡ ν p is a continuous, symmetric and smooth function with the approximate range of To add complexity to the true function, we include a change-point to ν π ,i |ν p,i at π, where we switch from a linear function of ν p,i to a linear function of s i .Also, the presence of ν p,i in the definition of ν π,i suggests a functional relationship between ν π ,i and ν p,i .See simulated dataset and the true proportions in Figure 2. To see if our method was robust to the relative magnitude of the negative binomial counts, we specified the prior on d i to produce negative binomial counts that could cover a wide range of magnitudes.The simulated negative binomial count quantiles (25%, 50% and 75%) based on our choice of prior on d i are 15, 215, 3600.We interpret the quantiles 5, 215 and 3600 as small to large count-values.
We use our proposed model to predict the unknown function in the right panel of Figure 2 (using the data displayed in the left panel).To use our model, one has to specify the basis matrices, covariates and several hyperparameters.We assume an intercept only model for ν π,i and ν p,i , respectively.We consider two choices for basis matrices π j and pj : j = 1, 2. The first is setting the elements equal to evaluations of thin-plate splines with r = 25 knots.This use of thin-plate splines results in a 100×25 matrix that we denote with * .The second choice is to set kj equal to a 100×25 matrix of zeros; k = π , p, j = 1, 2.
We consider six special cases of the model in ( 3): • Joint model with between-type and within-type (JBW) random effects: Here we set π1 = p1 = π2 = p2 = * .Notes: Maximum coverage by rank and type are bolded.The coverage of π i over all i and averaged over the 30 replicates.The 95% confidence intervals for the proportion of credible interval completely covering the true proportion over the 30 replicate are in parentheses.
allowed for a shorter MCMC chain than more traditional approaches, which is one of the advantages of our algorithm.The hyperparameters of the MLB distribution, in practice, are chosen by minimizing a criterion.In this illustration, we set them as = 0.05 and ρ = 0.965 for binomial model; = 0.1 and ρ = 0.99 for negative binomial model; and σ = 0.000001 for both binomial and negative binomial models (see these hyperparameters definition detailed in Appendix B in the Supplemental Materials).The comparison of these six models is repeated with 30 replicates of datasets {y π , y p }. Tables 1 and 2 and the left panel of Figure 3 show that all six special cases yield reasonable predictions (i.e.close to the true and small MSPE) and credible interval coverage for ν π .In this simulation, JB performs very poorly and JBWNEG tends to perform the best in terms of MSPE.In terms of coverage (i.e. the proportion out of 30 replicates that the truth is contained within the 95% credible interval), all methods appear to have similar (high) coverage for binomial data, with JBWBIN, JBW and Uni tending to perform (marginally) the best.

Application: analysis of ACS period estimates of poverty rates in southeast USA
The number of families that fall below the poverty threshold can be modeled using a CS binomial distribution, however the population in a county may express spatial dependence  and is observed with uncertainty.This suggests an opportunity to leverage possible crossdependence between the probability of being below the poverty line and the mean of the population to produce more accurate predictions of poverty rates.The primary goal of this analysis is to predict the family poverty rate for the Southeast region of USA at county level using the proposed Bayesian hierarchical model.The fiveyear period estimates (US Census Bureau, 2013-2017 ACS 5-Year Cumulative Estimates) of incidence of poverty and the population are available at 948 counties in the Southeast US region, which we focus our analysis on.The left and right panels in Figure 1 show the incidence of poverty and the population by county in the Southeast USA.We subjectively chose roughly 95% of the observations to be training.That is, we randomly select 898 out of 948 counties to be the training dataset and the 50 counties are treated as a holdout to the model fit.
The Moran's I (MI) basis functions are used, and a second-order adjacency matrix is chosen using cross validation.The MSPE (between the sample proportion and the model estimated proportion) using the first and second-order adjacency matrix are 0.01351 and 0.01308, respectively.Consequently, we use the second-order adjacency matrix.For covariates, we use the logit transformation of the sample proportion of each race, age, education and single households at the county level, which are all five-year ACS period estimates that are common factors in socioeconomic research (see Refs. [33,34], among others).These same covariates are used in the model for ν π and ν p .The majority of predictors are significant (i.e.zero is not in the credible interval) across both binomial and negative binomial models.The model is evaluated based on the coverage of the credible intervals (i.e.whether or not the credible interval for y p,i and y π,i contains the holdout observations) and MSPEs of the holdout data.The hyperparameters are chosen to minimize (training) MSPE and are = 1.5 and ρ = 0.995 for the binomial model; and = 0.1 and ρ = 0.99 for the negative binomial model (see Table 3 and Figure 4 for MSPE versus prior specification).In general, the estimate of these hyperparameters change for each dataset.The Gibbs sampler is implemented with runs of 5000 iterations and has a burn-in of 3000 iterations.No convergence issues were found.Specifically, trace plots (see Figure 5) were informally checked and the associated Gelman-Rubin's diagnostics [16] were found to be between 1.0 and 1.1, respectively.
In Figure 6 and Table 4, we see that all versions of our model (we dropped JB due to its poor performance in the simulation) produce very small out-of-sample (i.e. on the order

Discussion
In this article, we develop a Bayesian approach to jointly model a spatially referenced countvalued outcome and an associated finite population.We are motivated by estimates made available by ACS.We use the MLB latent process [5,12], which allows us to draw directly from the full-conditional distributions and makes Gibbs sampling more computationally efficient than other choices.The primary motivation for the proposed methodology is that complex spatial dependencies can exist within and between the two probabilities associated with binomial and negative binomial data.In our proposed joint model, these complex spatial dependencies are explicitly defined with basis functions.Furthermore, the specification of low-rank basis functions provides the flexibility of incorporating dimension reduction.
We tested versions of the proposed model via simulations.The simulated studies show that the joint model based on the MLB latent processes achieves reasonable credible interval coverage and accurate predicted proportions.The motivating analysis of ACS poverty data in the Southeast USA illustrates that the proposed Bayesian hierarchical joint model 6.Testing (50) models for binomial data: the first row displays the predicted logit of poverty proportions and their associated MSPE (on the logit-scale for visualization purposes); the second row displays the estimated proportions versus the observed proportions; and the third row displays the 95% lower and upper credible limits for estimated proportions.yields precise and accurate predictions of the probability that a family falls below the poverty threshold.Furthermore, there is evidence that the finite population is informative for predicting US poverty rates.
There are exciting avenues to extend the proposed joint model.First, it is natural to extend this model to the spatio-temporal case, where the time indexed random effects are used to incorporate temporal dependencies.Additionally, the multiscale nature of time in ACS may be modeled using the techniques in the spatio-temporal change of support literature (e.g.see Ref. [2]).There are limitations to our method that could also lead to future developments.In particular, the specification of MLB hyperparameters require small sensitivity studies.One could instead develop a hyperprior distribution for these quantities.Similarly, the choice of the rank r requires a sensitivity student, where one could instead place a prior distribution for certain classes of basis functions.

Figure 1 .
Figure 1.ACS poverty count data: the observed families counts in poverty (left) and the total families number (right).

Figure 2 .
Figure 2. Simulated binomial and negative binomial counts and true proportions.

Figure 3 .
Figure 3.The first panel presents the true values and estimated values of {π i } by method with i on the x-axis.Similarly, the right panel presents the credible bands for {π i } by method.

Figure 4 .
Figure 4. MSPE versus prior specification for ACS binomial data mixed effects model.

Figure 5 .
Figure 5. Trace plots of the JBWBIN model for ACS binomial training data (898).
[10,24,27,39]of y p,i in y π,i |π i , y p,i induces cross-data-type dependence, which again is an important component of our model, but this particular implementation has been used by several others (e.g.see, for example, Refs.[10,24,27,39]).When ν π and ν p are assumed independent, then the cross-dependence between y π and y p is determined by the conditional distribution f (y π,i |ν π,i , y p,i ).Hence, when ν π and ν p are assumed independent, we call this model the aforementioned CS model.In what follows, we consider the case where ν π and ν p are dependent.

Table 1 .
MSPEs by rank and type for simulated data, average over 100 locations and 30 replicates.The rank is the basis matrix column number.Minimum MSPE by rank and type are bolded.The 95% confidence intervals for the MSPE over the 30 replicate (i.e.mean MSPE plus or minus 1.96 times the standard deviation) are in parentheses.

Table 2 .
Estimated credible interval coverage by rank for the simulated data.

Table 3 .
MSPE by different priors specifications for ACS binomial data mixed effects model.

Table 4 .
Posterior summary by model and rank for ACS binomial testing data (50).
coverage.The marginally best performing model is JBW at rank 948.The CS model is fairly competitive in terms of MSPE.However, in terms of coverage, each joint model is preferable than the CS model, with JBWBIN and JBW performing comparably.Hence in terms of coverage and (marginally) MSPE, it appears that the finite population is informative.