Bayesian Generalized Additive Models for Location, Scale, and Shape for Zero-Inflated and Overdispersed Count Data

Frequent problems in applied research preventing the application of the classical Poisson log-linear model for analyzing count data include overdispersion, an excess of zeros compared to the Poisson distribution, correlated responses, as well as complex predictor structures comprising nonlinear effects of continuous covariates, interactions or spatial effects. We propose a general class of Bayesian generalized additive models for zero-inflated and overdispersed count data within the framework of generalized additive models for location, scale, and shape where semiparametric predictors can be specified for several parameters of a count data distribution. As standard options for applied work we consider the zero-inflated Poisson, the negative binomial and the zero-inflated negative binomial distribution. The additive predictor specifications rely on basis function approximations for the different types of effects in combination with Gaussian smoothness priors. We develop Bayesian inference based on Markov chain Monte Carlo simulation techniques where suitable proposal densities are constructed based on iteratively weighted least squares approximations to the full conditionals. To ensure practicability of the inference, we consider theoretical properties like the involved question whether the joint posterior is proper. The proposed approach is evaluated in simulation studies and applied to count data arising from patent citations and claim frequencies in car insurances. For the comparison of models with respect to the distribution, we consider quantile residuals as an effective graphical device and scoring rules that allow us to quantify the predictive ability of the models. The deviance information criterion is used to select appropriate predictor specifications once a response distribution has been chosen. Supplementary materials for this article are available online.


INTRODUCTION
For analyzing count data responses with regression models, the log-linear Poisson model embedded in the exponential family regression framework provided by generalized linear or generalized additive models is still the standard approach. In many applied examples, however, we face one or several of the following problems: • An excess of zeros as compared to the number of zeros expected from the corresponding Poisson fit. For example, in an application on citations of patents considered later in this article, a large fraction of patents is never cited and this fraction seems to be considerably larger than expected with a Poisson distribution. • Overdispersion, in which case the assumption of equal expectation and variance inherent in the Poisson distribution has to be replaced by variances exceeding the expectation. While it is common practice to introduce a single, scalar overdispersion parameter to inflate the expectation (Fahrmeir and Tutz, 2001), more complex forms of overdispersion where the amount of overdispersion depends on covariates and varies over the observations are often more adequate.
• A simple linear predictor is not sufficient to capture all covariate effects. For example, the number of claims arising in car insurance for a policyholder requires both spatial effects to capture the strong underlying spatial correlation and flexible nonlinear effects to model the effects of age of the car and age of the policyholder. Further extensions may be required to include complex interaction effects or random effects in case of grouped or multilevel data.
To overcome these problems, a number of extended count data regression variants have been developed. To deal with an excess of zeros, zero-inflated count data regression models assume that the data are generated by a two-stage process in which a binary process decides between observations that are always zero and observations that will be realized from a usual count data distribution such as the Poisson distribution. As a consequence, zeros can either arise from the binary process or from the Poisson distribution. In the application on citations of patents, the binary process distinguishes those patents that are of very little interest and will, therefore, never be cited from those that are relevant and for which the number of citations follows, for example, a Poisson distribution. Both the probability for the binary decision and the Poisson rate may then be characterized in terms of covariates.
To deal with overdispersion, the negative binomial distribution provides a convenient framework extending the Poisson distribution by a second parameter determining the scale of the distribution; see, for example, Hilbe (2007). The negative binomial distribution can also be combined with zero-inflation as described in the previous paragraph; see, among others, Winkelmann (2008).
For Poisson regression and negative binomial regression with fixed scale parameter and no overdispersion, generalized additive models as developed in Hastie and Tibshirani (1990) and popularized by Wood (2006) provide a convenient framework to overcome the linearity assumptions of generalized linear models when smooth effects of continuous covariates shall be combined in an additive predictor. Inference can then be based on optimizing a generalized cross-validation criterion (Wood 2004), a mixed model representation (Ruppert, Wand, and Carroll 2003;Fahrmeir, Kneib, and Lang 2004;Wood 2008) or Markov chain Monte Carlo (MCMC) simulations (Brezger and Lang 2006;Jullion and Lambert 2007). The framework of generalized additive models for location, scale, and shape (GAMLSS) introduced by Rigby and Stasinopoulos (2005), extends generalized additive models to more complex response distributions where not only the expectation but multiple parameters are related to additive predictors via suitable link functions. In particular, zero-inflated Poisson and zero-inflated negative binomial responses can be embedded in this framework where for the former both the probability of excess zeros and the Poisson rate and for the latter the probability of excess zeros, the expectation of the count process and the scale parameter are related to regression predictors.
Predictor specifications that go beyond the generalized additive models of Hastie and Tibshirani (1990) comprising only nonlinear effects of continuous covariates have been developed within the framework of structured additive regression and allow arbitrary combinations of parametric linear effects, smooth nonlinear effects of continuous covariates, interaction effects based on varying coefficient terms or interaction surfaces, random effects, and spatial effects using either coordinate information or regional data (Fahrmeir, Kneib, and Lang 2004;Brezger and Lang 2006). Structured additive regression relies on a unifying representation of all these model terms based on nonstandard basis function specifications in combination with quadratic penalties (in a frequentist formulation) or Gaussian priors (in a Bayesian approach).
In this article, we develop Bayesian structured additive regression models for zero-inflated and overdispersed count data covering the following unique features: • The approach supports the full flexibility of structured additive regression for specifying additive predictors for all parameters of the response distribution including the success probability of the binary process and the scale parameter of the negative binomial distribution. • The model formulation and inference are embedded in the general framework of GAMLSS, which allows us to develop a generic approach for constructing proposal densities in an MCMC simulation algorithm based on iteratively weighted least squares approximations to the full conditionals as suggested by Gamerman (1997) or Brezger and Lang (2006) for exponential family regression models. • We provide a numerically efficient implementation including also an extension to multilevel structure that is particularly useful in spatial regression specifications or for models including random effects; see Lang et al. (2014).
This implementation is part of the free software package BayesX (Belitz et al., 2012). • Theoretical results on the propriety of the posterior and positive definiteness of the working weights required in the proposal densities are included. • Especially compared to frequentist GAMLSS formulations, our approach has the advantage to include the choice of smoothing parameters directly in the estimation run and to provide valid credible intervals which are difficult to obtain based on asymptotic maximum likelihood theory.
Model choice between different types of zero-inflated and overdispersed count data models will be approached based on quantile residuals (Dunn and Smyth 1996) to evaluate the fit and proper scoring rules (Gneiting and Raftery 2007) to determine the predictive ability. The deviance information criterion (Spiegelhalter et al. 2002) will be considered to select important covariates for the different predictors once a response distribution has been fixed.
Some rare approaches developing similar types of models and inferences are already available. For example, Fahrmeir and Osuna Echavarría (2006) developed a Bayesian approach for zero-inflated count data regression with Poisson or negative binomial responses but only allow for covariate effects on the expectation of the count data part of the response distribution and not on the probability of excess zeros or the scale parameter of the negative binomial distribution. Czado et al. (2007) also developed zero-inflated generalized Poisson regression models for count data where the overdispersion and zero-inflation parameters can be fitted by maximum likelihood methods but only considered linear predictors. He, Xue, and Shi (2010) studied the length of sick leaves in Indonesia in 1997 in a zero-inflated Poisson model that allows for the inclusion of one piecewise linear effect in both the expectation of the Poisson distribution and the probability of excess zeros. Their model, relying on piecewise linear functions with four and three knots, respectively, can be reproduced with our more flexible model. Doing so, we obtain similar results but the DIC would be in favor of a model including both the age and the income nonlinearly which would not have been possible in the model by He, Xue, and Shi (2010).
There are two packages in R that provide regression for zeroinflated models. In gamlss (Rigby and Stasinopoulos 2005) maximum (penalized) likelihood inference is used to fit models within the GAMLSS framework including the zero-inflated Poisson and (zero-inflated) negative binomial distribution. A description about the implementation of GAMLSS in R and data examples was given in Stasinopoulos and Rigby (2007). We will evaluate the comparison of the proposed Bayesian approach for zero-inflated and overdispersed count data with the penalized likelihood approach in gamlss in extensive simulations in Section 4. Linear predictors for the count and zero-inflation part can be specified in the package pscl (Zeileis, Kleiber, and Jackman 2008), while the scale parameter of the negative binomial distribution is always assumed constant. Parameters are estimated with the function optim to maximize the likelihood. We will not directly compare our results with those from pscl but will consider model specifications with the same restrictions as competitors in our applications.
The rest of this article is organized as follows: Section 2 describes the model specification for Bayesian zero-inflated and overdispersed count data regression in detail including prior specifications. Section 3 develops the corresponding MCMC simulation algorithm based on iteratively weighted least squares proposals and discusses theoretical results. Section 4 evaluates the performance of the Bayesian approach compared to the penalized likelihood approach of GAMLSS within a restricted class of purely additive models and for more complex geoadditive models. Sections 5 and 6 provide analyses of the applications on citations of patents and claim frequencies in car insurance. The final Section 7 will summarize our findings and comments on directions of future research.

ZERO-INFLATED AND OVERDISPERSED COUNT
DATA REGRESSION

Observation Models
We assume that zero-inflated count data y i as well as covariate information ν i have been collected for individuals i = 1, . . . , n. For given covariates ν i the responses y i are conditionally independent with (conditional) density p(y i |ν i ) = π i 1 {0} (y i ) + (1 − π i )p(y i |ν i ). This distribution arises from the hierarchical definition of the responses as the product of two independent variables y i = κ iỹi , where κ i is a binary selection process κ i ∼ Be(1 − π i ) andỹ i follows one of the standard count data models,ỹ i ∼p such as a Poisson distribution or a negative binomial distribution. The underlying reasoning is as follows: To model the excess of zeros observed in zero-inflated count data, the response is zero ifỹ i equals zero but additional zeros arise whenever the indicator variable κ i is zero. The amount of extra zeros introduced compared to the standard count data distribution ofỹ i is determined by the probability π i . From the definition of zero-inflated count data models, we obtain Our focus is on two special cases for the count data part of the distribution, namely the Poisson The latter choice is particularly suited when the count data part of the response distribution is overdispersed.
To allow maximum flexibility in the zero-inflated count data regression specifications, both the parameter for the excess of zeros and the parameters of the count data part of the distribution are related to regression predictors constructed from covariates via suitable link functions. For zero-inflated Poisson (ZIP) regression, we choose η π i = logit(π i ) and η λ i = log(λ i ) whereas for zero-inflated negative binomial (ZINB) regression we assume η π i = logit(π i ), η μ i = log(μ i ) and η δ i = log(δ i ). Both specifications can be embedded in the general class of generalized additive models for location, scale, and shape proposed by Rigby and Stasinopoulos (2005). Poisson and negative binomial (NB) regression are obtained as special instances (π = 0) with one and two parameters λ and μ, δ, respectively. Note that in applications we may often observe that modeling either zero-inflation or overdispersion is sufficient to adequately represent the data generating mechanism. In particular, a large fraction of observed zeros can also be related to overdispersion and it is, therefore, not generally useful to consider the most complex model type for routine applications. In Sections 5 and 6 we will further comment on this issue and will also provide ways of comparing different models for zero-inflated and overdispersed count data.

Semiparametric Predictors
For each of the predictors, we assume a structured additive specification where, for notational simplicity, we drop the parameter index from the predictor and the included effects. While β 0 is an intercept term representing the overall level of the predictor, the generic functions f j (ν i ), j = 1, . . . , J , relate to different types of regression effects combined in an additive fashion. In structured additive regression, each function is approximated in terms of D j basis functions such that For example, for nonlinear effects of continuous covariates, the basis functions may be B-spline bases while for spatial effects based on coordinates, the basis functions may be radial basis functions or kernels. We will give some more details on special cases later on in this section. The basis function approximation (1) implies that each vector of function evaluations f j = (f j (ν 1 ), . . . , f j (ν n )) can be written as Z j β j , where Z j is the design matrix arising from the evaluations of the basis functions, that is, Z j [i, d j ] = B jd j (ν i ), and β j = (β j 1 , . . . , β j,D j ) is the vector of all regression coefficients. Then the predictor vector η = (η 1 , . . . , η n ) can compactly be represented as where 1 is an n-dimensional vector of ones.

Prior Specifications
To enforce specific smoothness properties of the function estimates arising from the basis function approximation (1), we consider multivariate Gaussian priors for the regression coefficients where τ 2 j is the smoothing variance determining our prior confidence and K j is the prior precision matrix implementing prior assumptions about smoothness of the function. Note that K j may not have full rank and, therefore, the Gaussian prior will usually be partially improper yielding a singular normal prior. A completely improper prior is obtained as a special case for either τ 2 j → ∞ or K j = 0.
Several prior choices are available for the variance parameters τ 2 j . A standard choice is inverse gamma priors τ 2 j ∼ IG(a j , b j ) usually with small values for the hyperparameters a j and b j (our default option is τ 2 j ∼ IG(0.001, 0.001)). Other proposals are uniform priors for τ 2 j or the standard deviation τ j obtained as special cases of inverse gamma priors with a j = −1, b j = 0 and a j = −0.5, b j = 0. Half-Cauchy priors have recently been advocated by Polson and Scott (2012); see also Gelman (2006) for a general discussion of prior specifications in Bayesian hierarchical mixed models. We will discuss sensitivity with respect to these different prior choices later in our additive and geoadditive simulations (Section 4) and the applications (Sections 5 and 6).

Special Cases
To make the generic model specification introduced in the previous section more concrete, we compactly summarize some special cases: • Linear effects f j (ν i ) = x i β j where x i is a subvector of original covariates: The design matrix is obtained by stacking the rows x i and usually a noninformative prior with K j = 0 is chosen for the regression coefficients β j . A ridge-type prior with K j = I is an alternative, especially if the dimension of the vector β j is large. • P-splines for nonlinear effects f j (ν i ) = f j (x i ) of a single continuous covariate x i : The design matrix comprises evaluations of B-spline basis functions defined upon an equidistant grid of knots and a given degree. The precision matrix is given by K j = D D, where D is a difference matrix of appropriate order. Usual default choices are 20 inner knots, cubic B-splines and second-order differences; see Lang and Brezger (2004) for details. • Markov random fields f j (ν i ) = f j (s i ) for a discrete spatial variable s i ∈ {1, . . . , S}: The design matrix is an indicator matrix connecting individual observations with corresponding regions, that is, Z[i, s] is one if observation i belongs to region s and zero otherwise. To implement spatial smoothness, K j is chosen as an adjacency matrix indicating which regions are neighbors; see Rue and Held (2005) for details.
• Random effects f j (ν i ) = β g i based on a grouping variable g i ∈ {1, . . . , G}: The design matrix is an indicator matrix connecting individual observations with corresponding groups, that is, Z[i, g] is one if observation i belongs to group g and zero otherwise. The precision matrix is chosen as K j = I.
A more detailed exposition for structured additive regression specifications comprising also bivariate surfaces or varying coefficient terms is provided in Fahrmeir, Kneib, and Lang (2004).

INFERENCE
Our Bayesian approach to zero-inflated and overdispersed count data regression relies on MCMC simulation techniques. For both the ZIP and ZINB model, the full conditionals for the regression coefficients arising from the basis function expansion are not analytically accessible due to the complex structure of the likelihoods. The same remains true for the Poisson and NB model. One possibility is to develop suitable proposal densities based on iteratively weighted least squares (IWLS) approxima-tions to the full conditionals as detailed below. Note that in contrast, the full conditionals for the smoothing variances τ 2 j can be derived in closed form

IWLS Proposals
The basic idea of IWLS proposals is to determine an approximation of the full conditional that leads to a Gaussian proposal density with expectation and covariance matrix corresponding to the mode and the curvature at the mode at the current state of the Markov chain. The resulting Metropolis-Hastings sampler described in Section 3.2 has the advantage that the IWLS proposal automatically adapts to the form of the full conditional and thereby avoids manual tuning. To make the description easier, we assume for the moment a model with only one predictor η but the principle idea immediately carries over to our multipredictor framework since in the MCMC algorithm we are always only working with subblocks of coefficients corresponding to one predictor component. Let l(η) be the log-likelihood depending on the predictor η. It is then easy to verify that the full conditional for a parameter block β j is and ∝ is abused for equality up to additive constants. Taylor expansion leads to the Newtons method type approximation with t an index of the current iteration. From this, we can deduce the working model in which z = η + W −1 v is a vector of working observations with the predictor as expectation, v = ∂l/∂η is the score vector and W is the working weight matrix based on a Fisher-scoring approximation, with w i = E(−∂ 2 l/∂η 2 i ) on the diagonal and zero otherwise. With the priors (3) on β j we finally obtain the Gaussian proposal densities N(μ j , P −1 j ) for β j with expectation and precision matrix where η −j = η − Z j β j is the predictor without the jth component.
To be able to apply the IWLS proposals in the context of zero-inflated count data regression, we now have to derive the required quantities, namely the score vectors v and the working weights W . For the ZIP model, the elements of the score vectors for the zero-inflation and the Poisson parts of the model are given by and the working weights can be shown to be For the ZINB model, we obtain where ψ 1 (x) = d 2 dxdx log( (x)) is the trigamma function for x > 0. The expectations of the digamma and trigamma functions contained in W δ are approximated by We choose m such that it is lower than or equal to the largest observed count and the cumulative sum k p(k) of probabilities is above a certain threshold (our default is 0.999). The computing time is considerably dominated by the evaluation of the approximations above and it proved to work quite well in practice to calculate only within the initialization period for computing the starting values (see Section 3.2 below). After that period, we keep expression (11) fixed during the MCMC iterations which reduces computing time by at least two thirds. Explicit derivations of all score vectors and working weight matrices are given in the supplement Section C.1. Further discussion and proofs for the positive definiteness of W are given in Section 3.3. The required quantities in the Poisson and NB model can directly be obtained from those of the ZIP and ZINB model.

Metropolis-Hastings Algorithm
The resulting MCMC algorithm can compactly be summarized as follows: 1. Initialization: Let T be the number of iterations. Determine suitable starting values for all unknown parameters (e.g., using the algorithm described in Section A). 2. Loop over the iterations t = 1, . . . , T , the predictors of a given model and the components of the predictor. (a) Compute the working observations with expectation μ j and precision matrix P j given in (5). To solve the identifiability problem inherent to additive models, the sampled effect is corrected according to Rue and Held (2005, Algorithm 2.6), such that Aβ j = 0 holds, with appropriate matrix A, such as A = 1 Z j . (c) Update of τ 2 j : Generate the new state from the inverse gamma distribution IG a j , (b j ) (t) with a j and b j given in (4).
By construction, the acceptance rates of the smoothing variances are 100% as the generation of random numbers is realized by a Gibbs-sampler. During several simulations and in the applications we observed acceptance rates between 70% and 90% for linear and nonlinear effects. Note that in contrast to random walk proposal Metropolis-Hastings samplers high acceptance rates are desired since they reflect situations in which the proposal density is very close to the true full conditional. In cases with high-dimensional parameter vectors, such as in spatial effects, the acceptance rates might be lower than 30%. An extension to multilevel structure can cover this problem and is explained in Section B of the online supplementary material.

Theoretical Results and Numerical Details
Propriety of the Posterior. Since our model specification includes several (partially) improper normal priors, a natural question is whether the resulting posterior is actually proper. For exponential family regression with similar predictor types, this question has been investigated, for example, in Fahrmeir and Kneib (2009) or Sun, Tsutakawa, and Zhuoqiong (2001) and we will now generalize these results to the framework of GAMLSS. Assume that conditionally independent observations y i , i = 1, . . . , n, with densities p(y i ) corresponding to an mparametric distribution family with parameters ϑ i1 , . . . , ϑ im are given. Let η ϑ 1 i , . . . , η ϑ m i be the predictors linked to the m parameters. For a generic predictor vector, Equation (1) allows us to write η ϑ l = J j =1 Z ϑ l j β ϑ l j , l = 1, . . . , m, with appropriate design matrices Z j and vectors of regression coefficients β j . The basic idea to get sufficient conditions for the propriety of the posterior is to rewrite this model in a mixed model representation with iid individual specific random effects where we explicitly differentiate between effects with proper and (partially) improper priors. We now discuss the required assumptions in more detail: A. Conditions on the densities. Assume that the set of observations can (after reordering) be partitioned such that for n * ≥ 1 . . , n. This implies that for at least one observation the density is integrable (with respect to the predictors) and that all remaining densities are bounded. Since we are dealing with discrete distributions, all densities are automatically bounded by 1 so that only Condition (A.1) can be an issue in practice. In our class of models, Condition (A.1) is usually fulfilled if certain restrictions apply on specific parameters that exclude extreme values on the boundary of the parameter space. This is similar to the case of logistic regression with binary responses where finite integrals can be achieved by restricting the parameter space for the success probability to either π i > 0 for at least one observation y i = 0 or π i < 1 for one y i = 1 (excluding one of the boundary cases).
For Poisson distributed responses, the density is integrable for all y i > 0 such that n * corresponds to the number of nonzero observations. For zero-inflated Poisson, we require at least one observation with y i > 0 for which additionally the boundary case π i = 0 is excluded. Note that excluding the boundary case simply implies that we exclude cases in which no information is available on the zero-inflation part of the distribution. For the negative binomial case, we similarly have to restrict the parameter space for the overdispersion parameter to exclude the boundary cases δ i = 0 and δ i = ∞ together with y i > 0 for at least one observation y i . Note that these boundary cases simply imply that one observation y i exists for which the second moment exists and for which information on δ i is available. For zero-inflated negative binomial responses, the condition on the probability of excess zeros and the condition on the overdispersion parameter have to be combined. Summarizing, we only run into trouble if for all y i > 0 we are in one of the boundary cases described before. Exploratory analyses based on simpler parametric models will usually provide helpful guidance on excluding such situations.
B. Reparameterization and conditions on ranks of design matrices. To explicitly differentiate between model components with proper and improper prior, we apply a mixed model representation to all vectors of regression coefficients with partially improper priors such that where p(ξ ϑ l j ) ∝ const is a vector of fixed effects with flat prior are iid Gaussian random effects with proper prior and dimension k ϑ l j = rk(K ϑ l j ) (see Fahrmeir and Kneib 2009, for details). The complete predictor can then be written as where X ϑ l ξ ϑ l comprises all predictor components with a flat prior, V ϑ l 0 b ϑ l 0 corresponds to the random effect with the largest dimension (denoted as k ϑ l 0 ), and V ϑ l b ϑ l is a composition of all remaining random effects.
Letk 0 = min{k ϑ 1 0 , . . . , k ϑ m 0 } and assume that we can choosek 0 observations including at least one observation fulfilling (A.1) to define the submodel corresponding to these observations. Then the following rank conditions have to be fulfilled: (B.1) The design matrix X ϑ l s has full rank r ϑ l .
To ensure (B.1), superfluous columns arising from the reparameterization have to be deleted. In particular, duplicated constant columns representing the levels of the functions are deleted which correspond to the centering of functions that we included in the MCMC algorithm. Condition (B.2) indicates that the rank of the design matrices in the submodel is the same as in the complete model whereas (B.3) defines a similar restriction for the design matrix of the largest random effect arising from the mixed model representation. C. Assumptions on the hyperparameters. Define the normalized submodelη , and ε ϑ l 0s ∼ N(0, (τ 2 0 ) ϑ l Ik 0 ) represents an iid random effect since V ϑ l 0s b ϑ l 0s ∼ N(0, (τ 2 0 ) ϑ l V ϑ l 0s V ϑ l 0s ). Define then the residual sum of squares for the normalized submodel as We then assume for the hyperparameters (a ϑ l j , b ϑ l j ) of the inverse gamma prior of the jth model term in the lth predictor: . . , J ϑ l . (C.4) SSE ϑ l s +2b ϑ l 0 > 0. Condition (C.1) relates the rank of the random effects part of one individual effect to the sum of all rank deficiencies in the corresponding predictor and requires that the dimensionality is not too small. The condition can be ensured by increasing the value for a ϑ l j if necessary. Condition (C.2) restricts the number of effects with flat prior to be at most equal to the dimension of the largest random effects part in the model. Again, the condition can be ensured by increasing the hyperparameter values. Condition (C.3) requires that if b ϑ l j is set to zero, the parameter a ϑ l j has to be negative. This includes situations corresponding to flat priors for the random effects variance (a ϑ l j = −1, b ϑ l j = 0) or standard deviation (a ϑ l j = −0.5, b ϑ l j = 0) but excludes Jeffreys' prior (a ϑ l j = b ϑ l j = 0). Finally, Condition (C.4) requires that either there is variation in the residual sum of squares in the normalized submodel (implying that not all effects are zero) or b ϑ l 0 > 0 holds. The latter requirement can always be ensured in practice but excludes flat priors for the random effects variances or standard deviations.
This allows us to adapt the sufficient conditions for the propriety of the posterior derived in Fahrmeir and Kneib (2009), yielding the following theorem: A proof for the theorem is contained in Section D. Note that Theorem 1 gives sufficient but not necessary conditions such that the posterior can in fact still be proper even if one or more conditions are violated. Compared to the usual exponential family case, the conditions on rank deficiencies have to apply separately for each predictor in the model, so that the total requirements are in general stronger than in the generalized additive model case.
Regularity of the posterior precision matrix. Concerning the IWLS proposals, one requirement is that the covariance matrix of the approximating Gaussian proposal density is positive definite and, therefore, invertible. Given full column rank of the design matrix, this is ensured when the diagonal elements of the working weight matrices are all positive. Positivity of the weights is always given for zero-inflated Poisson models, as shown in Section C.2. For zero-inflated negative binomial models, the weights involved in the updates for π and μ are always positive (see again Section C.2) while this is not necessarily the case for the weights related to δ. Note, however, that this is not too problematic since positive weights are a sufficient but not necessary condition for the precision matrix to be invertible. Moreover, we have empirically observed that negative weights only occur rarely and in extreme parameter constellations. If a computed weight is exceptionally negative, we set it to a small positive value in our implementation to avoid rank deficient precision matrices.
Implementation. The Bayesian zero-inflated and overdispersed count data approach developed in this article is implemented in the open source software BayesX (Belitz et al. 2012). We use methods for efficient storing of large datasets and sparse matrix algorithms for sampling from multivariate Gaussian distributions; see Lang et al. (2014) for details. This also has the advantage that the multilevel framework, briefly outlined in Section B of the supplementary materials, becomes accessible for zero-inflated and overdispersed count data regression.
To compute starting values that ensure rapid convergence toward the stationary distribution, we make use of a backfitting algorithm (Hastie and Tibshirani 1990) with fixed smoothing parameters; see Section A of the online supplementary material for further details.
A challenge in working with count data models is the numerical stability of the software. Suppose for instance that we estimate a (possibly complex) ZIP regression whereas the true model is a simple Poisson regression without zero-inflation. Then π is actually zero and the estimated predictor η π corresponding to π will tend to be rather small so that a software crash (e.g., due to overflow errors) is very likely. The problems get even worse for the ZINB model. A similar problem occurs for π → 1, but this would be easier to track in applications since then all observations would be equal to zero. To avoid numerical problems, we included an optional "save estimate" in our software that prevents a software crash due to numerical instability. This is obtained by updating a vector of regression parameters β j only if the proposed new state β p j of the Markov chain ensures that the predictor vector is within a certain prespecified range (e.g., −10 ≤ η π ≤ 10). Otherwise, the current state of the chain is kept. In the majority of applications, a predictor outside limits will not occur or only in a very few number of iterations. If it occurs frequently, then of course the estimated results will not be fully valid but rather an indicator that the specified model is too complex for the data at hand.

SIMULATIONS
With the simulations conducted in this section, we aim at studying various properties of our proposed Bayesian approach for zero-inflated and overdispersed count data regression (denoted as MCMC in the following) and at comparing it with frequentist penalized likelihood estimation (denoted as ML in the following). The computations for the Bayesian approach have been carried out in BayesX. For penalized likelihood estimation, we considered the gamlss package in R (Stasinopoulos and Rigby 2007) but observed a number of limitations in its performance so that comparisons could not be done in all cases. We, therefore, structured the rest of this section as follows: Section 4.1 considers only additive models that consist of a combination of nonlinear effects of continuous covariates. We mainly present results for ZIP regression since we observed convergence problems of the Newton-Raphson/Fisher-scoring algorithm build in the gamlss package for about 10% of the simulation replications for the ZINB model. These problems could not even be overcome with different smoothing parameter choices (based on find.hyper) for the function pb. We also tried the ga function within the gamlss.add package for our simulated data which caused even more convergence problems than the gamlss package. After the detailed comparison of frequentist and Bayesian inference for ZIP regression, we summarize results from a number of additional simulations addressing, for example, the performance of Bayesian inference in NB and ZINB regression, the impact of relatively small sample sizes on estimation quality, effect separation between the zero-inflation and the count data part of the model, and the prior sensitivity of the results. In Section 4.2, we then extend the additive models by a spatial effect comprising a structured part (modeled as a Markov random field) and an unstructured part (represented by an iid random effect). Although the gamlss.add package provides a possibility to fit models with spatial effects based on Markov random fields, it does not support the hierarchical model specification we employed in the simulations. We also briefly comment on the performance of the DIC for model selection in Sections 5 and 6. Considerably more details on many of the simulation results including geoadditive models are provided in Sections E and F of the online supplement to this article.

Additive Models
As building blocks for additive model specifications, we consider the functions depending on which of the count data models is used, that is, each predictor is composed of the sum of two nonlinear functions f 1 and f 2 . The covariates x 1 and x 2 are obtained as iid samples from equidistant grids of step size 0.01 with x i1 ∈ [1, 6] and x i2 ∈ [−3, 3]. We use the sample size n = 1000 as starting point and simulate 250 replications. An average amount of about 50% and 46% of zeros is observed in the generated samples for ZIP and ZINB, respectively. For both the Bayesian and the frequentist estimates we considered cubic P-splines based on a grid of 20 equidistant knots and second order random walk prior/difference penalty. In the frequentist setting, the smoothing parameters were estimated using the function find.hyper with starting value 3 for all parameters and with default settings for the remaining arguments. In the Bayesian approach, we considered inverse gamma priors with parameters a = b = 0.001 as the default choice. The number of iteration steps T for each simulation run r in MCMC is set to 12,000 with a burn-in phase of 2000 iterations. We store and use every 10th iterate. In total, 90 unknown parameters are included in the ZIP model whereas 135 unknown parameters have to be estimated in case of ZINB.
To compare the results of ML and MCMC, we consider logarithmic mean squared errors (MSEs) evaluated at the observed covariate values (Figure 1), as well as empirical coverage rates of pointwise 95% credible intervals in Figure 2. Coverage rates of 80% credible intervals have also been computed but showed a similar qualitative behavior and are, therefore, omitted. Note that BayesX also provides simultaneous credible bands (following the suggestion in Krivobokova, Kneib, and Claeskens, 2010) which we will use later in our application to assist the selection of an appropriate predictor structure. Since no frequentist counterpart is available, we did not include simultaneous confidence bands in the simulation results. Figure E1 of the supplement shows posterior mean estimates obtained with MCMC and ML point estimates averaged over all 250 replications. Estimates corresponding to the 10% and 90% quantile of the MSEs (i.e., to 10% best and 10% worst estimates) are given in Figures E2 and E3 together with 95% credible intervals.
In summary, we obtain the following results: 1. Bias. Averaging all replications leads to satisfactory results for ML and MCMC with only slightly too smooth mean estimates in extreme areas of effects. At the boundaries of covariate domains, MCMC tends to fit the true functions better. Figure E3 indicates potential problems with the smoothing parameter choice of ML while the choice of hyperparameters based on MCMC seems to be more appropriate. 2. MSE. Figure 1 confirms that both methods deliver similar mean results. In general, the nonlinear functions with effects on λ seem to be easier to estimate than the ones impacting the probability of the additional zeros π . This can be seen in the smaller values of MSEs of f λ 1 and f λ 2 compared to the ones of f π 1 and f π 2 . 3. Coverage rates. Figure 2 provides evidence that the Bayesian approach provides valid uncertainty intervals of effects which cannot be obtained based on the asymptotic theory of ML. Note that a corresponding warning is already given in the manual of Stasinopoulos, Rigby, and Akantziliotou (2008, p. 51). For MCMC, the 95% level of the posterior intervals is mostly maintained. 4. Computing time. On a PC with Intel(R) 3.40GHz processor, the computation time of one MCMC replication is roughly 23 sec. and is pretty stable over the simulation replications. For ML, the runtime varies much more over the replications and depends mainly on the choice of smoothing parameters. To omit warnings, we applied the find.hyper function which can take several minutes. Given the hyperparameters, ML requires between less than 2 and more than 15 sec.
In conclusion, bias and MSEs support that results obtained with MCMC are at least as reliable as those obtained with ML. In addition, better coverage properties of the posterior intervals obtained with MCMC render our Bayesian approach a strong competitor to existing ML estimates.
The performance of our approach in the ZIP, NB, and ZINB regression was further investigated through simulation studies documented in detail in Section E.2 of the online supplement. These additional simulations yielded the following results:  (Kaas et al., 2012) indicate that the study design chosen in Section 4.1 still yields reliable estimates in the ZIP model even if the sample size is reduced to n = 250. However, with two nonlinear effects in each parameter we find that a sample size of n = 500 in the NB model and n = 750 in the ZINB model is desirable to obtain reliable results, compare to Section on sample size. 3. Separation of effects. Since the zero-inflation and count data process both produce observations with y i = 0 and the expectation of the response depends on both the probability of excess zeros and the expectation of the count data process, one might be worried about the separability of effects between these processes. In Section E.2.4 of the online supplement, we find that this separation is indeed possible based on appropriate rules for model selection discussed in detail in Section 5. 4. Prior sensitivity. To investigate how sensitive additive models for zero-inflated count data are with respect to the prior settings for the smoothing variances, we ran the additive ZIP model with different hyperparameters of the inverse gamma distribution. In addition to our standard choice a j = b j = 0.001, we tried a j = b j ∈ {0.01, 0.0001, 0.00001, 0.000001}, as well as a j = −0.5, b j = 0 and a j = −1, b j = 0 corresponding to uniform priors on the standard deviations τ j and on the variances τ 2 j , respectively. As another alternative, we considered half-Cauchy priors for the standard deviations τ j . All these choices did not cause serious problems with mixing or convergence of the Markov chains. Moreover, posterior mean estimates of the splines on λ are hardly influenced by these choices while the smoothness of splines impacting π varies slightly (compare Section E.2.5 of the online supplement).

Geoadditive Models
In this section, the simulations for the ZIP, NB, and ZINB model have been extended by an additional spatial effect simulated as The structured part of the spatial effect f spat is assigned a Markov random field prior and simulated on the basis of centroids with standardized coordinates (x c s , y c s ), s ∈ {1, . . . , S} of the S = 327 regions in Western Germany. The unstructured part is described by a random effect ε s ∼ N(0, 1/16) for each of the regions. In Figure E20 of the supplement, the simulated complete spatial effects for the rate λ of the count process and the probability of the additional zeros π in case of a ZIP model are visualized. The model for a generic predictor η can now be written as Estimates are obtained from a two-level model in which the total spatial effect is decomposed in a structured part and an unstructured effect ε; see Section B of the online supplement and Lang et al. (2014).
Since the mixing of the Markov chains in a geoadditive model is in general less satisfactory than in additive models, the number of iterations is increased to 55,000 with a burn-in phase of 5000. We store each 50th iterate so that the final sample size of 1000 is retained. To find a desirable sample size for which satisfactory estimate results can be achieved, we performed estimates for n = 1000, 2000, and 4000 observations. Note that in the following we restrict ourselves to the presentation of results in the ZIP model. Outcomes for the NB and ZINB model are summarized at the end of this section and illustrated in Sections E.3.2 and E.3.3, respectively.
As shown in the supplement of Fahrmeir and Lang (2001), the unstructured and the structured spatial effect can generally not be separated and are often estimated with bias. Only the sum of both effects is estimated satisfactorily. This means in practice that only the complete spatial effect should be interpreted and nothing (or not much) can be said about the relative importance of both effects. Exceptions are cases where one of both effects (either the unstructured or the structured effect) is estimated practically zero and the other effect clearly dominates. Hence, we present the estimated complete spatial effect compared to the true simulated effect for two selected sample sizes n = 1000 and 4000 in Figure 4. Beside that, the log(MSE) in Figure 3 and the kernel densities of complete posterior mean spatial effects in Figure E21 give further information about the quality of the inference.
Important results obtained from these simulations are: 1. MSE. The spatial effect has higher MSEs compared to the nonlinear effects but we observe that for greater sample sizes the MSEs can be reduced in all effects. Comparing Figure 1 with 3, it is positive to note that for sample size n = 1000 an additional spatial effect does not impair the MSEs of the nonlinear effects. 2. Bias. Figure 4 shows that extreme values of the spatial effect are the most difficult to estimate since both large negative and large positive effects tend to be underestimated. Together with Figure E21, it can be concluded that the spatial effect for π is estimated too smooth at the boundaries in comparison with the true effect. 3. On a PC with Intel(R) 3.40GHz processor, the computing time for one replication is around 4 min. We also assessed prior sensitivity as in the case of additive models presented in the previous section. Similar to additive models, nonlinear effects are quite robust even for very small values a j , b j without any convergence problems. Furthermore, under uniform priors or inverse gamma priors with a j = b j > 0.00001 we also did not observe serious mixing problems in the samples of the Markov random field. However, for smaller hyperparameters a j = b j = 0.00001 the number of iterations and the thinning has to be increased considerably, such that we do not recommend these specifications given the theoretical results on the propriety discussed in Section 3.3. On average, posterior mean spatial effects on π are worse under a half-Cauchy and IG(a j = −1, b j = 0) prior, compare Section E.4.
The simulations show that with a sample size of n = 4000 the estimated complete spatial effect is similar to the simulated, true one. The quality of mean estimates of nonlinear effects remains 1,000 4,000 1,000 4,000 1,000 4,000 1,000 4,000 1,000 4,000 1,000 4,000  True Spatial Effect Estimated Spatial Effect Figure 4. Geoadditive ZIP model. Estimated complete posterior mean spatial effects (y-axes) compared to true effects (x-axes). The first row corresponds to λ and the second row to π . Sample sizes n = 1000 and n = 4000 are shown in the first and second column, respectively. as in the previous section even when adding an additional spatial effect. Hence, it can be said that both nonlinear and spatial effects are well identified in the estimates, especially when taking the complexity of the models into account. Similar basic outcomes are obtained for the NB and ZINB models (documented in the supplement). For the performance of the deviance information criterion (DIC, Spiegelhalter et al. 2002) in geoadditive models to select relevant effects we refer the reader to a study documented in the online supplement Section F. There we also provide further details about how estimates suffer due to model misspecification.

PATENT CITATIONS
In our first application, we analyze the number of citations of patents granted by the European Patent Office (EPO). An inventor who applies for a patent has to cite all related, already existing patents his patent is based on. The data have originally been collected to study the occurrence of objections against patents for 4866 patents; see Jerak and Wagner (2006). Details about the dataset including summary statistics and a discussion about outlier removal can be found in (Fahrmeir et al. 2013).
A raw descriptive analysis of the response variable (number of citations, ncit) gives an average of 1.64 and a variance of 7.53. Roughly 46% of the observations are zeros, the smallest and largest observed values are 0 and 40. While these summary statistics do not take the potential covariate effects into account, they already provide evidence that overdispersion and zeroinflation may be relevant to obtain a realistic model for the conditional distribution of the number of citations.
We consider the four candidates Poisson, ZIP, NB, and ZINB as possible distributions for the response. Available explanatory variables are the grant year (year), the number of designated states (ncountry), the number of claims against the patents (nclaims), and further binary and categorical covariates collected in x (compare Table G4 of the online supplement). Our model class allows the inclusion of all covariates in each distribution parameter, but model selection shows that this usually does not lead to an optimal model. To determine such models we proceeded as follows: Given one of the four distribution candidates, we used the DIC to select appropriate predictor structures for all distribution parameters to obtain a good compromise between fit and model complexity. Starting from the full model with predictor structure η = f 1 (ncountry) + f 2 (year) + f 3 (nclaims) + x β for all relevant model parameters, we performed a stepwise model selection and decided for the model with the lowest DIC based on the rule of thumb discussed in Section F. When comparing two models, this rule indicates a clear preference for one of the two models under consideration if the DIC difference is larger than 10 points whereas it is indecisive otherwise. In the latter case, we assist the model selection by looking at the significances of the estimated effects, that is, we exclude a parametric effect from the optimal model if the 95% credible interval contains zero. For nonlinear effects, neither conducting separate tests for each knot nor pointwise credible intervals are desirable approaches to assess the uncertainty of the complete estimated nonlinear effect. We, therefore, base our decision on simultaneous credible bands (Krivobokova, Kneib, and Claeskens 2010) and exclude nonlinear effects if such a 95% credible band contains zero for all observed covariate values. Overall, this procedure aims at favoring sparse and parsimonious models since the considerable flexibility offered by GAMLSS may otherwise tempt researchers to specify overly complex models. Of course, the rule of thumb may (and should) be adapted to the specific requirements of an analysis. To study how estimates suffer when fitting misspecified models, we conducted additional simulations described in Sections E.2.4 and F.3 of the online supplement. The basic outcome of these studies is that point estimates of influential effects are rather insensitive to the inclusion of noise variables that do not have an effect on the response distribution.
In the following, optimal models obtained from our stepwise search are denoted by (M1). The predictors of ZINB M1, for example, are given by year) + f μ 3 (nclaims) η π = β π 0 + β π 1 biopharm + f π 1 (ncountry) + β π 2 (year − 1991) η δ = β δ 0 + β δ 1 patus + β δ 2 patgsgr. For the other distributions, the predictors for the location parameter λ or μ are identical to that of ZINB. For the parameters π or δ the selected covariate effects in the ZIP and NB model differ compared to ZINB, see Table G5 of the online supplement for a full documentation. To achieve some evidence that our flexible model class offers an improved fit for the patent data, we also considered models in which (M2) the predictor of λ (or μ) and π is a linear version of M1 (i.e., all spline-based estimates are replaced by linear effects) and where δ is estimated as a constant not depending on covariates. These models provide a Bayesian analogue to the results obtained with the R package pscl (Zeileis, Kleiber, and Jackman, 2008) for ZIP or ZINB, as well as to a usual generalized linear model for Poisson or NB. (M3) the predictors for the mean parameter of the count process are identical to those of M1 but π and δ are considered as constants not depending on any covariates as proposed in Fahrmeir and Osuna Echavarría (2006). While estimating a constant π in the ZIP model and a constant δ in the NB model did not cause any problems, a ZINB model with flexible predictor for μ and constant π and δ was somewhat problematic in our implementation but also in the original implementation of Fahrmeir and Osuna Echavarría (2006). Also, when estimating the model with a linear predictor for μ in pscl, we obtained exceptionally high standard errors. Consequently, we consider ZINB M3 as the model with η μ and η π from ZINB M1 and δ restricted to be constant.
The results of all 11 models were compared in terms of normalized (randomized) quantile residuals as a graphical device suggested by Stasinopoulos, Rigby, and Akantziliotou (2008): For an observation y i , the residual is given byr i = −1 (u i ) where −1 is the inverse cumulative distribution function of a standard normal distribution, u i is a random value from the uniform distribution on the interval [F (y i − 1|θ), F (y i |θ)],θ comprises all estimated model parameters and F (·|θ) is the cumulative distribution function obtained by plugging in these estimated parameters. If the residuals are evaluated for the true model, they follow a standard normal distribution (Dunn and Smyth 1996) and, therefore, models can be checked by quantilequantile-plots. Since the residuals are random, several randomized sets of residuals have to be studied before a decision about the adequacy of the model can be made. Figure 5 shows one realization for the flexible models Poisson M1, ZIP M1, NB M1, and ZINB M1. The residuals of the remaining, simpler models are given in the supplement Figure G38. Figure 5 clearly indicates a preference for the NB or ZINB model that provide a considerably better fit for estimating the distribution of patent citations in general. Although the residuals of the Poisson model can be improved by applying the ZIP model, the sample quantiles greater than 2 are too high compared to the true quantiles. Both the NB and ZINB model seem to overcome this problem. Furthermore, comparing Figures 5 and G38 indicates an improvement in both cases when allowing for nonlinear effects and estimating π and δ as functions of relevant covariates.
In a second step, we applied proper scoring rules proposed by Gneiting and Raftery (2007) to confirm the findings assessed by the residuals: Let y 1 , . . . , y R be data in a hold out sample andp r = (p r0 ,p r1 , . . .) the estimated probabilities of a predictive distribution withp rk = P (y r = k). A score is obtained by summing up individual score contributions, that is, S = R r=1 S(p r , y r ). Let p 0 be the true distribution, then Gneiting and Raftery (2007) took the expected value of the score under p 0 to compare different scoring rules. A scoring rule is called proper if S(p 0 , p 0 ) ≥ S(p, p 0 ) for any predictive distributionp and it is strictly proper if equality holds if and only ifp = p 0 . We consider three common scores: the Brier (or quadratic) score, S(p r , y r ) = − k (1(y r = k) −p rk ) 2 , the logarithmic score, S(p r , y r ) = log(p ry r ), and the spherical score S(p r , y r ) =p ryr √ kp 2 rk . All these scoring rules are strictly proper but the logarithmic score has the drawback that it only takes into account one single probability of the predictive distribution and is, therefore, susceptible to extreme observations. This property is shared with the DIC since both are based on the log-likelihood. In our application, the predictive distribution is assessed by a 10-fold cross-validation. Table 1 summarizes scores and DIC  cessor. Since the nonlinear functions f 1 to f 3 are modeled by cubic P-splines with 20 inner knots and second-order random walk prior, the ZINB M1 model contains 99 unknown parameters (compare Table G5). The effective number of parameters is estimated to be approximately 36. For all three distribution parameters, the effective MCMC sample sizes are on average greater than 410 and show only rather small autocorrelations. Similar to the simulation studies, the posterior distributions of effects are very robust with respect to the prior choice and hyperparmater setting. For all considered choices, we did not face mixing problems and obtain very similar effects. The effects of year and nclaims on μ are slightly rougher for an IG(a j , b j ) prior with a j = b j = 0.01. Further evidence and detailed results are given in the supplement G.1. Figure 6 displays posterior mean estimates of ncountry, year, and nclaims on μ (first row) and of ncountry on π (second row) together with pointwise 80% and 95% credible intervals. Vertical stripes indicate the relative amount of observations relating to the different covariate values (the darker the stripes, the more data). For the variable year, we found a linear impact on π . A summary of the posterior distributions of all included linear effects can be found in the supplement, Table G6.
The following observations and interpretations on selected effects can be made: Looking at patents with grant year later than 1985, we estimate that patents are cited the less the newer they are. The effect of ncountry on π is the highest at around 10 designated states while the effect of year rises the probability of never being cited steadily. For an adequate interpretation of the expectation (1 − π )μ of the response, it is important to see that an increase of the effects on μ and a decline of the function estimates on π result in a growing estimated expectation and vice versa. For the patent data, we find that the influence of year and ncountry on the total expectation behaves similar as when considering only the expectation of the count data part μ.  Figure 6. Patent citations. The first row shows posterior mean estimates of ncountry, year, and nclaims on μ (centered around zero) together with 80% and 95% pointwise credible intervals. The second row shows the posterior mean estimate of ncountry on π (centered around zero) together with 80% and 95% pointwise credible intervals. Estimates belong to the best performing model ZINB M1. Vertical stripes indicate the relative amount of data.

MODELING CLAIM FREQUENCIES
We also apply the developed methods to a dataset with n = 162,548 policyholders from a car insurance in Belgium of the year 1997. The insurance premium in car insurances is based on detailed statistical analyses of the policyholder's risk structure. One important step is to model the loss frequency which usually depends on characteristics of the policyholder and the vehicle. Typical covariates are the age of the policyholder (ageph), age of the car (agec), the engine power (power), and the previous claim experience. In Belgium, the claim experience is measured by a 23-level bonus-malus-score (bm). The higher the score, the worse the claim history of the policyholder. Furthermore, the geographical information in which of the 589 districts (distr) in Belgium the policyholder's car is registered is provided.
The dataset has already been treated in Denuit and Lang (2004) who applied geoadditive Poisson models. A detailed analysis based on both count data regression for claim frequencies and zero adjusted models (as introduced in Heller, Stasinopoulos, and Rigby 2006) for claim sizes in the framework of GAMLSS is provided in Klein et al. (2014). Here we build upon these more detailed treatments to illustrate the application of zero-inflated and overdispersed models for claim frequencies. We, therefore, consider the predictor for the mean parameter in the count process, that is, λ in case of ZIP and μ in case of NB or ZINB. The spatial effect has been modeled by a Markov random field and the term x β contains additional linear effects of dummy variables that will not be discussed here. Model comparison is provided in the online supplement Section H.1. In summary, there is no clear evidence in favor of ZIP or NB and both models seem to provide a more reasonable fit compared to the Poisson model. In the following, we present results for the ZIP model to illustrate the interpretation of estimated effects. The selected model for π has the predictor η π = f λ 1 (ageph) + f λ 2 (agec) + f π spat (distr) + (x π ) β π . Based on our selection rule discussed in the previous section, the spatial effect could also be excluded from the predictor as the DIC is indecisive and the simultaneous credible intervals fully cover the zero line. However, the estimated spatial effect displays an interesting cluster structure with districts of consistently low effect in the north around Antwerp. Replacing the smooth spatial effect with a dummy for this low-risk region also yielded a significant effect so that we decided to include the spatial effect for illustration purposes.
The computing time was less than 450 min on a PC with Intel(R) 3.40 GHz processor and a total number of 55,000 MCMC iterations. In Figure 7, the estimated nonlinear effects on π are plotted together with 80% and 95% pointwise credible intervals. Again, vertical stripes indicate the relative amount of data of the corresponding covariate values. Figure 8 depicts the estimated spatial effects on λ and π . We also reestimated the models with different hyperparameters for the inverse gamma priors, including flat priors for variances or standard deviations, as well as half-Cauchy priors for the standard deviations. Compared to the simulation studies in hierarchical geoadditive models (where an additional iid random effect was simulated, which was neither significant nor selected by the DIC in this study),  the mixing of the spatial effect on λ for values of τ 2 j as small as a j = b j = 0.000001 was less problematic. Nonlinear and linear effects, as well as the spatial effect on λ were estimated similarly in all cases, compare Section H.2 for further illustrations. Some smaller differences occurred for the spatial effect on the probability of excess zeros but these are likely to be caused by the small values that are estimated for this effect anyway.
The spatial effect in Figure 8 clearly indicates a large number of expected claims in urban areas like Brussels, Antwerp, or Liège. For π the monotonically increasing effect of agec can be seen as an indication for an excess of zero claims for older cars. The estimated effects for λ (shown in Figure H43 of the online supplement) are generally close to those found in Denuit and Lang (2004).

SUMMARY AND CONCLUSIONS
In this article, we developed numerically efficient Bayesian zero-inflated and overdispersed count data regression with semiparametric predictors as special cases of GAMLSS relying on iteratively weighted least squares proposals. A particular focus has been laid on the ZIP, NB, and ZINB distribution as standard choices for applied work. Our framework goes far beyond the model flexibility in the R package gamlss (Stasinopoulos and Rigby 2007) since our predictors may include complex, hierarchical spatial effects and may in general cope with hierarchical data situations as described in Lang et al. (2014). Moreover, simulation studies revealed that the Bayesian approach yields reliable credible intervals in situations where the asymptotic likelihood theory fails while at the same time giving point estimates of at least similar quality. For model choice, we considered quantile residuals as a possibility to evaluate the general potential of a given model to fit the data. The deviance information criterion takes the complexity of an estimated model into account and can, therefore, be seen as a valuable tool both in comparing response distributions and predictor specifications. Proper scoring rules evaluated on hold out samples allow to assess the predictive ability of estimated models. Nevertheless, model choice and variable selection remain relatively tedious in particular due to multiple predictors involved. In the future it would, therefore, be desirable to develop automatic model choice and variable selection strategies in the spirit of Belitz and Lang (2008) in a frequentist setting or Scheipl, Fahrmeir, and Kneib (2012) in a Bayesian approach via spike and slab priors.
The Bayesian formulation of GAMLSS also provides the possibility to include modified/extended prior structures without major changes of the basic algorithm. For example, truncated normal priors may be considered to further improve the numerical efficiency or Dirichlet process mixture priors could be included to facilitate the inclusion of nonnormal random effects distributions. It will also be of interest to extend the Bayesian treatment of GAMLSS to further classes of discrete and continuous distributions or even combinations of both. A first attempt in the direction of the latter has been made in Klein et al. (2014) in the context of zero-adjusted models as introduced in a frequentist setting by Heller, Stasinopoulos, and Rigby (2006).

SUPPLEMENTARY MATERIALS
The online supplement for this article includes further details on the backfitting algorithm for finding the starting values of the MCMC algorithm and additional information about the multilevel framework; the explicit derivation of all working weights for the proposal densities and proofs for their positivity; the proof of Theorem 1; additional simulation studies dealing with additive and geoadditive NB and ZINB models; the impact of small sample sizes; identifiability and separability of effects; prior sensitivity; model choice based on the DIC; and additional results for the applications on patent citations and claim frequencies in car insurances. [Received June 2013. Revised March 2014