Semiparametric zero-inflated Bernoulli regression with applications

When the observed proportion of zeros in a data set consisting of binary outcome data is larger than expected under a regular logistic regression model, it is frequently suggested to use a zero-inflated Bernoulli (ZIB) regression model. A spline-based ZIB regression model is proposed to describe the potentially nonlinear effect of a continuous covariate. A spline is used to approximate the unknown smooth function. Under the smoothness condition, the spline estimator of the unknown smooth function is uniformly consistent, and the regression parameter estimators are asymptotically normally distributed. We propose an easily implemented and consistent estimation method for the variances of the regression parameter estimators. Extensive simulations are conducted to investigate the finite-sample performance of the proposed method. A real-life data set is used to illustrate the practical use of the proposed methodology. The real-life data analysis indicates that the prediction performance of the proposed semiparametric ZIB regression model is better compared to the parametric ZIB regression model.


Introduction
Logistic regression has been extensively used to study the relationship between a binary outcome variable and a set of covariates in many disciplines, including, e.g. biomedical studies, epidemiological studies, and social studies. See, e.g. [9,10] for a detailed survey on logistic regression. However, it is often seen that when analyzing a data set consisting of binary outcome data, the observed proportion of zeros is larger than expected under regular logistic regression. It is then often suggested to use a zero-inflated (ZI) Bernoulli distribution as an alternative to analyzing this type of binary outcome data. A ZIB distribution is a mixture of two Bernoulli distributions. The ZIB distribution can be thought of as a population that is composed of two latent groups: the susceptible group consisting of individuals who are at risk of an event of interest, and may have the event, and the non-susceptible (cured) group consisting of individuals who are not at risk of the event of interest.
Diop et al. [5] proposed a parametric ZIB regression model for this type of binary outcome data with linear predictors via logit links and applied it to analyze an example of data from a study of dengue fever to investigate the effects of covariates on dengue infection. Diop et al. [4] established the asymptotic properties of the maximum likelihood (ML) estimators of the parameters of a parametric ZIB regression model. In addition, for example, Hall [7] added the flexibility of a fixed-effects ZI Poisson (ZIP) model ( [13]) and a fixed-effects ZI binomial model by incorporating random effects to accommodate the within-subject correlation and between-subject heterogeneity typical of repeated measures data, and applied these mixed-effects models to analyze an example from horticulture where data with excess zeros were collected in a repeated measures designed experiment. Also, see [16] for random effect models for repeated measures of ZI count data. Young et al. [20] utilized a ZIP regression model ( [13]) and a zero-inflated negative binomial (ZINB) regression model to determine the predicted distributions of added and deleted housing units in a block, which are used to characterize the undercoverage and overcoverage, respectively. The used data were from the 2010 address canvassing operation. All the aforementioned authors assumed that the effects of covariates in a ZIB, ZIP, ZI binomial or ZINB regression model are usually modeled via a linear predictor function. However, sometimes it may not be completely appropriate to assume that a continuous covariate effect is linear or has some other parametric form. One can then use nonparametric regression approaches to relax the restrictive parametric assumption by using an unspecified smooth function to describe the possibly nonlinear effect of the continuous covariate. Thus, motivated to address this issue, we propose a semiparametric ZIB regression model, which is described in Section 2.1 and can enhance fitting flexibility, to fit ZI binary outcome data. As an application, the proposed semiparametric ZIB regression model can be applied to assess the adequacy of a postulated parametric ZIP regression model by comparing the prediction performance of these two models in terms of prediction error, normal quantile-quantile (Q-Q) plot of Pearson residuals, and likelihood ratio (LR) test, which are illustrated in Section 5. In this work, for the proposed semiparametric ZIB regression model, the spline-based sieve M-estimation method (e.g. [14,15]) is used to estimate the unspecified smooth function for the effect of the continuous covariate. The spline coefficients for the chosen spline basis functions can completely describe the approximated unknown smooth function. The spline likelihood function can then be maximized to estimate simultaneously the regression parameters and the spline coefficients. The asymptotic variances of the estimators of the regression parameters can be estimated directly by using the spline estimation method. Additionally, for instance, Lam et al. [12] considered a semiparametric ZIP regression model for a possibly nonlinear relationship between the natural logarithm of the mean of the count variable and a particular continuous covariate when modeling ZI count data. They proposed a sieve ML estimation method for fitting the proposed semiparametric ZIP model and investigated asymptotic properties of the sieve ML estimators. He et al. [8] fit a doubly semiparametric ZIP model to ZI count data in which two partially linear link functions were assumed for both the mean of the Poisson component and the probability of zero. They used a sieve ML method to estimate both the regression parameters and the nonparametric functions, and studied asymptotic properties of the sieve ML estimators. To analyze longitudinal ZI count data, Feng and Zhu [6] considered a semiparametric ZIP mixed model that postulates a possible nonlinear relationship between the natural logarithm of the mean of the count variable and a particular continuous covariate. They proposed a penalized log-likelihood function, and used the Monte Carlo expectationmaximization algorithm to estimate the parameters of the proposed semiparametric ZIP mixed model and established asymptotic properties of the resulting estimators.
The estimation method we use for the nonparametric part of the semiparametric ZIB regression model is different from those for the nonparametric part(s) of the aforementioned semiparametric ZIP regression models. We are the first to establish the large-sample properties of the estimators of the parameters of this proposed model based on the splinebased sieve M-estimation method. It can be shown that in our estimation method, the spline estimator of the unknown smooth function achieves the optimal convergence rate by taking into account the number of knots to increase at an appropriate rate. The splinebased sieve semiparametric ZIB regression model can be shown to achieve the asymptotic efficiency of the estimators of the regression parameters.
In Section 2, we introduce the semiparametric ZIB regression model and the estimation method for the parameters of this regression model. Section 3 describes the asymptotic results of the spline estimators and likelihood ratio inference, and proposes an easily implemented approach to consistently estimate the variances of the estimators of the regression parameters. In Section 4, extensive Monte Carlo simulation studies are conducted to investigate the finite-sample performance of the proposed methodology. Section 5 illustrates the practical use of the proposed methodology with a real-life data example. Some concluding remarks are given in Section 6.

Semiparametric ZIB regression model
Let Y be a binary response variable that indicates occurrence of the event of interest or not, e.g. whether or not catching one or more fish in a lake from a survey (i.e. Y = 1 if catching one or more fish, and Y = 0 otherwise). Assume U is a latent binary variable indicating an individual's risk state: U = 1 if the individual is at risk of the event, e.g. susceptible to catching one or more fish (in other words, going fishing); U = 0 if the individual is not at risk of the event. Note that if Y = 0, then U is unknown (unobserved), and if Y = 1, then U = 1. Let x = (x 1 , . . . , x p ) be a p × 1 covariate vector, where x can be discrete, continuous, or a combination of both, and z a continuous covariate. Denote w = (x , z) .
In a parametric ZIB regression model, both the susceptible probability η(w) ≡ Pr(U = 1; w) and the probability of occurrence of the event for a susceptible individual ζ(w) ≡ Pr(Y = 1; w | U = 1) are linked to w via logit-linear predictors. The parametric ZIB regression model can then be written as follows: Here, for a given w, [1 − η(w)] ∈ [0, 1] is a mixing weight for the accommodation of extra zeros, and ζ(w) is the mean of the Bernoulli distribution. I {·} is the indicator function. Note that the ZIB distribution is reduced to a regular Bernoulli distribution when η(w) = 1.
Although the use of a linear predictor may be completely appropriate in some applications, it may not be valid for other data situations. More specifically, when it is suspected that the effect of the continuous covariate z on the logarithm of the odds of the event for susceptible individuals is nonlinear, smoothing techniques can be used to approximate an unspecified smooth function that depicts the effect of z. Motivated by this, we relax the restrictive parametric assumption in this parametric ZIB regression model by assuming that the functional form for the effect of z on the logarithm of the odds of the event for susceptible individuals is smooth but unknown, and the effects of x remain linear. Let ψ be an unknown smooth function that is used to describe the effect of z on the logarithm of the odds of the event for susceptible individuals. We can then propose the semiparametric ZIB regression model that consists of the following logistic regression model for the susceptible probability: , and β = (β 1 , β 2 ) for β 1 = (β 10 , β 11 , . . . , β 1p ) , and the following semiparametric logistic regression model for the probability of occurrence of the event for a susceptible individual: where α = (α 1 , . . . , α p ) . To the best of our knowledge, the semiparametric ZIB regression model has not been proposed for modeling ZI binary outcome data. In this proposed model, a spline function is used to approximate the unknown smooth function that is used to describe the potentially nonlinear effect of z on the logarithm of the odds of the event of interest for a susceptible individual. With some algebra, one can express the probability of occurrence of the event of interest as follows: π(w; α, β, ψ) = Pr(Y = 1; α, β, ψ, w) = ζ(w; α, ψ)η(w; β).
Assuming that {(y i , w i ) : i = 1, . . . , n} is a data set, we can express the log-likelihood function for the semiparametric ZIB regression model as follows: Define S n (T n , l) as the class of splines of order l ≥ 1 with the sequence of knots T n . The smooth function ψ is approximated by a spline function ψ n ∈ S n (T n , l) that can be expressed as a linear combination of the B-spline basis functions as follows: see [18] for more details. Let γ = (γ 1 , . . . , γ q n ) . We use ψ n (z) to replace ψ(z) in the log-likelihood function (4) to obtain the following spline log-likelihood function for τ = (α , β , γ ) : where π(w; τ ) = ζ(w; α, γ )η(w; β) for ζ(w; α, γ ) that can be expressed as follows: The spline ML estimate τ = ( α , β , γ ) of τ can be obtained by using the optim function with the method of L-BFGS-B ( [3]) in R to solve the system of score equations ∂ n (τ )/∂τ = 0 because this L-BFGS-B algorithm can handle box (or bounds) constraints, i.e. each variable can be given a lower and/or upper bound. The obtained spline estimate of ψ(z) is then denoted by ψ n (z) = q n j=1 γ j B j (z).

Identifiability
To be able to estimate the parameters of the spline ZIB regression model, we need to verify the identifiability of this model. Let W be the design space.

Selection of knots
The Schwarz information criterion (SIC) ( [19]) is employed to determine the optimal number of interior knots, m * n , which minimizes the following SIC value: Here, m n + l is the number of B-spline basis functions, and m n + l + 2p + 2 is the number of free parameters. The SIC rather than the Akaike information criterion (AIC) ( [1]) is used because if n ≥ 8, then the SIC will tend to favor models of lower dimension than those chosen by the AIC ( [11]). After m * n is determined, the locations of the m * n knots are chosen at equally spaced quantiles of the continuous covariate z.

Asymptotic results
. The largesample properties of spline estimators are established with the following L 2 -norm: where · and · 2 denote the Euclidean norm and usual L 2 -norm, respectively. The log-likelihood function of τ for a single observation v = (y, w ) is given by Let˙ π (τ ; v) = ∂ (τ ; v)/∂π(w; τ ) denote the first derivative of (τ ; v) with respect to π(w; τ ). The score functions for θ are then as follows: Differentiating (τ ; v) along with the parametric submodel ψ t = ψ + th at t = 0 for h with bounded variation yields the following score operator for ψ: . It can be shown that there exists a unique h * that minimizes the distance 2 2 for h ∈ H 2p+2 ; see [2]. The efficient score functions for θ at τ = τ 0 are then as follows: * Specifically, in view of the property of h * , one can then express h * as Here, μ 0 =˙ π (τ 0 ; v)ξ(w; τ 0 ) and ν 0 =˙ π (τ 0 ; v)λ(w; τ 0 ). Thus, we can have the following efficient score functions for θ at τ = τ 0 : * Here, {E[ * θ (τ 0 ; v)]} ⊗2 is the efficient information I(θ 0 ), where a ⊗2 = aa for any vector a. To derive the large-sample properties of the spline estimator vector τ = ( θ , ψ) , we require the following regularity conditions that are similar to those in [14]: (C1) The true regression parameter vector θ 0 is an interior point of a compact subset of R 2p+2 . The rth derivative of true function ψ 0 , for r ≥ 3, satisfies the Lipschitz condition on a compact set . (C2) The support of z is an interval within , and the second moment of z is finite.

Remark 3.1:
Condition C1 is the standard assumption in semiparametric estimation. To calculate entropy in the proofs of the following theorems, conditions C2 and C3 are required. To establish the identifiability of the model, condition C4 is needed.

Remark 3.2:
In Theorem 3.1, ψ is shown to be uniformly convergent. The overall convergence rate of τ is n r/(1+2r) , but as shown in Theorem 3.2, the convergence rate of θ is still n 1/2 . θ is shown to achieve the information bound and, hence, be efficient in semiparametric regression; see [2].
Assume that ψ θ is the spline estimator of ψ for any θ in the neighborhood of θ. Let be the profile log-likelihood for θ and lrt(θ 0 ) = 2pl( θ ) − 2pl(θ 0 ) the likelihood ratio test statistic for testing θ = θ 0 . The asymptotic result of lrt(θ 0 ) is stated in the following theorem.

Theorem 3.3 (Likelihood Ratio Inference):
The proofs of Theorems 3.1-3.3 are sketched in the Web Appendix by using the proof skills of [14].

Estimation of variances
Motivated by Zhang et al. [21], to estimate I(θ 0 ), we propose to use the empirical version of I(θ 0 ), expressed as follows: where P n is the empirical measure. Again we apply the B-spline basis functions to appro- By the definition of h * , one can estimate the B-spline coefficient vector γ s = (γ 1,s , . . . , γ q n ,s ) by minimizing . Applying standard least-squares calculation, one can express the empirical version of I(θ 0 ) in (12) as follows: . It can then be shown that in Theorem 3.4 I(θ 0 ) is estimated consistently by the conditional expected information More specifically, with the notations defined in Section 2, the consistent estimator of I(θ 0 ) is expressed as follows: Here, ξ , λ, and V represent ξ , λ, and V evaluated at τ = τ , respectively. ξ = diag

Theorem 3.4 (Estimation of Variances):
Under conditions given in Theorem 3.2, the conditional expected information E n is asymptotically consistent to I(θ 0 ).
The approach can be employed to do statistical inference for the regression parameters in real applications. The proof of Theorem 3.4 is sketched in the Web Appendix by using the proof skills of [14].

Simulation study
Monte Carlo experiments were conducted to study the finite-sample performance of the proposed method. Two thousand replications were conducted for each experimental configuration. All the numerical experiments were performed with R-2.15.3 on a Windows 7 (64-bits) computer with 2.5 GHz Intel(R) i5-2520M CPU.
The covariate vector is assumed to take the form . The data of binary outcome variable Y were generated from the ZIB regression model that consists of the following two sub-regression models: the logistic regression model for the susceptible probability: where w = ( x , z) for x = (1, x ) and β = (β 1 , β 2 ) = (β 10 , β 11 , β 12 , β 13 , β 2 ) , and the logistic regression model for the probability of occurrence of the event for susceptible individuals: Here, α = (α 1 , α 2 , α 3 ) . ψ(z) is an unknown smooth function. Two functional forms for ψ(z) were considered. The x i1 s and x i2 s, i = 1, . . . , n = 500, 1000, or 1500, are values of design variables for the covariate with three nominal levels, where level 3 is used as a referent, generated from the multinomial distribution with probabilities 0.6, 0.2, and 0.2. The x i3 s were generated from the normal distribution with mean 1 and standard deviation 1. The z i s were generated from the uniform [−1, 1] distribution. Five scenarios were considered. We in the first and second scenarios, and α = (2.5, −2, −2.8) and β = (−2, 1, 0.5, 1.3, −1.5) in the third and fourth scenarios. ψ(z) = 2 + 3 sin(3z) and ψ(z) = 2.5 + 3z 2 − log(z/2 + 1) + 2 cos(2π z) were set in the first and third scenarios and the second and fourth scenarios, respectively. Under the parameter settings in the first and second scenarios, the average susceptible proportion was approximately 75% (i.e. the average non-susceptible proportion ≈ 25%), and approximately 25% and 37% of the susceptible individuals had the event of interest, respectively. Under the parameter settings in the third and fourth scenarios, the average susceptible proportion was approximately 50% (i.e. the average non-susceptible proportion ≈ 50%), and approximately 38% and 58% of the susceptible individuals had the event of interest, respectively. Under the parameter settings in the first four scenarios, we estimated the susceptible probabilities for an individual with w * were ζ * * 1 = ζ(w * * 1 ; α, ψ) ≈ 0.6471, ζ * * 2 = ζ(w * * 2 ; α, ψ) ≈ 0.7960, ζ * * 3 = ζ(w * * 3 ; α, ψ) ≈ 0.7417, and ζ * * 4 = ζ(w * * 4 ; α, ψ) ≈ 0.4953, respectively, which were estimated. In the fifth scenario, another simulation study was conducted under the setting similar to that of the real example of fishing data given in Section 5. To this end, the data of x and z were generated from the empirical distribution of the data of number of persons and that of the data of length of stay (LOS), respectively. It is noted that the data of LOS were skewed to the left, 83.6% (209/250) of the data were less than 10 h, and 248 data points were between 0.004 and 35.592 h, but two data points were outliers and 65.34 and 71.036 h. The data of Y were then generated from the ZIB regression model in (14) and (15). Here, w = (x, z) , w = (1, x, z) and β = (β 1 , β 2 ) = (β 10 , β 11 , β 2 ) . ψ(z) is an unknown smooth function, wherez isz  (1.2πz). Under the parameter settings, the average susceptible proportion was approximately 69% (i.e. the average non-susceptible proportion 31%), and approximately 58% of the susceptible subjects had the event of interest. Estimated were the susceptible probability for a subject with w * 5 = (x * , z * ) = (2, 4.1) , which was η(w * 5 ; β) ≈ 0.7310 and the event probability for a susceptible subject with w * * 5 = (x * * , z * * ) = (2, 42.62) , which was ζ(w * * 5 ; α, ψ) ≈ 0.6175, where z * * = 42.62 corresponds toz * * = 0.2.
The smooth function ψ(·) was approximated by a cubic spline. The number of interior knots was determined based on the SIC value (11). After the number of interior knots was determined, the quantile method was used to choose the knot locations.
Tables 1-5 present the summary statistics for the estimates of parameters given in scenarios 1-5, respectively, from the ZIB regression model by a cubic spline, including sample mean (mean), bias, standard deviation (SD), and mean squared error (MSE). The simulation results indicated that the biases of spline estimates were very small except estimation of ζ(w * * 5 ; α, ψ) ≈ 0.6175 in scenario 5 because z * * = 42.62 (i.e.z * * = 0.2) in w * * 5 = (x * * , z * * ) = (2, 42.62) is between 35.592 and 65.34, but there are no data of LOS between 35.592 and 65.34 in the original data set of LOS. The standard deviations of the estimates were decreased at a rate of n −1/2 as n was increased.
We also evaluated the performance of the proposed standard error method for the spline estimates of (α , β ) or (α, β ) based on the conditional expected information (13). Tables 1-5 show the standard errors of the spline estimates of (α , β ) or (α, β ) given in scenarios 1-5, respectively. It was found for scenarios 1-4 that overall the proposed standard error estimation method worked reasonably well. The averages of estimated standard errors were all close to the corresponding Monte Carlo standard deviations of the estimators. The empirical coverage probabilities were not statistically significantly different from the nominal coverage probability 95% except for α 1 and β 13 (n = 500 in scenario 1), α 1 and α 3 (n = 500 in scenario 2), β 12 (n = 1500 in scenario 2), α 2 , α 3 , and β 13 (n = 500 in scenario 3), α 1 , α 2 , β 10 , β 11 , β 13 , and β 2 (n = 1000 in scenario 3), α 3 , β 12 , and β 13 (n = 1500 in scenario 3), α 3 and β 13 (n = 500 in scenario 4), α 3 (n = 1000 in scenario 4), and β 10 and β 11 (n = 1500 in scenario 4). These empirical coverage probabilities tended to be closer to this nominal coverage probability when the sample size was increased. The performances for scenario 5 were overall little worse compared to scenarios 1-4 because the data of z were generated from the empirical distribution of the data of LOS in which the data of LOS were skewed to the left, 83.6% (209/250) of the data are less than 10 h, and 248 data points were between 0.004 and 35.592 h, but two data points were outliers and 65.34 and 71.036 h. Figures 1-5 depict the pointwise mean estimates of ψ and the corresponding 2.5th and 97.5th percentiles obtained from 2000 Monte Carlo samples for scenarios 1-5, respectively. The fitted curves followed closely ψ, which indicates that there was little bias. The lower and upper limits of confidence intervals were also reasonably close to ψ; this reveals that the variation in the estimates was small. The variability was decreased when the sample size was increased. In summary, the Monte Carlo study shows that the proposed semiparametric ZIB regression model is a practical model for analyzing ZI binary outcome data when the continuous covariate has a possibly nonlinear effect on the logarithm of the odds of the event for susceptible individuals.       Sample mean (Mean), bias, standard deviation (SD), and mean squared error (MSE) of the semiparametric zero-inflated Bernoulli regression model by a cubic spline. Results of variance study. Sample mean of estimated standard errors (mean.se), standard deviation of estimated standard errors (SD.se), and coverage probability of 95% confidence interval (CP) of (α , β ) obtained from 2000 Monte Carlo samples with sample size 500, 1000 or 1500. Table 4. [Scenario 4: the average non-susceptible proportion ≈ 50% when β = (β 1 , β 2 ) = (β 10 , β 11 , β 12 , β 13 , β 2 ) = (−2, 1, 0.5, 1.3, −1.5) ; ≈ 58% of the susceptible individuals having the event when α = (α 1 , α 2 , α 3 ) = (2.5, −2, −2.8) and ψ(z) = 2.5 + 3z 2 − log(z/2 + 1) + 2 cos(2πz)] Summary of parameter estimation for simulation study. Sample mean (Mean), bias, standard deviation (SD), and mean squared error (MSE) of the semiparametric zero-inflated Bernoulli regression model by a cubic spline. Results of variance study. Sample mean of estimated standard errors (mean.se), standard deviation of estimated standard errors (SD.se), and coverage probability of 95% confidence interval (CP) of (α , β ) obtained from 2000 Monte Carlo samples with sample size 500, 1000 or 1500.

Real example: fishing data
We illustrate the practical use of the proposed methodology with the fishing data set available at https://github.com/rmcelreath/rethinking/blob/master/data/Fish.csv. This data set consists of 250 groups that went to a state park. Each group was asked how many fish they caught, whether or not they used live bait, how many children were in the group, how many people were in the group (persons), how long they stayed (LOS), and whether or not they brought a camper to the park. To illustrate the practicality of the proposed methodology, we are mainly interested in whether or not they caught one or more fish, denoted by Y, which is the binary outcome variable, and investigating the effects of persons, denoted by x, and LOS, denoted by z. In this study, some groups did not even fish, so these groups had systematically zero fish. Among the 250 groups, 142 groups (56.8%) did not catch fish. To describe the potentially nonlinear effect of LOS (z) on the logarithm of the odds of catching one or more fish for a group going fishing, we fit the proposed semiparametric ZIB regression model to the fishing data, which consists of the logistic regression model for the probability that a group went fishing logit[η(w; β)] = w β, where w = (x, z) , w = (1, x, z) and β = (β 10 , β 11 , β 2 ) , and the logistic regression model for the probability of catching one or more fish for the group going fishing logit[ζ(w; α, ψ)] = αx + ψ(z). The unknown smooth function ψ(z) was approximated by a cubic spline with 1 interior knot that was selected based on the SIC value (11). The average of the estimated probabilities, η(w i ; β)'s, of going finishing was 0.722. The average of the estimated probabilities, ζ(w i ; α, ψ i )'s, of catching one or more fish for groups going fishing was 0.601, where ψ i = ψ(z i ). Table 6  gives the results of fishing data analysis by fitting the semiparametric ZIB regression model and a parametric ZIB regression model. To compare the prediction performance of these two models in terms of prediction error, we used the cross-validation technique to estimate the prediction errors of these two models. The 25-fold cross-validation estimates of prediction errors of the semiparametric ZIB regression model and the parametric ZIB regression model were 0.151 and 0.154, respectively. This indicates that the prediction performance of the proposed semiparametric ZIB regression model used to fit the fishing data is slightly better than that of the parametric ZIB regression model. For model diagnostics and goodness-of-fit, Figure 6 provides normal Q-Q plots of Pearson residuals for semiparametric and parametric ZIB regression models. It is seen from the two plots that semiparametric ZIB regression model gives a slightly better fit. To statistically and quantitatively assess which one of the parametric and semiparametric ZIB regression models provides a better fit to the fishing data, i.e. test the null hypothesis of the parametric ZIB regression model versus the alternative hypothesis of the semiparametric ZIB regression model, we compared their log-likelihoods by using the LR test. The log-likelihoods of the parametric and semiparametric ZIB regression models were −103.4975 and −99.8085, respectively, and the observed value of the LR test statistic was 7.378 with the corresponding p-value of 0.061 based on the asymptotic chi-square distribution with 3 degrees of freedom. The p-value indicates that the null hypothesis of the parametric ZIB regression model for this data set is not rejected at the significance level of 0.05, but is rejected at the significance level of 0.1. We also used a parametric bootstrap approach to approximate the distribution of the LR test statistic to test the null hypothesis of the parametric ZIB regression model for the fishing data set, denoted by {(y i , w i ), i = 1, . . . , 250}. The steps of the parametric bootstrap approach are described as follows: Step 1. Generate a bootstrap sample y * i from the fitted parametric ZIB regression model.    at the significance level of 0.1. Thus, based on the two p-values, 0.061 and 0.067, the semiparametric ZIB regression model is marginally better than the parametric ZIB regression model for fitting the fishing data at the significance level of 0.05.
It can be seen from Table 6 that the probability that a group went fishing was increased when the number of people in the group and the length of the group stay was increased. The probability of catching one or more fish for a group going fishing was increased when the number of people in the group was increased. Figure 7 depicts the estimate of the unknown smooth function ψ of LOS with 95% pointwise confidence intervals obtained from 2000 bootstrap samples when fitting the semiparametric ZIB regression model with the cubic spline of 1 interior knot. It can be seen from Figure 7 that the estimate of ψ indicates that the probability of catching one or more fish for a group going fishing was not necessarily increased when the length of the group stay was increased. A plausible explanation is that the group stayed longer in the park does not necessarily imply that the group spent more time fishing if the group went fishing. Some groups that had a long stay in the park only spent a short amount of time fishing if they went fishing. Additionally, based on the analyses of the prediction error, normal Q-Q plot, and LR test, the proposed semiparametric ZIB regression is (slightly) better than the parametric model, but the nonlinear relationship with LOS is obvious. A plausible explanation of this phenomenon is again that the data of LOS were skewed to the left, 83.6% (209/250) of the data were less than 10 h, and 248 data points were between 0.004 and 35.592 h, but two data points were outliers and 65.34 and 71.036 h.

Conclusion
A semiparametric ZIB regression model has been proposed to fit binary outcome data with excess zeros and to depict the potentially nonlinear effect of a continuous covariate on the logarithm of the odds of an event of interest for a susceptible individual. An unknown smooth function, which is used to describe this potentially nonlinear effect, is approximated by a spline function that can be expressed as a linear combination of B-spline basis functions. The SIC ( [19]) has been used to determine the optimal number of interior knots, and the locations of the knots have been chosen by the quantile method. The asymptotic properties of the spline estimators have been established. A direct and consistent variance estimation method based on the least-squares method has also been proposed. The proposed methodology can be easily extended to describe the effects of several continuous covariates of interest. The functional form for each covariate effect is assumed to be smooth but unknown. We can incorporate the effects of these covariates in an additive fashion, and splines are used to model each covariate effect.
An extensive simulation study has been conducted to investigate the finite-sample performance of the proposed methodology. The standard deviations of the estimates are decreased at a rate of n −1/2 . For all sample sizes considered in the simulation study, the empirical coverage probabilities tend to be closer to the nominal probability 95%, and the behaviors of the empirical coverage probabilities are similar. Overall, the proposed standard error method for the spline estimates of the regression parameters works reasonably well. The averages of estimated standard errors are all close to the corresponding Monte Carlo standard deviations of the estimators. The fitted curves follow closely the true curve, and the lower and upper limits of confidence intervals are also close to the true function. The simulation study shows that the proposed semiparametric ZIB regression model is a practical model for analyzing ZI binary outcome data when a continuous covariate has a possibly nonlinear effect on the logarithm of the odds of an event of interest for susceptible individuals. Finally, the fishing data set has been used to illustrate the practicality of the proposed methodology. The fishing data analysis indicates that the prediction performance of the proposed semiparametric ZIB regression model is better than that of the parametric ZIB regression model.