Bayesian negative binomial regression model with unobserved covariates for predicting the frequency of north atlantic tropical storms

Predicting the annual frequency of tropical storms is of interest because it can provide basic information towards improved preparation against these storms. Sea surface temperatures (SSTs) averaged over the hurricane season can predict annual tropical cyclone activity well. But predictions need to be made before the hurricane season when the predictors are not yet observed. Several climate models issue forecasts of the SSTs, which can be used instead. Such models use the forecasts of SSTs as surrogates for the true SSTs. We develop a Bayesian negative binomial regression model, which makes a distinction between the true SSTs and their forecasts, both of which are included in the model. For prediction, the true SSTs may be regarded as unobserved predictors and sampled from their posterior predictive distribution. We also have a small fraction of missing data for the SST forecasts from the climate models. Thus, we propose a model that can simultaneously handle missing predictors and variable selection uncertainty. If the main goal is prediction, an interesting question is should we include predictors in the model that are missing at the time of prediction? We attempt to answer this question and demonstrate that our model can provide gains in prediction.


Introduction
Prediction of tropical cyclone (TC) activity for the North Atlantic region started in the early 1980s [5,6], while the very first attempt for prediction of TC activity around the world was taken by Neville Nicholls in the late 1970s [13]. Since then, the prediction of North Atlantic tropical storms has received more and more attention and previous studies have built forecast systems which give retrospective forecasts for the hurricane season that reaches its peak during August to October [2,6,24]. Although it is difficult to give accurate forecasts of TC activity 9-10 months prior to a particular season [18], considerable progress has been made for shorter lead times [2,17,19,22]. To accommodate the missing data, we propose a two-level Bayesian regression model. The top level models the frequency of tropical storms as the response variable with the predicted and observed SSTs as covariates. The second level models the covariates via a sequence of regression models. Of the two levels, the top level which relates the response variable to the covariates is of more interest because our focus is on prediction of the frequency of tropical storms. Previous studies (e.g. Villarini et al. [21]) have shown that all climate models do not have equally good predictive power. This motivates us to use a variable selection prior in the top level of the model, that can automatically discriminate covariates, and retain covariates that are supported more strongly by the data. Mitra and Dunson [11] have developed such a model for linear and binary regression models. In this work, we extend their approach to a negative binomial regression model for count data. It is possible to use variable selection priors in both the top and lower level regression models, but based on the results in Mitra and Dunson [11], the gain from such an approach appears to be small compared to the increased computational burden. Thus, we specify variable selection priors only for the top level. Based on simulations and the North Atlantic TC data, we illustrate that this model can provide improved predictive performance compared to some existing models in the literature.
The organization of this paper is as follows. In Section 2, we provide a brief review of variable selection in a Bayesian framework. In Section 3, we provide a detailed description of the model development, prior specification, and posterior computation. In Section 4, we conduct two simulation studies. The simulation studies examine the predictive performance of our Bayesian model in comparison to a hierarchical model [21] under different scenarios. Through these simulation studies, we also try to answer the following important question: if a predictor is unavailable/missing for prediction, should we use a fully Bayesian procedure that includes that predictor or should we use a reduced Bayesian model that discards this predictor? In Section 5, we present results from applying the different methods to data from North Atlantic tropical storms. In Section 6, we discuss some directions for future work.

Review of Bayesian approach to variable selection
In this section, we provide a brief introduction to Bayesian model selection and Bayesian model averaging. For a more detailed review, see Clyde and George [1]. In a Bayesian framework, models are treated as additional unknown parameters and they are assigned a prior probability distribution.
In the context of variable selection in regression models, different models represent different subsets of variables (covariates). Suppose there are p variables. Let γ = (γ 1 , γ 2 , . . . , γ p ) represent a particular subset of these p variables, where γ j = 1 if the variable is included in the model and 0 otherwise. For example, a choice of γ = (1, 0, . . . , 0) would represent a model with the first covariate only. Each model is encoded by γ , and let the probability distribution for the observables W be p(W | κ γ , γ ) where κ γ is a model specific parameter.
In the Bayesian approach, we assign a prior distribution p(κ γ | γ ) to the parameters for each model and a prior probability, p(γ ) to each model. This prior formulation leads to the following joint distribution: Suppose our goal is to predict a new observation W f , which is assumed to be generated from the same process that generated the observed data W. Then the posterior predictive distribution of W f under model γ is obtained as The last equality in (2) holds because of the assumption of independence between W and W f , given the model and model specific parameters. After marginalizing out parameters κ γ , conditional on the observed data W, we have the posterior probability for model γ as where is called the marginal likelihood of model γ . To measure the importance of a covariate based on all models, the posterior marginal inclusion probability for the jth covariate is defined as follows: Under Bayesian model averaging, the posterior predictive distribution of W f is given by which is a mixture of the conditional predictive distributions under each model, with mixture weights given by the posterior probabilities of models. The marginal likelihood in (4) is typically not available in closed form, except for a small class of models with certain prior structures, like linear regression models with conjugate priors. When it is available in closed form, posterior computation is greatly simplified. In most cases, including our negative binomial regression model, the posterior distribution is not available in closed form and these quantities cannot be computed analytically. However, we can use a Markov chain Monte Carlo (MCMC) algorithm to sample approximately from the posterior predictive distribution and estimate quantities of interest such as the posterior inclusion probability of a covariate.

Bayesian negative binomial regression model with missing covariates
We first introduce some notation. Let Y denote the n dimensional vector containing the count of the tropical storms from the the National Oceanic and Atmospheric Administration's (NOAA) National Hurricane Center's best-track database (HURDAT2) [8]. Let X = (x 0 , x 1 , . . . , x p ) denote the n × (p + 1) design matrix, whose first column is a column of ones, corresponding to the intercept term. The remaining columns correspond to SST Atl and SST Trop from the climate models, as well as the true SST Atl and SST Trop from the Met Office Hadley Centre [15].
Let M = (m 1 , . . . , m p ) be the n × p matrix of indicators for the p covariates [9] denoting whether it is available or missing. Here m ij = 1 denotes that x ij is observed and m ij = 0 denotes that x ij is missing. Let the observed predictor values be denoted by X obs = {x ij , i = 1, . . . , n, j = 1, . . . , p : m ij = 1} and the missing predictor values by X mis = {x ij , i = 1, . . . , n, j = 1, . . . , p : m ij = 0}. Here we assume that the covariates have been standardized using the observed values to have mean 0 and standard deviation 1 for the observed part of each covariate. Such standardization converts covariates to the same scale and leads to ease in prior specification for the regression coefficients.

Negative binomial regression model
The top level regression model relates the frequency of tropical storms to SSTs as predictors. Commonly used distributions for modeling count data are Poisson or negative binomial distributions, both of which have support over the set of non negative integers {0, 1, 2, . . . }. While using a Poisson model could be a reasonable approach for our data, we choose the negative binomial distribution because Bayesian posterior computation is much more amenable for a negative binomial regression model with variable selection priors, using the Pólya Gamma data augmentation approach [14]. Zhou et al. [25] have developed algorithms for overdispersed count data based on Pólya Gamma data augmentation. Neelon [12] has extended the model of Pillow and Scott [14] to the zero-inflated case for spatial and time series data. However, instead of overdispersion our focus in this work is mainly on Bayesian model averaging via variable selection priors in the presence of missing covariates. Note that even though we use a negative binomial model for our data, we put a prior on the dispersion parameter. So if the data does not exhibit overdispersion, as in our case, the results are very similar to that under a Poisson regression model. As suggested by one reviewer, we have included results for an approximate Poisson regression model for the tropical storm data set in the Supplemental Materials.
Let X i denote the (p + 1) × 1 vector of predictors for the ith observation, and let the corresponding response variable be Y i . Let β be a (p + 1) dimensional vector of regression coefficients. Then the top level negative binomial regression model is given as follows: where π i = e μ i /(1 + e μ i ), μ i = X T i β, and η > 0 is the dispersion parameter. We assume a Gaussian prior for the intercept β 0 and a spike and slab prior [3] for all other regression coefficients: where δ 0 is a degenerate distribution at zero. This implies that the prior for β j is a mixed distribution. With probability ρ, β j comes from a Gaussian distribution, and with probability (1 − ρ), β j is exactly 0. Setting β j = 0, allows the corresponding covariate x j to be dropped from the model and lead to variable selection. If we let γ j be an indicator variable for including the j th predictor x j , such that γ j = 1 if x j is included in the model, and 0 otherwise, then {γ j = 0} ≡ {β j = 0}. We set λ 0 = 100 to have a reasonably diffuse prior for the intercept term. We standardize the covariates and set λ j = 1, as large values of λ j in a variable selection prior can lead to favoring the null model, without any covariates. We choose ρ = 0.5 which leads to a discrete uniform prior on the space of models γ , giving each model a prior probability of 1/2 p . The dispersion parameter η controls the deviation of the negative binomial distribution from the Poisson distribution, where smaller values of η lead to larger variance compared to the Poisson distribution. If the mean remains fixed and η → ∞, the negative binomial distribution converges to the Poisson distribution. We assume that η has a uniform prior on the interval (0, 1000).

Sequence of linear regression models for covariates
Since the covariates may not be fully observed, we specify a joint distribution for the predictors in the second level of the model. We adopt the method developed by Mitra and Dunson [11], where a joint distribution is specified using a series of univariate models: Specifically, we assume, for i = 1, 2, . . . , n, . .
We put conjugate prior distributions on all regression coefficients and intercepts, i.e.
We set λ j0 = 100 for the intercept terms. We set λ jk = 1 for other regression coefficients to have the same prior variance on the standardized scale. For the residual precision parameters we specify the following prior: where we set the shape and rate parameters as c = 1 and d = 1 5 , respectively. This choice of hyperparameters offers reasonably diffuse priors, when the covariates are standardized.

Posterior computation
For the aforementioned model and prior specification, the posterior distribution does not have a closed form. In such cases, a natural option is to use MCMC sampling, where a Markov chain is constructed so that its stationary distribution is the posterior distribution. Gibbs sampling, when possible, makes computation relatively straightforward as tuning is not needed. Gibbs sampling is not immediately applicable to the above posterior distribution, as the full conditional distributions are not available in closed form for a negative binomial regression model.
The Pólya Gamma data augmentation approach of Pillow and Scott [14] greatly simplifies posterior computation for the negative binomial model with a Gaussian prior on the regression coefficients. We extend their approach to spike and slab priors for regression coefficients, where spike refers to the degenerate distribution at 0, and slab refers to a normal distribution. We give a brief description of this data augmentation method to show why it simplifies posterior computation.
Following Pillow and Scott [14], the conditional posterior distribution of β can be expressed as where the last line follows from an integral identity via which the terms in the negative binomial likelihood can be expressed as an integral with respect to the density of a Pólya Gamma random variable w i ∼ PG(η + Y i , 0). Thus conditional on w i , the contribution from the likelihood terms starts to resemble a likelihood for linear regression with normal errors. With some more algebra, Pillow and Scott [14] have shown that the conditional posterior distribution of β simplifies to a normal distribution, under a normal prior. Exploiting this data augmentation framework, we derive full conditional distributions for spike and slab priors, in closed form. For our two-level Bayesian model, the posterior distribution of interest is p(η, w, β, X mis , θ, ψ | Y, X obs ). Full conditionals are available for all components except η. Thus samples can be drawn from the above posterior distribution approximately, using a Gibbs sampler with a Metropolis Hastings (MH) step for drawing η [16]. For the MH step we use a normal proposal for η centered at the current value, with support over (0, 1000). We outline the full conditionals for the remaining components in Appendix 1.

Simulation study
We perform a simulation study to investigate the performance of the Bayesian model in comparison to the existing hierarchical model of Villarini et al. [21]. We first review the different methods that were considered in that paper.

Review of Villarini et al. [21]
Villarini et al. [21] noted that not all climate models (GCMs) have similar predictive power. They considered various kinds of weighting schemes to combine the forecasts of SSTs from different climate models. They first used a Poisson regression to model the count of tropical storms based on the true/observed Atlantic sea surface temperatures (SST Atl ) and the tropical mean sea surface temperatures (SST Trop ), respectively. Maximum likelihood estimation was used and suppose the MLE of the 3 × 1 vector β is denoted as β O , where O in the suffix denotes that this regression coefficient was estimated based on the 'observed' SSTs.
In terms of forecasts, the true SSTs during the upcoming tropical storm season are not available, because they depend on a future event. But forecasts of these two covariates are available from six climate models. The simplest method is to take an average over the forecasts by the six climate models, and plug those averaged predictors in a Poisson regression model with regression parameter β O . There is a weighted average version of this Poisson regression model which weights the forecasts from different climate models, based on how well they predict the real SSTs.
The final model considered by Villarini et al. [21] is a mixture of six Poisson regression models, with mixture components corresponding to the six climate models. The weight for each climate model is taken as 1/RMSE, where RMSE is the root mean squared error in predicting the response variable for that climate model. Weights are normalized over the six climate models to sum to 1. For prediction, the Poisson mean parameter of each climate model is taken as exp(X T GCM β O ), where X GCM is the GCM specific forecast of SSTs. Among the different weighting schemes, no method was always the best, but this mixture model, called the hierarchical model, seemed to have the best performance overall.
The hierarchical model has a similar flavor to Bayesian model averaging, since predictions are made using a mixture of models and model probabilities are related to accuracy in prediction. Since the mixture weights and parameters are known, sampling from this model can be done by independent Monte Carlo sampling. However, it has two main drawbacks. It uses the observed SSTs for estimation ( β O ) but uses the forecasts of these SSTs for prediction. The Poisson mean parameter exp(X T GCM β O ) used for each mixture component can lead to a mismatch and affect predictions, if the forecasts of SSTs are not good predictors of the real SSTs. The second drawback is that the predictive distribution for the hierarchical model assumes that the true value of β is known and equal to the MLE, β O . This can underestimate uncertainty. Since the hierarchical model ignores the uncertainty in the estimation of parameters, the predictive distribution under it boils down to a mixture of Poisson regression models, with completely known mixture weights and known mixture specific parameters (Poisson means). As a result, independent Monte Carlo sampling can be used to draw samples from the predictive distribution and prediction sets can be formed based on those samples.

Data generation
For the North Atlantic tropical storms, we have data on the frequency of storms (response variable) and observed SSTs from 1958-2018, from the HURDAT2 database [8]. However, the forecasts of SSTs from six GCMs from the North American Multi Model Ensemble [7], are available from 1982-2018. This means the hierarchical model which uses the observed SSTs can use roughly twice the number of observations compared to the Bayesian models developed in this work. Thus, we design a simulation study which maintains a similar structure, to enable a fair comparison of different procedures.
This structure tries to mimic the real data to a certain extent, but in a somewhat more simplified setting. The real data in Section 5 also has p = 12 covariates, which includes SST forecasts from 5 GCMs and the true SSTs. Since one of the six GCMs is known to have consistently poor predictive performance [21], we focus on the 5 remaining GCMs in this work. Here the first two covariates are assumed to play the role of the real/observed SSTs, (SST Atl and SST Trop ), and the next two covariates resemble the forecasts of the same quantities from the strongest climate model in the real data.
In our application, there is no overdispersion so a Poisson regression model seems most reasonable for generating the counts in the simulation study. We developed a negative binomial regression model mainly for computational convenience. But the added flexibility to account for overdispersion could be useful in other applications.
We generate an outcome, Y i , using a Poisson regression model with mean exp (X T i β), and consider the following two scenarios: (1) Scenario 1: β = (1.8, 0.4, −0.2, 0.35, −0.25, 0, . . . , 0) (2) Scenario 2: β = (1.8, 0.5, −0.2, 0, . . . , 0) Scenario 1 assumes that the response variable is generated using the pair of true SSTs and a pair of strong SST forecasts. Scenario 2 assumes that the response variable is generated solely by the true SSTs. While Scenario 2 seems more plausible, Scenario 1 is closer to what we find in the real data. One possible explanation is that while it is known that a difference in SSTs in the tropical Atlantic and global tropics is a driving force for formation of storms, there are many other factors that are not accounted for by the model in Scenario 2. The GCM forecasts themselves come from climate models that try to model the physical processes that affect SSTs. So it is possible that the forecasts are systematically capturing other factors that are also related to the formation of storms.
For each simulated data set, we have 150 observations. We assume that the first two predictors x 1 and x 2 that represent the real SSTs, are only accessible to the hierarchical model for the first 50 observations. The other predictors x 3 − x 12 are not accessible to any method, for these first 50 observations, to indicate the fact that the climate models did not issue forecasts for the initial period in our real data set. We split the next 100 observations into two equal halves. The middle 50 observations of the entire 150, are accessible to all methods (Bayesian and hierarchical) and used for estimation. The final 50 observations are used for prediction to compare methods. For prediction, we treat x 1 and x 2 as missing, to represent that the true SSTs will not be available when making forecasts. We use the same prior hyperparameters that were described in the preceding section. We conduct the Bayesian analysis using the package R2jags in R.

Missing covariates for prediction
If our main goal is prediction and some of the covariates are missing/unobserved at the time of prediction, one can envision using two approaches for dealing with the missing covariates in a Bayesian framework: (1) Method 1: Retain all covariates in the model; sample the unobserved covariates from their predictive distribution; and marginalize over them using Monte Carlo integration. (2) Method 2: Drop all covariates that are unavailable for prediction and consider posterior computation with a reduced model.
We generate 100 data sets under Scenarios 1 and 2. Each data set is analyzed using the (i) hierarchical model, (ii) a Bayesian model without missing covariates (best case scenario), (iii) Method 1, and (iv) Method 2. For the Bayesian methods, we run the MCMC algorithm for five million iterations and discard the first 20,000 samples as burn in. Samples are drawn from the hierarchical model using independent Monte Carlo sampling, with same sample size of five million.

Results and analysis
The results under Scenarios 1 and 2, averaged over 100 data sets, are presented in Tables 2  and 3. All results in these Tables are related to the predictive distribution of the response variable. Because the posterior distribution can be skewed, the median of the predictive distribution is used as a robust estimate for point estimation. Its accuracy in predicting the response variable is evaluated using correlation coefficients (between the true response variable and its point estimate), RMSE, and mean absolute error (MAE). For assessing uncertainty, we construct two kinds of prediction sets. Prediction sets with approximately 90% posterior probability are constructed with end points as 5th and 95th percentiles of the predictive distribution. These are referred to as Equal-tailed sets. We also construct prediction sets using HPD (highest posterior density) regions, which could give smaller sets. Since smaller sets with good coverage are desirable, we look at the cardinality of the 90% prediction sets for the different methods, and denote it as size in Tables 2 and 3. For this particular application, frequentist coverage is of interest, so we also assess that.
Based on the results reported in Tables 2 and 3, the RMSE and MAE of the Bayesian methods tend to be less than that of the hierarchical method. The Bayesian methods also have reasonably good frequentist coverage with smaller size than the hierarchical method. The difference between Methods 1 and 2 is negligible in terms of point estimates, with very similar RMSE and MAE. Method 1 which retains all variables tends to produce larger    sets than Method 2, which discards predictors with missing values during prediction. All methods have frequentist coverage close to 90%.
To get an idea of the variability of the results across different data sets, notched box plots [10] are shown in Figures 1 and 2. These are similar to traditional box plots with additional information provided by the notches around the medians, for each box. The notches provide an informal way to test whether the true (population) medians are equal or not. If the notches of two box plots do not overlap, it indicates the true medians are different. The notches for the hierarchical model do not overlap with notches of any of the Bayesian methods, suggesting that the Bayesian methods have substantially improved RMSE and MAE under both scenarios. As expected, the Bayesian method without missing data has the smallest RMSE and MAE.
To have a better understanding of Methods 1 and 2, we examine their estimates of posterior inclusion probabilities and regression coefficients, under Scenarios 1 and 2, reported in Tables 4-5 and Tables 6-7, respectively. The posterior inclusion probabilities are the same under Method 1 and the case with no missing observations, because they have identical posteriors. In Scenario 1, covariates 1-4 are included in the true model. Method 2 discards covariates 1-2, and thus covariates 3-4, especially 3 seems to  The regression coefficient for covariate 3 seems to compensate for both 1 and 3. A similar phenomenon can be seen in Tables 6-7 under Scenario 2. In Scenario 2 covariates 1-2 are included in the true model. Here covariate 3 (strongly positively correlated with covariate 1) seems to compensate for missing covariate 1. This partly explains why discarding covariates with missing values during prediction, tends to do equally well as retaining those covariates. A natural question is how sensitive the results are to the choice of prior hyperparameters. Thus, in addition to the priors specified in 3.1 and 3.2, we consider the choice of λ j = 100, λ jk = 100, and a gamma prior with shape parameter 2 and inverse scale (rate) parameter 1/10 on ψ j . The results are given in Appendix 2. As summarized in Tables A1  and A2, there is a negligible difference in the point estimates; however there is some difference in the prediction sets which tend to get larger, due to the priors being more diffuse in the present setting. Overall, the posterior does not seem very sensitive to these choices.

Illustration of the methods with the north atlantic tropical storms data set
We have data on the frequency of tropical storms, tropical Atlantic and tropical mean SSTs, for the period 1958-2018. TC activity occurs during August to October and SSTs are averaged over this period to serve as predictors in our model. We consider forecasts of SSTs for 1982-2018, from five climate prediction systems which are part of the NMME. The Bayesian models use data from 1982-2018, for all variables, as the climate model forecasts do not exist prior to 1982. The non Bayesian hierarchical model, is set up in such a way, that it uses data from 1958-2018 for estimating regression coefficients. Data from 1982-2010 were used as a testbed when the climate prediction systems were developed. Thus, we focus on the later period 2011-2018 for prediction. Forecasts of SSTs are issued every month, from 9 to 2 months before the hurricane season. Here we focus on June, July, and August as initialization months, as the forecasts of SSTs before June can be rather inaccurate. Given that the size of the data set is not large, we use leave-one-out predictions for each of the years during 2011-2018. During the period 2011-2018, there are two years with missing forecasts for two (NASA and CMC2) of the five climate models used in the analyses. We choose the ordering of covariates, in our second level sequence of regression models, such that the normality and independence assumptions of errors are    reasonable. The names of the response variables and covariates used in the analyses are provided in Appendix 3.
For the Bayesian methods, the MCMC sampling algorithm is run for five million iterations after discarding the first 20,000 samples as burn in. The hierarchical model is run for five million iterations. The same prior hyperparameters used in the simulation studies are applied to this data set. Analysis is done for each month separately. Note that the covariates (GCM SSTs) change across months because forecasts of true SSTs are issued every month by the climate prediction systems. However, the true SSTs and the count of tropical storms during a hurricane season, for a given year, do not change across months. In other words, in each monthly analysis, the same response variable (annual count of tropical storms during a hurricane season) is being predicted, but a new set of covariates (GCM SSTs) is used for each month. Thus, if a covariate is deemed to be important in a given month, say July, it is retained in the Bayesian model for the next month, August, because important covariates from earlier months may improve predictions. Table 8 shows the marginal posterior inclusion probability for each covariate, by month, which can be used to determine whether a covariate is important or not. Under a discrete uniform prior for the model space, the prior inclusion probability of each covariate is 0.5. For testing γ j = 1 (jth covariate is included) versus γ j = 0 (jth covariate is excluded), the Bayes factor can be used. A value of the Bayes factor greater than 3 provides positive evidence against γ j = 0. The Bayes factor is larger than 3 if the corresponding marginal posterior inclusion probability is larger than 0.75. Thus we use 0.75 as the threshold for retaining a covariate in the model for the next month.
It is well known that under high collinearity among covariates, marginal inclusion probabilities of correlated covariates can sometimes be misleadingly low, even though the covariates are strongly associated with the response variable [4]. So, there is a danger of concluding the covariates are not needed, when in fact they are all associated with the response variable. Ghosh and Ghattas [4] noted that such an erroneous conclusion can usually be avoided by also considering the Bayes factor BF(H A :H 0 ), where H 0 corresponds to the model which only contains the intercept and H A is the complement of H 0 . All covariates are correlated in our data, and the marginal posterior inclusion probabilities for most of the variables are lower than 0.75, so we also calculate the Bayes factor BF(H A :H 0 ) for each month. The Bayes factors are 30.403 for June, 60.806 for July, and 40.537 for August. Since the Bayes factor for July is substantially larger than June and August, it seems reasonable to not add any variables from June to July, but add important variables from July to August. We add the GFDLA06 Atl from July to August as it has a marginal posterior inclusion probability of 0.85. With this added variable, the Bayes factor BF(H A :H 0 ) for August increases to 60.798. The results with this additional covariate for August are denoted by '(added)' in Table 11.
For comparison, both approaches for handling missing covariates are presented: Method 1 retains all covariates and Method 2 discards covariates with missing/unobserved values in the year of prediction. The results summarized in Tables 9-11 show that, overall, the Bayesian methods (1 and 2) give improved predictive performance compared to the hierarchical model [21]. Adding the GFDLA06 atl covariate from July to the model in August ('(added)' in Table 11) leads to some improvement in RMSE and MAE. The (average) size of the prediction set tends to be larger than the hierarchical method.
A visual representation of these results is provided in Figure 3. Based on Figure 3, all methods have the largest prediction errors in the years 2012 and 2018. In these two years, the point estimates corresponding to the Bayesian methods with the 'added' variable approach, improve from June to July and retain the improvement in August. The hierarchical model tends to underestimate the uncertainty compared to the Bayesian methods. For example, in 2012, the upper limit of the 90% prediction set for the hierarchical model, is lower than the true count of tropical storm in June, and it coincides with the true count in July and August.

Discussion and future work
In this paper, we have proposed a Bayesian negative binomial regression model that can incorporate missing covariates and variable selection uncertainty. This model was primarily developed to propose a fully Bayesian alternative to the hierarchical model of Villarini et al. [21]. Based on simulations and the North Atlantic tropical storm data set, we have shown that the Bayesian model can lead to better predictive performance.
We also made an attempt to answer an interesting question, whether we should retain or discard predictors which are missing at the time of prediction. In the set up of our model, we found both methods (retaining or discarding) perform similarly in terms of prediction using point estimates. However, the method that retained all predictors had larger prediction sets with somewhat higher frequentist coverage. Based on our results, predictions with a reduced model can work fairly well and is somewhat less computationally demanding.
For example, the approximate running times needed to fit the model and predict for a single year for the real data, are 2.6 hours, 1.6 hours, and 1 second, for Methods 1 and 2, and the hierarchical model, respectively, on the ARGON cluster at The University of Iowa.
In this work, we have implicitly assumed that the missing data are missing at random (MAR), that is the distribution of the missing data mechanism does not depend on the values of the missing data. We think this is a reasonable assumption about the missing pattern in our data. We have also assumed that the parameters governing the distribution of the missing data mechanism are independent of the parameters in the observed data likelihood. This leads to an ignorable missing data mechanism for Bayesian inference and inference can be done based on the observed data likelihood and prior distributions.
Although the Bayesian methods are more time consuming, running an algorithm for 2-3 hours, once per month (when forecasts of SSTs are issued) does not seem daunting. Nevertheless, in the future, we will explore alternative methods that are faster, and try to strike a balance between the hierarchical model and the Bayesian methods in terms of speed and accuracy. In this paper, our focus has been on predicting the count of tropical storms, but several other response variables are also available, such as the number of hurricanes, and variables that measure the duration and intensity of the storms [21]. One direction of future research is to model the response variables in a multivariate regression model framework, that also incorporates variable selection uncertainty and missing covariates.

Appendix 3. The Models Used in the Tropical Storm Example
In this section, we provide the list of models used in the analysis of the tropical storm data set. The covariates for the second level linear regression models were chosen after careful examination of the residual plots, to check the assumptions of normality and independence of residual errors. The conditional regression models for the response variable and each missing predictor, for Method 1, are summarized in Tables A3-A5, for June, July, and August. The same models were also used for Method 2, except in June, when the model for CMC Atl had the predictors GFDLA Trop and GFDL Trop .