Bayesian inference for random coefficient dynamic panel data models

ABSTRACT We develop a hierarchical Bayesian approach for inference in random coefficient dynamic panel data models. Our approach allows for the initial values of each unit's process to be correlated with the unit-specific coefficients. We impose a stationarity assumption for each unit's process by assuming that the unit-specific autoregressive coefficient is drawn from a logitnormal distribution. Our method is shown to have favorable properties compared to the mean group estimator in a Monte Carlo study. We apply our approach to analyze energy and protein intakes among individuals from the Philippines.


Introduction
Dynamic panel data models with autoregressive coefficients are widely used in the analysis of economic, political science and other data [2,18]. Traditionally, the literature on these models has focused on models that allow intercepts to vary across units but that assume the same autoregressive coefficients for all units. However, in many settings, it is more realistic to allow the autoregressive coefficients to vary across units, allowing, for example, for goodspecific speeds of reversion to purchasing power parity [14], individual-specific speeds of adjustment to income shocks [13], and country-specific dynamics in savings behavior [10].
For dynamic panel data models with heterogeneous autoregressive coefficients, Pesaran and Smith [16] showed that not accounting for the heterogeneity produces inconsistent estimates of the mean autoregressive coefficient, even for a large N (where N is the number of units), large T (where T is the number of time points observed per unit) panel. To address this problem, Pesaran and Smith [16] proposed the use of the mean group estimator which averages the coefficients estimated in separate regressions for each unit (or each group of units). Hsiao et al. [12] showed that the mean group estimator is a consistent and asymptotically normal estimate of the average coefficient as long as N → ∞, T → ∞ and √ N/T → 0. However, the mean group estimator is not consistent for the large N, small T setting that has been the traditional focus of panel data analysis. CONTACT  As an alternative to the mean group estimator, Hsiao, Pesaran and Tahmiscioglu (hereafter HPT) [12] proposed a hierarchical Bayesian approach to estimate the mean autoregressive coefficient for random effect AR(1) models. HPT showed that their Bayesian estimator is asymptotically equivalent to the mean group estimator for the large N, large T setting and showed that the Bayesian estimator has better sampling properties than the mean group estimator for small and moderate T settings in a Monte Carlo study. However, there are some limitations to the hierarchical Bayesian approach proposed by HPT. First, the model assumes that the initial values y i0 of the dependent variable are fixed and uncorrelated with the unit-specific coefficients. This means that the unit-specific coefficients do not affect the unit's process at time 0 but affect the unit's process at time 1 and later. Such a model is not realistic when the decision about when to start sampling the panel is arbitrary; if the process has been going on for some time, there is no particular reason to believe that y i0 should be viewed differently than y it [11]. A second limitation of HPT's Bayesian approach is that although the model they considered assumes that each unit's process is stationary, i.e. the AR(1) coefficient for each unit is assumed to have absolute value less than 1, the distribution of the autoregressive coefficients is assumed to be normal and is not constrained to have absolute value less than 1 in order to facilitate Gibbs sampling.
In this paper, we build on HPT's hierarchical Bayes approach for the random coefficient AR(1) model, improving two features of it by allowing the initial values y i0 to be correlated with the unit-specific coefficients and imposing stationarity on the unitspecific AR(1) coefficients. We consider two assumptions about the initial value y i0 for unit i: (1) the unit's process has been going on for a long time before time 0 and the initial value is generated from a stationary process and (2) the unit's process started from a finite time before the 0th period. To impose stationarity on the AR(1) coefficients γ i , we assume that the γ i are generated from a distribution whose support is (−1, 1), in particular the logitnormal distribution. To sample from the posterior distribution of our model, we use a Metropolis algorithm. We conduct a Monte Carlo study to examine the frequentist properties of our Bayesian approach. The results show that our approach provides good estimates even when T is small. Besides its good frequentist properties for estimating mean coefficients and variance components, our Bayesian approach has the attractive feature that inferences on the unit-specific coefficients can easily be made.
Our paper is organized as follows. Section 2 describes our formulation of the random coefficient dynamic panel data model. Section 3 describes our prior distribution for our model parameters and our Markov chain Monte Carlo approach for drawing from the posterior distribution. Section 4 provides a Monte Carlo study of our approach's frequentist properties. Section 5 illustrates our approach by applying it to a panel of intakes of dietary energy and protein among individuals in the Philippines. Section 6 provides conclusions.

Formulation of the model
The dynamic panel data model, we consider is where |γ i | < 1, i = 1, 2, . . . , N, y i = (y i1 , y i2 , . . . , y iT ) is a T × 1 vector of observations for the dependent variable and y i,−1 = (y i0 , y i1 , . . . , y i,T−1 ) . The unit-specific coefficients γ i and α i are time invariant but vary over the cross section. The u i = (u i1 , u i2 , . . . , u iT ) are assumed to be iid disturbances with a N(0, σ 2 u I) distribution. The model (1) can be equivalently written in state-space form [11] as where w it is a hidden state and η i = α i /(1 − γ i ) is the long-run mean for unit i.
In a random coefficient model, both the AR coefficient γ i and the long-run mean η i are random. Parameters of interest include the mean coefficients μ γ = E(γ i ), μ η = E(η i ) as well as the corresponding variance components σ 2 γ = Var(γ i ), σ 2 η = Var(η i ) and the covariance σ γ η = cov(γ i , η i ). Because we assume that each unit's process is dynamically stable (|γ i | < 1), we want the distribution for γ i to have support on (−1, 1). We assume γ i is drawn from a logitnormal distribution scaled to have support on where γ * i is assumed to have a normal distribution. The logitnormal is a flexible family of distributions for a random variable constrained to an interval. Frederic and Lad [6] provided a review of the logitnormal distribution's properties. We assume that θ i = (γ * i , η i ) has a bivariate normal distribution with mean μ θ = (μ γ * , μ η ) and covariance matrix (with variances σ 2 γ * and σ 2 η , and correlation ρ). Additionally, we assume that each unit's coefficients are independent of other units' random We consider two scenarios for how the initial value y i0 is generated: Scenario 1: Each unit's process starts from the infinite past, meaning that y i0 is generated from the stationary distribution, i.e. a normal distribution conditional on (α i , γ i ) with mean α i /(1 − γ i ) and variance σ 2 u /(1 − γ 2 i ). Scenario 2: The initial value is generated from the finite past. We consider the following flexible formulation: where v i0 ∼ N(0, σ 2 v ). For example, if the process starts at time 0, i.e. y i0 = α i + v i0 , then a = 1, b = −1. The formulation in Scenario 2 allows for different nonstationary models with varied values of a and b. When the process is non-stationary, as long as y i0 follows a normal distribution, the posterior distributions of γ * i and η i derived in Section 3 are still valid. Scenario 2 also allows for the possibility that the process is stationary and started in the infinite past. If the process started in the infinite past, E(y i0 |α i , γ i ) = α i /(1 − γ i ), then a = 1, b = 0. Even though Scenario 2 does allow for the possibility that the process started in the infinite past, i.e. a = 1, b = 0, when the process actually started in the infinite past, Scenario 2 does not use as much information as Scenario 1 because Scenario 2 does not use the information that var(y i0 ) = σ 2 u /(1 − γ 2 i ). Scenario 1 is useful in applications where the individual processes have been going on for a long time and we start observing them at a random time. Scenario 2 is useful in applications where the subjects may be growing over time (e.g. the subjects are children) and where we start observing them at a set time (e.g. when the children enter kindergarten).

Bayesian approach
In this section, we develop a hierarchical Bayesian approach for estimating Scenarios 1 and 2 of the models above. For Scenario 1, we need to put a prior distribution on the parameters μ θ , , and σ 2 u . We choose the normal-inverse-Wishart distribution as the prior distribution for μ θ and with parameters (μ 0 , 0 /κ 0 ; ν 0 , 0 ): where IW represents the inverse-Wishart distribution; ν 0 and 0 are the degrees of freedom and the scale matrix for the inverse-Wishart distribution; μ 0 is the prior mean; and κ 0 is the number of prior measurements on the scale. We put a noninformative prior on σ 2 u : The joint posterior density can be written as We derive the posterior conditional densities from the joint density above: where N(a, b) denotes the normal density with mean a and variance b.
where C 1 , C 2 are constants and, HPT applied Gibbs sampling in their Bayes estimation of dynamic panel data models. Gibbs sampling is a Markov chain Monte Carlo algorithm that successively draws components of the parameter vector from the posterior distribution conditional on the other components of the parameter vector [7]. To use Gibbs sampling, one needs to be able to draw from the posterior conditional distributions. However, in our model, the posterior conditional distribution of γ * i given the other parameter is not easy to draw from. Instead of Gibbs sampling, we use a 'Metropolis-Hastings within Gibbs algorithm' that is a particular type of Metropolis-Hastings algorithm [9]. To obtain a new sample from the posterior distribution, we successively draw , μ θ , σ 2 u and (η 1 , . . . , η N ) from the above posterior conditional distributions, conditioning on the most recently drawn values of the parameters. We then use the following Metropolis step to draw a new γ * i for i = 1, . . . , N. We draw γ * i,trial from an easy to draw distribution g(x) described below. Letting γ * i,old denote the γ * i from the previous sample, we compute the acceptance ratio is the posterior distribution of γ * i conditional on the current samples of all the other parameters. If r is larger than 1, we set γ * i,new = γ * i,trial as our new sample of γ * i ; if r is less than 1, we draw a uniform number u and set γ * Gilks [9] for further discussion of the Metropolis-Hastings within Gibbs sampling approach. We Our proposal density is the marginal distribution of γ * i given η i , θ, and σ 2 u (but not the data y). As long as the data are not highly informative about γ * i , we have found that this proposal density is not too far from the true posterior conditional density p(γ * i |y i0 , y i , η i , μ θ , , σ 2 u ). For Scenario 2, we have the linear form Equation (2) for the initial conditions. We choose independent priors for a,b and σ 2 v with a and b having normal priors N(0, σ 2 a ) and We keep other prior distributions the same as in Scenario 1. The conditional posterior densities for Scenario 2 are the following: where C 1 , C 2 are constants and, where IG stands for Inverse-Gamma distribution. All of the above conditional densities have standard forms except for the conditional density of γ * i . We apply the same Metropolis-Hastings within Gibbs sampling method as for Scenario 1 to draw γ * i .

Design of study
We construct a Monte Carlo study to examine the performance of our hierarchical Bayes approach. We generate data from the model, where (γ * i , η i ) have a bivariate normal distribution with mean μ θ and covariance , The different cases of the true parameter values, we consider are given in Table 1. The disturbances u it are generated from a normal distribution N(0, σ 2 u ). Because σ u is not a parameter of interest in our model, we take the value of σ u to be 0.1 in all cases. To reflect the effect of coefficient heterogeneity, we use a similar design as HPT, where σ γ and σ η are chosen to be equal to either the mean coefficients or half of the mean coefficients. In our design, the mean coefficients have the values μ γ = 0.3 or 0.6 and μ η = 0.1, 0.2, 1 or 2. The number of cross-sectional units is N = 50 or 1000 and the number of time periods is T = 5 or 20. For Scenario 1, where the process starts from the infinite past, y i0 is generated from a normal distribution with mean α i /1 − γ i and variance σ 2 we choose values 0 and 0.8 for a and b separately. v i0 is a mean zero normal with standard deviation σ v = 0.1.
To monitor the convergence of our MCMC algorithm, we use the method suggested in [8]. We run several chains with scattered starting values and calculatedR which summarizes the ratio of between-and within-sequence variances to monitor the convergence.R near 1 suggests convergence. We also examine trace plots to monitor the convergence. In our study, we use 20 chains with scattered starting values andR generally becomes close to 1 after 500 iterations. We set the number of iterations to be 3000 and, to be conservative, discard the first 1000 iterations. We evaluate the frequentist properties of our Bayesian procedure by repeating the simulation 500 times.

Analysis of results
The Monte Carlo study results are presented in Tables 2-7. The Mean estimates of each parameter from the posterior distribution, averaged over 500 simulations, are compared to the true values used to generate the data. Tables 2-6 report results for the initial values y i0 generated from the stationary distribution (Scenario 1). Table 2 shows the bias of μ γ and corresponding root-mean-squared error (RMSE). In all cases, the hierarchical Bayes estimator performs much better than the mean group estimator. The mean group estimator is biased downwards heavily in most cases. The hierarchical Bayes estimator performs very well even when both N and T are small (N = 50, T = 5) with bias at most 10%. When  T = 20, the bias in all cases drops below 3%. The RMSE of the mean group estimator is much larger than the RMSE of the hierarchical Bayes estimator in all cases. For example, for N = 50,T = 5, the RMSE for the hierarchical Bayes estimator is at most 42% of the RMSE for the mean group estimator. The acceptance rates of the Metropolis-Hastings within Gibbs algorithm range from 26% to 95% with more than half of the acceptance rates below 70%. The acceptance rate decreases with increasing T and N. When N = 1000 and T = 20, the acceptance rates range from 26% to 71%.     Table 3 shows the results for μ η . The mean group estimator's bias is acceptable in most cases but it is heavily biased in some cases, while the hierarchical Bayes estimator has consistently low bias of less than 2%. The RMSE of the mean group estimator is much larger than that of the hierarchical Bayes estimator for all cases, more than 10 times as large in some cases.  Table 4 shows the results for σ γ . We show the results of the Swamy estimator along with our hierarchical Bayes estimators. The Swamy estimator [17] Following HPT, we drop the second term on the right-hand side of Equation (3), which is O p (T −1 ), in order to ensure that the estimate ofˆ is non-negative definite. When N = 50 and T = 5, the Swamy estimator overestimates and the hierarchical Bayes underestimates the true values in all cases. When N increases to 1000, the bias of the Swamy estimators does not significantly improve but the performance of hierarchical Bayes estimator improves substantially, with bias at most 25%. When T increases to 20, both estimators improve substantially, but the hierarchical Bayes estimators have much less bias, less than 1% in all cases. The RMSE of the hierarchical Bayes estimator is lower than that of the Swamy estimator in all cases, especially when for N and T large. Table 5 shows the hierarchical Bayes estimator of σ η . The bias seldom exceeds 10% except for cases 3 and 7 when N = 50.
In Table 6 we report the empirical coverage rates for μ γ and μ η for 90% and 95% credibility intervals. When both N and T are small, the coverage rate for μ γ is significantly lower than the nominal level. In the other cases, the empirical coverage rates are generally around the nominal levels, indicating that the credibility intervals can be used as approximate confidence intervals. Table 7 gives the results when we assume a linear form for initial values (Scenario 2). For most settings, the hierarchical Bayes estimates perform well and have little bias. For cases 3, 5 and 7, the estimates of μ η , σ γ , a and b exhibit some bias for N = 50 but the bias disappears for N = 1000.
Case   Tables 2-7 for most cases and are provided in the supplemental material.

Empirical application
To illustrate our methods, we consider a model of intakes of dietary energy and protein following Bhargava [3]. We consider a panel consisting of four rounds of 24-h recall data on dietary energy and protein used by Bhargava [3]. This panel is based on a survey of 448 households living within a radius of 20 miles from the Bukidnon region of the Philippines during the years 1984 and 1985 [4]. The surveys were conducted at 4-month intervals and the intakes of dietary energy and protein in four rounds are available on 2256 individuals who were older than 1 year (note that our sample is slightly different than that used in [3]).
where y it is the intake of energy or protein of the ith individual in the tth time period, α i is the mean intake of energy or protein on a logarithmic scale in the first survey month of year 1984, γ i is the ratio of the amount of intake increase of one survey month (compared to the previous month) to the intake increase of the previous survey month (compared to the month before the previous month).   We also consider mean group estimators and generalized method of moments (GMM) estimators [1] for a homogeneous (γ i = μ γ ) version of Equation (1). For the method of mean group estimators, the Swamy estimator of σ γ and σ α are provided and confidence intervals of both mean and variance estimates are derived as 2.5% and 97.5% quantiles from 1000 bootstraps. The estimation results from these three methods are reported in Table 17. The mean group estimates are quite different from the Bayesian and GMM estimates, while the Bayesian and GMM estimates are close. In addition, the variances of the mean group estimates are much larger. The estimates of μ γ from our Bayesian method are about 10% lower than GMM estimates for both panels, while the estimates of μ η are very close between the two methods for both panels. The homogeneous model with GMM estimates seems to overestimate the μ γ by about 10%. In addition, the 95% credibility intervals of μ γ and μ η from our Hierarchical Bayesian method are narrower than the 95% confidence intervals from the homogeneous model with GMM estimates.
We conducted posterior predictive checks using the test statistics of the mean of Y t at each time period and the pairwise correlations among (Y 1 , Y 2 , Y 3 , Y 4 ), where Y t = (y 1t , y 2t , . . . , y it , . . . , y nt ), t = 1,2,3,4. The posterior predictive p-values were generally  Y 4 ), respectively. However, the posterior predictive p-values for the correlations of (Y 2 , Y 3 ), (Y 3 , Y 4 ) and (Y 2 , Y 4 ) were below .05. Thus, the model fit is reasonable but there is some indication of nonstationarity. It would be valuable to explore this further in future work.

Conclusion
Our paper further develops Hsiao, Pesaran and Tahmiscioglu's hierarchical Bayesian method for random coefficient dynamic panel data. Instead of treating initial observations as fixed constants, we allow the initial values to either come from a stationary process or a flexible process. We also use the logitnormal distribution to enforce a stationary constraint on the coefficient γ i . We use a Markov chain Monte Carlo algorithm to generate the hierarchical Bayes estimators. Our Monte Carlo study provides evidence that these estimates have good frequentist properties. The hierarchical Bayes estimators perform well even when both N and T are small and perform substantially better than the mean group estimator. Incorporating covariates into our Bayesian framework is an area for future research.
In the model (1),the unit-specific effect α i and covariates y i,t−1 are assumed to be independent. Considering a correlated random effect approach [5,15] would be a good topic for further research.