Empirical Bayes estimation of parameters in Markov transition probability matrix with computational methods

Empirical Bayes estimator for the transition probability matrix is worked out in the cases where we have belief regarding the parameters, For example, where the states seem to be equal or not. In both cases, priors are in accordance with our beliefs. Using EM algorithm, computational methods for different hyperparameters of the empirical Bayes are described. Also, robustness of empirical Bayes procedure is investigated.


Introduction
Markov model is important in Internal migration [10] behaviour of population. Estimation of parameters of transition probability matrix in classical way is available in literature. When behaviour of the migration pattern is seen, that knowledge is to be employed in finding Bayes procedure [6][7][8]. For example, if we have a belief that all states are almost equally important, then we should try to add such beliefs in estimation procedures. In this paper, we consider empirical Bayes estimates in the light of such beliefs.
It is also important to suggest a prior distribution for a problem. Actually, the system will speak out the general form of prior. Form of prior can be understood from the form of the problem. What we do, we try to make the prior finer and finer from the data set only. Rather some of us give emphasis on data and not on prior and that is not the spirit of Bayes procedure. So once we have feeling about prior, then rational thing will be taken as general form and then from data, we make this concrete as far as possible. So in our case, that is, in internal migration problem or in transition probability, the general form of prior must be of multivariate Beta and Dirichlet prior and from data we can make this as concrete as possible by Bayes procedure. That is done earlier. For more general forms in terms of variations of hyperparameters of prior *Corresponding author. Email: babulal_seal@yahoo.com c 2014 Taylor & Francis and using data, we get empirical Bayes procedure. So, in Section 2, empirical Bayes estimations with transition data are considered under two different beliefs: (i) When the hyperparameters are the same for different states, that is, it gives knowledge whether the states have equal performance, (ii) When we do not have such beliefs. Then, the next problem is to estimate the hyperparameters if not known. These are included in Sections 3 and 4 under different beliefs as in (i) and (ii) above. The computational methods are worked out using EM algorithm. We suggest a way to get an initial value and give steps to get an accurate value of the hyperparameters. Finally, in Section 5, robustness of such model, that is, model of prior using possible variation of hyperparameters, is given. But it is not given in general form. For some prior hyperparameters, it is seen good. If it is for other also, then we can say that it is more or less close to Bayes estimate. But this numerical work is for clarity. The relevant R-codes used for computations are given in Appendix 2. 1 In spite of that, our achievement is to get a general form or model in this context. This is considered in Section 6. There, from a data set, we estimate hyperparameters for the scientific significance of the problem and the relevant R-codes are given in Appendix 3.

Empirical Bayes estimation from transition data
In this section, we want to find out the estimators of the unknown parameters involved in the prior distribution. It can be seen that the Bayes estimator δ(n ij ) of p ij depends on prior distribution π(p|α). When α ij 's are unknown, it is not possible to implement δ(·). The empirical Bayes approach is employed to combine information from observations n i , i = 1,2, . . . ,k.
If the prior distribution of p is not completely known, the posterior distribution cannot be used to make probability statements about the unknown variables. But, in some situations, it may be possible to compute the posterior distribution of p given n without the complete knowledge of the distribution function of p. But the estimate may involve unknown parameters which are characteristics of the marginal distribution of n alone.
Case 1 Priors have the same hyperparameters Let and That is, the k groups are tied together by the common prior distribution. Therefore, the Bayes estimator of p ij under sum of squared error loss is where n i· = k j=1 n ij for all i = 1, 2, . . . , k.
As the likelihood equation, it clearly does not yield an explicit solution forα j . Thus, Equation (2) must be solved by numerical methods. Case 2 Priors have different hyperparameters. Now, let n be the random vector corresponding to the observed data having Multinomial distribution with parameter p (unknown), where Then, the Bayes estimator of p ij [6][7][8] under squared error loss is It is noticed that Equation (1) can be used here also, using k = 1. However, in both cases, we need extensive computation.

Computational method when hyperparameters are the same for k-places
If we use the approximation Thus, the approximate values of (α j − 1 2 )/(α · − 1 2 ), for all j = 1, 2, . . . , k are given by summing over all j, we have Thus, from Equation (3) we have the estimators of α j , which have approximations toα j given byα Starting from these values ofα j from Equation (4), solutions of Equation (2) can be obtained by an iterative process.
Let us denote the value of α at the rth subsequent iteration of the Newton-Raphson method by α (r) . Then, the value of α in the next iteration is given by where I(α) = − (δ 2 /δαδα )logL(α) is the observed information matrix and S(α) = (δ/δα)logL(α) is the score vector.
From the likelihood function (1), the score has entries The observed information has entries where 1 {j=j } is the indicator function of the event {j = j }, and ψ (t) is the trigamma function (δ 2 /δt 2 )log (t).
The observed information can be summarised in matrix form by where D is a diagonal matrix with the jth diagonal entry , and 1 is a column vector of all 1's. Thus, the limiting value of α = (α 1 ,α 2 , . . . ,α k ) as α * = (α * 1 , α * 2 , . . . , α * k ) gives the empirical Bayes estimate when all prior distributions have the same set of parameters.

Computational method when hyperparameters are different
Consider the case where α ij 's are unknown parameter, for i, j = 1,2, . . . ,k. We take We are interested in finding out the estimate of the Dirichlet parameter α from observed data n.
But it is not possible to estimate α from observed data analytically as if we try separately for each component using Equation (4); then, for k = 1 in case of individual state, it is seen thatα ij = 1/2 which is absurd. However, in this case, it may be possible to compute iteratively the MLE of α from observed data by using the quasi-Newton accelerated EM algorithm. We estimate α i individually, since estimates of α i = (α i1 , . . . ,α ik ) depend only on n i = (n i1 , . . . ,n ik ) and independent of n j = (n j1 , . . . ,n jk ), for (j = i).
Thus, we iterate each α i individually and take the limiting value of α i as α * i , for all i = 1,2, . . . ,k.
Finally, we have Before that, by x we denote the vector containing the augmented and by p, we denote the vector containing the additional data, referred to as the unobservable data. Then, the complete-data vector x is taken to be for all i = 1,2, . . . ,k and After integrating w.r.t. (p i1 ,p i2 , . . . ,p ik ) from joint density function of (n i1 ,n i2 , . . . ,n ik ) and (p i1 ,p i2 , . . . ,p ik ) that is formed from Equations (5) and (6), the density function of (n i1 ,n i2 , . . . ,n ik ), for all i = 1,2, . . . ,k is given by where the integration is carried out over the region Thus, we get marginal density of (n i1 ,n i2 , . . . ,n ik ), for all i = 1,2, . . . ,k is which is the Multinomial-Dirichlet distribution. By (n i1 ,n i2 , . . . ,n ik ), we denote an observed random sample from the Multinomial-Dirichlet distribution with parameters (α i1 , . . . ,α ik ) for all i = 1,2, . . . ,k. Then, for incomplete-data log likelihood is of the form where n i· = k j=1 n ij , and α i· = k j=1 α ij , for all i = 1,2, . . . ,k. Because of the conditional structure of the complete-data model specified by Equations (5) and (6), the complete-data likelihood can be factored into product of the conditional density of (n i1 ,n i2 , . . . ,n ik ) given (p i1 ,p i2 , . . . ,p ik ) and the marginal density of (p i1 ,p i2 , . . . ,p ik ) for all i = 1,2, . . . ,k. Accordingly, the complete-data log likelihood be written as The EM algorithm approaches the problem of solving the incomplete-data likelihood equation (8) indirectly by proceeding iteratively in terms of the complete-data log likelihood function logL c (α). The obstacle due to unobservability is overcome by averaging the complete-data likelihood over its conditional distribution given the observed data n. But in order to calculate this conditional expectation, we have to specify a value for α.
Let α (0) i denote the starting value of α i and α (r) i , the value of α i on the rth subsequent iteration of the EM algorithm. E-step: Then, on the first iteration of the EM algorithm, the E-step requires the computation of the conditional expectation of logL c (α) given n, using α (0) i for α i , which can be written as where Q-function is used to denote the conditional expectation of the complete-data log likelihood function, logL c (α), given the observed data n, using the current fit for α. We have on the (r + 1)th iteration of the E-step where the expectation operator E has the subscript α (r) i to explicitly convey that this (conditional) expectation is being affected using α (r) i for α i . Therefore, Q(α i , α (r) i ) can be written as It is seen from Equation (9) that, in order to carry out M-step, we need to calculate the term The calculation of the above term can be avoided if one makes use of the identity [4], where S i (n i ; α (r) i ) are the score statistics given by On evaluating S ij (n i ; α (r) i ), the derivative of Equation (8) with respect to α ij at the point α i = α (r) i , we have that On equating S ij (n i ; α (r) i ) equal to the derivative of Equation (9) with respect to α ij at the point is the digamma function of s. Therefore, from Equation (9) we have (11) M-step: On the M-step at the (r + 1)th iteration of EM algorithm is to maximise ij are obtained after solving It follows that α (r+1) ij is a solution of the equation Finally, we get where α * i the limiting value of α i , for all i = 1,2, . . . ,k. Thus, we have the empirical Bayes estimator of p ij as where α * ij is the limiting value of α ij for all i, j = 1, . . . ,k, and Details of such are included in Appendix 1 and its inspiration is in [1].
Concerning now the M-step, it can be seen that the presence of the terms like log (α ij ) in Equation (11) prevents a closed-form solution for α (r+1) i . Now, using [4], the quasi-Newton acceleration procedure in our case becomes The components of the score statistic S i (n i ; α (r) i ) are available from Equation (10), and the information matrix I c (α (r) i ; n i ) is given by where each D i is the diagonal matrix with the jth diagonal entry , and 1 is a column vector of all 1's. where ψ (s) is the trigamma function δ 2 log (s)/δs 2 . As trigamma function is decreasing, we have d ij > 0 when n ij > 0. For the same reason, C i > 0. Since the presentation (13) is preserved under finite sums, it holds, in fact, for the entire sample.
The observed information matrix (13) is the sum of a diagonal matrix, which is trivial to invert, plus a symmetric, rank-one perturbation.

Robustness of empirical Bayes estimate
Empirical Bayes estimation, falls outside of the formal Bayesian paradigm. However, it has been proved to be an effective technique of constructing estimators that perform well under both Bayesian and frequentist criteria. One reason for this, as we will see, is that empirical Bayes estimators tend to be more robust against misspecification of the prior distribution.
Bayes risk performance of the empirical Bayes estimator is often robust; i.e. its Bayes risk is reasonably close to that of the Bayes estimator, no matter what values the hypeparameters attain.
The Bayes risk of empirical Bayes estimator is only slightly higher than that of the Bayes estimator and is given in Table 1. For this simulation works are done. The important idea in this area is Monte Carlo Simulation [5]. For comparison, we also include the Bayes risk of the unbiased estimator n ij /n i· , where n i· = k j=1 n ij for a particular i. The risk of the empirical Bayes estimator is between that of the Bayes estimator and that of unbiased estimator.
The R-codes to perform these calculations are given in Appendix 2.  The estimates of Dirichlet parameters and those of transition probability matrix are given in Tables 3 and 4. The R-codes to perform these calculations are given in Appendix 3.

Concluding remarks
In this paper, robustness (in Section 5) of EB has been demonstrated, but this should be done elaborately as computations of risk have been done for some hyperparameters. More importantly, we have worked out in Sections 2, 3 and 5 for a class of Dirichlet prior which are reasonable under two cases where hyperparameters are the same for all states and different for all states. But, from other beliefs also, we are to keep this Dirichlet prior, but some other choices which are not like our two cases may be important. There, perhaps computations will have to be modified from this spirit.