Autoregressive Moving Average Infinite Hidden Markov-Switching Models

Markov-switching models are usually specified under the assumption that all the parameters change when a regime switch occurs. Relaxing this hypothesis and being able to detect which parameters evolve over time is relevant for interpreting the changes in the dynamics of the series, for specifying models parsimoniously, and may be helpful in forecasting. We propose the class of sticky infinite hidden Markov-switching autoregressive moving average models, in which we disentangle the break dynamics of the mean and the variance parameters. In this class, the number of regimes is possibly infinite and is determined when estimating the model, thus avoiding the need to set this number by a model choice criterion. We develop a new Markov chain Monte Carlo estimation method that solves the path dependence issue due to the moving average component. Empirical results on macroeconomic series illustrate that the proposed class of models dominates the model with fixed parameters in terms of point and density forecasts.


INTRODUCTION
Econometricians have developed models with changing parameters at least since Goldfled and Quandt (1973) introduced the idea of Markov-switching (MS) to model the changes in the parameters of a regression equation. This idea consists in enriching the regression with a discrete latent variable process indexing the parameters so that they can switch from one value to another. Hamilton (1989) updated the idea and introduced in particular a filtering algorithm that enables a direct evaluation of the likelihood function. A few years later, Chib (1998) proposed change-point (CP) models, where the transitions from one value to another are not reversible, as a convenient way to model structural breaks at unknown break dates. The estimation of all these models relies on algorithms that are not applicable to models exhibiting path dependence, such as the autoregressive moving average (ARMA) and the generalized autoregressive conditional heteroscedastic (GARCH) models. The difficulty occurs because an unobservable variable at date t (the lagged error term in ARMA models, the lagged conditional variance in GARCH) depends on the entire path of states that have been followed until that date. The computational time thus exponentially grows with the number of time-series observations (for the MS version) and is practically infeasible even for relatively short series. 1 1 Several articles propose estimation methods for the MS and CP-GARCH models, either circumventing the path dependence issue (e.g., Gray 1996;Klaassen 2002;Haas, Mittnik, and Paolella 2004) or tackling the issue upfront (e.g., Francq  Selecting the number of states (or regimes) in these models is an important issue. This is typically done by using a model choice criterion after estimating the model with different numbers of regimes. Bayesian inference by Markov chain Monte Carlo (MCMC) is practical even though the evaluation of the likelihood function is infeasible due to path dependence. Although several numerical tools are available for computing the marginal likelihood (Ardia, Hoogerheide, and van Dijk. 2009), it still remains a tedious calculation for complex models (see, e.g., Bauwens, Dufays, and Rombouts 2013). The sticky infinite hidden Markov-switching (sticky IHMS) modeling framework (Fox et al. 2011) allows us to bypass this demanding computation by assuming a Markov chain with a potentially infinite number of regimes, thus encompassing any finite number of them. This setting has been successfully applied in genetics (Beal and Krishnamurthy 2006), visual recognition (Kivinen, Sudderth, and Jordan 2007), and economics, with in the last area in particular autoregressive (AR) models (Song 2014;Jochmann 2015) and volatility models (e.g., Jensen and Maheu 2010).
With respect to this background, our contribution is threefold. To begin with, we propose a simple solution to relax the classical assumption of MS models, which states that all the parameters must change whenever a break occurs. To do so, we separate the break dynamics of the mean and variance parameters and use a hierarchical Dirichlet process to drive each of them. Although not based on Dirichlet processes, similar approaches relying on finite-state Markov chains are proposed by Doornik (2013), Goutte (2014), and Eo (2012). While the first two assume a fixed number of regimes, the latter uses the marginal log-likelihood to select the optimal specification. This method is impractical for a large number of regimes or of parameters. In comparison, we only need one estimation to determine the optimal number of regimes. Moreover, our forecasts take the uncertainty of the number of regimes into account without resorting to Bayesian model averaging. Our empirical forecasting results indicate that this feature contributes positively to the predictive performance of the proposed models. Our modeling approach, therefore, generalizes the conventional MS approach in two related directions: an unbounded number of states (as in existing IHMS-AR models) and a flexibility on the dynamics of the parameters, by allowing different break dates in the parameters of the mean and of the variance.
Second, as our baseline model is an ARMA one, we need an estimation method that operates for models subjected to path dependence. We develop a new MCMC algorithm that solves this issue. In addition, the sampling of the ARMA parameters is performed with the manifold Metropolis adjusted Langevin algorithm (MALA) introduced by Girolami and Calderhead (2011).
As a third contribution, we introduce in the econometric literature the steppingstone algorithm (see Xie et al. 2011), which provides a new way to estimate the marginal log-likelihood (MLL) from the MCMC output.
The rest of the article is organized in six sections. The model is presented in Section 2 and the estimation procedure in Section 3. The steppingstone algorithm is exposed in Section 4. The prior elicitation and the label switching problem are addressed in Section 5. Applications, including forecasting evaluations, are presented in Section 6. The last section contains our conclusions. An online supplementary appendix (SA) provides additional empirical results.

MS-ARMA MODELS
We start by defining the model with a finite number of states to discuss its limitations. In Section 2.2, we present the Dirichlet process mixture model and the related Dirichlet process. These processes are the building blocks of the infinite hidden Markov-switching framework (IHMS) on which the IHMS-ARMA model class is built. This model class is defined in Section 2.3. For simplicity, the exposition is limited to the ARMA(1,1) model, the term ARMA being used to designate shortly ARMA(1,1) in the rest of the article. Similarly, y 1:T = {y 1 , . . . , y T } denotes a generic time series.

The Model With a Finite Number of Regimes
The MS-ARMA model is defined by y t = μ s t + β s t y t−1 + φ s t t−1 + t , (1) where μ s t ∈ , σ 2 s t > 0, |β s t | < 1 (for stationarity), |φ s t | < 1 (for invertibility) and β s t + φ s t = 0 (no root cancellation). The elements of the vector s 1:T = {s 1 , . . . , s T } take integer values from 1 to K and denote which regime (also called state hereafter) is active at each period of time. They are assumed to follow a first-order Markov chain with a homogeneous transition probability matrix given by in which p ij denotes the probability of moving from state i to j with the constraint that K j =1 p ij = 1, ∀i ∈ [1, K]. This setting is similar to that of Hamilton (1989).
Although flexible, the MS-ARMA model has two limitations that motivate our contributions in this article. First, the number of regimes K must be fixed before the estimation. Indeed, the standard inference method consists in estimating the MS-ARMA model for several a priori plausible values of K, and then choosing the optimal number of regimes by using a model choice criterion. This approach requires several estimations that are tedious when K is large. Moreover, it does not take the uncertainty on the number of regimes into account. Second, at each regime switch, all the parameters change simultaneously. However, a break may affect only a subset of the parameters. As a consequence, the standard model may be overparameterized, resulting in imprecise estimates, in particular of the parameters of short regimes, and deteriorated forecast performance.

The Dirichlet Process Mixture Model
To address the two issues, we adopt a setting that potentially allows for an infinite number of regimes and we disentangle the dynamics of the model parameters by allowing for different break dates for the mean function parameters and for the variance. To do so, we rely on the sticky IHMS framework of Fox et al. (2011) (see also Teh et al. 2006) which is based on the Dirichlet process (DP) introduced by Ferguson (1973). We introduce first the DP and highlight its interest, before defining the complete model in the next section.
The Dirichlet process can be used as a nonparametric prior on model parameters. Its most popular use is the DP mixture model. An example of this for an ARMA model is the following: The Dirichlet process is denoted by DP(η, H ), where η is a positive "concentration" parameter and H a continuous "base" distribution (e.g., an inverse-gamma in this context). The DP expectation is given by H which indicates that any draw G 0 of the process can be seen as a distribution over the same support as H. The concentration parameter η controls its dispersion with respect to the base distribution. In particular, the larger is η, the more similar are the distributions G 0 and H.
To simplify the presentation, in the model (3)- (6), we only introduce the possibility of breaks in the variance. A more general mixture model can be straightforwardly designed by adding a DP layer to the mean function parameters (as we do in the next section). Two useful properties, as well as two related representations, of the Dirichlet process are worth mentioning.
First, the DP nonparametric prior is parsimonious. Indeed, from its Pólya urn representation that allows to integrate the DP (see Blackwell and MacQueen 1973), the time-varying variance σ 2 t of (5) is distributed, conditionally on the previous realizations, as where δ σ 2 i is the probability measure concentrated at σ 2 i . This result shows that the probability of drawing a new value from H decreases as the time index grows. If we denote all the n i identical values σ 2 i byσ 2 i and assume that at time t, only K different variances have been drawn, then the DP property saying that "rich regime gets richer" becomes transparent, since Equation (7) is equivalent to The probability that σ 2 t =σ 2 i is given by n i /(η + t − 1) and increases with the number of realizations that have already been assigned to regime i. This feature highlights the time-varying nature of the model variance.
Second, the Dirichlet process is discrete, which is essential to build a Markov chain with an infinite number of regimes. Sethuraman (1994) shows an alternative to the Pólya urn to construct a Dirichlet process. From two independent sequences of iid random variables {π i } ∞ i=1 and {σ 2 i } ∞ i=1 built as follows: it turns out that G 0 is distributed as a Dirichlet process with concentration parameter η and base distribution H. The sequence {π i } ∞ i=1 , conveniently written π ∼ Stick(η), satisfies ∞ i=1 π i = 1 with probability one and, therefore, defines a distribution over the positive integers. The explicit form of G 0 highlights that the DP support is discrete. From the stick-breaking representation (9)-(10), the conditional predictive density of the DP mixture model (3)- (6) is given by where f N (x|a, b) stands for the Normal density function with expectation a and variance b evaluated at x. The predictive density (11) shows that the Dirichlet process helps to move to an infinite number of regimes but also highlights that the transition probabilities to switch from one state to another are independent of time. Teh et al. (2006) were the first to restore the Markovian property in the state transitions by introducing the infinite hidden Markov-switching framework. Afterwards, Fox et al. (2011) developed the sticky-IHMS setting that copes with the high regime persistence typical in a time series context.
Equations (12) and (13) correspond to an ARMA model with time-varying parameters. To ensure tractability and to mimic the abrupt switches of the parameters in a Markov-switching model, we use different Dirichlet processes as priors on the mean and the variance parameters. Referring to the Pólya urn Equation (8), the model parameters at time t, conditionally on the previous ones, can stay in or move to an existing regime, or switch to a new one whose values are generated from the base distribution. The two structures, (14)-(16) and (17)-(19), constitute two hierarchical Dirichlet processes. To differentiate them from the Pólya urn Equation (8), as the model parameters are now driven by several distributions (instead of one in the DPM model) coming from multiple Dirichlet processes, the probabilities of moving from one state to another become time-varying and directly depend on the previous parameter. Additionally, as the discrete distribution G 0 is shared among the different Dirichlet processes of Equations (15), the distributions of the mean function parameters share the same possible states. The same comment holds for the variance parameters as the base distribution G 1 is also common to the Dirichlet processes, see Equation (18). The sticky hierarchical Dirichlet framework suggested by Fox et al. (2011) can account for the persistence in the regimes compared to the hierarchical Dirichlet structure proposed by Teh et al. (2006). This is done by introducing the parameters κ ψ and κ σ that generate the persistence in the regimes by increasing the probability of picking the parameter of the previous state (hence the qualifier "sticky").
Using the stick-breaking formulation of the sticky IHMS framework (see Fox et al. 2011) leads to the following way to formulate the sticky IHMS-ARMA process: where s ψ t , s σ t are discrete random variables that can take any positive integer value. Let us define From the above representation of the IHMS-ARMA model, we can obtain the following predictive densities: Equation (30) highlights that the conditional distribution given the current state of the mean function parameters is an infinite mixture of Normal distributions with time-varying probabilities. When we integrate over the current mean state value (s ψ t ), Equation (31) emphasizes that the model is equivalent to a MS-ARMA model with an infinite number of regimes for the mean function parameters and for the variance. This avoids the two drawbacks of the MS-ARMA model with finite number of states mentioned at the end of Section 2.1.
Due to the DP assumptions (23) and (27), the expected values of the ith rows (p ψ i and p σ i ) of the infinite dimensional transition probability matrices are given by These formulas show that the expected self-transition probability is inflated compared to the probability of moving to another state thanks to the positive sticky parameters κ ψ and κ σ . We, therefore, have an infinite dimensional Markovian structure encouraging regime persistence. Regarding the base distributions H ψ and H σ -see (25) and (29)-a third hierarchical layer is introduced in Section 5 to update them with information stemming from the active regimes. As advocated by Song (2014), this layer improves the birth of new regimes by drawing realistic parameters from the common distributions. The types and hyperparameters of these and all other prior distributions are defined in Section 5.
The IHMS framework has been used in several empirical applications. Jochmann (2015) and Song (2014) used it to model macroeconomic series with an autoregressive model (thus without path dependence). In volatility modeling, Jensen and Maheu (2010), Jensen and Maheu (2013), Jensen and Maheu (2014), Dufays (2015) and Jin and Maheu (2014) also apply this kind of structure to GARCH, stochastic volatility and realized volatility processes. All these papers provide empirical evidence in favor of IHMS models compared to the existing alternatives. The proposed sticky IHMS-ARMA model makes no exception as shown by the forecasting results reported in Section 6. Furthermore, we innovate in two directions with respect to those papers. First, the model relies on two sticky IHMS structures, improving its flexibility. Second, we extend the model to include a MA component, thus we face the complications due to the path dependence issue. Large sample properties of time-varying ARMA parameters have been the focus of another strand of the literature (Basawa and Lund 2001;Francq and Gautier 2004 among others). However, contrary to Basawa and Lund (2001), our approach does not assume periodic changes in the ARMA time-varying parameters. Our approach also departs from Francq and Gautier (2004) by relaxing the assumption that changes in the mean and variance parameters are simultaneous. As a final remark, the exposition has used an ARMA(1,1) specification with Normal errors, but a higher order ARMA model or a different innovation distribution can be handled without complications.

ESTIMATION BY MCMC
Two issues need to be addressed to estimate the IHMS-ARMA model: the path dependence issue, and the infinite number of regimes revealed by the predictive distribution shown in Equation (31). We deal with the former in the same way as Dufays (2015) proposed in the GARCH context; for details, see Section 3.1. To tackle the second issue, one can rely on the beam sampler (Van Gael et al. 2008), which augments the posterior distribution with a set of auxiliary variables that truncate the infinite number of states to a finite one. As the posterior distribution marginalized with respect to the auxiliary variables corresponds to the targeted posterior one, the MCMC algorithm is correct. A simpler alternative consists in truncating the infinite sum to a large number of states L without embedding auxiliary random variables, a technique known as the degree L weak limit approximation (Ishwaran and Zarepour 2002). Despite the truncation, if the chosen number L is large enough, the error is negligible, see Kurihara, Welling, and Teh (2007) and Fox et al. (2011). In this article, we rely on this approximation as it eases the algorithm implementation and its exposition. However, with slight modifications, the MCMC scheme would also operate with the auxiliary variable approach. The estimation is based on the stick-breaking representation in Equations (20)-(29). Under the degree L weak limit approximation, every row of each transition matrix is truncated to a finite Dirichlet distribution (denoted by Dir), instead of being driven by a Dirichlet process as in Equations (23) and (27). Thus, (23) and (27) are replaced by . . , α σ π σ L ), respectively. In the same spirit, the probability distributions π ψ and π σ of the stick-breaking representation are truncated to L elements following symmetric Dirichlet prior distributions given the parameters (η ψ and η σ ): For notational ease, the sticky IHMS parameters α = {α ψ , α σ }, κ = {κ ψ , κ σ }, η = {η ψ , η σ } are brought together in the set H Dir = {α, η, κ} and the truncated transition matrices are denoted by Bayesian estimation is feasible by explicitly treating s ψ 1:T , s σ 1:T as parameters. To simulate the posterior distribution, we use a Gibbs sampler that cycles between eight full conditional distributions, as summarized in Table 1, whereμ,¯ ,ē,f are the parameters of the base distributions H ψ and H σ . Details about the prior distributions are provided in Section 5. Except in Steps 1 and 5 in Table 1, the full conditional distributions can be directly simulated, as detailed in Appendix 1. In the rest of this section, we concentrate on the most challenging item of the sampler (Step 1), and we expose how we sample the ARMA parameters in Section 2.1. For model comparisons based on forecasts, we detail how we compute and evaluate the predictive density at any horizon in Section 3.3.

Sampling the State Vector of the Mean Function Parameters
Sampling the state vector is usually done by a forwardbackward algorithm (Rabiner 1989;Chib 1998). The algorithm is applicable if there is no path dependence, as in the case of an AR model, but not of an ARMA model. In an AR model, the likelihood of observation t only depends on the current state, while in the MA one, it depends on the whole path of past states. The forward-backward method is then infeasible since the computations exponentially grow with the time index t. For the ARMA model, we adapt the method of Dufays (2015) and follow a two-step procedure. We first sample an entire state vector from an approximate model, which is a modified MS-ARMA model adapted from Klaassen (2002). Such a model is free from the path dependence problem, so that the forward-backward algorithm can be used to sample a state vector. We then implement the Metropolis-Hastings (MH) step to accept or reject the draw. Although an approximate model is used to sample the state vector, the MH step ensures that the targeted distribution remains the posterior one of the IHMS-ARMA model.
GARCH and ARMA models are closely related when tackling the path dependence problem. Reliable approximations of the MS-GARCH process have been proposed by Gray (1996), Klaassen (2002), and Haas, Mittnik, and Paolella (2004). We adapt the approach of Klaassen (2002) to the ARMA case. This replaces the unobserved error t−1 in the ARMA equation by its conditional expectation E s ψ t−1 [ t−1 |y 1:t−1 , s ψ t , , P ψ , s σ 1:t−1 ] denoted by˜ t−1,s ψ t below. The approximate model is (see Appendix 2 for the computation of˜ t−1,s ψ t ). The approximation eliminates the path dependence problem since the error term t (s ψ t ) only depends on the current state and not also on the past sequence of states.
We therefore sample a new state vector s ψ 1:T from the MS-ARMA approximation employing the forward-backward algorithm. The proposed parameter is accepted according to the MH ratio: where q(s ψ 1:T |y 1:T , , s σ 1:T , P ψ ) is the proposal distribution of s ψ 1:T derived from the forward-backward algorithm. A proposed s ψ 1:T is very likely to be rejected if it is drawn as one block from the MS-ARMA approximation. To ensure good MCMC mixing properties, we sample the state vector in blocks of random sizes sampled from a uniform distribution with lower bound equal to 40 and upper bound to 150 (see e.g., Jensen and Maheu 2010 in a stochastic volatility context). This avoids situations where the MCMC algorithm always rejects the proposed state vector and also enhances the acceptance rate. In our empirical applications, the average of the acceptance rate over the random block sizes is always above 40%.

Sampling the ARMA Parameters
Due to the unobserved lagged error term, the full conditional distribution of the ARMA parameters cannot be simulated directly. Nevertheless, given the mean function parameters and the state vectors, the full conditional distribution of the variances is a product of (conditionally) independent inverse-gamma distributions if the base distribution H σ is itself an inverse-gamma (see Section 5 for prior densities and Appendix 1 for details). We therefore split the block into two pieces and sample first the variances conditionally on all the other parameters. Then the mean function parameters are drawn from their full conditional distributions using an MH algorithm.
Focusing on the mean function parameters, we adapt the Riemannian manifold Metropolis adjusted Langevin algorithm (RMMALA) to define a proposal distribution (see Calderhead 2011 andits corrected version, Xifara et al. (2014)). The RMMALA algorithm is a discrete version of an Ito stochastic differential equation of the Langevin diffusion, which exhibits the full conditional distribution as unique stationary one. Focusing on the mean function parameters ψ i = {μ i , β i , φ i } of the ith regime and denoting by log f (ψ i |D) the logarithm of the full conditional density, where ψ 1:T , s σ 1:T , y 1:T }, the RMMALA proposal distribution, given the current MCMC realization ψ i , is defined bỹ where G(ψ i ) denotes the Hessian of minus the logarithm of f (ψ i |D), γ is a discretization tuning constant, and ξ (., .) stands for a function of the gradient, the Hessian and the third derivative of minus the logarithm of f (ψ i |D). If we assume that the curvature is locally constant, the proposal distribution is simplified as follows: where ∇ denotes the gradient operator. The proposal distribution (33) is called the simplified manifold MALA (smMALA). Intuitively, the proposal expectation lies in a high density area of the posterior distribution thanks to the gradient. Moreover, the translation takes the local curvature into account through the Hessian matrix. Girolami and Calderhead (2011) provided Table 2. Sampling from f (y T +1:T +h−1 , s σ T +1:T +h , s ψ T +1:T +h | , P ψ , P σ , s ψ 1:T , s σ 1:T , y 1:T ) several examples in which the proposal distribution (33) allows to update strongly correlated parameters in one block. Although very appealing, the proposal distribution requires the computation of the Hessian, which can be negative definite. To circumvent the issue as well as to speed up the Hessian computation, we use its Gauss-Newton approximation, as suggested by Vakilzadeh et al. (2014). It is given by is the Jacobian of the standardized error terms and C −1 prior is the inverse of the covariance matrix of the prior distribution.
The tuning constant γ has also an impact on the proposal density and therefore on the MCMC mixing properties. If its value is too large, the MCMC sampler can be stuck for long periods, while if it is too small, the proposed update is likely to be accepted but the posterior support exploration is very slow. We solve this issue by adapting the rule of Atchadé and Rosenthal (2005): at the rth MCMC iteration, the constant γ r is updated as γ r = max ζ, γ r−1 + (α − α opt )/(0.6 r ) , where ζ is a very small positive constant to avoid a negative value of γ r , α is the current acceptance rate of the MH algorithm, and α opt stands for the user-defined one. In the empirical applications, α opt is set to 40% and ζ to 10 −8 .

Predictive Densities and the Continuously Ranked
Probability Score The usefulness of a model can be assessed by its forecasting ability. We explain how to obtain draws from the predictive density f (y T +h |y 1:T ) where h is the forecast horizon. The predictive density can be estimated by where the R draws (indexed by the superscript r) come from the posterior distribution f ( , P ψ , P σ , s ψ 1:T +h , s σ 1:T +h , y T +1:T +h−1 |y 1:T ). Consequently, the computation of the predictive density requires to sample future observations and states in the Gibbs sampler sketched in Table 1. To do that, we add a step in the sampler to draw the future states and observations. Table 2 documents how this is done.
The predictive density evaluated at the realized data is a prominent metric to assess the predictive performance of a model but other loss functions exist and can provide additional information about the performance of a model. For this purpose, we also use the mean squared forecast errors (MSFE) and the continuously ranked probability score (CRPS) popularized by Gneiting and Raftery (2007) in our empirical applications.
The CRPS is based on the Brier probability score and relies on the idea that any density forecast f of a random variable Y induces a probability forecast for the binary event {Y ≤ z} through its cumulative density function, that is, The score function is strictly proper if the random variable Y has finite first moment and goes to an infinite value otherwise. The value of the function (34) can be computed by simulation since where Y and Y are independent random variables from the same distribution F (see, e.g., Gneiting and Raftery 2007). The loss function assesses the forecast performance of a model through the distance between the predictive cumulative distribution function of the chosen model and the empirical one.
In forecast evaluations, we are interested in the mean of the score function for the k-ahead predictive density over a time window, that is,S τ k (f k+1:k+τ , y k+1:k+τ ) = 1 τ τ t=1 S(f t+k , y t+k ), wheref t+k denotes the value of the predictive probability density function and y t+k the observed value of the time series. Gneiting and Ranjan (2011) The term inside the integral can be plotted with respect to z to see where the cumulative density function deviates from the empirical distribution. Second, as in Amisano and Giacomini (2007), one can derive a statistical test to assess if a model M 1 produces a better score function than an alternative M 2 . The test consists in comparing the mean of the score functions as follows: where t,k t+j,k ] and t,k = S(f M 1 t+k , y t+k ) − S(f M 2 t+k , y t+k ). In the predictive exercises, we apply this test to our three score functions.

MODEL SELECTION
In Bayesian inference, model comparison is often carried out through Bayes' factors that require the computation of the marginal likelihood (ML). In this section, we adapt the steppingstone sampling (see Xie et al. 2011) used in phylogenetics to estimate the marginal likelihood from a MCMC output. The approach relies on multiple importance sampling steps and is actually a generalization of the bridge sampling algorithm (e.g., Fruhwirth-Schnatter 2004).
The ML is the normalizing constant of the posterior distribution and, gathering all the random parameters of the model in , is defined as f (y 1:T ) = f (y 1:T | )f ( )d . As this integration is intractable for most models, the importance sampling (IS) approach makes use of a proposal distribution defined on the support of to obtain an estimator of the marginal likelihood by simulation, as follows: where the realizations { r } N r=1 constitute an ergodic sample from the proposal density q( |y 1:T ). The IS estimator defined above is almost surely consistent under conditions stated by Geweke (1989). The precision of the IS estimator depends on the quality of the proposal distribution, which should be a good enough approximation of the posterior, so that the variance of the ratio of the posterior to the proposal is finite and as small as possible. This is typically very hard to achieve if is of high dimension.
Instead of applying importance sampling using a single proposal candidate, the steppingstone algorithm considers a sequence of IS steps using tempered posterior distributions as proposal ones. Let where Z x = f (y 1:T | ) (x) f ( )d stands for the normalizing constant (or the marginal likelihood) of the tempered posterior distribution f x . When x = 0, the distribution coincides with the prior one and when x = p, it coincides with the targeted posterior distribution. The steppingstone method aims to build a sequence of bridging distributions from the prior to the posterior. Note that the ML is given by p k=1 (Z k /Z k−1 ) when the prior distribution is proper (i.e., integrates to one, so that Z 0 = 1). Then using the IS approach, we have that where the realizations { r } N i=1 form an ergodic sample drawn from f k−1 ( |y 1:T ). For a given function (x), Equation (39) provides a simple tool to sequentially compute the ML of any model by MCMC. Indeed, one just needs to adapt the MCMC scheme to obtain draws from the distribution f x ( |y 1:T ).
The steppingstone method is unsatisfactory in the sense that the accuracy of the estimator depends on a function (x) that is model-dependent. In the literature, different functions have been proposed (Xie et al. 2011;Herbst and Schorfheide 2012) and a consensus has emerged to suggest that more IS steps should be devoted to very small values of . However, this does not help much to select the function. Instead of fixing it, we use the sequential Monte Carlo (SMC) theory to make the steppingstone algorithm function-free. Indeed, although not recognized by their authors, the steppingstone algorithm is actually an adaptation of the ML computation proposed in the SMC sampler (see Del Moral, Doucet, and Jasra 2006) to MCMC methods. We suggest, therefore, to build the tempered function from one IS step to the next by using the effective sample size (ESS) criterion (see Doucet, de Freitas, and Gordon 2001). The ESS is defined as N/ N r=1 W 2 r , where W r is the normalized weight given by W r ∝ f (y 1:T | r ) (k)− (k−1) in the present context. Hence, the ESS is a function of (k) and we set the value of (k) by solving the following optimization program: This optimization is standard in the SMC literature (see Jasra et al. (2011) or Dufays (2016) and avoids the difficult choice of the tempering function.
As a final note, marginal likelihoods of Markov-switching models can be biased if computed by the MCMC technique of Chib (1995). The issue coming from the label switching problem is addressed in Fruhwirth-Schnatter (2004) where the bridge sampling method is introduced. Nevertheless, the bridge sampling accuracy highly depends on an user-defined proposal distribution and, therefore, can be difficult to use in practice, especially in high dimension. The steppingstone sampling solves the problem as no extra distribution is required and as it does not fix the posterior distribution at a specific value of the parameters as in Chib (1995). Table 3 reports the prior distributions and their hyperparameters. Regarding the sticky IHMS parameter set H Dir , the priors are conjugate, as suggested by Fox et al. (2011). Table 3 suggests two different values for the hyperparameters of the persistence variables ρ ψ and ρ σ . The first one (ω MS/CP = 10) implies a weak state persistence. The break dynamics is likely to rapidly switch from one state to another, hence the name Markov-switching type. The second value (ω MS/CP = 1000) induces high state persistence (hence, the name change-point type). The posterior results are likely to be easier to interpret as only few changes will occur in the break dynamics. This value, therefore, induces a kind of change-point behavior. The two cases are considered in the empirical applications, and we can discriminate between the two prior types by the marginal likelihood criterion.

PRIOR ELICITATION AND LABEL SWITCHING PROBLEM
The prior on the parameters of the ARMA mean function is a hierarchical Normal-Wishart distribution that provides an additional layer on the base distributions of the Dirichlet processes. The marginal prior density of the AR and MA parameters, taking into account their truncation to the interval (−1, + 1), is almost uniform on that interval (with mean 0 and standard deviation 0.57). Gathering information of existing regimes, this structure facilitates the birth of new regimes without complicating the MCMC simulation (since the prior is conjugate). A similar idea is applied to the variances.
The posterior distribution is invariant to the label of the state vector. If a label switch occurs in the state vectors during the MCMC simulation, the usual summary statistics such as the posterior means and standard deviations are misleading. Indeed these statistics depend on the label of the state. Different solutions exist to solve this issue. The prior distributions can be chosen to rule out the label switching problem by constraining the support of the parameters given the regimes. However, finding appropriate constraints to preclude all the possible switches without truncating heavily the posterior distribution can be difficult. Otherwise, as advocated by Geweke (2007), the label switching issue can be completely ignored in the MCMC simulations. In this case, either a loss function is used to sort the posterior draws in one specific label ordering (see e.g., Marin, Mengersen, and Robert 2005;Bauwens, Dufays, and Rombouts 2013) or the reported summary statistics must be label invariant (see Song 2014;Dufays 2015). We apply the latter approach. For instance, the posterior draws deliver, among other things, the probabilities of having a number of regimes for the mean function parameters and the variance. From the MCMC sample, we can also compute the posterior means and the confidence intervals of the model parameters as they evolve through time. These summary statistics do not depend on a specific label and can therefore be reported without bothering about the label invariance issue.

APPLICATIONS
In this section, several estimation and forecasting results for the sticky IHMS-ARMA model are provided for the quarterly U.S. GDP growth rate and a monthly U.S inflation series. Afterwards, a forecasting exercise on 18 macroeconomic series is reported to compare the performances of the sticky IHMS-ARMA and of the ARMA model with fixed parameters. Regarding the MCMC implementation, the starting values are the maximum likelihood estimates of the model without any break. The burnin period uses 7500 iterations and the next 22,500 draws are stored to compute the posterior results. The number L of the degree L weak limit approximation is fixed to ten. Additional results are provided in the SA.

U.S. GDP Growth Rate
Hamilton (1989) applied the MS model using an AR mean function to the U.S. quarterly GDP growth rate series. We revisit this example by using the sticky IHMS-ARMA model and its AR version on the series from 1947Q2 to 2014Q1 (268 observations). The graph of the series (visible in Figure 1) suggests that breaks should be taken into account, in particular due to the well-known great moderation phenomenon.
Table 4 (see also Table 2 of the SA) provides overwhelming evidence (with increases by at least 23 points of the MLL) in favor of the IHMS models compared to the ARMA model with fixed parameters. Moreover, the IHMS models with prior hyperparameters implying weak regime persistence (MS-prior) slightly improve the fit over the models implying high regime persistence (CP-prior), whatever the dynamic specification (AR or ARMA). The differences of the MLL are equal to 1.53 and 1.23, respectively. Assuming prior odds equal to unity, these values imply posterior probabilities of the models with short regime persistence amounting to 0.82 and 0.77 compared to their CP alternatives. There is also similar evidence that, in   the IHMS class, the ARMA specification dominates the corresponding AR one (with average improvements of 1.94 and 1.64 leading to posterior probabilities of 0.87 and 0.84 in favor of the ARMA function). Whatever the model, the MLL values of the ten replications are very similar. Focusing on the IHMS-ARMA model for the two kinds of prior, Table 5 reports the posterior probabilities of the number of regimes for the mean function parameters and the variance (see also Table 3 in the SA). As expected, the uncertainty on the number of regimes is much more important for the MS-prior than for the CP-prior. With both types of prior, there is no evidence of breaks in the mean function parameters. The variance mainly evolves over time through two different states for the CP-prior while more regimes are found for the MS one. Figure 1 displays the series together with the probabilities of a break in the previous or in the next year computed from the posterior samples. The probability of having a break in the mean function parameters is null for the CP-prior. For the MS-prior, there are small positive break probabilities in the beginning of the series, in 1971, at the beginning of the great moderation era and during the financial crisis. Regarding the breaks in the variance, the MS-prior leads to much more instabilities than the CP one. However, both priors agree on a quiet period starting with the great moderation and ending at the financial crisis.
The corresponding estimated time-varying parameters are displayed on Figure 2, where instead of showing the graph for the constant term (μ t ) of the ARMA equation, we show the implied "unconditional" expectation μ t /(1 − β t ). Overall, both types of prior deliver similar results. The mean function parameters are relatively constant over time, especially with the CP-prior. The variance evolves with changes that are smooth or sharp depending on when they occur. As expected due to the differences between the priors, the variance dynamics obtained with the CP-prior is close to a change-point model, as the level is more or less constant before and after the great moderation, with a blip during the financial crisis. On each figure the corresponding posterior median of the ARMA model with fixed parameters is also shown. Obviously, the variance of this model cannot accommodate both the high and low volatility levels of the error terms and, therefore, its posterior estimate lies in between. For the mean function parameters, there are small differences between the posterior estimates of the simple ARMA and IHMS-ARMA models, even if there is no break in the mean function parameters of the latter. The occurrence of breaks in the variance and the fact that the variance and the parameters of the mean functions are not independent a posteriori is a reason for such differences.
We compare the out-of-sample forecast performances of the five models appearing in Table 4, to which we add three models: an AR(16) model with fixed parameters, this lag order resulting from the AIC criterion, and the restricted IHMS-ARMA models in which the mean parameters jointly move with the variance one. The forecasts start in the first quarter of 1987 (at 60% of the sample) and are computed until the end of the sample, adding one new observation at a time. At each step, all models are reestimated and predictive densities as well as the CRPS are computed for different horizons. The mean squared forecast errors (MSFE) using the predictive means, the average values (over the forecast period) of the predictive densities (APD), and the CRPS values evaluated at the observed data are reported in Table 6. The main conclusions from these results follow.
• For each criterion and forecast horizon, all the IHMS models values dominate substantially the ARMA model with fixed parameters. The AR(16) model performs better than this ARMA model but is still performing less well than the IHMS models (with very few minor exceptions). • Regarding the statistical test (36), for the APD and for the CRPS, the flexible IHMS models statistically outperform the ARMA model at all horizons (except for APD  in the case of the four year horizon of the IHMS-ARMA model with MS prior), while for MSFE, most tests are significant only for forecast horizons up to two years. Table 4 in the SA contains the test results for comparing the flexible IHMS models with respect to the AR(16) model and shows not surprisingly in view of the previous comment, that there are of course less cases of significant differences than in the comparisons with the ARMA model.
• Concentrating on the unrestricted IHMS models, the ARMA versions slightly dominate their AR counterparts, and similarly the models with the CP prior dominate those with the MS prior. • Incorporating an MA term is more relevant than including a flexible dynamic structure for the breaks since the restricted IHMS-ARMA model with CP prior setting is often the best competitor.
Going one step further, the integrand of Equation (35) is helpful to assess when a model produces better forecasts than another. For example, a model could very well forecast the growth rate when the U.S. economy is in expansion and could be a bad predictor in recessions. To evaluate this, Figure 3 displays the integrand of Equation (35) with respect to z for one-step-ahead predictions. We observe that the integrand of the ARMA model envelops the integrand of all the IHMS models meaning that whatever the state of the U.S. economy, the distance between the cumulative density functions of the IHMS models and the empirical one are always smaller than for the ARMA model. Finally, Figure 4 shows the differences between the onequarter ahead predictive densities (without taking logarithms) of the IHMS-ARMA (CP-prior) model and the fixed parameter ARMA one, both evaluated at the realized outcome. The gains are spread over the entire period and cannot therefore be associated to a specific subperiod.

U.S. Inflation
The inflation measure is computed from the personal consumption expenditure deflator and spans the period from February 1959 to November 2012 (646 observations). Table 7 reports the MLL values of several models including the ARMA model with fixed parameters. Additionally to these results, Table 5 of the SA includes the results of the random walk process and the AR(12) chosen by the AIC but these models are dominated by the ARMA process with fixed parameters. Like for the U.S. GDP growth rate, there is a strong evidence in favor of the IHMS models with respect to the simple ARMA one (differences of MLL larger than 62). Moreover, the models with the IHMS CPtype prior implying weak regime persistence have the highest MLL (differences equal to 2.96 and 2.24 leading to posterior probabilities of 0.95 and 0.9). The inclusion of MA terms in the IHMS models increases slightly the average MLL (1.73 and 1.01) but this conclusion should be tempered (especially for the model with MS-type prior) given the overlap of the min-max ranges. Table 8 reports the posterior probabilities of the number of regimes of the restricted IHMS-ARMA models (see also Table 6 of the SA which includes the IHMS-ARMA models based on a unique HDP). The mean function parameters and the variance clearly experience structural breaks. Furthermore, the uncertainty on the number of regimes is visible, especially if the MS-prior is used. This highlights the limitation of the standard method consisting in picking up the model with a fixed number of regimes exhibiting the highest marginal likelihood. With the CP-prior, two regimes seem sufficient for each set of parameters, while three to five seem useful with the MS-prior.
The estimated break probabilities over the past and the next year are displayed on Figure 5. First of all, the IHMS model with a priori long-lasting regimes (CP-prior) only detects one break for the mean function parameters, which happens in the early 2000s. On the contrary, many instabilities on the mean function parameters are visible for the IHMS model with a priori weakly persistent regimes. Some of these changes correspond to known historical episodes. For example, the breaks detected around 1973 and 1979 capture the oil crisis era and the change of the monetary policy of the Fed, both marked by a rise of U.S. inflation (see the top right graph). The break dynamics of the variance is more volatile in both configurations. Even if less switches are detected by the IHMS model with a priori high persistence, the two graphs do not show very different results. Two quiet periods spanning from 1967 to 1973 and from 1991 to 1998 do not exhibit breaks. Figure 6 documents the time-varying posterior medians of the model parameters along with the 70% credible intervals. Interestingly, the mean function parameters and the variance do not exhibit the same dynamics, emphasizing the relevance     slightly below 1); then until the end of the sample, it corresponds to a weakly persistent ARMA process (with the AR parameter close to 0.2). The ARMA model with fixed parameters obviously cannot capture the break. Its estimated AR coefficient is also close to 1 over the entire sample (but slightly less than in the IHMS model during the first regime of the latter). The MA coefficient of the IHMS model is clearly different from zero in both regimes, with a strong negative value (around -0.75) during the first regime, and a positive value (about 0.3) in the second regime. The estimated MA parameter of the simple ARMA model (at 0.6) is dominated by the first regime. 2 In the IHMS-ARMA model with a priori weak persistence (MS-prior), the AR and the MA coefficients have broadly a similar time series evolution as for the CP-prior, but they are 2 The results for the first regime of the IHMS-ARMA model with CP-prior are close to those of Stock and Watson (2007). These authors found that the first difference of (quarterly) inflation is well captured by a MA(1) model (with heteroscedasticity) on the period 1960-1983, with estimated MA coefficient of −0.25.  (12) 0.14 0.14 0.14 0.13 0.12 0.12 ARMA 0.15 0.14 0.14 0.13 0.12 0.12 Rest. IHMS-ARMA (CP) 0.16** 0.15** 0.15* 0.14* 0.13 0.13 Rest. IHMS-ARMA (MS) 0.16** 0.15** 0.15** 0.14** 0.14** 0.13* IHMS-AR (CP) 0.16** 0.15** 0.15** 0.14** 0.14** 0.14** IHMS-AR (MS) 0.16** 0.15** 0.15** 0.14** 0.14** 0.14** IHMS-ARMA (CP) 0.16** 0.15** 0.15** 0.14** 0.13** 0.13** IHMS-ARMA ( Finally, as for the U.S. GDP growth rate, we report the results of forecast comparisons over the last 60% of the sample, starting the forecasts in April 1991. In addition to the models considered in Table 7, the set of models includes a random walk (RW) process that is sometimes considered to be relevant for this series, the AR(12) model, since this lag order results from the AIC, and the restricted IHMS-ARMA models. Table 9 provides the APD, MSFE, and CRPS values for six forecast horizons and inspires the following comments.
• For each criterion and forecast horizon, the APD of the all the IHMS models values are higher than those of the ARMA and AR(12) models with fixed parameters. For the MSFE, the IHMS models are also performing better, with   (36) and renders them less significant.

Comparison of Predictive Performance for Other Series
We extend our study of the IHMS-ARMA model by applying to other U.S. series the same type of comparison of forecast performance as in the previous subsections. Table 10 lists 18 quarterly macroeconomic series (from 1959Q1 to 2011Q3) also used in Bauwens et al. (2015). All series are transformed in logarithm and differenced once, except series 9. The first series is the same as in Section 6.1 but on a shorter period. The fourth one is the quarterly version of the inflation series analyzed in Section 6.2. For each of these series, the forecast implementation is the same as for the quarterly GDP growth series of Section 6.1, except that the forecast period starts in 1990Q3.
For each series, Table 11 reports the model that delivers the best APD, MSFE, and CRPS values. It also gives the percentage of improvement (or deterioration) with respect to the fixed parameter ARMA model. The following conclusions emerge.
• Overall, the IHMS-ARMA model appears much more often than the three other ones. • For the APD criterion, the IHMS-ARMA process is the best one in more than 56% of the cases for each horizon. 3 • For the MSFE criterion, the performance of the simple ARMA model and the restricted IHMS-ARMA one gets better but these competitors are still dominated by the flex-ible IHMS processes in more than 60% of the cases at horizon one. For farther horizons, the restricted IHMS-ARMA model becomes as good as the most flexible ones. This result can be explained by the new regimes created during the forecast of the series. In particular, the parameters of the new regimes are sampled from the hierarchical distributions and can, therefore, generate bad forecasts. • Focusing on the CRPS, the IHMS-ARMA models dominate the other models in more than 55% of the cases at all horizons. • Considering the degree of improvement, the IHMS model drastically enhances the APD and the MSFE for several series. For instance, the relative APD of GDP (series 1), Real Imports and Exports of Goods and Services (series 7 and 8) and Private Residential Fixed Investment (series 17) are between 32% and 62% for all horizons. Similar conclusions hold for series 17 when we look at the MSFE. Finally, considering the CRPS, the improvements are also high for the same series as in the APD. On the contrary, when the ARMA model dominates the time-varying ones, the improvements are rather modest. They stay below 6% for all the series, whatever the criterion.

CONCLUSIONS
The Markov-switching modelling framework is a powerful tool to capture occasional changes in the parameter values of dynamic econometric models at a priori unknown dates. Such models may suffer from a potential overparameterization issue due to the assumption that all the parameters must change when a break occurs. We propose a solution to this problem by relying on the hierarchical Dirichlet process. The break dynamics of the mean function parameters is separated from the one of the variance and thanks to the a priori infinite number of states, only one estimation is sufficient to determine the number of regimes in the two parameter structures. Consequently, the proposed IHMS framework extends the MS class in two related directions as it both allows for an unbounded number of regimes and a flexible dynamics for the model parameters. In addition to that, this modeling approach is operational for complex models as we solve the path dependence problem due to the moving average component of ARMA models. This is achieved by using a Metropolis-Hastings step with a proposal density based on an approximate model inspired by the solution that (Klaassen 2002) proposed for the MS-GARCH model.
Empirical applications on the quarterly U.S. GDP growth rate and on the monthly U.S. inflation rate highlight the relevance of allowing for the possibility of different structural breaks in the mean and in the variance. In particular, for the U.S. GDP growth rate, we find a structural break in the variance at the beginning of the great moderation era, but no simultaneous break in the mean function parameters. The latter parameters are, therefore, estimated from the entire sample. The inference on the U.S. inflation delivers similar results as the breaks of the mean function parameters are different from those of the variance. This example additionally highlights that assuming only two regimes for a time series is not always satisfactory as the number of breaks in the mean and the variance are larger than two when the prior information favors weakly persistent regimes. A forecasting comparison on eighteen quarterly series illustrates that the IHMS-ARMA models perform better than the ARMA one with fixed parameters for a majority of series, in some case by a wide margin.
Future research could be devoted to relaxing the geometric duration of the regimes implied by the Markov chains. We could also investigate if allowing a different break structure for each model parameter further improves the predictions. Another avenue of research could be to extend the approach to VARMA and factor models. .
From the last line of the above formula, the forward step of the forwardbackward algorithm provides all the quantities to compute at each time index the conditional distribution f (s ψ t−1 |y 1:t−1 , s ψ t , , P ψ , s σ 1:t−1 ).

SUPPLEMENTARY MATERIALS
The supplementary materials contain additional empirical results. Its first section contains the posterior results for the HDP parameters of the estimated IHMS-ARMA models for the U.S. GDP growth rate and inflation series. Its second section reports additional in-sample and forecasting results for the same series. Its last section provides some results for a different truncation choice of the number of regimes in the approximate model.