Particle learning for Bayesian semi-parametric stochastic volatility model

Abstract This article designs a Sequential Monte Carlo (SMC) algorithm for estimation of Bayesian semi-parametric Stochastic Volatility model for financial data. In particular, it makes use of one of the most recent particle filters called Particle Learning (PL). SMC methods are especially well suited for state-space models and can be seen as a cost-efficient alternative to Markov Chain Monte Carlo (MCMC), since they allow for online type inference. The posterior distributions are updated as new data is observed, which is exceedingly costly using MCMC. Also, PL allows for consistent online model comparison using sequential predictive log Bayes factors. A simulated data is used in order to compare the posterior outputs for the PL and MCMC schemes, which are shown to be almost identical. Finally, a short real data application is included.


Introduction
Understanding, modeling, and predicting stylized features of financial returns has been extensively researched for more than 30 years and interest in the subject is far from decreasing. Meanwhile mean-variance framework has been of major interest, it is justifiable only for Normally distributed returns. There is overwhelming evidence in the literature that the distribution of the financial returns is far from Normal, in the sense that it exhibits fat tails and occasional asymmetry, see Bollerslev (1987), He and Ter€ asvirta (1999), Jensen and Maheu (2010), among many others. Therefore, apart from the mean and variance modeling one also has to consider departures from Normality by allowing for skewness and excess kurtosis via more flexible distributional assumptions for the innovations of the returns.
Modeling the conditional mean of the returns is a very challenging task, since they are always very close to zero and exhibit very low levels of autocorrelation. The volatility of the returns, on the other hand, usually exhibits slow decaying autocorrelation function, i.e. high persistence, which can be modeled via auto-regressive process. The two most popular approaches to modeling volatility are based on the Autoregressive Conditional Heteroscedasticity (ARCH) type models, first introduced by Engle (1982), and the Stochastic volatility (SV) type models, first introduced by Taylor (1982). There is evidence in the literature that SV models provide more flexibility than Generalized ARCH (GARCH, Bollerslev, 1986) specifications, see e.g. Broto and Ruiz (2004). Therefore, in this work we consider the SV model for the volatilities.
As for the distribution of the error term of the returns, the Normal distribution was considered by Taylor (1986Taylor ( , 1994, Jacquier et al. (1994), Kim et al. (1998), among many others. However, as mentioned above, financial returns depart from Normality since they exhibit fat fails and occasional asymmetry. There has been a multitude of papers considering all kinds of parametric non-Normal distributions. For example, the Student-t distribution was employed by Harvey et al. (1994), Gallant et al. (1997), Sandmann and Koopman (1998), Chib et al. (2002), Jacquier et al. (2004), Nakajima and Omori (2009), the Normal-Inverse Gaussian by Barndorff-Nielsen (1997), the Mixture of Normals by Mahieu and Schotman (1998), and the Generalized error distribution by Liesenfeld and Richard (2005), among many others.
Another alternative is to abandon parametric assumptions for the distribution of the error term of the returns altogether and consider a semi-parametric SV model, where the distribution of the returns is modeled non-parametrically, and, at the same time, the parametric discrete representation of the SV model is preserved. The Bayesian non-parametric approach for SV models is quite a new field of research, with growing popularity due to its flexibility and superior performance, see Jensen (2004), Jensen andMaheu (2010, 2014) and Griffin (2011, 2013). In these works it is assumed that the distribution of the returns follows an infinite mixture of Normals via Dirichlet Process Mixture (DPM) models (see Ferguson, 1983 andLo, 1984, among others). The infinite mixture of Normals can model other distributions, frequently used in financial time series context, see e.g. Tokdar (2006) and Menc ıa and Sentana (2009), because of its universal approximation property (Titterington et al., 1985).
The Markov Chain Monte Carlo (MCMC) estimation approach for SV models is the usual methodology since the seminal work by Jacquier et al. (1994), where Bayesian inference for standard SV models was firstly developed. For a survey on Bayesian estimation of time-varying volatility models see Virbickait_ e et al. (2015b). However, MCMC methods in general are computationally demanding for high-frequency data and "inherently non-sequential" (Lopes and Polson, 2010). Alternatively, one can rely on Sequential Monte Carlo (SMC) methods, also known as particle filters, that allow for online type inference by updating the posterior distribution as the new data is observed. SV (parametric or semi-parametric) models are state-space models, naturally suggesting SMC scheme. Moreover, models considered in this article belongs to such a class, that have the availability of sufficient statistics of the parameters. This naturally suggests using a filter that instead of tracking a high-dimensional vector of the parameters tracks a lowdimensional set of sufficient statistics that can be recursively updated. The use of sufficient statistics has been shown to increase the efficiency of the algorithm by reducing the variance of sampling weights, see Carvalho et al. (2010a).
In general, particle filters provide a simulation based approach where a set of particles represent the posterior density. For instance, consider the following state-space model, where x t are latent states and H are static parameters: for t ¼ 1; :::; T, with initial probability density pðx 0 jHÞ and prior pðHÞ. Each particle has an associated weight that is proportional to the predictive pðr t jx t ; HÞ. The sequential state filtering and parameter learning problem is solved by a sequence of joint posterior distributions pðx t ; Hjr t Þ, where r t ¼ ðr 1 ; :::; r t Þ. Assume for the time being that H is known, which leaves us with a pure filtering problem. Gordon et al. (1993) and Pitt and Shephard (1999) propose bootstrap and auxiliary particle filters, respectively, which are among the most popular ones. However, when H is unknown and also needs to be sequentially estimated, the problem becomes more difficult. The approach of directly introducing and resampling H breaks down in a few steps, since all the particles collapse into a single point. In order to delay particle degeneracy, Gordon et al. (1993), and later Liu and West (2001), consider artificial evolution for the parameters. On the other hand, Storvik (2002) and Carvalho et al. (2010a) rely on a low-dimensional set of sufficient statistics, instead of the parameters, to be tracked in time. For discussions and illustrations of some of the particle methods or reviews of particle methods in general, see Johansen and Doucet (2008), Kantas et al. (2009), Douc et al. (2009, Tsay (2011), Lopes et al. (2011) together with Chopin et al. (2011) for a lively discussion, Lopes and Carvalho (2013) and Rios and Lopes (2013), among many others. Even if particle filters are known to suffer from a fundamental problem called particle degeneracy, i.e., an ever-decreasing set of atoms in the particle approximation of the density of interest (see Section 2.6), the online property of particle filters is definitively an advantage over MCMC. Among all available Sequential Monte Carlo methods, in this paper we make use of the particle learning (PL) approach, which is a particle based method, firstly introduced by Carvalho et al. (2010a). Surely, alternative particle filters are in order. Nevertheless, comparison of SMC methods in this setting is out of the scope of this paper. One can find extensive empirical results for comparisons of a variety competing filters in Carvalho et al. (2010a), Lopes and Tsay (2011), and Rios and Lopes (2013) in more general settings. PL incorporates sequential parameter learning, state filtering and smoothing, thus providing an online estimation alternative to MCMC/ Forward Filtering, Backward Sampling (FFBS) methods. For PL comparison with MCMC see Carvalho et al. (2010a), Lopes and Polson (2010), among others. An essential feature of PL is the presence of conditional sufficient statistics for the parameters to be tracked in time. It also makes model comparison easy, since at each step we have the predictive likelihood as a by-product.
The main contribution of the paper is that we design a PL algorithm for a SV model with DPM innovations, referred to as a semi-parametric model (SPM), which is the same as in Delatola and Griffin (2011). We estimate the simulated data via PL and MCMC in order to illustrate that the produced posteriors are almost identical at any given data point. PL method provides the advantage of easily incorporating the information from the new observation, while MCMC requires to re-run the algorithm again. Additionally, PL produces predictive likelihoods for each data point without any additional costs, which allows for sequential model comparison via log predictive Bayes factors. Finally we estimate real data via PL using the SPM and fully parametric model with Normal innovations, referred to as PM (following the nomenclature of Delatola and Griffin (2011)), and perform sequential model comparison in order to illustrate the attractiveness of SMC approach.
Important to notice, that the proposed efficient SMC scheme for this type of models does not come without a cost. Apart from the limitations of particle filters in general, which are outlined in Section 2.6, there is an important shortcoming of PL algorithm for the specific class of models considered in this article. In particular, in order to design a fully-adapted PL algorithm, the returns have to be transformed by applying a log-square transformation. This transformation masks possible skewness of the distribution of the returns. 1 As acknowledged in Delatola and Griffin (2011), this is a strong assumption, however, they refer to the work of Jensen and Maheu (2010) and argue that the authors found little evidence of skewed returns and showed that a scale mixture exhibits better out of sample performance as compared to the location-scale mixture.
The paper is structured as follows. Section 2 presents the linearized SV model with non-parametric errors and designs a PL algorithm for this model. It also includes a discussion on the limitations of the particle methods in general. Then, Section 3 presents simulated data exercise and comparison with the MCMC estimation output. Section 4 compares the performance of the parametric and semi-parametric models using real data. Finally, Section 5 concludes.

SV-DPM model
In this section we briefly review a commonly used version of the standard SV model with Normal errors. We then drop the Normality hypothesis and introduce a novel particle learning 1 We would like to thank the Referee for pointing this out. scheme to perform sequential Bayesian learning in the class of semi-parametric SV models. The innovation distribution is assumed to follow an infinite mixture of Gaussians via Dirichlet Process Mixture models, giving rise to the SPM. We show the differences in the computational aspects between PL and MCMC. Meanwhile MCMC is a gold standard in this type of models, PL has the advantage of producing online inference and, as a by product, online model comparison/ selection statistics.

Normal errors
The standard SV model looks as follows: where jbj<1 for the stationarity of the volatilities; v t and g t are uncorrelated error terms, such that g t $N ð0; 1Þ. The distribution of the v t with zero mean and unit variance takes many different forms in the existing literature: from a standard Normal, to heavy-tailed Student-t and others (see Kim et al., 1998;Chib et al., 2002;Mahieu and Schotman, 1998;Liesenfeld and Richard, 2005, for example). Kim et al. (1998) proposed a linearization of the standard SV model by defining resulting into the following dynamic linear model: Observe that the distribution F is a log v 2 1 if v t is Normally distributed. Kim et al. (1998) and Omori et al. (2007) use carefully tuned finite mixtures of Normals to approximate the log v 2 1 distribution and use a data augmentation argument to propose fast MCMC schemes that jointly sample fh Ã 1 ; :::; h Ã T g based on the well-known FFBS algorithm of Carter and Kohn (1994) and Fr€ uhwirth-Schnatter (1994). Moreover, c O is an offset parameter that is needed in order to avoid the logarithm to be undefined in case zero returns. Delatola and Griffin (2011) have tried several different values for c O and presented their real data application with c O ¼ 10 À4 , meanwhile Jensen (2004) has used the value of c O ¼ 0:0005. Therefore, in this paper we fix c O ¼ 0:0003 for all simulated and real data applications.
However, the recent literature is abundant in showing that the distribution of v t has heavier tails than the Normal distribution, rendering the above approximations limited. Below we introduce the simple linearized SV model with non-parametric errors to model the unknown return distribution.
Another important issue concerns the moments of the distribution of Ã t . Even though the original errors v t are generated by a process with zero mean and unit variance, the resulting moments of Ã

Non-Normal errors
We do not specify a parametric model for the error density, but instead, we assume a Dirichlet Process Mixture prior, firstly introduced by Lo (1984). DPM models have been widely used for modeling time-varying volatilities, see Jensen (2004), Jensen and Maheu (2010, Griffin (2011, 2013), Kalli et al. (2013), Aus ın et al. Virbickait_ e et al. (2015a). This type of approach is known as time-invariant (independent) DPM. Griffin (2011, 2013), for example, propose to approximate the log-square of the unknown return distribution F as an infinite mixture of Normals by relying on DPM models. The SPM presented in this section is of the same spirit as the model in Delatola and Griffin (2011). As noted by the authors, since the mean of the disturbance Ã t is not fixed and is not known, there might arise some identification issues. Therefore, the mean of the volatility process in (4) can be subsumed into Ã t , leading to the following reparametrized model: Here the log volatility process has the unconditional mean equal to zero. As seen in Escobar and West (1995), the DPM model has the following density function: where k is some density kernel with parameters h t and the mixing distribution G has a DP prior, denoted here by G$DPðc; G 0 ðh; .ÞÞ. Each observation t comes from a kernel density with some parameters h t , following the mixing distribution G. The parameter c is called the concentration parameter and G 0 ðh; .Þ is called the base distribution that depends on certain hyperparameters .. The concentration parameter c can be interpreted as the prior belief about the number of clusters in the mixture. Small values of c assume a priori an infinite mixture model with a small number of components with large weights. On the contrary, large values of c assume a priori an infinite mixture model with all the weights being very small. c is also called a precision parameter and indicates how close G is to the base distribution G 0 , where larger c indicates that G is closer to G 0 .

Gaussian kernel and conjugate base prior
A rather standard approach is to consider a Gaussian kernel density, t jl t ; r 2 t $N ðl t ; r 2 t Þ, and follow the procedure outlined in Escobar and West (1995) and put a prior on the mixing mean and the variance. Alternatively, we rely on an approach proposed by Griffin (2010) and Delatola and Griffin (2011): Here l 0 is the overall location parameter and mixing is done over l t , where l t is the location of the t th component. Also, r 2 is the overall scale and is constant. Moreover, the uncertainty associated with l t can be integrated out and the prior predictive for t is just a single Normal N ðl 0 ; r 2 Þ. In real data applications the observations cluster, therefore, some of the t come from a component with the same l t , and the total number of components is smaller than the number of observations. In the rest of the manuscript instead of t we will use the subscript j to identify a component. Parameter a is a smoothness parameter and is fixed to 0.05 throughout the paper. Delatola and Griffin (2011) have also considered a different value of a ¼ 0:01; alternatively, a can also be estimated with the rest of the model parameters, see Griffin (2010) for details. The concentration parameter c is set to be equal to one, as seen in Carvalho et al. (2010b), however, it can be estimated together with the rest of model parameters. One can specify some informative priors for l 0 and r 2 , however, following Delatola and Griffin (2011), we allow for completely uninformative priors.
Define U ¼ ðb; s 2 Þ as the set of parameters associated with the parametric part of the model, X ¼ fðl 0 ; l 1 ; :::; r 2 Þg as a set of parameters associated with the distribution of the error term, and H ¼ ðU; XÞ as a complete set of all model parameters. Therefore, using a Polya urn representation of DPM, see Escobar and West (1995), the model in (5) and (6) can be rewritten as follows: where n t;j is a number of observations assigned to j th component at time t and L ? t is a number of non-empty components in the mixture at time t, i.e. L ? t is not fixed a priori and grows if new components are observed. Given this missing information, the mixture becomes finite, where the maximum number of components theoretically is limited by the number of observations. In practice, data tends to cluster, meaning that some observations come from the same component, therefore L ? t ( t.

MCMC for SPM
The standard Bayesian estimation of SV models, parametric or semi-parametric, relies on MCMC methods, which, however, can be costly, because, additionally to the parameter estimation, they have to consider a sampler for latent volatilities. One notable exception is a work by Jensen (2004), who proposes a highly efficient MCMC sampler for a long memory semi-parametric SV model by making use of the SV model's wavelet representation and near-independence of the wavelet coefficients. Jensen and Maheu (2010) construct an MCMC scheme for their proposed SV-DPM model, where latent volatilities are sampled via random length block sampler, which helps to reduce correlation between draws. The authors found that the semi-parametric SV model is more robust to non-Normal data and provides better forecasts. In another paper, Jensen and Maheu (2014) consider an asymmetric SV-DPM model. The authors extend their previous semi-parametric sampler to a bivariate setting, where the innovations of the returns and volatilities are modeled jointly via infinite scale mixture of bivariate Normals.
Meanwhile, Delatola and Griffin (2011) use a linearized version of the SV model. Conditional on knowing which mixture component the data belongs to, the linearized SV model is just a Normal Dynamic Linear Model (NDLM) and the latent volatilities are updated by FFBS (see the discussion at the end of Section 2.1). The remainder of the model parameters are sampled via an extension of Gibbs sampler, called hybrid Gibbs sampler. In their subsequent paper, Delatola and Griffin (2013) consider an asymmetric SV model. Same as before, they make use of the linearization and update the latent log volatilities via FFBS and the other parameters via Metropolis-Hastings. All above MCMC schemes are costly in the context of SV models for high-frequency data for at least three reasons: (1) the MCMC sampler has to include a filter for latent volatilities, (2) the sampler has to be re-run each time a new observation arrives, and (3) sequential consistent model comparison is nearly impossible due to computational burden.

PL for the SPM
In this section we present the algorithm to perform PL estimation for a SV model with non-parametric errors. PL, as mentioned before, is one of several particle filters that consider sequential state filtering and parameter learning. PL, which was firstly introduced by Carvalho et al. (2010a), allows for sequential filtering, smoothing and parameter learning by including state-sufficient statistics in a set of particles. The Online Appendix includes a brief description of the main idea behind PL. For a more detailed explanation of PL with illustrations refer to Carvalho et al. (2010a) and , among others. The priors for model parameters are chosen to be conditionally conjugate: h 0 $N ðc 0 ; C 0 Þ , s 2 $I Gðb 0 =2; b 0 s 2 0 =2Þ and b$T N ðÀ1;1Þ ðm b ; V b Þ . Here T N ða;bÞ represents Normal distribution, truncated at a and b, while c 0 , C 0 , b 0 , b 0 s 2 0 ; m b , and V b are hyper-parameters. Then, a set of sufficient statistics S t contains all updated hyper-parameters, necessary for the parameter simulation, as well as filtered state variables, which are of two kinds: the latent log volatilities h t and the indicator variable k t , which tells us to which mixture component the error data point belongs to. The object we call particle at time t thus will contain S t and corresponding parameters, simulated from the hyper-parameters in S t . At each time t we have a collection of N particles. When this set of N particles passes from t to t þ 1, some of the particles disappear, some are repeated (sampling with replacement, corresponds to the Resampling step defined below) and then modified (Sampling and Propagating steps). In order to initiate the algorithm, we need to have the initial set of sufficient statistics S 0 and initial parameter values. The set S 0 consists of: initial fh ðiÞ 0 g N i¼1 , that has been simulated from its prior, initial overall location fl ðiÞ 0 g N i¼1 , which is set to -1.272 for all particles, fr 2ðiÞ g N i¼1 , which is set to 4.946. These specific values correspond to the first two moments of the log v 2 distribution, which would correspond to Normally distributed returns. We have performed a simulation study included in the Online Appendix and found that for reasonable sample sizes the sampler is robust to the choice of the initial values of l 0 . The rest of the initial hyper-parameters fb ðiÞ 0 g N i¼1 ; fb 0 s 2ðiÞ 0 g N i¼1 ; ::: are all the same across all particles at t ¼ 0. For t ¼ 1:::; T and for each particle (i) the algorithm iterates through three steps (the derivations of the posterior distributions are rather straightforward and very similar to the ones available in Griffin (2010) and Delatola and Griffin (2011)): 1. Resampling. Resample the particles from the previous period t -1 with weights w / c c þ t À 1 f N r t ; bh tÀ1 þ l 0 ; s 2 þ r 2 À Á þ 1 c þ t À 1 X L ? tÀ1 j¼1 n j f N r t ; bh tÀ1 þ l j ; s 2 þ ar 2 ; that are proportional to the predictive density of the returns. The components of H ¼ ðb; s 2 ; l 1 ; :::; l L ? tÀ1 ; l 0 ; r 2 Þ have been simulated at the end of the previous period. The resampled particles are denoted by a tilde above the particle, as in e H.
3. Propagating sufficient statistics and learning H.
such that s k t ¼ e s j¼k t þ ðr t Àh t Þ: (c.5) Sample r 2 from IGðr 2 ; a; bÞ, where such that l k t ¼ e l j¼k t ðr t Àh t Àl k t Þ 2 .

Limitations of particle filters
Particle filters, PL included, are known to suffer from a problem called particle degeneracy: an ever-decreasing set of atoms in the particle approximation of the density of interest. As noted by Chopin et al. (2011), increasing the number of observations will lead to degenerating paths, unless the number of particles is being increased simultaneously. This has to be monitored carefully for the chosen filter and can be seen as a tradeoff between the sequential nature of the algorithm and stability of MCMC for very large samples. Therefore, the a priori consideration of the sample size of interest directly influences the choice of number of particles in order not to reach the stage where particles start to degenerate. Although the development of particle filters is not that new, it is a very active field of research. The ever going quest to avoid or at least postpone particle degeneracy has lead to Gordon et al. (1993) and Liu and West (2001) introducing artificial evolution in the parameters. Another strategy is to use resamplepropagate strategy rather than propagateresample, as seen in Carvalho et al. (2010a) and Lopes and Tsay (2011). Finally, the use of sufficient statistics produces lower MC error than other filters (given the same number of particles), which in turn implies that filters, making use of sufficient statistics, such as PL or Storvik (2002), can reach the same accuracy with a smaller number of particles as other filters. This leaves more room for increase in a number of particles to accommodate desired time-horizon before the particles start vanishing.
Finally, if the interest is not online type inference, MCMC is still a gold standard in the area. Recently other approaches, such as Particle MCMC, that combine MCMC and particle filters, have been emerging, see Andrieu et al. (2010) and Pitt et al. (2012), among others.

Simulation exercise and comparison with MCMC
We perform a simulation exercise based on synthetic data to illustrate computational aspects of MCMC and PL approaches. A data set of length T ¼ 500 is simulated from the model in (1)-(2) with a ¼ 0, b ¼ 0:97; s 2 ¼ 0:0225, where v t is distributed as a standard Normal. We estimate the SPM using the simulated data with PL and MCMC schemes. The priors for the unknown parameters are the same for MCMC and PL and are given by b$N 0:95; 0:1 ð Þ and s 2 $I G 10=2; 0:1=2 ð Þ : Also, initial values for l 0 and r 2 are set the same for both algorithms to match the first two moments of the log v 2 distribution. PL is run for 100k particles, meanwhile the MCMC is run for 100,000 iterations, keeping every 10th. MCMC results are obtained via Matlab code of Delatola and Griffin (2011), which is available on Jim Griffin's website 2 . We have modified the code accordingly, to exactly match our model specification. In particular, the concentration parameter is set to be c ¼ 1, the probability of zero returns is always set to be equal to zero and we do not switch between two alternative reparametrizations, as described in Delatola and Griffin (2011). Also, the draws for parameter b are obtained via Gibbs rather than MH step, as in the original code.
For illustrative purposes we also estimate a fully parametric model, where the error term is assumed to be Normally distributed. The log v 2 distribution is approximated via carefully tuned mixture of Normals, as seen in Kim et al. (1998). Such approximation allows us to implement the fully adapted filter and allows us to illustrate one of the advantages of the PL algorithm: sequential predictive model performance. In this case we know the underlying DGP, therefore, the sequential predictive Bayes factors should prefer the fully parametric model purely due to much smaller parameter space. We report estimation results at five points of the sample, in particular, at observations t ¼ 100; 200; 300; 400; 500. For PL, the algorithm has to be run only once, meanwhile for MCMC it had to be run five times. We present the PL results for four independent runs in order to get some idea about the Monte Carlo error (the codes were run on a standard desktop computer with four cores, this way all four runs could be carried out in parallel). The smaller the number of particles, the more variability is observed across runs, see Carvalho et al. (2010a) for example. Figure 1 plots the posterior distributions for the model parameters associated with the nonparametric part -l 0 and r 2 , and the parameters, governing the volatility process -b and s 2 at time T ¼ 100. The four gray lines correspond to the four independent PL runs, meanwhile the dotted black line draws the MCMC produced posterior distributions. As seen, at time T ¼ 100 all posterior distributions are nearly identical. Similar plots can be drawn for each time point t. In order to save space, for the rest of time points instead of drawing all posterior distributions, we plot the PL median, 2.5 and 97.5 percentile paths and the corresponding MCMC medians and 95% credible intervals, see Fig. 2. As seen from the plots, the posterior distributions seem very similar for all data points. Instead of the medians and credible bounds for the MCMC only at specific time points, one could also draw the exact paths for all ts, however, this would mean that MCMC algorithm would have to be re-run 500 times.
Next, Fig. 3 draws the posterior median, 2.5 and 97.5 percentile paths for PL and corresponding MCMC medians with 95% credible intervals at the selected time cuts for the filtered log volatilities. Although for the MCMC we have the entire path of volatilities available, it is important to distinguish that these are smoothed paths, therefore, are not comparable with the only filtered PL paths. If one wishes to obtained smoothed paths in PL setting, it is possible to perform the backwards smoothing after the algorithm has been run, see Carvalho et al. (2010a) for details on smoothing. As seen, the filtered median log volatilities and 95% credible intervals are almost identical for both algorithms. As mentioned in the Introduction, the predictive distribution of the returns (or their log square transformation) is of major interest. Figure 4 draws posterior predictive distributions for each of the time cuts for MCMC and PL. As seen from the plot, there is very little MC variability among the PL runs and the posterior predictives are identical to those produced by the MCMC. The figure presents such posteriors only for five selected time cuts, however, for PL there are 500 such posterior predictive distributions readily available. On the other hand, as mentioned before, the MCMC has to be re-run each time a new observation arrives, resulting into prohibitively large computational burden if one wants to produce online type inference.

Model comparison
To compare the performance of the models, we use the sequential predictive log Bayes factor (BF). As pointed out in Koop (2003), Bayes factors permit consistent model comparison even for non-nested models. Also, it contains rewards for model fit, accounts for coherency between the prior and the information arising from the data, as well as rewards parsimony. As seen in Kass and Raftery (1995), Bayes factor between two competing models is defined as where pðDjM r Þ is the marginal likelihood for data D given a model M r . Then the log predictive Bayes factor at time t -1 for data point r t is defined as log p r k jr kÀ1 ; M 2 À Á : The posterior predictive pðr t jr tÀ1 ; M r Þ for model M r is obtained as follows: where H r is a set of parameters associated with model M r . The integral above is not always analytically tractable and can be either approximated by using the MCMC output, or is readily available as a by-product in PL scheme. In particular, for each t ¼ 1; :::; T, the log predictive densities are calculated as log p r t jr tÀ1 Finally, Fig. 5 illustrates the attractiveness of PL: availability of sequential log predictive likelihoods and Bayes factors, which allow for fast and consistent model comparison. The top panel draws the simulated zero mean return process with Normal errors meanwhile the bottom panel draws the sequential predictive log Bayes factors. The sequential predictive log Bayes Factors are drawn for four independent runs for SPM and PM. Therefore, as a result of Monte Carlo error, multiple lines are visible in the bottom plot. The lager the number of particles, the less variability would be observed across the runs, see Carvalho et al. (2010a). Since the true data generating process is Normal, as expected, the Bayes factors are negative, showing strong support for the PM. Even though SPM includes PM as a special case, it has much more parameters to estimate, therefore, Bayes factors are negative since they reward parsimony.
This simulation study demonstrates that the posterior distributions for the parameters, filtered volatilities and posterior predictive distribution for the one step ahead squared log returns are identical for both estimation schemes. Moreover, PL allows for sequential consistent model comparison, which is prohibitively costly in MCMC setting.
We have also performed a similar simulation study, only for non-Normally distributed data. In this case, the Bayes Factors provide strong support for the SPM. The detailed results of the simulation study are included in the Online Appendix.

Real data application
In this section we present a real data application using return time series for two financial assets, which are the same as in Delatola and Griffin (2011). In particular, we consider the Microsoft company and the SP500 index. The daily prices from January 01, 2007 till October 31, 2016 for both assets are obtained from Datastream. The median, standard deviation, skewness, and kurtosis for the de-meaned log returns (in %) for Microsoft are -0.0271, 1.7581, 0.1926, and 12.7410, respectively, and 0.0129, 1.3051, -0.3273 and 13.2423 for the SP500 index.
In order to closer illustrate the ability of the SPM to capture different distributions of the squared log returns, we split the data into two disjoint periods: a volatile one that includes the financial crisis (January 01, 2007-November 01, 2010) and a calm one (January 01, 2013-October 31, 2016), both containing 1,000 observations. Figure 6 draws the daily prices (panels (a) and (b)), the log returns in (%) for the entire period, where the two sub-periods of interest are in    . Sequential log predictive Bayes factors and estimated densities for the log-squared error term for SPM, as compared to the PM for SP500 data for the first period. Figure 10. Sequential log predictive Bayes factors and estimated densities for the log-squared error term for SPM, as compared to the PM for SP500 data for the second period.
black (panels (c) and (d)) and the densities for the squared log returns for the two different subperiods (panels (e) and (f)). The SPM can capture such different shapes via the infinite mixture of Normals, meanwhile the purely parametric model will be fitting the exact same distribution in all four cases. Next, we estimate the data using the SPM and PM specifications. The hyper-parameters for the priors are the same as in the simulation study, the offset parameter value is set to c O ¼ 0:0003. The codes were run for 500k particles each. Figures 7 and 8 present the estimation results for the Microsoft data set. The figures draw sequential predictive Bayes factors as compared to the PM specification and the estimated predictive densities at time T þ 1 for the two sub-periods. The PM density corresponds to the mixture of seven Normals, as an approximation of log v 2 1 . Only by looking at the plots, it is obvious that SPM estimates different densities than the one provided by the fully parametric model. The sequential predictive log Bayes factors confirm the non-Normally distributed returns, i.e. SPM is strongly preferred to PM for both sub-periods.
Figures 9 and 10 present estimation results for the the two sub-periods of the SP500 data set. Same as for the Microsoft data, the SPM is strongly preferred to PM for both sub-periods. Also, the shapes of the predictive distributions for the log squared returns differ dramatically from the ones produced by Normally distributed errors.
To conclude, there is strong evidence that SPM outperforms PM for the selected data sets, confirming the finding present in previous empirical studies. Consistent sequential model comparison is possible via the use of the proposed PL algorithm for semi-parametric SV models.

Discussion
This paper designs a sequential estimation procedure, based on PL, for a semi-parametric SV model. PL is comparable to MCMC and allows for sequential inference, which is important in high-frequency data context. SMC also produces the picture of the evolution of parameter learning and provides the predictive likelihoods at each data point as a by-product. The availability of predictive likelihoods at each time point enables to perform fast online model comparison using sequential predictive log Bayes factors. Finally, we present a real data application using two financial time series of the returns for one index -SP500 and one company -Microsoft. As already confirmed in prior empirical semi-parametric SV studies, non-parametric errors provide a better model fit for both, volatile and calm periods.
As noted in the introduction, we use PL to perform sequential Monte Carlo for non-parametric SV models. Nevertheless, other particle filter alternatives are in order. Comparison of these methodologies for the particular models considered in this article is of interest and we believe it deserves its own space.