Product partition latent variable model for multiple change-point detection in multivariate data

The product partition model (PPM) is a well-established efficient statistical method for detecting multiple change points in time-evolving univariate data. In this article, we refine the PPM for the purpose of detecting multiple change points in correlated multivariate time-evolving data. Our model detects distributional changes in both the mean and covariance structures of multivariate Gaussian data by exploiting a smaller dimensional representation of correlated multiple time series. The utility of the proposed method is demonstrated through experiments on simulated and real datasets.


Introduction
Multiple change-point analysis is a statistical methodology for identifying and locating positions or times at which an underlying process generating sequenced data undergoes abrupt local changes. These models have been deployed in a variety of scientific and industrial contexts. In economics and finance, change-point models have been used to detect changes in the volatilities of returns [6,23]. In bioinformatics, Erdman and Emerson [7] used Bayesian change-point analysis to detect regions of genetic alteration identifying genes involved in cancer progression. These models are particularly useful in industrial control settings where the underlying process generating the observed data changes, or parameters measuring certain quality aspects shift over time [22]. Other recent areas of application include climate change detection [28], solar flare imagery [14], and network traffic patterns [18].
Multiple change-point analysis presents many estimation challenges: simultaneously determining both the number of change points and their locations, as well as selecting an appropriate model for data between change points, is not straightforward. Another level of complexity is added when the change points occur on multiple correlated time series, that is, in an industrial *Corresponding author. Email: gnyamundanda@gmail.com c 2015 Taylor & Francis setting where it is required to determine endpoints of different stages of a time-evolving pharmaceutical manufacturing process [4]. This is particularly difficult, when the number of variables or time series is considerably larger than the number of time points measured, as in a microarray setting or social network interactions [27]. In these situations, the application of standard parametric statistical methods is confounded by insufficient data for parameter estimation. Hence, there is a genuine need for dimension reduction techniques which account for the covariance between several time series in order to efficiently identify and estimate change points in multivariate time series data.
Although considerable attention has been devoted to change-point models in the recent literature [7,16,17,22,29,33], there is no standard method for modeling multivariate time series data with multiple change points. Markov chain Monte Carlo (MCMC) techniques were used by Erdman and Emerson [7] for Bayesian multiple change-point analysis of univariate to multivariate time series. This approach ignores dependencies between multiple time series such that any changes on the full joint distribution of the multivariate time series will not be identified. In similar vein, the R packages change-point [15][16][17] and cpm [29] provide suites of parametric and non-parametric methods for multiple change-point analysis of univariate series. Bayesian change-point detection techniques of Fearnhead [8,9] were used by Xuan and Murphy [33] to perform multiple change-point analysis on multivariate time series data. Although this approach appropriately models the covariance structure of the multivariate time series, it breaks down, due to singularity problems, when the dimension of the series is increased, whilst the length of time series measured is constant. Recently, Maboudou-Tchaoa and Hawkins [22] employed a maximum-likelihood estimation approach coupled with a dynamic programming algorithm to detect multiple change points in multivariate time series. However, this approach cannot be easily extended to detect multiple change points in a high-dimensional setting.
In this article, we present a full Bayesian approach to the problem of multiple change-point detection in multivariate data known as product partition latent variable model (PPLVM). The PPLVM can be used to detect distributional changes in the mean and covariance, even in highdimensional settings, whilst exposing underlying relationships in the series. The PPLVM draws on two different types of models: the product partition model (PPM) [2], which identifies change points by imposing a partition structure on parameters of interest, and the probabilistic principal component analysis (PPCA) model [24,25,31], which reduces data dimension whilst modeling the covariance structure. Perhaps the key benefit of the PPM over other change-point models is that it assumes that both the number of change points and their locations are random variables rather than considering them as fixed and known in advance. This model was initially used by Barry and Hartigan [3] to detect changes in the means from a Gaussian model. However, their approach does not fully explore the posterior distribution as the algorithm only collects posterior means. Loschi and Cruz [20] modified the Barry and Hartigan algorithm to extend the PPM to apply a Gibbs sampling scheme to estimate the posterior distribution for the change points, number of change points, and the posterior probability that each position is a change point. The PPLVM described here exploits these results by Losch and extends the PPM to detect changes in the full joint distribution of a multivariate time series whilst providing a smaller dimensional representation of the series.
The remainder of the article is structured as follows. In Section 2, the PPM, first introduced by Barry and Hartigan [2], is briefly reviewed. This is followed by a presentation of the PPLVM and how it can be used to identify distributional changes in the mean and covariance. The PPLVM is estimated within the Bayesian framework. Section 2 is a discussion of the necessary prior distributions and a brief description of the use of MCMC techniques [10] to fit the PPLVM. In Section 3, we assess the performance of the PPLVM, compared to other approaches to changepoint problems for multivariate data, on several simulated and real-world examples. Finally, Section 4 is a discussion of the proposed model and further avenues of research.

Change-point models
Let x 1 , . . . , x n be a sequence of observations measured over time which are assumed to be independent conditional on the parameter θ i , with a conditional marginal density f (x i | θ i ), where i ∈ l = {1, . . . , n} is a set of ordered time points. The PPM, due to [2], assumes that there exist a random partition ρ = {c 1 , . . . , c b } which partitions a set l into temporal blocks c k = {i k−1 + 1, . . . , i k }, where k = 1, . . . , b and i k ∈ l. Instances or positions at which changes in the distribution of x = (x 1 , . . . , x n ) T occur identify change points and are constrained to be The prior probability of partition ρ, which partitions data x into b blocks, can be expressed as where E c k is the prior cohesion associated with block k, representing the degree of similarity amongst observations within block The constant K is chosen so that the sum of the probabilities over all partitions is one. The PPM treats the number of partitions b as a random variable B with values 1 to n. Given the partition ρ = {c 1 , . . . , c b }, the sequence of parameters θ 1 , . . . , θ n can be partitioned into independent common parameters θ c 1 , . . . , θ c b , where θ c k = θ i k−1 +1 = · · · = θ i k .

Product partition latent variable model
. , x im ) T is a set of measurements taken on m correlated variables on time point i. A generative statistical model, the PPLVM, closely related to the PPCA model [24,25,31], can be used to model a high-dimensional observed data point x i as a linear function of a corresponding low-dimensional latent variable u i = (u i1 , . . . , u iq ) T , q m. Under the PPLVM, the observed data point x i can be generated through the generative model where W is a m × q transformation matrix, which relates the latent variables to the observed variables, and μ i is the mean of the data at time point i. The noise i and the latent variables u i are assumed to be independently Gaussian-distributed random vectors, MVN m [0, σ 2 i I] and MVN q [0, I], respectively, where I is an identity matrix. Unlike the PPCA model, the PPLVM presented here allows the mean μ i and the noise covariance parameter σ 2 i to vary slowly with time. The PPM introduced in Section 2 is incorporated into the PPLVM to model time dependencies within the data by imposing a random partition structure ρ on the means and the noise covariance parameters, . This partition structure ρ identifies blocks of data x c k which share the same values of parameter θ k within the same block, but heterogeneous between. Hence, the PPLVM is a clustering approach which assumes that the process generating data can be split into temporal blocks in which the process is constant. Under this model, conditional on the partition ρ = { c 1 , . . . , c b } and the latent variables u = (u 1 , . . . , u n ), the data x = (x 1 , . . . , x n ) have a joint distribution given as where θ = (θ 1 , . . . , θ b ) and b is the number partitions.

Estimation of the PPLVM
Parameters of the PPLVM can be estimated by conditioning on the partition and then averaging over all possible partitions. However, as the size of the series n increases, the number of possible partitions increases exponentially such that it becomes difficult to compute all partitions. Hence, Bayesian approximation methods are required which can search through the space of all possible partitions for values of ρ with high posterior probability. This can be carried out by replacing the random partitions ρ with a binary latent variable z = (z 1 , . . . , z n−1 ) suggested by Barry and Hartigan [3], where z i = 1 if i is a change point, θ i = θ i+1 , and zero otherwise. Estimating the random vector z is equivalent to estimating the random partition ρ. MCMC techniques such as Gibbs sampling [12] can be used to sample each z i from its full conditional distribution. A Bayesian approach requires the specification of prior distributions for all the model parameters.
Detailed derivations of the full conditional distributions for z and all other parameters are given in the additional file. Under the PPLVM, the parameters of interest include W, μ 1 , . . . , μ b and σ 2 1 , . . . , σ 2 b , ρ, and b. Assuming independent priors on each parameter, the posterior of interest can be expressed as is a hyperparameter associated with prior covariance of W, p is the probability of a change point, and P(

Prior distributions
A fully Bayesian approach is taken by specifying independent priors on each of the parameters. A prior distribution on the probability of a change point p is necessary in order to control the number of change points. A prior distribution with very small values of p will result in a model that ignores small changes. By contrast, large values of p will result in oversensitive models that respond to spurious changes through time [32]. A beta distribution, P(p) = Beta[α p , β p ], is the most popular prior distribution over p used in most approaches to multiple change-point problems [19][20][21]. In case the number of change points is known to be small a priori, small values of α and large values of β can be specified. This restricts the distribution of p to be concentrated around small values.
Assuming the binary random variables z i are independent Bernoulli distributed and P(z 0 = i 0 |p) = 1, the prior distributions on partition ρ and on the number of partitions B given p are specified, respectively, as where C n−1 b−1 is the number of distinct partitions of l into b contiguous blocks. A q-dimensional multivariate Gaussian prior distribution centered at the origin zero, with a hyperparameter ν which controls the prior covariance of the loadings, was considered on the rows of the loadings matrix, that is, The elements ν r , r = 1, . . . , q, of the hyperparameter ν have a Gamma hyperprior distribution with parameters a o and b o , that is, We impose the partition structure ρ on the space of the mean and the noise covariance parameters by assigning prior distributions that identify the clustering structure in the data. Hence, given The choice of the block prior distribution also allows the mean to be related to the noise variance σ 2 k as where μ o is the prior mean, β μ is a parameter controlling the prior covariance of the mean, and α σ 2 and β σ 2 control the prior distribution of the noise covariance.

Sampling from the PPLVM: the Gibbs sampler
Given the specified prior distributions, the resulting posterior distribution P(W, θ , ν, ρ, b, p|x, u) is intricate and MCMC methods are required to produce realizations of the model param-

The PPLVM identification
If the transformation matrix and latent variables are subjected to an orthogonal rotation, this gives rise to the same distribution for the observed data. Hence, the PPLVM suffers from identification issues as it is difficult to identify the model parameters unless restrictions are imposed.
There are several approaches in the literature available for dealing with non-identifiability of the related factor analytic models which can also be adopted for the PPLVM. A unique PPLVM is defined by constraining the transformation matrix such that the first q rows are lower triangular with positive diagonal elements [11]. Although this approach is known to impose structure on the ordering of the variables [1], this does not affect how the PPLVM identifies the change points.

Simulations and case studies
In this section, we present results of applying the proposed and current existing approaches to the problem of multiple change points in a multivariate data setting on simulated and real datasets. A Gibbs sampling scheme of 20,000 iterations, thinning every 20th iteration, was set out for all analyses. The first 5000 iterations were discarded as burn-in.

Simulation study
Simulation experiments were carried out to assess the performance of the PPLVM in comparison to other current existing methods for multiple change-point analysis of multivariate time series such as bcp [7], ecp [13], and dynamic programming (DP) by Maboudou-Tchaoa and Hawkins [22]. Different simulation scenarios presented in this section illustrate different aspects of the proposed approach, the PPLVM, in detecting distributional changes in the mean and covariance structure. Three scenarios were considered and data were generated from a Gaussian distribution, MVN m (μ, ), where the mean μ = (μ 1 , . . . , μ m ) and covariance were allowed to change at certain time points. In the first scenario, the distributional changes were imposed on the mean structure; in the second on the covariance structure; and in the third on both the mean and the covariance structure.
Scenario 1: Changes in the mean structure. In order to compare our approach to the other three methods, data in this scenario were simulated following the simulation strategy in DP [22]. Three time series m = 3 of length n = 125 with change points in the mean at time points τ 1 = 30, τ 2 = 65, and τ 3 = 100 were generated according to the DP scheme as follows: The covariance is fixed at = I m (identity covariance) and the mean parameters were set at μ o = μ + kσ , where μ = 0, k = 3, and σ = 1. The PPLVM, bcp, ecp, and DP were all applied to the generated data to identify the three change points and results are presented in Figure 1 and Table 1.
Small posterior probabilities pp is the model conveying the absence of a change point, pp close to 1 is the model identifying a change point with much resolution, whilst pp close to 0.5 is the model giving a less clear-cut signal. Table 1 shows that the PPLVM, as well as bcp and DP, performed very well in predicting the three change points, whilst ecp failed to correctly locate the last change point. Although bcp estimates for the change points match those of the PPLVM, the posterior probabilities corresponding to the identification of these change points were consistently higher for the PPLVM method (0.85, 0.81, 0.97) than the bcp method (0.44, 0.51, 0.81) as shown in Figure 1(d), implying that specificity provided by the former method is greater. The posterior means estimated by the PPLVM presented in Figure 1  Scenario 2: Changes in the covariance structure. In this scenario, we examine the performance of the PPLVM when the joint distribution of the multivariate time series changes whilst the marginal distribution is constant. Simulation approach taken in ecp [13] was adopted in this scenario such that m = 5 time series of length n = 125 were generated with change points in the covariance structure at the same positions as in the first scenario, τ 1 = 30, τ 2 = 65, and τ 3 = 100, as follows: • 1 = I m for observations 1:30, • 2 = 1 0.9 0.9 0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 0.9 1 for observations 31:65,  = (0, 0, 0, 0, 0). Only the PPLVM, bcp, and ecp were applied to investigate change points in this scenario and results are shown in Table 2.
The performance of the methods depends on the type of distributional changes. Since the change in this second scenario is on the full joint distribution of the multivariate time series data, Table 2 shows that the bcp method failed (pp 0.5) to identify any change points. The bcp approach cannot detect changes in the covariance structure as it assumes that the multivariate time series are independent. Figure 2 shows that the posterior probabilities for bcp are flat throughout, whilst the proposed model accurately estimated change points at τ 1 = 30, τ 2 = 65, and τ 3 = 100 with probabilities 0.91, 0.77, and 0.75, respectively. Although the simulation approach in this second scenario was adopted from the ecp method, it estimated change-point locations which were slightly different from the true change points. Table 2 shows that the PPLVM outperforms the two methods as any changes in the correlation structure is exposed in its noise model. This highlights that, in many cases where the parametric distribution of the data is known, non-parametric approaches are less powerful in detecting changes when compared to parametric approaches.
Scenario 3: Changes in both mean and covariance structure. Unlike the first two scenarios, the third scenario has only two change points at τ 1 = 40 and τ 2 = 70 on both the mean and the covariance structure. As in the first scenario, data for this scenario were generated according to a simulation strategy by Maboudou-Tchaoa and Hawkins [22] but it was improved to allow the covariances between the three observed variables to change over time. Three correlated time series of length n = 100 with two change points on both μ and were generated using the following scheme: The results of applying the PPLVM, bcp, and ecp are presented in Table 3. The table shows that the proposed approach managed to correctly infer that the generated data have two change points at τ 1 = 40 and τ 2 = 70 with probabilities 0.80 and 0.82, respectively. As in the second scenario, the bcp approach found it difficult to identify any change points since the changes were on the joint distribution of the data. As shown in Table 3, in this scenario, the ecp method provided correct estimates of the change points.

Case study 1: Climate change data
In this section, we assess the potential utility of the PPLVM in climate change analysis by applying it to five Swiss station annual temperature data sets publicly available through the European Climate Assessment and Database (ECAD) project web page www.eca.knmi.nl. The data consist of 104 average annual temperatures for the period 1900-2004 for 5 Swiss stations, Basel, Geneva, Lugano, Saentis, and Zurich. The aim is to establish timing of potential changes in the Swiss temperatures in this period. These data were analyzed in the web address www.chartsgraphs.wordpress.com using cumulative control charts (CUSUM) first developed by Page [26] for detecting shifts in trends. In their analysis, they identified four change points in Geneva which include 1920, 1943, 1963, and 1989. Lugano and Zurich had three change points (1919,1940,1988) and (1943,1952,1988 Figure 3. The posterior estimates of the annual temperatures for the five stations are also presented in Figure 3. The model predicts that the change point in annual temperature on year 1963 was less pronounced in Basel. Since these change points were on the joint distribution of the five cities, the bcp approach had problems identifying these change points, and the estimated problems were less than 0.5.

Case study 2: Freeze drying data
In an industrial controlled setting, this model can be used to detect changes in the underlying process generating data. In this example, the PPLVM is applied to the freeze drying dataset from pharmaceutical industry to detect changes in the process called freeze drying. Freeze drying, also known as lyophilization, is a process used in pharmaceutical industry to remove water from pharmaceutical products by sublimation at very low temperatures [4]. Freeze drying products improve their stability for increased shelf life, preserve their biological activity, and maintain the basic structure of the product. Freeze drying process is performed through three successive stages: (i) the freezing stage, water is converted to ice and solutes are crystallized; (ii) the primary drying stage, the product temperature is slowly increased whilst pressure is reduced removing ice crystals by sublimation; and (iii) the secondary drying stage, temperature is gradually increased under high vacuum to remove remaining water. In order to reduce time and costs of the process, the main task is to monitor and determine endpoints of the freeze drying process.
Spectroscopic process analyzers in combination with chemometrics tools are used to monitor different stages of freeze drying [4,5,30]. The main limitations with chemometrics methods is that they are not based on any probability density model. They assume observations are independent, ignoring correlation due to repeated time measurements of the freeze drying cycle. They cannot be used to efficiently predict the endpoints of freeze drying cycle with certain level of assurance. The problem of determining endpoints of different stages of freeze drying can be treated as a multiple change-point problem. In this example, the PPLVM is employed to identify the end and start of primary and secondary drying, respectively, using information measured from the product temperature, condenser temperature, and chamber pressure variables only. The freeze drying data used here consist of 200 time points of primary to secondary drying stages measured on 17 variables (2 shelf temperature, 9 product temperature, 5 condenser temperature and 1 chamber pressure). Figure 4 shows that the PPLVM managed to correctly identify the end or start of primary or secondary drying, respectively (which was after 99.6 h shown by the first green arrow) with probability 1, whilst bcp identifies this change point at 98.6 with probability 0.49 and a notable spike at 99.6. The PPLVM identified the end of secondary drying to be around 108 h (which is 1 h later than the correct change point shown by the second green arrow) with probability 1, whilst bcp identified it at around 109 h (2 h later than the correct change point) with probability 0.81. Figure 4. Results of the Freeze drying data. Results of applying the PPLVM to the Freeze drying data : the upper plot shows the product and condenser temperature profiles denoted by black and grey dashed and dotted lines, respectively. The pressure profile is denoted by a blue dash-dotted line. The shelf temperature profiles in red show the true endpoints for primary and secondary drying (set by the manufacturer) denoted by two green arrows, respectively. The lower plot compares the posterior probabilities P(z i = 1| x, u, W) estimated from the data without the shelf temperature profile by the PPLVM in solid red line and bcp in black dashed lines. The PPLVM identifies the end of both primary and secondary drying at τ 1 = 99.6 and τ 2 = 108 h with probability 1, respectively.

Discussion
Simulated and real examples presented in this article demonstrate that the PPLVM is an important tool for detecting multiple change points in multivariate time series data. Perhaps the key contribution of this article is the extension of the Barry and Hartigan [3] PPM by providing a full Bayesian approach to the problem of detecting distributional changes in the mean and covariance of a multivariate time series Gaussian data. Compared to the currently existing methods for multiple change-point analysis for multivariate data, the dimensionality reduction approach, the PPLVM, was shown to perform well. One important advantage of this approach is that it can also be used to model data from a high-dimensional setting, providing a smaller dimensional representation of the observed series. The PPLVM is very flexible and it can be applied to a wide variety of settings which include finance, bioinformatics, genetics, and engineering.
Currently, compared to other approaches, the PPLVM has computational complexity quadratic relative to the length of the time series. Further work is required in the effort to improve further the computational efficiency of the PPLVM.