Bayesian Change Point Detection with Spike-and-Slab Priors

Abstract We study the use of spike-and-slab priors for consistent estimation of the number of change points and their locations. Leveraging recent results in the variable selection literature, we show that an estimator based on spike-and-slab priors achieves optimal localization rate in the multiple offline change point detection problem. Based on this estimator, we propose a Bayesian change point detection method, which is one of the fastest Bayesian methodologies. We demonstrate through empirical work the good performance of our approach vis-a-vis some state-of-the-art benchmarks. Interestingly, despite having a Gaussian noise assumption, our approach is more robust to misspecification of the error terms than the competing methods in numerical experiments. Supplementary materials for this article are available online.


Introduction
Change point detection has received considerable attention in the statistical literature for several decades.Assume we observe a vector of independent random variables Y = (Y 1 , . . ., Y T ) according to the linear model where f t is a right continuous function with an unknown number K of change points, and t 's are independent random variables with E( t ) = 0 for all t.The main goal of offline changepoint detection is to simultaneously estimate K and the locations of the change points of f t .A natural assumption is that f t (i.e., the conditional mean) is sparse, in the sense that there are only a small number of change-points.We further assume that f t is piecewise-constant (i.e., a right continuous step function).These modeling choices lead to a low-dimensional parametric model that is interpretable, can fit nonstationary time-series, and is suitable for prediction.The literature on change point detection includes a large number of frequentist methods.Most of these methodologies rely on a test statistic to detect parametric changes in the distribution of the observables and model selection techniques to determine the number of parameters defining the signal f t .Some examples of test statistics include the likelihood ratio and the CUSUM statistic (Page 1954).The model selection step typically relies on either an 0 or an 1 penalty.The penalty is included directly through a penalized likelihood or via an information criterion such as AIC and BIC.Consequently, several variable selection methodologies have a conceptual analogue in change point detection: for example, the Dantzig selector of Candes et al. (2007) has the same rationale as the multiscale SMUCE estimator of Frick, Munk, and Sieling (2014); the total CONTACT Lorenzo Cappello lorenzo.cappello@upf.eduDepartment of Economics and Business, Universitat Pompeu Fabra, Barcelona, Spain.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/JCGS.variation denoising (Rudin, Osher, and Fatemi 1992) and the fused LASSO procedure (Tibshirani et al. 2005) share the same penalty with the LASSO (Tibshirani 1996).Other frequentist methods for univariate change point detection include the wild binary segmentation of Fryzlewicz (2014) based on the CUSUM statistics, and various algorithms for 0 penalized change point detection (Friedrich et al. 2008;Rigaill 2010;Killick, Fearnhead, and Eckley 2012;Maidstone et al. 2017).
From the Bayesian perspective, popular change point detection methods rely on product partition models (Barry andHartigan 1992, 1993).However, the use of MCMC to approximate the posterior distributions of these models is challenging (Chib 1996(Chib , 1998)), and much research has focused on alternatives to MCMC: Fearnhead (2006) proposed two algorithms to perform direct simulation from the posterior distribution (one to do exact simulation from the posterior and one using an approximate version); Rigaill, Lebarbier, and Robin (2012) derived exact formulas for the posterior distribution.Recent works take an empirical Bayes approach to set the prior distributions (Du, Kao, and Kou 2016;Liu, Martin, and Shen 2017).Liu, Martin, and Shen (2017) is more general, allowing to recover piecewise polynomial signals.To the best of our knowledge, Bayesian variable selection procedures such as the horseshoe prior (Carvalho, Polson, and Scott 2010) and the spike-and-slab prior (Mitchell and Beauchamp 1988), have not been studied in this setting.Recent works used the horseshoe prior for trend filtering (Faulkner and Minin 2018;Kowal, Matteson, and Ruppert 2019) but not to explicitly infer K and the change points locations.
In this article, we study spike-and-slab priors for offline multiple change point detection.Starting from a baseline f 0 , we model each increment through the operator f i = f i − f i−1 for 1 ≤ i ≤ T and introduce latent binary variables (Z 1 , . . ., Z T ) to indicate whether f i corresponds to a change point or not.
The prior distribution on the increment f i under Z i = 0 is a distribution "concentrated" around 0 (or a point mass) called spike.The prior distribution for f i under Z i = 1 is a diffused distribution called slab.The choice of which distributions to use for the spike, the slab, and the model space, has been a subject of extensive research; see Bhadra et al. (2019) for a recent review.We use the shrinking and diffusing prior of Narisetty and He (2014), which consists of Gaussian spike-and-slab priors with sample size dependent prior variances.The reasons for this choice are (i) Narisetty and He (2014) proved one of the strongest selection consistency results in the Bayesian variable selection literature, and (ii) Chen and Walker (2019) recently proposed a methodology for fast Bayesian variable selection employing this prior and not requiring MCMC.
Here, we make the following contributions to the change point detection literature.(i) We show how to employ spikeand-slab priors for change point detection and propose a fast algorithm capable of detecting change points from thousands of observations in under a second.(ii) We establish that a modified estimator based on the shrinking and diffusing prior is consistent and achieves optimal localization rates of multiple change points.(iii) Through simulations, we show that our procedure is competitive with state-of-the-art methodologies.A salient feature of our proposed method is that it does not rely on MCMC, avoiding the risk of MCMC chains failing to converge.Additionally, in numerical experiments, it is highly robust to misspecification of the noise term, a situation where many stateof-the-art benchmarks fail by substantially overestimating the number of change points.This is somewhat surprising given that the method is developed under the Gaussian noise assumption.
While in this article we study the univariate change point detection problem, methods for change point detection have also been studied for other types of data beyond univariate mean change point detection and settings more general than (1).Pein, Sieling, and Munk (2017) considered change point detection with heterogeneous noise.Carlstein (1988), Rizzo and Székely (2010), Zou et al. (2014), Matteson and James (2014), and Padilla et al. (2019a, 2019b, 2019c) developed nonparametric change point methods that can detect arbitrary changes in distribution.Cho and Fryzlewicz (2015), Cho (2016), and Wang and Samworth (2018) focused on high-dimensional change point estimators.Aue et al. (2009), Avanesov and Buzun (2018), and Wang, Yu, and Rinaldo (2021) studied covariance change point detection.Fearnhead and Rigaill (2018) considered methods for change point detection combining a robust loss with the 0 penalty.Vanegas, Behr, and Munk (2022) proposed a multiscale method for quantile change point detection.Here, we focused on a simpler setting because spike-and-slab priors were not studied in the change point detection context, and, as stated by Wang, Yu, and Rinaldo (2020), the estimators built for (1) are often the building blocks for more complex settings.
The rest of the article is organized as follows.Section 2 describes the model, conditions on the parameters, and introduces the algorithm.Section 3 presents our main results on consistency.We present simulation studies in Section 4 to illustrate how our proposal fares with alternatives.Section 5 includes applications to microarray and ion channel data.Section 6 concludes with a discussion on the use of spike-and-slab priors for multiple change point detection.

Method
We assume that ( t ) 1:T are independent Gaussian random variables with mean zero and known variance σ 2 .Let C * := {η 1 , . . ., η K } ⊂ {1, . . ., T} denote the set of change points of the piecewise-constant right continuous signal f t , we can write this as (2) and assume that f 0 = μ 0 = 0. Further technical conditions on f t will be given in Section 3. We include a discussion on how to handle unknown σ and f 0 = 0 later in the section.
Basad.cp change point detection.A working model employing spike-and-slab priors is as follows: for t from 1 to T, f t = f t − f t−1 , and τ 2 0,T , τ 2 1,T , and q T are hyperparameters that depend on T, with τ 2 1,T τ 2 0,T .The rationale behind this set-up is that the posterior probability of Z t = 1 should be high for t ∈ C * ; vice versa, the posterior probability of Z t = 0 should be high for t / ∈ C * .A natural change point detection procedure is to employ the posterior probabilities of Z t to determine if t is a change point or not, for example classifying t as a change point if the posterior exceeds a certain threshold.We elaborate on this selection rule later.The assignment of the same prior for f 1 follows from the assumption that the mean process starts at zero.
A relevant difference with variable selection comes from the fact that while covariates are not ordered, in change point detection we generally want to avoid classifying consecutive time instances as change points.We expect this behavior because it is a common feature in the change point detection literature.Most of the procedures employ minimum spacing conditions, that is, the distance between consecutive change points is lower bounded by a quantity > 0, such that |η j − η j+1 | > .Minimum spacing conditions are used both in the finite sample implementations of the estimators and in the proofs of consistency.We will introduce a procedure to avoid consecutive change points.
The model (3), which we will refer to as basad.cp, is the analogue of the basad variable selection procedure of Narisetty and He (2014) to change point detection.The sample size dependent hyperparameters τ 2 0,T , τ 2 1,T and q T , are the salient feature of (3).We require that as T → ∞, τ 2 0,T → 0 and τ 2 1,T → ∞.In variable selection, a shrinking τ 2 0,T ensures that the marginal posterior probability of including (excluding) an active (inactive) covariate converges to one as sample size increases.Increasing τ 2 1,T and q T allows for the consistent estimation of the number of active covariates and consistent model selection.Narisetty and He (2014) proved that the penalization achieved through τ 2 1,T and q T is equivalent to an explicit 0 penalty.We will show in Section 3 that these parameters play a similar role for achieving consistent estimation of K and η 1 , . . ., η K .A similar asymptotic result holds despite the settings being very different: in high-dimensional variable selection, the number of covariates grows at a rate faster than the number of samples; in change point detection, the number of piecewise increments is equal or smaller than the sample size.
Another key feature of the methodology of Narisetty and He (2014) is that they employ the marginal posterior probabilities pr(Z t = 1|Y, σ 2 ) to select the active parameters in the finite sample implementation (to prove consistency they employ pr(Z|Y, σ 2 ), with Z = (Z 1 , . . ., Z T ) ).The idea is motivated by computational reasons, given that one can sample from the marginals with a Gibbs sampler, which is not available for pr(Z|Y, σ 2 ).Furthermore, the MCMC targeting pr(Z|Y, σ 2 ) has a much larger space of models and struggles to explore it, for example, Fearnhead (2006) and Rigaill, Lebarbier, and Robin (2012).For this reason, we will employ marginals for change point detection.
We classify a time instance t as a change point if pr(Z t = 1|Y, σ 2 ) is larger than a certain threshold.In this case, the estimated number of change points K is the number of marginal posterior probabilities larger than the chosen threshold.The model selected using 0.5 as a threshold corresponds to the median probability model of Barbieri and Berger (2004), who also proved that it is the optimal predictive model.An alternative strategy to select the change points would be to first rank the f t based on pr(Z t = 1|Y, σ 2 ), and then select the top K (the model size) increments according to a given information criteria.We do not investigate this strategy and leave it for future work.We further stress that, under our selection rule, it is likely that there will be consecutive time instances such that pr(Z t = 1|Y, σ 2 ) > 0.5, that is, consecutive points could be "classified" as change points.We deal with this issue after introducing an alternative methodology to compute pr(Z t = 1|Y, σ 2 ).
Solo.cp change point detection.Chen and Walker (2019) introduced a sequential procedure based on a misspecification of basad that admits marginal posterior probabilities in closed form.Their method, called solo spike-and-slab, has asymptotic properties and empirical accuracy similar to basad, while being substantially faster.While our setting can be seen as a particular instance of the linear regression framework of Chen and Walker (2019), there is an advantage in deriving the closed form marginal pr(Z t = 1|Y, σ 2 ) for the setting considered in this article because we will be able to simplify certain calculations further.We do this following the same steps of Chen and Walker (2019).Suppose we are interested in testing whether a time instance j is a change point and consider the following model: ) where τ 2 T is an additional sample size dependent hyperparameter.Thus, we place a spike-and-slab prior on a single change point at a time (in this case f j ), and place conjugate Gaussian priors on the remaining terms.The tuning parameter τ 2 T controls the shrinkage across time instances.The advantage of model ( 4) is that it allows us to write the marginal posterior probabilities in closed-form.First, we can marginalize out (5) where we have integrated out the random vector f −j using the conjugate prior distribution given in (4).The parameters n j+1 , . . ., n T , γ j,j and y j,j are function of the hyperparameters of the prior on f −j and data Y 1:T , and are computed recursively as follows.Initialize n T = τ 2 T /(τ 2 T + σ 2 ), ȳ T = y T , and compute recursively for i = T − 1 to j + 1 , and Then, set γ 1,j , and for i = 1 to j compute Despite the involved notation, simple calculations lead to the definitions of parameters in ( 6) and ( 7).The basic idea is first to marginalize f T , then f T−1 , then continue sequentially to f j+1 .This first step leads to the definition of the parameters in (6).In the second step, we first marginalize f 1 and then recursively integrate out the remaining parameters until f j−1 .This second step leads to the definition of the parameters in (7).Equations ( 6)-( 7), together with the priors on f −j , define a vector of weights: observations recorded at time points closer to j have a higher weight than those further away from j. Essentially, what we have done is marginalizing two Gaussian Markov random fields (GMRF) of order 1 with parameter τ 2 T , one to the left and one to the right of j; see Rue and Held (2005) for an intro to GMRFs.
Notice that the quantities {n i } can be computed in O(T) as follows.First, we compute and store the quantities which can be done in O(T).Then a single pass using (6) allows us to obtain {n i } with O(T) cost.As for (7), we start by computing Next, for a fix j, we calculate n i,j for i = 1, . . ., i. Then we store and compute Hence, the calculations in ( 7) can be done in O(T 2 ).
Given the marginal likelihood L(Y| f j , σ 2 ), we can compute the marginal posterior distribution of f j through Bayes rule: where the parameters are defined as follows, for k ∈ {0, 1}: The parameters above are all we need in order to compute pr(Z j = 1|Y, σ 2 ), which corresponds to Here, q T is the sparsity inducing parameter.Given (10), we follow the same procedure described for basad.cp:a time instance is declared a change point if pr(Z j = 1|Y, σ 2 ) exceeds the prespecified threshold.In practice, we would not be interested only in a single time instance j, so one needs to compute (10) for j in 1 to T. That is, we need to consider T models.Equations ( 5), (8), and (10) are the analogues to (7), ( 11) and ( 19) in Chen and Walker (2019).Similarly, the definitions of parameters in ( 9) are the analogues of ( 12)-( 14) from Chen and Walker (2019).The differences arise because their definitions rely on a matrix of covariates and require several matrix multiplications and inversions.Importantly here, we can write analytically all the formulas and bypass the need for these matrix operations.
Solo.cp notably differs from approaches based on generalized likelihood ratio, such as the CUSUM statistics.While CUSUMbased methods fit two constant means model (to the left and to the right of j), solo.cpseeks for evidence for a jump in mean at time j after having accounted for all possible variations from 1 to j -1 and j + 1 to T. This is done through the random vector f −j , which we integrate out when computing the marginal likelihood.This difference is an essential feature of solo.cp and enables the detection of multiple change points.The idea is that a change in mean at time η i will be marginalized out before getting to time η i+1 , as long as η i+1 and η i are not "too close".Such minimum spacing condition is standard in the literature.
In Section 4, we will show empirically that solo.cp is more robust to misspecification of the noise model and outliers than competitors.We conjecture that this property follows from the feature described above and the shrinkage implicit in Bayesian estimators."Outliers" are shrunk toward multiple local means: An observation at time j contributes to estimating (or marginalizing) all increments from f 1 to f j .A potential drawback of this construction is that changes of very small magnitude are integrated out.Both hypotheses will be tested numerically.
Implementation.Regardless of whether we compute pr(Z i = 1|Y, σ 2 ) through basad.cpor solo.cp,we propose the use of a post-processing step to avoid the detection of consecutive change points.This involves a rule that defines when two or more estimates will be considered as "consecutive", and a selection rule to determine which estimates to keep.
In detail, let C 0 := { η 1 , . . ., η K 0 } be the set of points such that pr(Z i = 1|Y, σ 2 ) > 0.5.Now, fix ∈ N, and partition C 0 into Hence, the partition defines the notion of "consecutive change points".Finally, within each subset C 0 i , choose the point The estimated set of change points is A few remarks.First, the length of the partition determines the number of estimated change points K. Second, we need an extra parameter to define the partition of C 0 .The sensitivity of the two methods to is studied in the supplementary material (SM).Lastly, we pick the time instance having the maximum marginal posterior probability within each subset and classify it as the change point.Whereas this is an arbitrary criterion, choosing the point that maximizes a given test statistics is stan-Algorithm 1 Spike-and-slab change point detection with the Gibbs sampler defined in Narisetty and He (2014).
Both basad.cpand solo.cpcan be easily extended to unknown σ .This instance is considered both by Narisetty and He (2014) and Chen and Walker (2019): one simply includes an inverse gamma prior in (3) and (4).solo.cpremains analytically tractable: the difference is that ( 8) is a mixture of Student's tdistributions, whose parameters can be written in closed form.In this article, we do not study solo.cpand basad.cpwith inversegamma priors, but with the fused LASSO residuals standard deviation computed through the "one standard error" rule.
In this section, we considered a mean-process starting at zero.In the case f 0 = 0, we can treat it as an unknown and assigning to it a Gaussian prior with mean μ 0 and variance τ 2 0 .These hyperparameters will generally be different to the prior on the increments ( f i ) i≥1 because f 0 is not an increment but a baseline mean.The choice of μ 0 and τ 2 0 can be based on prior information and how informative we want the prior to be.To prove the theory, we will assume that f 0 = 0 holds and use the same prior in numerical experiments.If the assumption is violated, then pr(Z 1 = 1|Y, σ 2 ) → 1 with high probability.
So far, we considered a setting where the number of observations collected at a given point t (n t ) is equal to one (case n t = 1), which is the standard in the literature.In applications, this may not be the case (case n t > 1).This situation could arise if multiple observations are collected at once, or if observations are collected at distinct time points, but the reported data are binned into time intervals.To our knowledge, there are few methods in the literature dealing with this situation (Padilla et al. 2019b).The extension of basad.cpand solo.cp to the case n t > 1 is straightforward.Parameters ( 6), (7), and ( 9) can be written in closed form, including an adjustment done through n t for all t.The explicit formulas are provided in the SM, where we also show numerically that the case n t > 1 is particularly beneficial in terms of empirical performance for the two methodologies discussed in this section.

Theory
In this section, we provide some theoretical support for the method that we study in this article.We show that for the task of multiple change point detection, a modified version of the estimator described in (3) (basad.cp)based on the spike-andslab variable selection framework of Narisetty and He (2014) leads to optimal localization rates of the change points.Specifically, for the case of a bounded number of change points, under a slightly weaker signal-to-noise condition than the wild binary segmentation and 0 based methods, we attain optimal localization rates.We also show in the SM that this optimality is preserved in the single change point detection framework if we consider a version of the solo.cpestimator.All the proofs can be found in the SM.
The goal is to define an estimator C ⊂ {1, . . ., T} of C * .We do this by first defining an estimator C relying on the indexes of the partition 1, . . ., m, then we use C to construct our actual estimator.First, let Z ∈ {0, 1} m be such that and C = {j : Z j = 1}.The set C is then used to construct C ⊂ {1, . . ., T} as follows: • If i ∈ C then there exists a j ∈ C with i ∈ j .
• If j ∈ C then for a unique i ∈ C we have that i ∈ j .
The construction of C is used to map the estimates conditioned on the transformed data to the actual time indices we are trying to infer.Our results show that C defined by the modified estimator based on Y and C attains optimal localization rates for estimating C * .This result exploits Theorem 4.1 in Narisetty and He (2014) which provides a consistency result for linear model estimation with the shrinking and diffusing prior.Before arriving at this result, we now make the following modeling assumption.
Assumption 1.Let κ be the minimum jump size, thus, Then we require that κ 2 T/(σ 2 log T) → ∞, as T → ∞.Furthermore, we impose the following minimum spacing condition for a large enough c 1 > 0, and require that .
Assumption 1 can be thought as a signal-to-noise-ratio condition.In fact, Assumption 1 is a weaker condition than Assumption 2 from Wang, Yu, and Rinaldo (2020) which states that for positive constants c and ξ .However, the framework in Wang, Yu, and Rinaldo (2020) allows the possibility that K diverges whereas here we require that K = O(1).We are now ready to state the main result of this section.
Notably, Theorem 1 shows that the maximum a posteriori estimator constructed based on the model ( 11) attains a localization rate of order log T/κ 2 .As Wang, Yu, and Rinaldo (2020) showed, this localization rate is minimax optimal up to a logarithm factor.Importantly, our guarantee on the localization rate holds under the minimum signal-to-noise ratio condition possible; see Lemmas 1-2 in Wang, Yu, and Rinaldo (2020).Also, notice that provided that log T/(κ 2 ) → 0, then in the language of change point detection, the estimator C is consistent (Wang, Yu, and Rinaldo 2020).
We highlight that Theorem 1 requires σ 2 to be known.However, this is not restrictive.In fact, if σ 2 is misspecified by some σ 2 = σ 2 , as in Remark 2 in Narisetty and He (2014), the conclusion of Theorem 1 will still hold provided that c 1 σ 2 ≤ σ 2 ≤ c 2 σ 2 for some positive constants c 1 and c 2 .The latter can easily be attained in practice provided that σ 2 is a consistent estimator of σ 2 .A simple consistent estimator can be found by splitting the data in two sets, one consisting of odd t and the other of even t.We mentioned in the previous section a few possible estimators.Finally, we can use the estimate as input to our method along with the other half of the data.
We stress that in practice we use the fast method described in the previous section based on a misspecification of (11).In the SM, we show that such surrogate procedure still enjoys a localization guarantee in the case of single change point detection.
One another note, we remark that the estimator studied in this section required specifying the parameter m that describes the number of elements in the partition used to construct Y.The intuitive reason for this construction is that in change point detection, unlike variable selection, no estimator can detect the exact locations of the change points with high probability, see Wang, Yu, and Rinaldo (2020).Hence, by using averages over small intervals and treating the resulting signal as the input to the basad.cp,we can expect to perform variable selection exactly with high probability and then translate this into optimal change point localization.
Finally, if the number of observations collected at a given point t is n t and n t n for all t, then it is possible to construct a variant of C that attains a localization rate of order log T/(κ 2 n).This agrees with the intuition that having multiple observations per time t can help with the estimation of the change points.

Simulations
We rely on simulations to explore the ability of the solo.cpand basad.cpestimators to accurately estimate K and change point locations η 1 , . . ., η K .We consider realistic scenarios designed to capture the variability encountered in applications, varying the conditional mean f t and the distribution of the error terms ( t ) 1:T .We compare basad.cpand solo.cp with several state-ofthe-art methods: wbs (Fryzlewicz 2014), ebpiece (Liu, Martin, and Shen 2017), smuce (Frick, Munk, and Sieling 2014), and pelt (Killick, Fearnhead, and Eckley 2012).Details of each implementation are given in the SM.All code to reproduce the results in this section is available at https://github.com/lorenzocapp/solocp_experiments.The methodology is available as an R package available for download at https://github.com/lorenzocapp/solocp.
Our empirical comparisons assess the accuracy of the different estimators with the following criteria.We consider the statistic K − K to measure how well each estimator recovers the true number of change points.We consider an order- |x − η|, and report the proportion of points that are at distance zero, one, two, and equal or greater than three.We refer to this criterion as the normalized empirical distribution and denote it by | η − η|/K.It is a finer measure than the Hausdorff distance of the change point location estimation accuracy.Since this criterion is also insensitive to overestimation, we include the reciprocal |η − η|/ K.The unnormalized version of this criterion is also considered by Fryzlewicz (2014).
Here, we focus on the setting where one observation is available per time point (n t = 1).Experiments for n t > 1 can be found in the SM.We consider three test signals and four error distributions.The first signal f t is called BLOCKS (K = 11, T = 2048), a standard benchmark for change point detection procedures (e.g., used by Fryzlewicz 2014), the second test signal is called TEETH (K = 4, T = 140), the third signal is called FMS (K = 6, T = 497), used Fryzlewicz (2014) and Fearnhead and Rigaill (2018) is characterized by some very narrow spacing and varying intensity.We consider four distributions for the error terms: Gaussian, Laplace, Student's t, and a mixture of Gaussians (to mimic the presence of outliers one of the two components has a larger variance).Change point locations of the test signals and the parameters of the error terms are fully specified in the SM.For each combination (f t , t ), we sample 100 datasets and report the average value for each criterion considered.Figure 1 plots examples of datasets sampled for each scenario, along with the true test signals in red.We consider the solo.cpalgorithm with τ 2 0,T = T −1 , τ 2 1,T = T.For the basad.cpalgorithm, we employ the default choices of the parameters given by Narisetty and He (2014): τ 2 0,T = σ 2 (10T) −1 , τ 2 1,T = σ 2 log T. For both solo.cp and basad.cp,we use q = 0.1, and = 2, that is, we are preventing the algorithm from detecting consecutive change points.Pelt also has the minimum spacing parameter , which we set equal to two as well.The parameters τ 2 0,T , τ 2 1,T are set following the results of Section 3. The theory in this article does not provide guidance on the choice of and q.We study the robustness of the solo.cpalgorithm to these parameters' choices in the SM.
The procedures basad.cp,solo.cp, and ebpiece have σ as an input.Here, we computed it from the residuals of the fused LASSO (Tibshirani et al. 2005; implemented in the genlasso R package available on CRAN).We input the same σ in pelt and smuce for a fair comparison.In the BLOCKS signal datasets, we initialize the ebpiece MCMC from the estimates of the fused LASSO ("one standard deviation rule"), otherwise it is not possible to achieve convergence in a reasonable time.This can be seen by the very poor performance of the method which can be due to the fact that the chains "get stucked" into local modes.
Tables 1 summarizes | 2 for the TEETH test signals, and Table 3 for the FMS test signals.The basad.cp method is not included in Tables 1 and 3 because it was not computationally feasible to approximate the posterior distributions with MCMC in these datasets.
The procedures wbs, smuce, and pelt achieve the best overall performance according to |η− η|/ K, suggesting that they recover very well C * ; their performance under Gaussian noise scenarios is excellent.The criteria |η − η|/ K, d( C, C), and K − K suggest that wbs and smuce tend to overestimate the number of change points.The problem is extremely severe in the presence of outliers and with Student's t-distributed errors.pelt is the fastest method and the best performing one in the FMS scenario.
ebpiece recovers well K in the BLOCKS and TEETH scenarios but it does not seem to recover well the exact locations of the change points (| η − η|/K and |η − η|/ K).It underestimates the number of change points in the FMS scenario.In the BLOCKS signals, the MCMC chain converges only when the fused LASSO is used as initialization.Essentially, we are starting the chain where we would want the posterior samples to concentrate.If we use the default initialization of the chain used in (Liu, Martin, and Shen 2017), the MCMC chain fails to converge within 30 min (computing time between brackets in Table 1).(Liu, Martin, and Shen 2017), smuce (Frick, Munk, and Sieling 2014), wbs (Fryzlewicz 2014), and pelt (Killick, Fearnhead, and Eckley 2012).For | η − η|/K and |η − η|/ K the higher the number in the zero column the better.Conversely, for d( C, C * ) the lower the better.For K − K, the closer to the zero the better.We report in bold the methods with best empirical performance and those within 10% of the best.The method basad.cp is not included since it required a computing time longer than 2 hr.We include between brackets the computing time for ebpiece if initialized with the in-built procedure (see text for a discussion).Average time is reported in seconds.
Table 2. Hausdorff distance, empirical distributions, and estimation bias in K of the procedures considered for the TEETH test signals.Average statistics computed over 100 simulations for solo.cp,basad.cp,ebpiece (Liu, Martin, and Shen 2017), smuce (Frick, Munk, and Sieling 2014), wbs (Fryzlewicz 2014), and pelt (Killick, Fearnhead, and Eckley 2012).For | η −η|/K and |η − η|/ K the higher the number in the zero column the better.Conversely, for d( C, C * ) the lower the better.For K − K, the closer to the zero the better.We report in bold the methods with best empirical performance and those within 10% of the best.Average time is reported in seconds.
solo.cp performs well in all scenarios, being consistently the best Bayesian method.It is not as accurate as the frequentist methods in recovering the exact location of the change points (| η − η|/K).In particular in the BLOCKS and FMS scenarios, the reason seems to be that solo.cpunderestimates K. On the other hand, it is the fastest Bayesian method and the algorithm is extremely robust to the misspecification of the error terms, being consistently among the best in terms of |η − η|/ K and d( C, C * ).
In these TEETH scenarios, computation times are comparable to the state-of-the-art frequentist methods.The results of solo.cp are very robust to the choices of q and ; see SM.
Like ebpiece, solo.cpunderestimates the number of change points in the FMS scenario across error distributions.In the FMS scenario, the high values of d( C, C * ) and K − K are due to the fact that both ebpiece and solo.cpdo not detect the first change point, the one with the smallest magnitude.This effect may be linked to  (Liu, Martin, and Shen 2017), smuce (Frick, Munk, and Sieling 2014), wbs (Fryzlewicz 2014), and pelt (Killick, Fearnhead, and Eckley 2012).For | η − η|/K and |η − η|/ K the higher the number in the zero column the better.Conversely, for d( C, C * ) the lower the better.For K − K, the closer to the zero the better.We report in bold the methods with best empirical performance and those within 10% of the best.Average time is reported in seconds.
the better robustness to outliers of the two methodologies.Very small effects are shared among multiple random means, which are integrated out, hence, even more challenging to detect.In this scenario, solo.cpoutperforms ebpiece and its performance is comparable to smuce in terms of | η − η|/K.basad.cpachieves a performance comparable to solo.cp in the TEETH scenarios.This is expected given that both methods are based on the shrinking and diffusing priors of Narisetty and He (2014).A similar performance is achieved at a much higher computational cost.
In Section 2, we attributed the robustness of solo.cp and basad.cp to their peculiar construction that involves multiple random increments and shrinkage estimators.A competing hypothesis is that the spike prior on a candidate change point is more flexible than a sharp l 0 penalty, allowing for small fluctuations in the signal.This would not be desirable because, under the default spike prior hyperparameter τ 2 0,T = T −1 , the spike prior would degenerate into a point mass at 0, that is, allowing for no fluctuations.We investigate this hypothesis repeating the analysis for the BLOCKS signals for a range of prior hyperparameters τ 2 0,T (T −1 , T −2 , T −3 , T −4 ) (Table S4, supplementary material).Results are robust to a varying τ 2 0,T , essentially ruling out this hypothesis.Another hypothesis is that the robustness is due to the postprocessing (steps 3 and 4 in Algorithm 1).We similarly repeat the BLOCKS signals simulation study for = 1 (Table S2 SM).We observe a small increase in the number of false positives, leaving solo.cp as the most accurate method in the misspecified scenarios.We stress that the effect of the post processing is minimal: the results presented in the section were obtained with = 2, which simply prevents consecutive change points from being detected.Most of the methodologies included here have such a minimum spacing constraint.

Applications
We analyze two real datasets (aCGH and Ion channel data) to compare solo.cp to alternative methodologies built under the same assumptions (Gaussian iid noise, piecewise constant mean).In particular, we compare solo.cp to wbs.We will acknowledge the limitations of our analyses and possible solutions at the end of each subsection.

Array Comparative Genomic Hybridization (aCGH) Data
Genomic alternations happen in the development of tumors.Studying these alternations, for example determining the copynumber variations, is important for understanding cancer and also used for its diagnosis.Array Comparative Genomic Hybridization (aCGH) is a popular method that generates this type of data (Schena et al. 1995).We analyze an aCGH dataset of individuals with a bladder tumor collected by Stransky et al. (2006).The dataset is publicly available in the R package ecp (James and Matteson 2014), and includes 43 individuals and 2215 locations.The goal of the analysis is to detect changes in the copy-number.The underlying assumption is that alternations are constant within a segment.Segments involved in the tumor should be equally affected across patients.While we could repeat the analysis for all the patients, we include only two in this manuscript for parsimony.The number of samples is approximately identical to the BLOCKS test signal, hence we use the same parameters for solo.cp(τ 2 0,T = T −1 , τ 2 1,T = T, τ 2 T = T −1/2 , q = 0.1, = 5) and σ 2 equal to the variance of the residuals of the fused LASSO (tuning parameter chosen by one-standard-error rule).We compare the results of solo.cp with wbs (default implementation).The left column of  Both methods seem to recover more change points than the number of blocks identified through a visual inspection of the data.There are a few points where the change point corresponds to a single observation, not an entire segment along the genome.We would need further research to determine if these points can be classified as outliers.However, solo.cp appears more parsimonious: K = 19 for Patient 3 and K = 13 for Patient 7, while wbs estimates K = 49 for Patient 3 and K = 36 for Patient 7. The results are consistent with what we observed in the simulation section.We also note that both methods ignore autocorrelation.Sequences may exhibit some serial dependence because nearby genomic segments are known to be associated.The autocorrelation in the noise process could lead to overestimation.Recently, methods robust to this serial dependence has been proposed (Cho and Fryzlewicz 2020;Romano et al. 2021).The method of Cho and Fryzlewicz (2020) returns K = 14 for Patient 3 and K = 2 for Patient 7.

Ion Channels Data
Ion channels are a class of proteins expressed by all cells that create pathways for ions (charged particles) to pass through the otherwise impermeable cell membrane.The opening of these pathways is essential for cell operations in the nervous system, in the muscles, and in the pancreas.The study of ion channels plays a fundamental role in the development of new drugs (Alexander, Mathie, and Peters 2008).The patch clamp technique is an electrophysiological tool for understanding ion channel behavior.It is used to measure ionic currents from single living cells or tissues (Neher and Sakmann 1995).Electrophysiologists use glass microelectrodes to gain access to cells expressing ion channels.Through the microelectrode, a voltage is applied, forming a voltage clamp, and the current passing across the cell membrane through the ion channels is measured.
We consider a dataset produced by the Steinem Lab (Institute of Organic and Biomolecular Chemistry, University of Göttingen), recently analyzed by Vanegas, Behr, and Munk (2022), measuring a single ion channel of the bacterial porin PorB, a bacterium that plays a role in the pathogenicity of Neisseria gonorrhoeae.The experiment design includes a technique that induces local dependencies of the error terms (Pein, Sieling, and Munk 2017).To remove these dependencies, we follow the same approach of Vanegas, Behr, and Munk (2022), subsampling every 11th observation.The original dataset includes 600,000 time instances.We analyze a portion of the dataset of length 32,511.After subsampling, the dataset is composed of 2956 time points.Figure 3 depicts the dataset.
Figure 3 suggests that the noise variance when the channels are open is much higher than when they are closed.This feature of ion channel data is known as open channel noise (Neher and Sakmann 1995).The methods studied in this article, and considered in Section 4, do not assume error heterogeneity.The first column of Figure 3 depicts a naive application of solo.cp ) and wbs where we estimate a single global variance ignoring variance heterogeneity.This leads to overestimation on the change points.
The second column depicts an attempt to account for the heteroscedasticity due to the open channel noise phenomena.We estimate the sample variance when the ion channels are open (we approximate it considering the observations above 0.2).We input to both methods this sample variance and reestimate the change points (for solo.cpsame parameters as before; for wbs, we set θ 0 set to 3).Now, K is 12 for both methods and the locations of the change points seem reasonable by visual inspection.A few isolated points are not detected as change points (approximately around 1100 and 2900).We note though that the result of wbs largely depends on other tuning parameters (e.g θ 0 ), while the estimates of solo.cp are very robust to the choices of all the parameters that are not σ 2 .
There are methods in the literature that are robust to heterogeneous variance (Killick and Eckley 2014; Pein, Sieling, and Munk 2017).These can be directly deployed and do not require the extra step of understanding how to estimate the input variance.An application of HSMUCE (Pein, Sieling, and Munk 2017) to this dataset gives K equal to 25.Further study and domain knowledge is required to determine whether extra change points are justified.

Discussion
We studied spike-and-slab priors for change point detection leveraging recent results in the variable selection literature.We chose to work with a prior having both the spike-and-slab component defined by Gaussian distributions and sample sizedependent hyperparameters.We established that an estimator based on this prior distribution is consistent and achieves optimal localization rates of multiple change points.Furthermore, the use of this prior allowed us to propose a fast Bayesian change point estimator based on a slightly misspecified model.A version of the fast estimator achieves the optimal rate in the single change point problem.In simulations, its empirical accuracy is comparable to state-of-the-art benchmarks.Its salient features are being one of the fastest Bayesian methods available (no MCMC required) and being robust to misspecification of the error model in numerical experiments.The latter property is somewhat surprising, given that we only considered Gaussian noise in the method development and in the theory.We showed these features in simulation studies, displaying situations where our estimator performs well while many competing methods severely overestimate the number of change points.
There is a rich literature on change point detection for settings more general than the one considered in this article.Nevertheless, our results are promising and suggest that it is worth investigating the use of spike-and-slab priors in change point detection for more general settings, such as settings with unknown variance, heterogeneous errors, and different types of dependence.
The first area of future work is to further improve the computational performance of the solo.cpalgorithm.The main bottleneck of the algorithm is the computation of the parameters in (9).The computing time of solo.cp is comparable to those of frequentist estimators for small sample sizes up to thousands of observations.A possible strategy to tackle the large T scenario is to design a more efficient (possibly approximate) version of solo.cp that provides a preliminary data partition, then apply the regular algorithm to the different chunks.For larger sample sizes, we expect numerical and memory issues to arise, such as floating point collapse and lack of memory to compute all quantities necessary for the procedure.
The second area of research is the detection of higher-order changes, such as in piecewise-linear signals.A version of the solo.cpalgorithm for piecewise-linear change point detection is readily available (as well as higher-order changes).However, our preliminary results suggest that a vanilla version of this estimator does not work well in this setting.Liu, Martin, and Shen (2017) suggest that a possible explanation is that one cannot fix the prior means at zero in this setting.
invariant Hausdorff metric d( C, C * ) = d( C|C * )+d(C * | C), where d( C|C * ) = max η∈C * min x∈ C |x − η| and d(C * | C) = max η∈ C min x∈C * |x − η| are respectively the one-sided Hausdorff distances.We use d( C, C * ) to assess the overall accuracy of the estimators in recovering the true change points locations η 1 , . . ., η K .We employ d( C, C * ) in lieu of d( C|C * ), being the latter insensitive to overestimation.Lastly, for all η ∈ C * we calculate min x∈ C

Figure 1 .
Figure 1.Examples of datasets for the 12 scenarios considered and true test signals.First row panels depict sample datasets along with the BLOCKS test signal (red), second row panels sample datasets along with the TEETH test signal (red), third row panels sample datasets along with the FMS test signal (red).Columns depict different noise models.
η−η|/K, |η− η|/ K, K− K, K, d( C, C * ), and the mean computing time for the four scenarios considered for the BLOCKS test signals, Table

Figure 2 .
Figure 2. aCGH data.Copy number variations of patients having a bladder tumor recorded at n = 2215 sites by Stransky et al. (2006).First row depicts the observations for Patient 3, second row depicts the observations for Patient 7. First column includes the change points estimated by solo.cp(red), second column the ones estimated by wbs (blue).

Figure 2
Figure 2 depicts the estimates of the solo.cpchange points, the right column depicts the one obtained with wbs.The two rows refer to the two different patients.Both methods seem to recover more change points than the number of blocks identified through a visual inspection of the data.There are a few points where the change point corresponds to a single observation, not an entire segment along the genome.We would need further research to determine if these points can be classified as outliers.However, solo.cp appears more parsimonious: K = 19 for Patient 3 and K = 13 for Patient 7, while wbs estimates K = 49 for Patient 3 and K = 36 for Patient 7. The results are consistent with what we observed in the simulation section.We also note that both methods ignore autocorrelation.Sequences may exhibit some serial dependence because nearby genomic segments are known to be associated.The autocorrelation in the noise process could lead to overestimation.Recently, methods robust to this serial dependence has been proposed(Cho and Fryzlewicz 2020;Romano et al. 2021).The method ofCho and Fryzlewicz (2020) returns K = 14 for Patient 3 and K = 2 for Patient 7.

Figure 3 .
Figure 3. Ion data.Ion channel data recorded at the Steinem lab (Insitute of Organic and Biomolecular Chemistry, University of Göttingen) at n = 2956 time instances.First row depicts the change points estimated through solo.cp,second row depicts the change points estimated through wbs.Estimates in the first column are obtained using "default" estimates of the sample standard deviation, which means the standard deviation of the residuals obtained from the fused LASSO (one standard error rule) for solo.cp, and the median absolute deviation estimates for wbs.Estimates in the second column are obtained using the sample standard deviation of observations taking values larger than 0.2 (to approximate when ion channels are open).

Table 1 .
Hausdorff distance, empirical distributions, and estimation bias in K of the procedures considered for the BLOCKS test signals.

Table 3 .
Hausdorff distance, empirical distributions, and estimation bias in K of the procedures considered for the FMS test signals.