On optimal segmentation and parameter tuning for multiple change-point detection and inference

Change-point analysis is the task of finding abrupt (and significant) changes in the underlying model of a signal or time series. Change-point detection methods typically involve specifying the maximum number of segments to search for and the minimum segment length. However, there is no objective way to pre-specify these two parameters, and it mostly depends upon the particular application. Within this framework, a recursive optimization algorithm is developed that is capable of exploring and fine tuning these two input parameters, and optimally segmenting a time series. This multiple change-point detection technique therefore addresses a wide class of real-life contexts and problems where the identification of optimal level shifts in a time series is the main goal. Extensive simulation results are presented and a real-life example is given to illustrate the implementation of the developed scheme in practice and to unfold its capabilities. Concluding remarks and suggestions for future research are also provided.


Introduction
Change-point analysis has played an important role in modern time series study. One type of change one may seek for, is a change-point that is a moment in time where there is an abrupt change in one or more of the properties/distributional characteristics of the time series such as the mean, the median, or the standard deviation (see, for example, Pettitt [1], Aue and Horváth [2]). Change-points can also occur in variances/volatilities (see, for example, Inclan and Tiao [3], Gao et al. [4]), in the series' correlation structures (see, for example, Davis et al. [5]), or even in the marginal distribution of the series (see, for example, Gallagher et al. [6]). Detection of change-points helps modeling and prediction of time series and is found in applications of many scientific fields. Change-point detection was studied first in product quality in manufacturing operation (see, for example, Dudding and Jennett [7]), however, over the last two decades, it has begun to spread to nonmanufacturing industries and found many applications in areas such as healthcare and medical condition monitoring, infectious disease surveillance, environment monitoring, meteorology, climatology, hydrology, speech recognition, image analysis, human activity analysis, bioinformatics, neuroscience, signal processing, social sciences and econometrics, among others.
It is also worthwhile to note that as in other areas of statistics, there are parametric and nonparametric methods for change-point detection. A number of classical parametric methods have been developed; however, in practice, there is often not enough information on the type of distribution to make an informed choice for the distribution family and subsequently perform a parametric change-point detection analysis (see, for example, Gurevich and Vexler [37]). Recently the change-point model framework has been developed for change-point detection in situations where the available information about the distributions is very limited. Note that previous studies have analysed Pettitt's nonparametric method in terms of its ability to detect the correct time of change for different distributions (see, for example, Xie et al. [38]). Ross [39], among others, studied several implementations of the change-point model framework (incorporating different test statistics, using both parametric and nonparametric techniques) for detecting changes in the distribution of discrete-time sequences of univariate random variables. In particular, statistics for detecting changes in either a batch of data containing at most one change-point or a sequential context with data which may contain multiple change-points were implemented. Moreover, note here that most existing change-point methods are relevant to the Phase I Statistical Process Control problems, since the offline change-point detection problem is essentially similar to the Phase I control problem in that in both cases, the sample size is fixed, and one wishes to detect any changes in the underlying process distribution.
In this section, our goal is not to present the vast fast-growing change-point detection literature in an exhaustive fashion, but rather to present the main ideas that shaped the field, and to relate them with the developed methodology -focussing on individual time series data -since recent advances have led to more and more instances where individual measurements are collected over time. Toward this end, this paper addresses the problem of 'a posteriori' optimal multiple change-point detection and inference, performing a global segmentation. Change-point detection methods typically involve specifying the maximum number of segments to search for and the minimum segment length; however, there is no objective way to pre-specify these two parameters. In this context, a recursive optimization algorithm is developed for exploring and fine tuning these two input parameters, and optimally segmenting a time series. Within this framework, an empirical reasonable compromise is provided for pre-specifying the aforementioned input parameters. This issue is of great importance since in practice, up to now, pre-specifying these two parameters (initializing step of a typical change-point detection algorithm) is 'somewhat arbitrary', lacking statistical justification, and it mostly depends upon the particular application. In this sense, this paper, through optimal exploration and fine tuning of the input parameters is filling a gap in the relevant literature, whereas the developed scheme also provides optimal time series segmentation. The empirical and simulation-based findings indicate that the proposed scheme is a viable alternative to widely used multiple change-point detection algorithms already existing in the literature. The proposed scheme therefore addresses a wide class of real-life contexts and problems where the identification of optimal level shifts in a time series is the main goal.
The rest of the paper is organized as follows. In Section 2, the statistical framework is introduced and technical details are provided. In Section 3, an extensive simulation study is performed for specifying the tuning parameters (i.e. maximum number of segments and minimum segment length). Two empirical rules are also provided. In Section 4, a comparative study is performed for benchmarking purposes. A simulation-based performance comparison between the proposed scheme and widely used multiple change-point detection algorithms is conducted. Further, an application to real epidemiological data is presented. Within the latter context, the developed technique is compared to its competitors, as regards its ability to detect successfully the signalled start and end weeks of epidemics. Finally, in Section 5, some concluding remarks and suggestions for future research are given.

Research methodology
The a posteriori (or off-line) change-point problem arises when the series of observations is complete at the time to process it [40]. The interested reader may refer to Tartakovsky et al. [41] for related work on online change-point detection. We only consider in this paper the a posteriori problem; hence, our goal consists in recovering the configuration of change-points using the whole observed series. In this paper, we deal with piecewise stationarity, arguably the simplest type of deviation from stationarity. This implies a timevarying process where its parameters evolve through time but remain constant for a specific period of time. The complex problem of detecting multiple change-points in a time series is described explicitly below.

The Segmentation model under a Gaussian framework
In a segmentation framework, two types of changes can be considered: changes in the mean and the variance of the signal, or changes in the mean only. Let us define a model where the mean is segment-specific and the variance is common between segments. Suppose x t are realizations of independent random variables {X t } t=1,...,m , with Gaussian distributions N (μ t , σ 2 ). Let us assume that this stream of process data fluctuates around some underlying signal. Suppose now that the underlying signal is piecewise constant (that is, properties of the underlying phenomenon do not change within each segment), and there exist K−1 changes (discontinuity instants) that affect the mean parameter of the distribution of the Xs, at unknown coordinates (t 0 , t 1 , t 2 , . . . , t K−1 , t K ) with convention t 0 = 1 and t K = m, and that the parameters of the Xs distributions are constant between two changes where μ k is the mean of the k th segment. The change-points (t 1 , t 2 , . . . , t K−1 ) define a partition of the individual observations into K segments denoted as Due to the independence assumption, the log-likelihood can be decomposed into a sum of 'local' likelihoods, calculated on each segment as Note that because the data are supposed to be independent and Gaussian, the loglikelihood is a sum over all the segments (additive contrast) and maximum likelihood estimation is equivalent to least squares fitting, respectively. Given the number of segments K and the segments' coordinates (t 0 , t 1 , t 2 , . . . , t K−1 , t K ), the mean for each segment is estimated using maximum likelihood aŝ and since the variance of the segments is homogeneous, its estimator is given bŷ For a given sequence of change-points, the segment coordinates are known, and the estimation of the mean for each segment is straightforward. Then, the key problem is to estimate K and the positions of change-points. We will further proceed in two steps as follows: Step 1: estimate the t k s that is, find the best partition of a set of m individuals into K segments assuming that the number of segments is known, Step 2: estimate the number of segments by adopting elbow curve method.
When the number of segments K is known, the problem is to find the best partition of {1, . . . , m} into K segments, according to the likelihood, where m is the sample size.
However, in this context, an exhaustive search becomes unfeasible for large m and K, since there are m−1 K−1 possible choices for the positions of the change-points (t 1 , t 2 , . . . , t K−1 ). An efficient dynamic programming approach to compute an exact solution for the multiple change-point problem was presented in Auger and Lawrence [42], with O(Qm 2 ) time complexity for a given upper bound Q on the number of change-points. Improvements to this algorithm is made by requiring the cost function to be additive, i.e. C(x a:b ) = C(x a:τ −1 ) + C(x τ :b ) where C(·) is a cost function for a segment. Using this assumption, Jackson et al. [43] presented a dynamic programming algorithm with O(m 2 ) complexity. Killick et al. [16] subsequently extended this method with a pruning step on potential change-point locations, which reduces the time complexity to be approximately linear under certain conditions on the data generating process and the distribution of changepoints. A different pruning approach was proposed by Rigaill [44] using further assumptions on the cost function, which was extended by Maidstone et al. [45] achieving a significant reduction in computation time (m * log(m)), and made robust against outliers by Fearnhead and Rigaill [46].
In this work, we also use a dynamic programming approach to reduce the computational load. Programs are coded in R statistical environment and are available in Supplementary  File1. Under the assumption of additivity for the cost function, we adopt a recursive optimization (RO) scheme allowing us to solve the multiple change-point detection problem with a time complexity O(m 2 ). The recursions involved in the RO algorithm are described below.

Step 1
Let us assume that K is known, we then have to minimize The quantity t∈I k (x t −μ k ) 2 can be viewed as the 'cost' of segment I k , in other words the cost of putting data x t k−1 +1 to x t k+1 in a single segment. In these settings, J k denotes the cost matrix and the global optimum is given by J k (1, m). This optimization problem is actually a shortest path problem that can be solved thanks to dynamic programming based on Bellmann's [47] optimality principle 'any optimal path is composed of sub-paths which are themselves optimal', following the least-squares approach for a change in the mean of a series of Lavielle and Moulines [48].
LetL k+1 (i, j) be the maximum log-likelihood obtained by the best partition of the data {X(i), X(i + 1), . . . , X(j)} into k + 1 segments, with k change-points, and let denotê J k+1 (i, j) = −2L k+1 (i, j). The RO algorithm is as follows: Step k : where K max denotes a given maximum number of segments. Details for (pre)specifying K max are provided later. Note that this problem of partitioning is analogous to the search for the shortest path to do a trip from one point to another, whereĴ k+1 (1, m) represents the total length of a (k + 1)-step path connecting the point with coordinate 1 to the point with coordinate m. At this point, it is important to highlight some of the inherent statistical properties of the RO scheme as above described. Because of the normality assumption, the maximum likelihood estimator of the change-points coincides with the least-square estimator, and the cost function values coincide with the residual sum of squares (RSS). In this segmentation model developed under a Gaussian likelihood-based framework, we assumed that the sequence of residual errors are i.i.d Gaussian variables. Lavielle [40] pointed out that a Gaussian likelihood can be effectively used as a contrast function for detecting changes in the mean (and/or in the variance) for any sequence of random (not necessarily Gaussian) variables. Further, the adopted dynamic programming approach takes advantage of the additivity of the log-likelihood described above, considering that 'a partition of the data into k + 1 segments is a union of a partition into k segments and a set containing 1 segment'. This approach is advantageous for two main reasons; first, it yields an exact solution for the global optimum of the likelihood [42]; second, for a given K, the computational load is reduced from O(m K ) to O(m 2 ), since the algorithm only requires the storage of an upper m × m triangular matrix. Note that the recursions presented in the RO algorithm are more efficient than the Segment Neighbourhood method [42] as the location and number of change-points are decided in one pass of the data. This results in an O(m 2 ) method compared to O(Qm 2 ), if the number of change-points Q were known. At the end of this procedure, the quantitiesĴ 1 (1, m), . . . ,Ĵ K max (1, m) are stored and will be used in the following step.

Step 2
Let us assume that the likelihoodL k measures the adjustment of the segmentation model with K segments to the process data. We therefore aim at determining the optimal dimension K given a stream of process data. A method for selecting the optimal K consists in looking at theL k value obtained with different number of segments. A classical and natural graphical method for selecting the dimension K can be summarized as follows. First, we examine how theL k value decreases when K increases; then we select the dimension K for which theL k value ceases to decrease significatively. In other words, this heuristic approach (similar rationale with Elbow method/Elbow curve) looks for maximum curvature in the plot (K,L k ), ∀k ∈ [1, K max ]. Thus, to look for whereL k ceases to decrease, means to look for and locate the break in the slope of this curve. Conclusively, the value of K at which the slope abruptly changes provides an estimateK of the optimal number of segments, and we could then derive theK − 1 change-points, i.e. (tK ,1 , . . . ,tK ,K−1 ).
Note here that selecting the number of segments only based on the likelihood criterion could probably lead to over-fitting [49]. Furthermore, the number of parameters to estimate is proportional to the number of segments, and a too large number of segments would lead to a large estimation error. Tuning input parameters, that is setting a reasonable upper limit on the size of the segmentation space subject to the constraint of a minimum segment length is here used in order to achieve a trade-off between a good adjustement and a reasonable number of parameters to estimate. However, despite this trading-off, one still needs to implement elbow method and visually inspect the optimal number of segments. The interested reader may refer to Picard et al. [49] for an anlternative method to automatically locate the break in the slope of the likelihood and estimate the number of segments via penalized likelihood. This procedure estmates a penalty constant and depends on a threshold which is chosen arbitrarily. Nevertheless, despite this thresholding, the procedure remains adaptive since the penalty constant is estimated according to the data, and it is also easy to implement. Further theoretical considerations regarding this adaptive method can be found in Lavielle [50].

Tuning parameters
Change-point detection methods typically involve setting an upper limit on the size of the segmentation space subject to the constraint that distance between change-points should be above a minimum length. Since there is no objective way to pre-specify these two parameters, one can fine tune them by 'trial and error' before arriving at the chosen one. Besides, the choice for the gap between two successive change-points mostly depends upon the particular application. The presented RO scheme also depends upon the pre-specification of a maximum number of segments to search for subject to the constraint of a minimum segment length. In this section, we summarize the results of an extensive simulation study conducted for specifying the minimum segment length and the maximum number of segments, as well as providing two practical rules as a reasonable compromise for both tuning parameters.

Specification of the minimum segment length
The choice of the minimum segment length (denoted as L min ) should be chosen balancing the importance of detecting step (and even transient) shifts, with the cost of investigating a larger number of partitions. It is therefore necessary to constrain the minimum length between the successive change-points in front of a real data sequence. For this purpose a minimum length between two successive change-points is imposed: instead of minimizing over all possible partitions, only the partitions such that t r − t r−1 L min , r = 1, . . . , K are considered. L min is a positive integer, and the (default) minimum segment length allowed by theory is a single observation for a change in mean, whereas for a change in variance (with unknown mean), the minimum segment length is two observations [16]. However, assuming that the minimum segment length is a single observation t r − t r−1 1 (as in our case for detection of changes in the mean) is likely to yield an overfitting model with redundant change-points. In order to prevent overfitting, some sufficient minimum segment length can be chosen [51]. Thus, a simulation study was conducted here in order to choose a minimum segment length that seems satisfactory for practical prescriptions.

Remark 3.1:
Note that it is possible, instead of imposing a minimum length between two successive change-points, for example, to constrain the estimates of the mean to lie within a compact set or imposing either stronger conditions on the process [48].

Remark 3.2:
Lavielle [40] obtained some asymptotical results when the length of each segment tends to infinity (at the same rate as the total number of observations) which is though beyond the scope of this paper. Indeed, in this paper, the choice of the minimum segment length value is justified by practical considerations.
Following similar simulation settings for the detection of changes in the mean as in Lavielle [50], the simulated series are sequences of m independent Gaussian random variables of variance σ 2 = 1. The means in the five segments considered are (0, α, 0, 2α, 0) respectively. We simulated 10,000 series with α = 0.5 (representing the smallest jump size) and 10,000 series with α = 1 (representing the largest jump size). For each of these series and several combinations of L min and K values, we computed the values of the cost function. Based on these 10,000 simulations, we then computed the average cost function values for each combination of L min and 1 ≤ K ≤ K max . Results will be presented here only for m = 50 for space reasons. Several different sample sizes (m = 100, 200, 500, 1000) were examined and additional simulation results can be found in Supplementary File 2(m = 100, 200, 500, 1000). Note here that in practical applications, we do not know in advance when the transition between a stable and unstable period occurs. Here, the stable period stands for the case in which there is no change and the mean remains constant, whereas the unstable period stands for the case in which the mean changes abruptly at some unknown instant(s). Thus, the positions of the change times were assumed to be random, and were chosen according to the three following patterns: Pattern I: The unstable period is longer than the stable period overall (e.g. Tables 1-3 summarize the derived results for m = 50 for Patterns I-III, respectively. Tables 1-3 present the average cost function values obtained for several combinations of L min and K values (both ranging from 1 to 10), for α = 0.5 and α = 1 (representing the smallest and largest jump size, respectively). In each table, the up (↑) and down (↓) arrows symbols in the column cells indicate the RSS values' increase or decrease, respectively. Note here that 'Inf' corresponds to cases for which minimization of the RSS terminated due to proximity of estimation to a value at which the cost function is infinite. These correspond to cases for which the RO scheme could not run successfully (abnormal termination) due to the infinite cost function, since the number of observations required to run the algorithm in accordance with the specified L min and K, exceeds the number of available observations (e.g. for L min = 6 and K = 9 at least 54 observations are required, thus for m = 50 the scheme returns the cost function value as infinite). Tables 1-3 show that the choice of L min = 1 leads to overfitting, as expected. This is also the case for L min = 2, 3, 4; in particular, the sequence (J K , 1 ≤ K ≤ K max ) keeps decreasing significantly, and a significant improvement of the fit continues to be obtained when we start adding change-points. An improvement of the fit (decrease in the RSS values) is generally obtained when we start adding change-points for any L min value considered. Nevertheless, for L min = 5, these improvements become less obvious after detecting the main 5 jumps. Indeed, adding more change-points only allows to detect very small changes in the empirical mean and reduces the RSS very slightly. In particular, when the number of segments is too close to the maximum bound (m/L min ) the algorithm tends to oversegment. For identifying changes in mean, the minimum segment length in our trial and error, for various sample sizes examined, ranged between 5 to 10. This particular choice was found to avoid spurious and excessive number of change-points and at the same time does not unduly lower the number of identified changes. We then implemented the elbow method to help selecting the optimal number of segments by fitting the model with a range of values for L min and K. Figures 1-6 dsiplay the RSS values obtained for m = 50, for Patterns I-III, for the smallest (α = 0.5) and the largest (α = 1) jump size as well. Note here that Figures 1-6 show the RSS values on the y-axis (J K , 1 ≤ K ≤ K max ) obtained only for L min values varying from 1 to 5 versus different number of segments K (1 ≤ K ≤ 10) on the x-axis, since for all cases examined the best-case scenario is obtained for the first time for L min = 5, that is the RSS value shows a steep decrease up to the main 5 jumps (which in our case we know to be the optimal number) that are clearly visible, and a slow decrease after the optimum. If the line chart resembles an arm, then the elbow (the point of inflection on the curve) is a good indication that the underlying model fits best at that point. In the visualizer, elbow (K = 3, 5) is annotated with a two dashed vertical line. The RSS plots show a constantly decreasing curve with clear breakpoints for the optimal number of segments, notably when the largest jump size is considered. The value(s) of K at which the slope abruptly changes provides an estimateK of the number of segments. The favourite numbers of segments for the cases examined appear to beK = 3 (as expected, the smallest jump with α = 0.5 is not always detected) andK = 5. An improvement of the fit (decrease in the RSS values) is generally obtained when we start adding change-points for any L min value considered. When the model is fit with L min = 5, we can see the elbow in the graph for the first time. An 'artifactual' decrease of RSS can happen when K is too high (close to m/L min ) and corresponds generally to an oversegmentation. Moreover, it is observed that in some cases, for example, when the model is fit with L min = 5, then for K = 9 and K = 10 (for which K is too close to m/L min = 50/5 = 10) the RSS values instead of decreasing start to increase. Therefore, the maximum number of segments to search for (K max ) could be by default fixed to the maximum bound (m/L min ), but as it can be seen, setting a smaller value is needed to achieve a more reasonable performance. In fact, the contrast J K should necessarily decrease when the model becomes more complex. Taking into account the information provided from Tables 1-3 and Figures 1-6, one can conclude that: (a) The performance of the RO scheme is similar regardless of the Pattern (I-III) considered. In other words, the location of change-points and the overall duration of the stable/unstable period do affect the RSS values obtained, but not the overall performance observed. (b) Statistical modeling of the simulated data allows us to get the optimal segmentation for any number of segments K and proposes some possible number of segments. The value(s) of K at which the slope abruptly changes provides an estimateK of the number of segments. Breakpoints for the optimal number of segments are clear when the largest jump scenario is considered, whereas the smallest jump is not always detected. (c) The sequence (J K , 1 ≤ K ≤ K max ) keeps decreasing significantly, and a significant improvement of the fit continues to be obtained when we start adding changepoints for any L min value considered. Nevertheless, for L min = 5, these improvements (decrease in the RSS values) suddenly become less obvious after detecting the main 5 jumps that are visible. Indeed, we observe a steep decrease up to the main 5 jumps (which in our case we know to be the optimal number), and a slow decrease after this optimum. Thus, adding more change-points only allows to detect very small changes in the empirical mean and reduces the RSS very slightly. (d) When the number of segments is too close to the maximum bound (m/L min ) the algorithm tends to oversegment. Therefore, the maximum number of segments to search for (K max ) could be by default fixed to the maximum bound (m/L min ), but as it was seen, setting a smaller value is needed to achieve a more reasonable performance. (e) For identifying changes in mean, the minimum segment length in our trial and error, for various sample sizes examined, ranged between 5 to 10. This particular choice was      found to avoid spurious and excessive number of change-points and at the same time does not unduly lower the number of identified changes.
In general, our results showed that L min = max 5, min 10, m 10 (here [x] denotes the integer closest to x) provides a reasonable compromise. Conclusively, in order to get a reasonable set of change-points we summarize here some restrictions that should be taken into account when proposing a new set of change-points. For any data set, it is not reasonable to propose a change-point at the first and last position in the series. When inserting a change-point between two other change-points, make sure that the difference between their positions is at least five, and check if after inserting a change-point, the difference between all consecutive break-points is at least five. This restriction on minimal segment length (L min = 5) was found to help prevent overfitting without hindering the power to detect short, transient unstable phases. This practical rule accords with the minimum segment length chosen by Capizzi and Mazarotto [52]. However, for a practical purpose, it can be useful to decide automatically the minimum length of a step shift in accordance with the number of available observations (m), thus the above empirical rule could be adopted.

Setting an upper limit on the size of the segmentation space
As aforementioned, when the number of segments to search for is set too close to the maximum bound (m/L min ) the algorithm may oversegment, and it is thus advisable to look carefully at the RSS curve in such a case. Moreover, this maximum bound can considerably slow the calculations for larger sample sizes and needs to be reduced to a more reasonable value. Details are given below.
Since the main goal here is to specify the maximum number of segments to search for in accordance with the sample size, we propose a procedure to 'fine tune' this parameter choosing the penalty constant adaptively to the data sample size, implementing the samplesize-adjusted BIC (SABIC) [53] by replacing sample size m with m * (= (m + 2)/24). The SABIC has the appealing properties of usually being stricter than AIC (depending on m) but not as strict as BIC, and it 'adjusts' for the sample size (except for very small sample sizes, i.e. m ≤ 21, having a negative penalty encouraging needless complexity). This adjustment (m * ) on the penalty component originates from Rissanen's work on model selection for autoregressive time series models from a minimum description length perspective [54], and in practice it is usually used in latent class modelling working fairly well [55,56].
In this paper, we implement SABIC in terms of the RSS values, defined as where k is the number of estimated parameters.
We followed similar simulation settings as previously described in 3.1. In all cases, we adopted the previously described empirical rule for (pre)specifying L min . The greatest value of K for which the SABIC RSS is minimized -for at least a 5% of the examined  Tables 4 and 5 Based on the derived results from different sample sizes examined, we suggest using denotes the largest integer closest to x) for (pre)specifying K max . This upper limit on the size of the segmentation space provides a reasonable compromise (smaller than the maximum bound m/L min ) to avoid i. oversegmenting and ii. slowing the calculations for larger sample sizes. Only if the optimal number of segments identified (K) is too close to K max , one should think about increasing K max .

Comparative study
In this section a comparative study is performed. A simulation-based performance comparison between the proposed scheme and widely used multiple change-point detection algorithms is conducted. Further, an application to real epidemiological data is presented; in particular, the developed technique is compared to its competitors, as regards its ability to detect successfully the signalled start and end weeks of epidemics.

Simulation-based benchmarking
In this subsection, the proposed RO scheme is compared to widely used multiple changepoint detection algorithms, under a Gaussian framework. In particular, multiple changes in mean are found using approximate BinSeg [57] or exact (RO, SegNeigh [42] and PELT [16]) methods. Following similar settings as in Killick and Eckley [58] for identifying changes in mean, the simulated series are sequences of m = 200 independent Gaussian random variables of variance σ 2 = 1. The means in the four segments considered are (0, 5, 10, 3), respectively. Multiple changes in mean are considered at locations 50, 100, 150 in simulated normal data. We conducted 1000 simulations using the R statistical environment and calling 'cpt.mean' function of the R package changepoint [58]. This function is capable of calculating the optimal positioning and number of change-points for data using the user specified method. As for the proposed RO scheme, the previously described empirical rules for (pre)specifying L min and K max were adopted. Details regarding the (hyper)parameters selected for the rest of the employed methods (BinSeg, SegNeigh and PELT) are given as follows: BinSeg: for BinSeg alogorithm ('method = BinSeg') the default choices were selected, that is 'penalty = Manual' and 'pen.value = 2 * log(n)', the minimum segment length was set equal to 2 ('minseglen = 2'), and the maximum number of change-points to search for was set equal to 5 ('Q = 5'); SegNeigh: for SegNeigh alogorithm ('method = SegNeigh') the default choices were selected, that is 'penalty = Asymptotic' and 'pen.value = 0.01', the minimum segment length was set equal to 1('minseglen = 1'), and the maximum number of segments to search for was set equal to 5 ('Q = 5'); PELT: for PELT alogorithm ('method = PELT') the default choices were selected, that is 'penalty = Asymptotic' and 'pen.value = 0.01'. Here, the minimum segment length equals 1 and the maximum number of change-points to search for is infinite. Table 6 presents all combinations of change-points detected after executing simultaneously all methods employed in the simulation study. The findings reveal that exact (RO, SegNeigh and PELT) methods outperform the approximate BinSeg method, in the sense that they return '3' as the optimal number of change-points and '50, 100, 150' as their exact locations, for 97% of the cases, whereas BinSeg for 86% of the cases, respectively. It can therefore be seen that the proposed RO scheme is a viable alternative to widely used exact methods for multiple change-point detection and inference, under a Gaussian framework.

A real-case example
A real-life example is given here to illustrate the implementation of the RO scheme in practice and to unfold its capabilities. In particular, the performance of the developed scheme is studied in the context of detecting outbreaks of an infectious disease. This paper focuses on the study of weekly influenza-like illness (ILI) rate data (provided from the Hellenic Centre for Disease Control and Prevention) for Greece, between September 29, 2014 (week40/2014) and October 2, 2016 (week39/2016), which were used for analysis purposes. Within this framework, it is of interest to compare the developed change-point detection technique to some competitors existing in the literature which are especially designed for outbreak detection purposes. The efficacy of the RO scheme is examined in terms of successful identification of the signalled start and end weeks of the detected epidemics.
Here, we perform the RO scheme for two periods under study (1st period: week40/2014-week39/2015, 2nd period: week40/2015-week39/2016), and then compare it with its competitors. Note here that from a statistical perspective, our problem consists in recovering the configuration of change-points (epidemic outbreaks) using the whole observed series instead of treating this problem as having two datasets. However, from an epidemiological perspective, it is of interest the observations to be processed in one batch for each period under study (Phase I change-point detection) and information to be returned regarding whether each sequence contains change-points. In this way, epidemiologically speaking, it is of interest to reveal the diverse disease patterns changing from year to year, since it is well known that the epidemiology and the burden of a disease also vary from year to year. Further, as previous authors working on the use of change-point analysis in the context of surveillance data of infectious disease counts (see for example Texier et al. [59], among others) we assume that the endemic or epidemic component of the process is driven by parameter (as mean incidence) which is considered as piecewise constant. Identifying an epidemic regime requires to identify the endemic states that come before and after this epidemic segment. Figures 7 and 8 implement the elbow method to help selecting the optimal number of segments by fitting the model for the first and second period under study, respectively. The RSS plot (see Figure 7) shows a decreasing curve with a clear breakpoint for the optimal number of segments, that isK = 3, for the first period under study, whereas the favourite numbers of segments appear to beK = 4 andK = 3 for the second period (see Figure 8).   The behaviour of the vertical lines suggests an increase in ILI rate after the first 13 weeks and a decrease after the 25th week, for the first period under study (a case of 'two-steps' level change); it also suggests an increase in ILI rate after the first 14 weeks, a decrease after the 22nd week as well a further decrease after the 29th week, for the second period under study (a case of 'three-steps' level change). Note here that in the  latter case, from an epidemiological perspective, the RO scheme except of the sustained step shifts also identified short, transient shifts (τ 23 : week09/2016 toτ 29 : week15/2016) since the number of successive observations between these change-points is very close to the minimum number of successive observations allowed between two change-points (L min = 5). It is important to clarify that in such a case only a practitioner may decide which change-points are 'physically' significant. It is not a statistical criteria that can decide, for instance, if the change-point located at a particular time-point should be associated to a significant change of the inherent properties of the characteristic(s) under study (e.g. change of the epidemiology of a disease). For this reason, we also fitted the model witĥ K = 3 segments (see Figure 11). Figure 11 suggests an increase in ILI rate after the first 14 weeks and a decrease after the 22nd week, for the second period.
We therefore concluded that the extracted signalled start (sw) and end weeks (ew) of the detected epidemics are sw01-ew12/2015 and sw01-ew08/2016, for the first and second period under study, respectively. We proceed below in examining the ability of the RO scheme for 'successful' change-point detection through benchmarking. For this purpose, the derived change-points (sw01-ew12/2015 and sw01-ew08/2016) are compared with those derived after executing some competing modelling techniques existing in the literature which are especially designed for outbreak detection purposes, in particular: where X(t) are the observed time series values (weekly ILI rate), ε(t) are centred zero-mean random variables with variance σ 2 , and m denotes the number of observations within one year; M11 represents the current approach to ILI rate surveillance, that is the standard flu detection algorithm based on Serfling's cyclic regression model Serfling [60], a routine method adapted in epidemiological surveillance systems in European countries (European Centre for Disease Prevention and Control-ECDC) and in the U.S.A. (Centers for Disease Control and Prevention-CDC); • X(t) = α 0 + α 1 t + α 2 t 2 + γ 1 cos( 2π t m ) + δ 1 sin( 2π t m ) + γ 2 cos( 4π t m ) + δ 2 sin( 4π t m ) + γ 3 cos( 8π t m ) + δ 3 sin( 8π t m ) + ε(t); M23 represents an extended Serfling-type periodic regression model selected as the best fitting one by Parpoula et al. [61]; • X(t) = α 0 + α 1 t + γ 1 cos( 2π t m ) + δ 1 sin( 2π t m ) + φ 1 x t−1 + φ 2 x t−2 + t + λ 1 t−1 + ω 1 mintemp; MXM11 represents a model incorporating Auto-Regressive Moving Average (ARMA) terms and random meteorological covariates in the model structure as described by Kalligeris et al. [62]; • Segment neighbourhood algorithm [42] in conjunction with periodic-type ARMA time series modelling (SegNeigh-ARMA); an alternative methodology capable not only of identifying the beginning and end of the extreme periods that occur, but also of modelling the typical (non extreme) parts of the time series as described by Kalligeris et al. [63]; as for the (hyper)parameters selected for SegNeigh algorithm the default choices in 'cpt.mean' function [58] were selected, that is the minimum segment length was set equal to 1 ('minseglen = 1') and the maximum number of segments to search for was set equal to 5 ('Q = 5').
Summarizing the extracted signalled start and end weeks of the epidemics, i.e.  all methods considered gave rise to similar results, producing almost identical signalled start and end weeks of the epidemics (except for MXM11 for the second period under study). This is the main reason we were forced to summarize in this way the start and end weeks for the epidemics (instead of showing them in a figure). Note that by placing all methods in one figure for comparison purposes, the vertical lines (breakpoints) of all methods almost coincide, thus differences are not visible. We further examined the ability of RO, SegNeigh-ARMA and MXM11 approaches for detection of epidemics, using Receiver Operating Characteristic (ROC) curve analysis and its related metrics (Accuracy-ACC, Sensitivity-SENS, Specificity-SPEC, Area Under the ROC curve-AUC). Here, the derived change-points from RO, SegNeigh-ARMA and MXM11 approaches are compared with those derived after executing the 'gold standard' approach to influenza surveillance (standard and extended ECDC and CDC flu detection algorithm, models M11 & M23, respectively). Recall here that the extracted signalled start and end weeks of epidemics were found to be identical considering either M11 or M23 (i.e. sw01-ew13/2015 and sw01-ew08/2016). The same holds for SegNeigh and RO scheme, both based on change-point analysis, i.e. both yielded sw01-ew12/2015 and sw01-ew08/2016. ROC curves have been a constructive measure for determining the trade-off between hit rates and false alarm rates of classifiers, allowing the determination and the comparison of the performances of several classification models, as discussed in Parpoula [64]. For a more detailed discussion on the theory and practice of ROC curves, the interested reader may refer to Pepe [65] among others. In our study, we performed pairwise comparison of ROC curves for the approaches examined in our study (RO, SegNeigh-ARMA and MXM11). More specifically, we estimated the aforementioned metrics along with their 95% Confidence Interval (CI) (exact Clopper-Pearson CIs for ACC, SENS and SPEC, and exact binomial CI for each derived AUC) following similar rationale as in Parpoula [64]. Table 7 presents the estimated metrics along with their 95% CIs. Table 8 presents the pairwise comparison of ROC curves. A two-sided statistical test was performed in a paired design, that is comparing paired data AUCs based on the empirical (nonparametric) ROC curves estimation implementing the empirical (nonparametric) methods of DeLong et al. [66] and of Hanley and McNeil [67]. In Table 8, SE denotes the standard error of the difference (diff.) between the AUCs (AUC 1 − AUC 2 ), z-value corresponds to the calculated z-statistic (Zhou et al. [68]) for testing H 0 : AUC 1 = AUC 2 , and significance level represents the pvalue (P), that is the probability that the true AUC 1 equals AUC 2 , given the sample data. Note here that the aforementioned calculations were obtained for each employed method as well as for both periods under study. Table 7 indicates that RO and SegNeigh-ARMA approaches outperform MXM11 in terms of higher AUC, ACC and SENS values, and seem to detect successfully the change-points (epidemic outbreaks) compared to the 'gold standard' approach to influenza surveillance. Further, the RO scheme does indeed compare favourably with its competitors since it may suffice for epidemiological surveillance purposes without need for recourse to an unfamiliar and more complicated model. Moreover, Table 8 indicates that statistically significant differences between AUCs were identified with the RO and MXM11, as well as the SegNeigh-ARMA and MXM11 approaches, but not statistically significant differences between AUCs were identified with the RO and SegNeigh-ARMA approaches (both based on change-point analysis). If the prevalent concern is that the process mean may incur a sustained jump in the mean, the results indicate that the RO scheme is a viable alternative to existing outbreak detection methods.

Concluding remarks
This paper addressed the problem of 'a posteriori' optimal multiple change-point detection and inference, performing a global segmentation. Our main goal was to recover the configuration of change-points using the whole observed series, aiming at detecting any change in the mean of a process in historical individual time series data. Within this framework, a recursive optimization algorithm was developed for optimally segmenting a time series and fine tuning the input parameters (that is, maximum number of segments to search for and the minimum segment length) which are usually unknown and there is no objective way to pre-specify them. Therefore, the proposed scheme addresses a wide class of real-life contexts and problems where the identification of optimal level shifts in a time series is the main goal.
Extensive simulation results showed that L min = max(5, min(10, [ m 10 ])) and K max = min(50, [ √ m] + 1) provide a reasonable compromise for practical prescriptions, useful to pre-specify automatically these two tuning parameters in accordance with the number of available observations. This restriction on minimum segment length was found to help prevent overfitting without hindering the power to detect short and transient OC phases, whereas this upper limit on the size of the segmentation space was found to help avoid oversegmenting and slowing the calculations for larger sample sizes. Further, the simulation-based findings reveal that exact (RO, SegNeigh and PELT) methods outperform the approximate BinSeg method, in the sense that they return correctly the optimal number of change-points and their exact locations for 97% of the examined cases, whereas BinSeg for 86% of the cases, respectively. It can therefore be seen that the proposed RO scheme is a viable alternative to widely used exact methods for multiple change-point detection and inference, under a Gaussian framework.
A real-life example was also given to illustrate the implementation of the proposed scheme in practice and to unfold its capabilities in the context of detecting outbreaks of an infectious disease. The developed scheme was found to compare favourably with its competitors since it detects successfully the change-points (epidemic outbreaks) compared to some typical modelling approaches existing in the literature (designed for outbreak detection purposes) and suffices for epidemiological surveillance purposes without need for recourse to an unfamiliar and more complicated model. Note here that in the presented Gaussian likelihood-based framework, we defined the probability distribution of the observed data assuming that the sequence of residual errors are i.i.d Gaussian variables, and in these settings the model was parametric. Lavielle [40] pointed out that a Gaussian likelihood can be effectively used as a contrast function for detecting changes in the mean (and/or in the variance) for any sequence of random (not necessarily Gaussian) variables, and shows through an example, that significative changes are well recovered without any assumptions on the dependence structure of the data (even for a discrete distribution considered). This is also the case for the real-life example presented here, where the change-points are well recovered without the need to take into account for autocorrelation in the epidemiological historical process data.
It is worth noting here that it was beyond the scope of the paper to develop a distribution-free change-point detection scheme, an alternative one to those already existing in the literature. The twofold goal of this paper was to develop a change-point detection scheme capable not only for optimal time series segmentation under a Gaussian framework, but also for input parameter fine-tuning in the context of multiple change-point analysis. Nevertheless, the derived practical rules (for pre-specifying the maximum number of hypothetical change-points to search for subject to the constraint of a minimum segment length, in accordance with a given sample size) could also be adopted for change-point detection methods monitoring process locations that make no distributional assumptions. As a future work, it would be interesting to adapt such an optimization scheme based on a multivariate data concept, explore and fine-tune the input parameters in order to provide accordingly reasonable compromises to be implemented either on a fully parametric or a distribution-free change-point analysis framework.