Meta-learning how to forecast time series

Features of time series are useful in identifying suitable models for forecasting. We present a general framework, labelled FFORMS (Feature-based FORecast Model Selection), which selects forecast models based on features calculated from each time series. The FFORMS framework builds a mapping that relates the features of a time series to the “best” forecast model using a classification algorithm such as a random forest. The framework is evaluated using time series from the M-forecasting competitions and is shown to yield forecasts that are almost as accurate as state-of-the-art methods, but are much faster to compute. We use model-agnostic machine learning interpretability methods to explore the results and to study what types of time series are best suited to each forecasting model.


Introduction
Forecasting is a key activity for many businesses in order to operate efficiently.The rapid advances in computing technologies have enabled businesses to keep track of large numbers of time series.Hence, it is becoming increasingly common to have to regularly forecast many millions of time series.For example, large scale businesses may be interested in forecasting sales, cost, and demand for thousands of products across various locations, warehouses, etc.
Technology companies such as Google collect many millions of daily time series such as webclick logs, web search counts, queries, revenues, number of users for different services, etc.Such large collections of time series require fast automated procedures generating accurate forecasts.The scale of these tasks has raised some computational challenges that we seek to address by proposing a new fast algorithm for model selection and time series forecasting.A key component of our work is in studying how this algorithm works, by identifying what types of time series should be forecast with the various available models.
Selecting the most appropriate model for forecasting a given time series can be challenging.Two of the most commonly used automated algorithms are the ETS algorithm of Hyndman et al. (2002) and the automatic ARIMA algorithm of Hyndman & Khandakar (2008).Both algorithms are implemented in the forecast package in R (R Core Team 2022; Hyndman et al. 2021).In this paradigm, a class of models is selected in advance, and many models within that class are estimated for each time series.The model with the smallest AICc value is chosen and used for forecasting.This approach relies on the expert judgement of the forecaster in first selecting the most appropriate class of models to use, as it is not usually possible to compare AICc values between model classes due to differences in the way the likelihood is computed, and the way initial conditions are handled.
An alternative approach, which avoids selecting a class of models a priori, is to use a simple "hold-out" test set; but then there is often insufficient data to draw a reliable conclusion.To overcome this drawback, time series cross-validation can be used (Racine 2000;Hyndman & Athanasopoulos 2021); then models from many different classes may be applied, and the model with the lowest cross-validated MSE selected.However, this increases the computation time involved considerably (at least to order n 2 where n is the number of series to be forecast).
Clearly, there is a need for a fast and scalable algorithm to automate the process of selecting models with the aim of forecasting.We refer to this process as forecast-model selection.We propose a general meta-learning framework using features of the time series to select the class of models, or even the specific model, to be used for forecasting.The forecast-model selection process is carried out using a classification algorithm -we use the time series features as inputs, and the "best" forecasting model as the output.The classifier is built using a large historical collection of time series, in advance of the forecasting task at hand.Hence, this is an "offline" procedure.
The "online" process of generating forecasts only involves calculating the features of a time series and using the pre-trained classifier to identify the best forecasting model.Hence, generating forecasts only involves the estimation of a single forecasting model, with no need to estimate large numbers of models within a class, or to carry out a computationally-intensive crossvalidation procedure.We refer to this general framework as FFORMS (Feature-based FORecast-

Model Selection).
Currently, there are two methods motivated and using the FFORMS framework as their backbone.Montero-Manso et al. (2020) proposed FFORMA (Feature-based FORecast Model Averaging) which includes, a gradient boosting algorithm with a custom objective function to train the meta-learner to obtain weights for forecast combinations.FFORMA was ranked as the second most accurate method in the M4-competition.Talagala, Li & Kang (2022) proposed FFORMPP (Feature-based FORecast Model Performance Prediction) which includes efficient Bayesian multivariate surface regression to model forecast error as a function of features.FFORMPP allows rankings of models according to their relative forecasting performance.Both these approaches are considerably slower (and arguably prohibitively slower in an operational sense) than FFORMS.However, the proposed methods show that if we are willing to sacrifice some speed, greater accuracy can be achieved using the FFORMS framework.As we will see in what follows the FFORMS algorithm provides a method that achieves reasonably accurate forecasts within a limited timeline setting.
There have been several recent studies on the use of meta-learning approaches to automate forecast model selection based on features computed from the time series (Shah 1997;Prudêncio & Ludermir 2004;Lemke & Gabrys 2010;Kück, Crone & Freitag 2016).However, to the best of our knowledge, a very limited effort has been made to understand how the meta-learners make their decisions and what is really happening inside these complex model structures.To fill these gaps, this paper makes a first step towards providing a comprehensive analysis of the relationship between time series features and forecast model selection using machine learning interpretability techniques.We try to answer the following questions: i) What are the similarities identified by the meta-learner between model classes?ii) Which features contribute most to the classification process?iii) How are features related to the property being modelled?iv) How do features interact with each other to identify a suitable forecasting model?We believe addressing these questions can enhance the transparency of the model predictions and build trust in model's predictions.
The remainder of the paper is structured as follows.We review the related work in Section 2.
In Section 3 we explain the detailed components and procedures of our proposed framework for forecast model selection.Section 4 presents the forecasting results in application to the M4-competition data.Section 5 presents the results of what is happening under the hood of FFORMS, followed by some conclusions and suggestions for future work in Section 6. Reid (1972) points out that the performance of forecasting methods changes according to the nature of the data.Exploring the reasons for these variations may be useful in selecting the Talagala, Hyndman, Athanasopoulos: 14 December 2022 most appropriate model.In response to the results of the M3 competition (Makridakis & Hibon 2000), similar ideas have been put forward by others.Hyndman (2001), Lawrence (2001) and Armstrong (2001) argue that the characteristics of a time series may provide useful insights into which methods are most appropriate for forecasting.Rather than work with the time series directly at the level of individual observations, we propose analysing time series via an associated "instance space" formed by features of the time series.A time series feature is any measurable characteristic of the time series.For example, Figure 1 Hyndman (2009).Time series in the lower right quadrant of the scatter plot are non-seasonal but trended, while there is only one series with both high trend and high seasonality.We also see how the degree of seasonality and trend varies between series.Other examples of time series features include autocorrelation, spectral entropy and measures of self-similarity and nonlinearity.Fulcher & Jones (2014) identified 9000 operations to extract features from time series.

Time series features
The choice of the most appropriate set of features depends on both the nature of the time series being analysed, and the purpose of the analysis.In Section 4, we study the time series from the M1, M3 and M4 competitions (Makridakis et al. 1982;Makridakis & Hibon 2000;Makridakis, Spiliotis & Assimakopoulos 2020), and we select features for the purpose of forecast model selection.The M1, M3 and M4 competitions involve time series of differing length, scale and other properties.We include length as one of our features, but the remaining features are asymptotically independent of the length of the time series (i.e., they are ergodic), and they are independent of scale.As our main focus is forecasting, we select features that have been shown in other studies (Kang, Hyndman & Smith-Miles 2017;Montero-Manso et al. 2020) to have discriminatory power in selecting a good model for forecasting.

Meta-learning for algorithm selection
John Rice was an early and strong proponent of the idea of meta-learning, which he called the algorithm selection problem (ASP) (Rice 1976).The term meta-learning started to appear with the emergence of the machine-learning literature.
Rice's framework for algorithm selection is shown in Figure 2   F, is the range of measures that characterize the problem space P. The algorithm space, A, is a list of suitable candidate algorithms which can be used to find solutions to the problems in P. The performance metric, Y, is a measure of algorithm performance such as accuracy, speed, etc.The main challenge in ASP is to identify the selection mapping S from the feature space to the algorithm space A. A formal definition of the algorithm selection problem is given by Smith-Miles (2009), and repeated below.

Algorithm selection problem.
For a given problem instance x ∈ P, with features f (x) ∈ F, find the selection mapping S( f (x)) into algorithm space A, such that the selected algorithm α ∈ A maximizes the performance mapping y(α(x)) ∈ Y.

Forecast-model selection using meta-learning
Selecting models for forecasting can be framed according to Rice's ASP framework.Existing methods differ with respect to the way they define the problem space (A), the features (F), the forecasting accuracy measure (Y) and the selection mapping (S).
Collopy & Armstrong (1992) introduced 99 rules based on 18 features of time series, in order to make forecasts for economic and demographic time series.This work was extended by Armstrong (2001) to reduce human intervention.Shah (1997) used the following features to classify time series: the number of observations, the ratio of the number of turning points to the length of the series, the ratio of the number of step changes, skewness, kurtosis, the coefficient of variation, autocorrelations at lags 1-4, and partial autocorrelations at lags 2-4.Casting Shah's work in Rice's framework, we can specify: P = 203 quarterly series from the M1 competition (Makridakis et al. 1982); A = 3 forecasting methods, namely simple exponential smoothing, Holt-Winters exponential smoothing with multiplicative seasonality, and a basic structural time series model; Y = mean squared error for a hold-out sample.Shah (1997) learned the mapping S using discriminant analysis.Prudêncio & Ludermir (2004) was the first paper to use the term "meta-learning" in the context of time series model selection.They studied the applicability of meta-learning approaches for forecast model selection based on two case studies.Again using the notation above, we can describe their first case study with: A contained only two forecasting methods, simple exponential smoothing and a time-delay neural network; Y = mean absolute error; F consisted of 14 features, namely length, autocorrelation coefficients, coefficient of variation, skewness, kurtosis, and a test of turning points to measure the randomness of the time series; S was learned using the C4.5 decision tree algorithm.For their second study, the algorithm space included a random walk, Holt's linear exponential smoothing and AR models; the problem space P contained the yearly series from the M3 competition (Makridakis & Hibon 2000); F included a subset of features from the first study; and Y was a ranking based on error.Beyond the task of forecast-model selection, they used the NOEMON approach to rank the algorithms (Kalousis & Theoharis 1999).Lemke & Gabrys (2010)    In order to train our classification algorithm, we need a large collection of time series which are similar to those we will be forecasting.We assume that we have an essentially infinite population of time series, and we take a sample of them in order to train the classification algorithm denoted as the "observed sample".The new time series we wish to forecast can be thought of as additional draws from the same population.Hence, any conclusions made from the classification framework refer only to the population from which the sample has been selected.We may call this the "target population" of time series.It is important to have a well-defined target population to avoid misapplying the classification rules.In practice, we may wish to augment the set of observed time series by simulating new time series similar to those in the assumed population (we provide details and discussion in Section 3.2 that follows).We denote the total collection of time series used for training the classifier as the "reference set".
Each time series within the reference set is split into a training period and a test period.From each training period we compute a range of time series features, and fit a selection of candidate models.The calculated features form the input vector to the classification algorithm.Using the fitted models, we generate forecasts and identify the "best" model for each time series based on a forecast error measure (e.g., MASE) calculated over the test period.The models deemed "best" form the output labels for the classification algorithm.In our research, since we used the M-competition data, we used the training and test sets specified by the M-competition organizers.In the M-competitions the training set length for series varies, however, the test length for the time series is fixed for each frequency level.The numbers of observations in the test periods are 6 for yearly, 8 for quarterly, 18 for monthly, 13 for weekly, 14 for daily, and 48 for the hourly series.Use of this standard test set specified by the M-competitions allows to compare our results with the results of other studies.Figure 4 shows the distributions of length of the training period of the M-competition data and the simulated data used in the study.
The pseudo code for our proposed framework is presented in Algorithm 1 below.In the following sections, we briefly discuss aspects of the offline phase.

Augmenting the observed sample with simulated time series
Data augmentation is used to expand the amount of training data by adding significantly changed versions of either existing data, or brand-new synthetic data that is derived from existing data.This helps to increase the generalizability of the algorithm.In practice, we may wish to augment the set of observed time series by simulating new time series similar to those in the assumed population.This process may be useful when the number of observed time series is too small to build a reliable classifier.Alternatively, we may wish to add more of some 16: Let Ĉb (x new ) be the class prediction of the b th random forest tree.Then the class label for particular types of time series to the reference set in order to get a more balanced sample for the classification.
There are several options that could be used to produce simulated series that are similar to those in the population.One approach is to fit models to the time series in the observed sample, and then use those models as data generating processes to obtain similar time series.This could be done, for example, using exponential smoothing models, ARIMA models or other modelling functions in the forecast package in R (Hyndman et al. 2021).For each fitted model, we can simulate multiple time series with similar characteristics to the observed data.An alternative approach is to generate new time series with similar features from those found in the reference set (Kang, Hyndman & Li 2020), using the gratis package in R (Kang et al. 2020).Either way, these simulated series are produced in the offline phase of the algorithm, so the computational time in producing them is of no real consequence.

Input: features
Although the FFORMS framework could use any time series features, our current implementation uses the features shown in Table 1; these are defined in the Appendix.These features were chosen based on their demonstrated utility in Kang, Hyndman & Smith-Miles (2017) and Montero-Manso et al. ( 2020).
An attribute of the proposed FFORMS framework is that its online phase is fast compared to the time required for implementing a typical model selection or cross-validation procedure.
Therefore, we consider only features that can be computed rapidly, as this computation must be done during the online phase.Furthermore, all our features are easily interpretable.

Output: labels
The task of the classification algorithm is to identify the "best" forecasting method for a given time series.The candidate models considered as labels will depend on the observed time series.
For example, if we have only non-seasonal time series, and no chaotic features, we may wish to restrict the models to white noise, random walk, ARIMA and ETS processes.We fit the corresponding models outlined in Table 2 to each series in the reference set.
Each candidate model considered is estimated strictly on the training period of each series in the reference set.Forecasts are then generated for the test period over which the chosen forecast accuracy measure is calculated.The model with the lowest forecast error measure over the test period is deemed "best".This step is the most computationally intensive and time-consuming, as each candidate model has to be applied on each series in the reference set.However, as this task is done during the offline phase of the FFORMS framework, the time involved and computational cost associated are of no real significance and are completely controlled by the user.
For each series in the reference set and each forecasting model, we compute the accuracy measures that were used in the M4-competition (Makridakis, Spiliotis & Assimakopoulos 2020), namely the MASE and the symmetric mean absolute percentage error (sMAPE).Each of these is standardised by the median MASE and median sMAPE, calculated across the forecast models.
The model with the lowest average value of the scaled MASE and scaled sMAPE is selected as the output class label.

Train a FFORMS meta-learner
In principle, any classification algorithm could be used to train the meta-learner.Our current implementation uses a random forest (RF) algorithm because: i) it can model complex interactions between features; ii) the modelling framework captures linear and non-linear effects of features through the averaging of large number of decision trees; iii) it is robust against over-fitting the training data; iv) it is easy to handle the problem of imbalanced classes; v) it is a fast approach compared to boosting algorithms; and vi) it is fairly easy and straightforward to implement with available software.In this work, we have used the randomForest package (Liaw & Wiener 2002;Breiman et al. 2018) which implements the Fortran code for random forest classification by Breiman & Cutler (2004).
Talagala, Hyndman, Athanasopoulos: 14 December 2022  the minimum number of leaf node observations is 1.The number of trees is set to 1000 to reduce the time in both the online and offline phases of the algorithm.
Our aim is different from most classification problems in that we are not interested in accurately predicting the class, but in finding the best possible forecast model.It is possible that two models produce almost equally accurate forecasts, and therefore it is not important whether the classifier picks one model over the other.Therefore we report the forecast accuracy obtained from the FFORMS framework, rather than the classification accuracy.

Application to M4-competition data 4.1 Evaluation choices
To test how well our proposed framework can identify suitable models for forecasting, we use the time series from the M1 (Makridakis et al. 1982), M3 (Makridakis & Hibon 2000) and M4 (Makridakis, Spiliotis & Assimakopoulos 2020) competitions.The data from the M-competitions are a sample of time series collected from several domains such as demography, finance, business and economics.In our experiment, we treat the time series from the M1 and M3 competitions as the observed sample.We augment these by adding multiple time series simulated using the training period of each series in the M4-competition to the reference set.For this purpose, we fit models to each series, and simulate new time series from those models.For, yearly, quarterly and monthly data, we use ETS and ARIMA models via the ets() and auto.arima()functions in the forecast package.For weekly, daily and hourly series, we use a multiple seasonal decomposition (MSTL) (Bandara, Hyndman & Bergmeir 2021), and forecast the seasonally adjusted series using ETS models.This is implemented in the stlf() function from the forecast package.
We build separate FFORMS meta-learners for yearly, quarterly, monthly, daily, and hourly series, due to their differences in features and class labels.Then the pre-trained FFORMS meta-learners are used to forecast the test period of the M4-competition time series.

Summary of the main results
Table 3 shows the performance of the FFORMS approach on the M4-competition data.The point forecasts and prediction intervals are evaluated based on the test period of each series.
The point forecasts are evaluated based on the MASE computed for each forecast horizon, and then by averaging the MASE errors across all series corresponding to each frequency category.
Similarly, the mean scaled interval score MSIS (Makridakis, Spiliotis & Assimakopoulos 2020) is used to evaluate the prediction intervals.The results are compared against several benchmarks Talagala, Hyndman, Athanasopoulos: 14 December 2022  and the top three ranked methods of the M4-competition.The top-ranking methods of the M4competition were based on some type of combination approach.The first ranked method was a hybrid approach that produced forecasts based on both ETS and machine learning approaches.
The second and third places were based on a combination of nine and seven different forecast models.Recall that the second ranked approach was FFORMA which is based on FFORMS.All these approaches are time-consuming as forecasts from all candidate models must be computed.
The FFORMS results are based on forecasts from a single selected model.
Talagala, Hyndman, Athanasopoulos: 14 December 2022 According to the Table 3, we can see the FFORMS yields relatively accurate forecasts comparable to benchmarks and the top-3 of the M4-competition.The purpose of FFORMS framework is not to introduce an algorithm that beats all benchmark approaches but to introduce an algorithm that performs reasonably well across a wide range of time series, and which is relatively fast to implement.There is usually a trade-off when forecasting between accuracy and computation time, and FFORMS is intended to trade off some accuracy for large gains in computational efficiency.It gives results that are relatively similar to the best methods in the M4-competition, while being orders of magnitude faster to apply in the online phase.

Computation times
The running time for fitting models are reported in Table 4.The reported values are based on 100 randomly selected time series from each frequency category of M4-competition data.
The reported results are median time along with IQR based on 100 replicates using 24 core Xeon-E5-2680-v3@2.50GHz servers.The microbenchmark (Mersmann 2021) R package was used for calculations.
The times to produce the forecasts that were in the top few places of the M4-competition are not shown, as these were provided to use in the M4 results.However, the times are reported in Makridakis, Spiliotis & Assimakopoulos (2020), albeit with a different computing archictecture, and FFORMS is much faster than any of the competitor methods shown in Table 3.

Peeking inside FFORMS
Apart from providing a fast forecasting method that has relatively good performance, we also wish to better understand what methods work best for different types of time series.
To address this problem, we present the results for yearly series (representing non-seasonal time series), monthly series (representing time series with a single seasonal component), and hourly series (representing series with multiple seasonal components).The results for quarterly and weekly data were very similar to the corresponding results for monthly data, while the results for daily data were very similar to the results for hourly data.Hence, these are omitted.

Visualization of the similarity between forecast models
In FFORMS, the forecast-model selection problem is posed as a supervised learning task.Each time series in the meta-data set is represented as a vector of features and we compute the probability of each model being the "best" forecast model.Sometimes it is clear that one model is better than the others, and it takes probability close to one.But for other series, there might be  two almost equally good models, and these would then take similar probabilities.Because we are interested only in obtaining good forecast accuracy, and not in identifying the "best" model, this is of no concern.These probabilities are obtained from the vote matrix which contains one row for each time series and one column for each model (or class label).Each cell contains the fraction of trees in the forest that classified each time series to each class.We use a vote matrix calculated based on out-of-bag observations.We can study these vote-matrix probabilities for all time series in order to explore FFORMS's understanding of the similarities between forecast models.
Figure 5 shows heatmaps (Schep & Kummerfeld 2017) of the vote matrices, with rows and columns reordered and clustered using a hierarchical clustering approach.Each cell in the plots displays the number of votes received by the time series in that row over the candidate forecast-model in that column.We can see the columns are clustered according to the similarities of the forecasting models.The results are summarized in Table 5, with an interpretation of each cluster given in bold text.For the yearly results, the models are clustered according to their ability to handle trends.For monthly and hourly series, the models are clustered according to the types of seasonal and trend components they can handle.This suggests that our meta-learner has correctly identified some similarities between model classes.
Some more subtle insights also emerge: 1. Regardless of the true class label (corresponding to the model that gives the best forecasts on the test set), the random walk with drift has a non-zero vote probability for all yearly data.That is, for each yearly series, at least one tree in the random forest voted for random walk with drift.
2. For monthly data, SARIMA, tbats, theta, nn, rwd and stlar have all been assigned a very high vote probability for at least some series.
3. Those monthly series with a high vote probability for SARIMA also have a high vote probability for ETS models with seasonal components.Similarly, those series with a high vote probability for (non-seasonal) ARIMA models also have a high probability for non-seasonal ETS models.
4. Compared to yearly and monthly series, for hourly series the best forecasting model has been selected with very high probability.
5. For hourly series, low vote probabilities have been assigned to random walk, random walk with drift, theta and white noise, while neural networks have been assigned a non-zero vote probability for all hourly series.
We also created an interactive dashboard1 allowing additional visualizations such as: i) the distribution of true class labels by row clusters of the vote matrix; ii) the feature distribution by clusters; and iii) the distribution of clusters in the feature space (a principal component analysis is used to map each time series as a point in a two-dimensional instance space given by features).

Which features are the most important and where are they important?
The importance of a feature can be measured using individual conditional expectation (ICE) scores (Friedman, Popescu, et al. 2008;Goldstein et al. 2015) which measure the effect of a change in value of a single feature of a single time series on the output of the algorithm.When averaged across series, these give "partial dependencies".The partial dependencies are often plotted against the feature values to give partial dependency curves (Greenwell, Boehmke & McCarthy 2018).
The partial-dependency score measures the "flatness" of the partial dependence curve for each feature.The flatness of the curve is measured using the standard deviation of the partial dependencies for different values of the feature.A feature with a low partial-dependency score either has little influence on the response variable or the feature interacts with other features.
An ICE score is similar, but is based on measuring the "flatness" of the ICE curves rather than the average ICE curve.The ICE score is the average of the standard deviations of the individual ICE curves.A higher value indicates a higher degree of interactivity with other features.
For each feature, an overall rank is calculated based on the mean ranking of three measures of variable importance: i) permutation-based variable importance; ii) the partial-dependency score; and iii) the individual conditional expectation scores.
Figure 6 shows the five most important features (represented by a coloured cell) for each model class across yearly, monthly and hourly series.Note that some columns contain more than five coloured cells for each frequency category as mulitple features were equally ranked.Overall, Talagala, Hyndman, Athanasopoulos: 14 December 2022 the most important features are: strength of trend, sum of the first five partial autocorrelation coefficients, stability, the first ACF value of the differenced series and linearity.
For yearly data, the test Phillips-Perron unit root test statistic is also important.Also, features related to different types of trend (beta and curvature) are important in selecting between ETS models.
For monthly data, the length of the time series (T) is assigned a high importance.For all seasonal time series (both monthly and hourly), the strength of seasonality has been selected among the top features across all classes.
For hourly data, the strength of daily seasonality (measured by seasonal_d) appears to be more important than the strength of weekly seasonality (measured by seasonal_w).Entropy and the ACF and PACF features related to seasonal lags and seasonal differencing also rank among the top five for some model classes.

When and how are features linked to the forecast outcome?
Partial-dependence curves can be used to visualise the relationship between each feature and the choice of forecast model selection (see section Section 5).These show the probability of selection each model as a function of the given feature, while accounting for the average effect of other features in the model.We can compute 95% confidence intervals for these partial dependency estimates, using the ICE curves.
For yearly series, the Phillips-Perron test statistic is the most important feature across many classes.Figure 7  non-stationary ARIMA models and ETS models with trend are selected for high values of the statistic.
The first autocorrelation coefficient of the differenced series (diff1y _ acf1) is useful in determining the difference stationarity of a time series.This feature has been ranked among the top-five within several classes of yearly series.The associated partial dependency plots are in Figure 8, showing that small absolute values of the feature increases the probability of selection a random walk with drift models, while larger absolute values increase the probability of ARIMA models.
For monthly series, the strength of seasonality is the most important feature across all classes, and Figure 9 shows that the probability of selecting a seasonal forecast model increases as the strength of seasonality increases, while the reverse occurs for non-seasonal models.
The length of time series is also assigned a high importance within many classes of monthly series.Figure 10     models or models involving seasonal components, while shorter series lead to simple models being selected.
The partial dependency plots of the strength of seasonalities for hourly series, in Figure 11, shows that the probability of selecting simple non-seasonal models such as random walk, random walk with drift, theta and white noise process, decreases as the strength of daily seasonality increases.Weekly seasonality is less important in selecting a model than daily seasonality, probably because daily seasonality tends to be stronger for these series.
For monthly series, stability is ranked among the top five features within many classes, whereas it does not appear in the top five features within any of the classes for yearly series.Further, it      is ranked important only within random walk and random walk with drift for hourly series.
Figure 12 shows the partial dependency plots for stability, where we can see that the probability of selecting white noise models increases as stability increases for monthly series while for hourly series the probability of selecting random walk with drift and tbats models shows a nonmonotonic structure.
The sum of the first five partial autocorrelation coefficients, strength of trend and linearity are the most commonly selected features within many classes across all three frequency categories.
The partial dependency plots for the sum of the first five partial autocorrelation coefficients are shown in Figure 13, showing that the relationship varies across classes as well as frequency categories.
Figure 14 show the partial dependency curves for strength of trend.The probability of selecting a white noise, stationary ARMA or ETS model without trend components, decreases as the strength of trend increases, while the opposite relationship can be observed for other classes.The partial dependency plot for linearity in Figure 15 shows that for yearly series, the random walk with drift and ARMA classes are highly sensitive to the value of linearity around 0. However, the partial dependency curves of linearity for other classes are relatively flat throughout the range.This could be due to the interaction of linearity with other features.Figure 16 shows a pattern of interactivity between linearity and both lmres_acf1 and diff1y_acf1, within the wn, rw, rwd, theta and nn model classes.For monthly series (Figure 17), linearity shows interaction with the ACF value at the first seasonal lag of seasonally-differenced series, stability, the first ACF value at the remainder series and a parameter estimate of the ETS(A,A,A) model.Figure 18 shows the two-way partial dependency plot for hourly series between linearity and the ACF value at the first seasonal lag of seasonally-differenced series.Within rw and mstlarima we can see a separation between the lower half and the upper half due to the main effect of sediff_seacf1.
Figure 19 shows how stability interacts with trend, length, and first autocorrelation coefficient of the differenced series.Although the partial dependency curve of stability for the class theta is relatively flat (Figure 12), Figure 19 shows patterns of interaction between stability and other features.

Discussion and conclusions
We have proposed a novel framework for forecast-model selection using meta-learning based on time series features.Our proposed FFORMS algorithm uses the knowledge of the past performance of candidate forecast models on a collection of time series in order to identify suitable forecast models for new series.A key advantage of FFORMS is the idea of outsourcing most of the heavy computational work to the offline phase.Therefore, unlike traditional model selection strategies or cross-validation processes, our proposed framework can be used to produce relatively accurate and fast forecasts for very large collections of time series.For realtime forecasting, our framework involves only the calculation of features, the selection of a forecast method based on FFORMS random forest classifier, and the calculation of the forecasts from the chosen model.None of these steps involve substantial computation, and they can be easily parallelized when forecasting for a large number of new time series.In doing so, the FFORMS framework fills an important gap in contemporary forecasting practice, with many available models to choose from and with forecasts being required in a short time.We are well aware that the FFORMS forecasts will not usually be the best possible forecasts given unlimited computation time; rather, they are almost as good but available in a much shorter time period.
Trading off computational time for accuracy, Montero-Manso et al. techniques.We have explored the role of features in forecast model selection in the FFORMS framework.First we explored the similarities between series learned by the random forest.
Next, we explored which features are the most important for model selection in the FFORMS framework, and where they are most important.Features such as strength of trend, strength of seasonality, linearity, information related to partial-autocorrelation structure, features related to difference stationary, length and entropy are the most important features.Finally, partial dependency plots were used to visualise when and how these features link with the prediction outcome of the FFORMS framework.This improves our understanding of the data and the relationship learned by the random forest algorithm, and increases trust in the results obtained from the proposed framework.Our method is sensitive to the lengths of the time series because we employed length as an input feature.We found that the algorithm chooses simple models (such as random walks or white noise) for shorter time series, while heavily parameterized models are more likely to be chosen for long time series.PDP curves of length revealed that there is a threshold beyond which the probability of choosing various models in response to training set length starts to plateau.
Hyperparameter tuning, feature selection, and feature engineering are essential tasks in machine learning that can have a significant effect on the performance of the machine learning algorithm.
In this study, we have selected features that are known to have been helpful in other studies, The features and models we have used in our analyses may be varied for other applications.
Features can be added or removed as required, and new models can be introduced that are appropriate to the application and types of time series being forecast.While some features will linearity and curvature depend on the the scale of the time series.Therefore, the time series are scaled to mean zero and variance one before these two features are computed.
The spikiness feature is useful when a time series is affected by occasional outliers.Hyndman, Wang & Laptev (2015) introduced an index to measure spikiness, computed as the variance of the leave-one-out variances of R t .We compute the first autocorrelation coefficient of the remainder series, R t .

Stability and lumpiness
The features "stability" and "lumpiness" are calculated based on tiled windows (i.e., they do not overlap).For each window, the sample mean and variance are calculated.The stability feature is calculated as the variance of the means, while lumpiness is the variance of the variances.These were first used by Hyndman, Wang & Laptev (2015).

Spectral entropy of a time series
Spectral entropy is based on information theory, and can be used as a measure of the forecastability of a time series.Series that are easy to forecast should have a small spectral entropy value, while very noisy series will have a large spectral entropy.We use the measure introduced by where fy denotes the estimate of the spectral density introduced by Nuttall & Carter (1982).The R package ForeCA (Goerg 2020) was used to compute this measure.

Hurst exponent
The Hurst exponent measures the long-term memory of a time series.The Hurst exponent is given by H = d + 0.5, where d is the fractal dimension obtained by estimating an ARFIMA(0, d, 0) model.We compute this using the maximum likelihood method (Haslett & Raftery 1989) as implemented in the fracdiff package (Fraley 2020).This measure was also used in Wang, Smith-Miles & Hyndman (2009).

Nonlinearity
To measure the degree of nonlinearity of the time series, we use the statistic computed in Terasvirta's neural network test for nonlinearity (Teräsvirta, Lin & Granger 1993), also used in Wang, Smith-Miles & Hyndman (2009).This takes large values when the series is nonlinear, and values around 0 when the series is linear.

Parameter estimates of an ETS model
The ETS (A,A,N) model (Hyndman et al. 2008) produces equivalent forecasts to Holt's linear trend method, and can be expressed as follows: where α is the smoothing parameter for the level, and β is the smoothing parameter for the trend.We include the parameter estimates of both α and β in our feature set for yearly time series.These indicate the variability in the level and slope of the time series.
The ETS (A,A,A) model (Hyndman et al. 2008) produces equivalent forecasts to Holt-Winters' additive method, and can be expressed as follows: where γ is the smoothing parameter for the seasonal component, and the other parameters are as above.We include the parameter estimates of α, β and γ in our feature set for monthly and quarterly time series.The value of γ provides a measure for the variability of the seasonality of a time series.

Unit root test statistics
The Phillips-Perron test is based on the regression y t = c + αy t−1 + ε t .The feature is the usual "Z-alpha" statistic with the Bartlett window parameter set to the integer value of 4(T/100) 0.25 (Pfaff 2008).This is the default value returned from the ur.pp() function in the urca package (Pfaff, Zivot & Stigler 2016).
The KPSS test is based on the regression y t = c + δt + αy t−1 + ε t .The feature is the usual KPSS statistic with the Bartlett window parameter set to the integer value of 4(T/100) 0.25 (Pfaff 2008).
(left panel) shows the time-domain representation of six time series taken from the M3 competition (Makridakis & Hibon 2000), while Figure 1 (right panel) shows a feature-based representation of the same time series.Here only two features are considered: the strength of seasonality and the strength of trend, calculated based on the measures introduced by Wang, Smith-Miles &

Figure 1 :
Figure 1: Time-domain representation of time series (left) and feature-based representation of time series (right).

Figure 2 :
Figure 2: Rice's framework for the Algorithm Selection Problem.
Our proposed FFORMS framework, presented in Figure3, builds on this preceding research.The offline and online phases are shown in blue and red respectively.A classification algorithm (the meta-learner) is trained during the offline phase and is then used to select an appropriate forecast model for a new time series in the online phase.We use machine learning interpretability tools to gain insights into what is happening under the hood of the FFORMS framework.

Figure 3 :
Figure 3: FFORMS (Feature-based FORecast-Model Selection) framework.The offline phase is shown in blue, the online phase is in red.

Figure 4 :
Figure 4: Distribution of lengths of the series in the reference set and new series collections.
mstlarima multiple seasonal decomposition with ARIMA modelling of the seasonally adjusted series ---✓To determine the class label for a new instance, features are calculated and passed down the trees.Then each tree gives a prediction and the majority vote over all individual trees leads to the final decision.The RF algorithm is highly sensitive to class imbalance(Breiman 2001), and our reference set is unbalanced: some classes contain significantly more cases than other classes.The degree of class imbalance is reduced to some extent by augmenting the observed sample with the simulated time series.We consider three approaches to address the class imbalance in the data: (1) incorporating class priors into the RF classifier; (2) using the balanced RF algorithm introduced by Chen,Liaw & Breiman (2004); and (3) re-balancing the reference set with downsampling.In down-sampling, the size of the reference set is reduced by down-sampling the larger classes so that they match the smallest class in size; this potentially discards some useful information.Comparing the results, the balanced RF algorithm and RF with down-sampling did not yield satisfactory results.We therefore only report the results obtained by the RF built on unbalanced data (RF-unbalanced) and the RF with class priors (RF-class priors).The RF algorithms are implemented by the randomForest R package(Liaw & Wiener 2002;Breiman et al. 2018).The likelihood that a given class will appear in the bootstrap samples is represented by the class priors, which are introduced through the option classwt.We use the reciprocal of class size as the class priors.The number of trees ntree is set to 1000, and the number of randomly selected features k is set to be one third of the total number of features available, whileTalagala, Hyndman, Athanasopoulos: 14 December 2022

Figure 5 :
Figure 5: Heatmaps of the vote matrices for yearly, monthly and hourly series.Each cell shows the number of votes received by the time series in that row for the forecast model in that column.

Figure 6 :
Figure 6: The top five features for each frequency category and model.The cell colours denote the frequency group (yearly, monthly, hourly), while features are in rows and forecast methods in columns.

Figure 7 :
Figure 7: Partial dependence plots for Phillips-Perron unit root test statisticm, along with 95% confidence intervals.All classes show a turning point in the relationship around zero.
reveals that longer time series lead to the selection of highly parameterized

Figure 8 :
Figure 8: Partial dependence plots for diff1y_acf1 for yearly series, along with 95% confidence intervals.

Figure 9 :
Figure 9: Partial dependence plots for strength of seasonality for monthly series, along with 95% confidence intervals.The probability of selecting forecast models with seasonal components increases as seasonality increases.

Figure 10 :
Figure 10: Partial dependence plots for length of time series (T), along with 95% confidence intervals.The probability of selecting rw and rwd decreases for series with T > 150.

Figure 11 :Figure 12 :Figure 13 :
Figure 11: Partial dependence plots for strength of seasonality for hourly series, along with 95% confidence intervals.

Figure 14 :
Figure 14: Partial dependence plots for trend for yearly and monthly series, along with 95% confidence intervals.The probability of selecting ETS models without a trend component and stationary models (WN and ARMA) decreases for an extremely high value of trend.

Figure 15 :
Figure 15: Partial dependence plots for linearity for yearly monthly and hourly series, along with 95% confidence intervals.

Figure 16 :Figure 17 :
Figure 16: Two-way partial dependence plots for linearity for yearly series.

Figure 18 :
Figure 18: Two-way partial dependence plots for linearity for hourly series.

Figure 19 :
Figure 19: Two-way partial dependence plots for linearity for monthly series within the theta class.
(2020)  andTalagala, Li & Kang (2022) build on FFORMS to introduce FFORMA and FFORMPP.This paper has made a first step towards providing a comprehensive analysis of the relationship between time series features and forecast model selection using machine learning interpretability Talagala, Hyndman, Athanasopoulos: 14 December 2022 and we have used default settings for the various hyperparameters.It is expected that further improvements in the algorithm could be achieved by carefully engineering new features, by removing some of the features we have included, and by tuning the hyperparameters of the random forest classifier.Several machine learning-based forecasting models such as NNs, have a stochastic estimation procedure, which leads to different results for the same input data in different training sessions, and this can affect the choise of best model.In such cases, one could fit many such models with different random seeds and use the average of the forecast error measure to select the best forecasting model when labelling time series.Future research in this area could consider how local interpretable model-agnostic explanations can be used to identify regions of data where features contribute most to the classification of a specific instance.

Goerg ( 2013 )
to estimate the spectral entropy.It estimates the Shannon entropy of the spectral density of the normalized spectral density, given by H s (y t ) = − π −π fy (λ) log fy (λ)dλ, For a given time series x ∈ P, with features f (x) ∈ F, find the selection mapping S( f (x)) into the algorithm space A, such that the selected algorithm α ∈ A minimizes forecast accuracy error metric y(α(x)) ∈ Y on the test set of the time series.
Forecast-model selection problem.

Table 1 :
Description of time series features used in FFORMS

Table 2 :
Description of class labels used in FFORMS

Table 3 :
The performance of FFORMS on the M4-competition data based on point forecasts (based on MASE) and prediction intervals (based on MSIS) Note: Bold signifies the best performing method.

Table 4 :
Computational time for producing forecasts based on 100 randomly selected series from each frequency category of the M4-competition data set.

Table 5 :
Summary of column clusters displayed in vote matrices of Figure5.