Coherent mortality forecasting by the weighted multilevel functional principal component approach

ABSTRACT In human mortality modelling, if a population consists of several subpopulations it can be desirable to model their mortality rates simultaneously while taking into account the heterogeneity among them. The mortality forecasting methods tend to result in divergent forecasts for subpopulations when independence is assumed. However, under closely related social, economic and biological backgrounds, mortality patterns of these subpopulations are expected to be non-divergent in the future. In this article, we propose a new method for coherent modelling and forecasting of mortality rates for multiple subpopulations, in the sense of nondivergent life expectancy among subpopulations. The mortality rates of subpopulations are treated as multilevel functional data and a weighted multilevel functional principal component (wMFPCA) approach is proposed to model and forecast them. The proposed model is applied to sex-specific data for nine developed countries, and the results show that, in terms of overall forecasting accuracy, the model outperforms the independent model and the Product-Ratio model as well as the unweighted multilevel functional principal component approach.


Introduction
As governments, insurance companies and pension funds are obliged to pay billions of pensions and annuities, they are heavily exposed to longevity risk due to the increasing life expectancy of pensioners and policy holders. Hence it is of great importance for them to estimate people's future life expectancy as accurately as possible. In the past decades, there have been many statistical models developed for this purpose. In 1992, the pioneering Lee-Carter model was introduced as an extrapolative method for forecasting mortality [21], which has been followed by a series of extensions and modifications, including Lee and Miller [22], Booth et al [3], Renshaw and Haberman [32], among others. Hyndman and Ullah [17] further extends the methods to a functional data framework which incorporates nonparametric smoothing, functional principal component decomposition and times series analysis into their model. However, all of these works deal with forecasting mortality for a single population. When applied to multiple populations independently, they tend to result in divergent forecasts of life expectancy in the long run, even though those populations may share extensive similarities in socioeconomic, environmental and biological conditions. As a result, it is often desirable that the forecasts do not diverge when modelling the mortality rates simultaneously for multiple correlated subpopulations, such as different sexes or states in a country, or several countries within a larger region (for instance Europe). As such, one of the challenges in human mortality modelling is to model multiple subpopulations simultaneously (while taking into account the heterogeneity among them) which results in nondivergent forecasts when the subpopulations are highly correlated.
Such nondivergent forecasts are termed as coherent forecasts in the literature (see for example, Li and Lee [27]). Li and Lee [27] extends the Lee-Carter model to a group of populations for coherent forecasting by proposing an additional common factor between the multiple populations and projecting the population period effects as stationary time series. Some variants of Li and Lee [27] model have also been developed, such as Li [23]. Oeppen [30] applies the methods of compositional data analysis to transform the Lee-Carter model and then extends it to a multiple-decrement life table. Cairns et al [5] proposes an Age-Period-Cohort model for two related populations which incorporates a meanreverting stochastic spread that allows for different trends in mortality improvement rates in the short-run but parallel improvements in the long run. Hyndman et al [14] develops the product-ratio method which achieves coherent mortality forecasting by the convergence of the ratios of the forecast age-specific death rates from any two subpopulations to appropriate constants. Their numerical examples for Swedish sex-specific mortality data and Australian state-specific mortality data show that the forecasting accuracy is homogenized across subpopulations while the life expectancy in the long run presents convergence. Enchev et al [8] reviews and compares some multi-population mortality models, including some variations of the Li and Lee [27] model and the common-age-effect (CAE) model of Kleinow [20]. Other developments in this area include Jarner and Kryger [18], Li and Hardy [25], Hatzopoulos and Haberman [12], Antonio et al [1], Li et al [26], Wan and Bertschi [34], Shang [33], and Li et al [24].
In this paper, we propose a new method, namely weighted multilevel functional principal component analysis (wMFPCA), for coherent mortality forecasting, in the sense that the life expectancy in different populations does not diverge in the long run. We treat the mortality rates of subpopulations within a large population as a set of multilevel functional data (e.g. male and female mortality rates within a country), and use wMFPCA to extract core information from the functional data and analyse them at multilevel scale, so that the model incorporates both overall information from the population as a whole and specific information from the subpopulations. Comparing to the Product-Ratio model [14], the proposed model possesses several major differences in nature which will be discussed later in the paper. The usefulness of the proposed model is demonstrated by the application to the sex-specific mortality data for nine developed countries.
The rest of the paper is organized as follows. In Section 2, we introduce the model and the framework of wMFPCA in details. In Section 3, we demonstrate the performance of the model in terms of long-run forecasting convergence in the context of sex-specific mortality data for the UK. The model is also applied to the sex-specific mortality data of nine developed countries and the performance is compared with the independent model [17] as well as the Product-Ratio model [14]. The paper is concluded in Section 4 with some discussions.

A review of the functional principal component analysis (FPCA)
Functional principal component analysis (FPCA) is established as a statistical methodology to analyse functional data. It is a natural extension of the multivariate PCA, allowing the data dimension to increase from finite space to infinite space. The core of this technique is based on the Karhunen-Loève (KL) expansion of a stochastic process [19,28]. Since then, there have been many theoretical developments of FPCA mainly in two streams: the linear operator point of view and the kernel operator point of view [2,4,10,31,35]. The ultimate goal of FPCA is to reduce the infinite dimension of functional data into finite dimensions in principal directions of variation.
Let Y(x) be a squared integrable stochastic process defined on a set T (time interval for instance) with finite variance T E[Y(x) 2 ]dx < ∞. Let μ(x) be its mean function and K(x, x ) be its covariance function. By Karhunen-Loève (KL) expansion, under certain continuity conditions Y(x) can be expressed as: where, {φ k } k≥1 is a set of orthonormal eigenfunctions such that with λ 1 ≥ λ 2 ≥ · · · ≥ 0 being the corresponding eigenvalues, and β k = T (Y(x) − μ(x)) φ k (x)dx are uncorrelated random variables with mean zero and variance λ k . These random variables are the projection of the centred stochastic process Y(x) − μ(x) in the direction of the k-th eigenfunction φ k and are called principal component scores. In practice, only the first few eigenfunctions are needed to represent the important features of a sample of random functions. Therefore we can truncate the expansion at the first N terms to obtain an approximation of Y(x):

Multilevel FPCA
In practice it is often the case that a set of functional data comprise a number of subsets with strong correlation so that the functional data have a multilevel structure, such as the mortality rates for males and females in a country. The conventional FPCA is not suitable for this type of functional data since it ignores the heterogeneity among subgroups. To address the challenges, Di et al [7] proposes a multilevel FPCA (MFPCA) which combines FPCA and multilevel mixed effect models, Crainiceani et al [6] introduces generalized multilevel functional linear regression models, Zipunnikov et al [36] considers multilevel functional principal component analysis for high-dimensional data, Goldsmith et al [9] studies generalized multilevel functional-on-scalar regression, and Meyer et al [29] develops Bayesian function-on-function regression for multilevel functional data.
Let Y i,j (x) denote a random function on [0, 1] for subject i and group j, i = 1, 2, . . . , I, j = 1, 2, . . . , J. Consider a two-way functional ANOVA model where μ(x) is the overall mean function, η j (x) is the group-specific mean shift from the overall mean, Z i (x) is the subject-specific deviation from the group-specific mean, and W i,j (x) is the residual subject and group specific deviation from the subject-specific mean. In such a model, μ(x) and η j (x) are fixed functions while Z i (x) and W i,j (x) are zero-mean stochastic processes. Z i (x), termed as the level-one functions, and W i,j (x), the level-two functions, are assumed to be uncorrelated. Both the level-one and level-two functions can then be decomposed using the Karhunen-Loève (KL) expansion as follows: where φ (1) k (x) and β i,k are the level-one eigenfunctions and the corresponding principal component scores with mean 0 and variance λ (1) k , and φ (2) l (x) and γ i,j,l are the level-two eigenfunctions and the scores with mean 0 and variance λ (2) l . It follows that {β i,k : k = 1, 2, . . .} are uncorrelated with {γ i,j,l : l = 1, 2, . . .}. By these expansions, the model can then be expressed as To obtain the eigenfunctions (or the principal components), we first estimate the covariance functions.
} the covariance function between the level-two units within the same level-one unit. And define (2) l φ (2) l (x s ) φ (2) l (x r ).
In practice, each function Y i,j (x) is observed at a set of grid points. We assume that a common grid {x 1 , . . . , x p } is used for every subject and group. Then, μ( and K B (x s , x r ) can be estimated as follows: , it may not always be positive definite. This issue can be solved by trimming the pairs of eigenvalue and eigenfunction where the eigenvalue is negative [11]. Therefore the eigenfunctions can be estimated by the following steps: Step 1. Estimate the mean and covariance functionsμ( Step l (x); then trim those pairs of eigenvalues and eigenfunctions where the eigenvalue is negative.
Step 4. Estimate the principal component scores for both levels.
The technical details for estimating the principal component scores for both levels are provided in Appendix A. In practice determining the number of eigenfunctions in the approximation is always important in the context of principal component analysis. One method is to use a cross validation procedure. But a relatively simple solution is to estimate the number of components based on the proportion of variance explained. In the MFPCA model, the cumulative percentage of the total variation explained by the first N 1 components at level-one is given by the ratio N 1 k=1 λ (1) k / ∞ k=1 λ (1) k . Similarly, the cumulative percentage of the total variation explained by the first N 2 components at level-two is given by the ratio N 2 l=1 λ (2) l / ∞ l=1 λ (2) l . Therefore the number of eigenfunctions can be determined given a threshold for the cumulative percentage of the total variation explained by the first N 1 (N 2 ) components at level-one (level-two). And the proportion of the variation explained by the level-one and level-two against the overall variation of the entire data set can be calculated as ∞ k=1 respectively. In reality, functional data are usually observed with error. Smoothing is therefore needed before the analysis is carried out. In the literature on FPCA, there exist three methods commonly used for smoothing: smoothing the data before applying FPCA [31], introducing a penalty term [31], and smoothing the covariance function [35]. In our later application in mortality modelling, we will adopt the first method to smooth the raw data first before the MFPCA is conducted. This will be discussed again in next section.

Weighted MFPCA for coherent mortality forecasting
In this section we propose a weighted MFPCA (wMFPCA) approach for coherent forecast of the future mortality for several subpopulations within a large population. Let y t,j (x) denote the log of the death rate in the jth subpopulation in year t for age x (t = 1, . . . , n, j = 1, . . . , m). And we assume that, for each t and j, there exists an underlying smooth function f t,j (x) that we are observing with error at discrete points of x. Suppose we have observed where e t,j,i is i.i.d standard normal random variables and σ t,j (x i ) allows the amount of noise to vary with age x.
In demographic modelling, it is often the case that more recent data tend to have more impact on the results than those in the distant past. Hence, we propose to incorporate the concept of weights into the MFPCA model to allow recent mortality data to have more significant effect on the forecasting results, as discussed below.
Letf t,j (x) be the smoothed function from y t,j (x), and w t = κ(1 − κ) n−t be a geometrically decaying weight with 0 < κ < 1. Then we estimate the overall mean function μ(x) using a weighted average:μ and estimate the subpopulation-specific shifts from the overall mean, η j (x), bŷ The mean-adjusted functional data are denoted asf * t,j (x) =f t,j (x) −μ(x) −η j (x). Then the weights are incorporated into the mean-adjusted dataf * t,j (x) in order to calculate the weighted functional principal components. Similar to Hyndman and Ullah [17], the weighted multilevel functional principal componentsφ (1) k (x) andφ Based on the weighted mean-adjusted matrix F * , the covariance functionsK T (x s , x r ) andK B (x s , x r ) can be estimated using the methods discussed in the previous subsection x r ) is calculated accordingly. The weighted multilevel functional principal componentsφ (1) k (x) andφ (2) l (x) can then be obtained by decompos-ingK B (x s , x r ) andK W (x s , x r ). Finally, the principal component scores at the two levels can be estimated using the methods described in Appendix A. Therefore, the entire weighted MFPCA model for coherent mortality forecasting of the subpopulations is given as: is the mean of the subpopulation j. Forecasts can then be achieved by extrapolating the level-one scores {β t,1 , . . . , β t,N 1 } and the level-two scores {γ t,j,1 , . . . , γ t,j,N 2 }, using time series models. Since the level-one scores are independent we assume independent and possibly non-stationary autoregressive integrated moving average (ARIMA) models for β t,1 , . . . , β t,N 1 . As for the level-two scores γ t,j,l , it is noted that, for the same order l (l = 1, . . . , N 2 ), the set of scores for different subpopulations {γ t,1,l , . . . , γ t,m,l } share the same basis φ (2) l (x), which implies that the scores {γ t,1,l , . . . , γ t,m,l } are not independent and hence multivariate autoregressive moving average (ARMA) models with stationary restriction are used for the level-two scores. The stationary constraint is to ensure the coherence in the mortality forecasting.
Letβ (n+h),k denote the h-step ahead forecast of β (n+h),k andγ (n+h),j,l denote the h-step ahead forecast of γ (n+h),j,l . Then the h-step ahead forecast of y (n+h),j (x) is obtained as: l (x).
Due to the fact that each component in the model is uncorrelated with each other, the forecasting variance can be obtained by adding up the variance of each component: var where,σ 2 μ j (x) denotes the variance of the smoothed estimates of the means and can be estimated from the smoothing method used; u (n+h),k = var{β (n+h),k } and v (n+h),j,l = var{γ (n+h),j,l } can be obtained from the time series models; and the estimated variance of the observational errors { σ (n+h),j (x)} 2 can be obtained from the historical data [17]. Thus a 100(1 − α)% approximate prediction interval can be constructed asŷ (n+h),j (x) ± z α var{y (n+h),j (x)}, where z α is the (1 − α/2) quantile of the standard normal distribution.
Note that the weights {w t } are controlled by the tuning parameter κ. The larger κ is, the faster the weights for the past years are decaying over time. Therefore, κ represents people's perception on how past data should be weighted. The value of the parameter κ can be specified a priori or determined empirically by minimizing the average root mean square forecast error (RMSFE):

Application
In this section we evaluate our weighted MFPCA model by applying it to some sex-specific mortality data. We first consider the coherent forecasting of the sex-specific mortality rates in the UK to test if the forecasting result appears to be non-divergent for the male and female life expectancy. We then compare the forecasting accuracy of our model with those of the unweighted MFPCA and the Product-Ratio model [14] as well as the independent model [17] using the sex-specific mortality rates of nine different countries.

Coherent forecasting for the male and female mortality in the UK
The sex-specific mortality data in the UK are obtained from the Human Mortality Database [13]. The data provide the observed mortality rates for every one year at each age. We select the years from 1950 to 2010 and aggregate the data for the ages 100 and above to avoid the anomalous mortality rates during the first and second world wars and the erratic rates above 100. The mortality curves are first smoothed using the weighted penalized regression splines with a monotonicity constraint [17]. The smoothed curves are shown in Figure 1.
The weighted MFPCA model is then fitted to the data. The weight parameter κ is determined by minimizing the average root mean square forecast error (RMSFE) based on a rolling window approach; see Section 3.2 for the details. The mean functions and the level-one and level-two functional principal components are estimated as discussed in the previous section. The level-one scores are modelled by univariate ARIMA time series using the R package 'forecast' [15,16]. The level-two scores are modelled by multivariate stationary ARMA time series using the R package 'MTS'. The orders of the ARIMA and ARMA models are selected based on the Akaike information criterion (AIC).
The results show that the level-one principal components explain 94% of the total variation whilst the level-two principal components do less than 6%. The first three level-one principal components account for around 97.7%, 1.6% and 0.3% of the total level-one variation respectively, and the first three level-two principal components account for around 61.8%, 17.2% and 7.0% of the total level-two variation respectively. Together, the first three principal components in both levels explain about 98.8% of the total variation. Hence three principal components in both levels are used in the approximation (N 1 = N 2 = 3); more principal components are tested in the experiments but no significant differences are found   Figure 2 displays the level-one components which are shared by both male and female subpopulations. As can be seen in the figure, the first level-one principal component models the degree of variation in mortality among different age groups, which is very large for children age groups and relatively small for twenties and old age groups. The second and third principal components present more complicated patterns and are difficult to interpret. Figure 3 displays the sex-specific deviations from the overall mean function, the leveltwo principal component scores which are specific for male and female subpopulations,  and the common level-two principal components. It can be observed that the male and female deviations from the overall mean are complementary to each other, because of the method used for estimating the mean and deviations.
As demonstration, Figure 4 shows the forecasted log mortality rates for males and females in the UK by the weighted MFPCA model for the years 2025 (15-year forecast horizon) and 2040 (30-year forecast horizon). Figure 5 shows the 30-year forecasts of the male and female cohort life expectancy at birth by our proposed model as well as the independent model [17]. It can be observed that the independent model produces a more divergent forecasting result than our coherent model does.

Comparison with other models
We now compare the forecasting accuracy of our proposed weighted MFPCA model (denoted by wMFPCA) with the unweighted (or equally weighted) MFPCA model (uMF-PCA), the Product-Ratio model (Product-Ratio) and the independent model (Independent). We first consider the UK male and female mortality data from 1950 to 2010. We adopt a rolling window approach, that is, to use the data from 1950 to 1981+t as the observations and forecast the mortality rates for years 1981+t + 1, . . . , 1981+t + 20, for t = 0, . . . , 9. Thus, ten sets of forecasts for each of 1 to 20 year horizons are obtained and compared with the actual values. For a given forecast horizon h (h = 1, . . . , 20), the outof-sample root mean square forecast error (RMSFE) for the jth subpopulation is defined as: where p is the number of age groups (p = 101 in this example). For the UK male and female mortality data, the values of the out-of-sample RMSFE are obtained using the weighted MFPCA model, the unweighted MFPCA model, the Product-Ratio model and the independent model, and are illustrated in Figure 6.  It can be observed that the wMFPCA model performs similarly to the independent model for the UK male mortality and both outperform the unweighted MFPCA model and the Product-Ratio model. On the other hand, for the UK females, the performance of wMFPCA is very close to those of the unweighted MFPCA model and the Product-Ratio model, and they all present significantly higher forecasting accuracy than the independent model. Overall, the average RMSFE (over the forecast horizons and the sex) by the weighted MFPCA model is 0.1491, compared with 0.1766 by the unweighted MFPCA, 0.1705 by the Product-Ratio model and 0.1770 by the independent model. These results show that, in overall terms, the proposed weighted MFPCA model performs considerably better than the unweighted MFPCA, the Product-Ratio model and the independent model for the UK mortality forecast.
The same experiment is further conducted to the male and female mortality data of eight other developed countries, including Australia, USA, France, Belgium, Spain, Canada, Netherlands and Italy. The values of the average RMSFE by the four models for all nine countries are presented in Table 1. It can be seen that the weighted MFPCA model provides the smallest forecasting errors for all the countries except Spain, and it is confirmed that in overall terms the weighted MFPCA model outperforms the other three models.

Conclusion and discussion
We have proposed a new model for coherent forecasting of mortality rates for several correlated subpopulations in a large population. The model is based on the multilevel functional principal component analysis (MFPCA) framework and is incorporated with weights which allow more recent data to have more significant impact on forecasting results. The mortality curves of different subpopulations are treated as a set of multilevel functional data, and the principal components and their corresponding scores are obtained at two levels and forecasts are made by extrapolating the scores using time series models. The numerical example shows that the proposed model can achieve nondivergent forecasts for the subpopulations. Using the sex-specific mortality data of nine developed countries, we have demonstrated the usefulness of the proposed model and compared the results with the unweighted MFPCA, the Product-Ratio model and the independent model.
The wMFPCA model consists of a group mean, a decomposition of the level-one functions and a decomposition of the level-two functions. The level-one functions govern the common properties of the entire population while the level-two functions involve the properties specific to the subpopulations. Therefore the model possesses a simple and explicit form which makes the modelling procedure easy and interpretable. Since the level-one functions are largely dominated by their first principal component, the age pattern of change at this level is relatively fixed. However, the governance of the level-two functions are separated into several principal components, which gives a flexible age pattern of change at level-two. In such way, the wMFPCA model provides more flexibility in the age pattern of change compared with the independent model which contains only a single level of principal components.
It is noted that, although the wMFPCA model has a similar structure to the Product-Ratio model the former has several advantages over the latter in its nature. The core of the wMFPCA method is to find the variations of multilevel functional data at different levels and then decompose them using the Karhunen-Loève expansion. There is no need to transform the data as done in the Product-Ratio method (calculating the product functions and the ratio functions) so the functional data are allowed to speak more for themselves within the wMFPCA framework. The Product-Ratio model assumes the mortality rates in the subpopulations having equal variances, which is not needed in the wMFPCA model. Moreover, within the wMFPCA framework, the percentage of variance explained by each principal component at each level can be calculated explicitly and easily. Knowing these percentages of variance explained, we are able to explicitly evaluate the importance of the principal components at different levels.
Coherence is another important issue to be discussed. In our model, coherence is ensured by applying stationary multivariate ARMA models to the level-two principal component scores. The convergence of the level-two principal component scores to certain constants under stationary ARMA models guarantees that the long-term forecasts of the level-two functions also converge to their age-specific constants. As the level-two functions converge to constants, they gradually lose ability to affect the change of mortality. The change of mortality is then entirely dominated by level-one functions. Since the level-one functions are commonly shared by different subpopulations, their impact on the changes of mortality in those subpopulations is equal. The forecasted mortality differences among subpopulations are therefore constrained, leading to a similar constraint on the forecasted life expectancy as well.
for multilevel functional data involves extra complication because the two levels of eigenfunctions, namely φ (1) k (x) and φ (2) l (x), are not necessarily mutually orthogonal. Di et al [7] assumes that β i,k and γ i,j,l both follow Gaussian distributions and proposes two methods for estimating the scores. In the first method, the estimated μ(x), η j (x), λ (1) k , λ (2) l , φ (1) k (x) and φ (2) l (x) are treated as fixed while the PC scores β i,k and γ i,j,l are random effects to be estimated. The model can then be written as: where ε i,j (x) is i.i.d and appears only when the functional data are observed with error (if the functional data are already smoothed or perfectly observed, simply remove this term from the model). This is in fact a linear mixed effect model in nature, therefore the random effects can be estimated by either best linear unbiased prediction (BLUP) or Markov Chain Monte Carlo (MCMC). This model is appropriate for both dense and sparse functional data. However, it faces computational challenges, especially when dealing with large volume of data. Therefore the second method is adopted in our proposed model.