Probabilistic load forecasting using post-processed weather ensemble predictions

Abstract Probabilistic forecasting of electricity demand (load) facilitates the efficient management and operations of energy systems. Weather is a key determinant of load. However, modelling load using weather is challenging because the relationship cannot be assumed to be linear. Although numerous studies have focussed on load forecasting, the literature on using the uncertainty in weather while estimating the load probability distribution is scarce. In this study, we model load for Great Britain using weather ensemble predictions, for lead times from one to six days ahead. A weather ensemble comprises a range of plausible future scenarios for a weather variable. It has been shown that the ensembles from weather models tend to be biased and underdispersed, which requires that the ensembles are post-processed. Surprisingly, the post-processing of weather ensembles has not yet been employed for probabilistic load forecasting. We post-process ensembles based on: (1) ensemble model output statistics: to correct for bias and dispersion errors by calibrating the ensembles, and (2) ensemble copula coupling: to ensure that ensembles remain physically consistent scenarios after calibration. The proposed approach compares favourably to the case when no weather information, raw weather ensembles or post-processed ensembles without ensemble copula coupling are used during the load modelling.


Introduction
Grid operators of electric utilities rely on accurate load forecasts to make informed decisions regarding electricity transmission and distribution. Over the past few years, modelling load has become increasingly challenging with the advancements in low carbon technologies, integration of renewable energy, and growth in unmetered and distributed smallscale renewable generation sources, all of which introduce more volatility into the energy system. Moreover, load and renewable energy supplies vary with weather seasons (Taylor, 2003), and load also depends on human behaviour, which is often rather stochastic. To cope with these uncertainties, probabilistic forecasting of load at different hierarchies of the energy system has garnered attention (Arora & Taylor, 2016;Guo et al., 2018;Haben et al., 2019;Taieb et al., 2021;Taylor & Buizza, 2002;van der Meer et al., 2018). Probabilistic forecasts aim to quantify the uncertainty in the form of probability distributions over possible future events, which allows for informed decision-making compared to the case when only a point forecast is communicated. In the context of this study, probabilistic load forecasts are of particular interest for a range of energy applications, including reliability planning (Billinton & Huang, 2008), probabilistic load flow (Chen et al., 2008), stochastic unit commitment (Wang et al., 2011), and probabilistic energy price forecasting (Nowotarski & Weron, 2018).
Load at the national level exhibits prominent variability, due largely to periodic cyclicality and variations in weather patterns. While some studies do not use explicit weather information at all (Hu, 2017;Hu & Jiang, 2017), it is imperative, when modelling load in terms of weather as the basis for probabilistic load forecasting, to propagate the uncertainty from the weather variables through the load forecasting model (see, for example, Haupt et al. (2019)). This can be achieved through the use of weather ensemble predictions generated from Numerical Weather Prediction (NWP) models.
In recent years, NWP models have become stateof-the-art in meteorology, with modern computing power allowing complex physical models to be run at a high resolution. These weather models describe the atmospheric processes using first principles, and, due to their nonlinearity and complexity, are solved with numerical approximations (Al-Yahyai et al., 2010). The outcome of a NWP model highly depends on the initial state of the atmosphere and the model's physical processes. To quantify these sources of uncertainty, the NWP model is run several times, with different initial conditions and/or a differently parameterised physical representation of the atmosphere. Each run of the NWP model provides a different scenario for the future of the weather variable, which is referred to as an ensemble member. Overall, the weather ensemble prediction encapsulates the degree of uncertainty in weather variables.
Tremendous advancements have been made in the area of NWP over the past few years. It has been shown that the forecast skill for lead times from three to 10-days ahead has been increasing by around one day per decade (Bauer et al., 2015). The improvements in weather predictions have primarily been attributed to progress in: (1) modelling the physical processa detailed representation of the atmosphere, (2) ensemble forecastingwhich encapsulates the uncertainty in initial conditions and model processes for a nonlinear complex system, and (3) model initialisationderiving the current state of the atmosphere and Earth's surface based on four-dimensional variational (4D-Var) data assimilation techniques that have been described as a major milestone in the field. For details, see Bauer et al. (2015) and Alley et al. (2019).
Unfortunately, raw ensemble predictions obtained from the NWP models are subject to underdispersion and bias. To deal with these shortcomings, statistical post-processing methods have been proposed for calibrating the weather ensembles (e. g. Baran & Lerch, 2018;Ben Bouall egue et al., 2016;Feldmann et al., 2015;M€ oller et al., 2015;Scheuerer & Buermann, 2014). Of these, two of the most commonly used methods for post-processing include Bayesian Model Averaging (see Raftery et al., 2005) and Ensemble Model Output Statistics (EMOS), originally also known as non-homogeneous Gaussian regression (NGR) (see Gneiting et al., 2007). Both of these post-processing methods have been shown to substantially improve the accuracy of NWP ensemble predictions (Hagedorn et al., 2012;Wilks & Hamill, 2007).
It is worth noting that the raw ensemble outputs from the NWP model represent a multivariate dependence structure, as specified by the model equations. If the weather ensembles are treated as being independent during the post-processing, we end up with several univariate and independent distributions, which may be practically unrealistic. It is thus important to take into account the temporal dependencies, as well as dependencies between the weather variables during ensemble post-processing (Hu et al., 2016;Schefzik et al., 2013). This problem is well known in the meteorological literature, and studies have proposed methods to create dependent distributions via empirical copula approaches, most notably the Schaake shuffle (see Clark et al., 2004) and different forms of Ensemble Copula Coupling (ECC) (Schefzik et al., 2013).
Advancements in the NWP models over the years have, unfortunately, not adequately translated into more accurate modelling of load. In the energy forecasting literature, studies using ensemble weather information as model input are rare, although e. g. Al-Yahyai et al. (2010) show that NWPs are superior to station-based weather information. Even when weather ensemble predictions are used for forecasting in a range of diverse applications such as: e. g. wind power (Gensler, 2019;Heppelmann et al., 2015;Nielsen et al., 2004), wind ramp events (Bossavy et al., 2013), solar power plant output (Thorey et al., 2018;Zamo et al., 2014) or load (Taylor & Buizza, 2002;, the need for calibration and maintaining the dependency structures among the weather ensemble predictions is neglected, which is a major limitation in the modelling. Only Heppelmann et al. (2017) use post-processing and an approach to capture the dependencies, however, they investigate probabilistic wind power forecasts.
This study aims to bridge the gap between the field of meteorology (focusing on weather ensemble predictions from NWP models) and energy modelling by proposing and implementing the following set of best practices: (1) calibrating the raw weather ensemble predictions to correct for biases and dispersion errors; (2) maintaining the temporal and multivariate dependencies between the calibrated ensemble members; and (3) using the post-processed weather ensemble predictions to estimate the forecast distribution of load (as opposed to just producing a point estimate or a set of pre-specified discrete scenarios of load). Our post-processing comprises a two-stage approach, in the first stage we use EMOS for calibration, and in the second stage, we use ECC to ensure that the calibrated weather ensemble predictions remain physically consistent scenarios. Moreover, we are essentially revising the approach taken by Taylor and Buizza (2002) and Taylor and Buizza (2003), who employed raw weather ensemble predictions to generate multiple scenarios of load, which were then post-processed. We compare the out-of-sample load forecast accuracy obtained using the raw weather ensemble predictions (current practice) versus using postprocessed ensembles (as proposed in this study). To the best of our knowledge, this is the first study that employs ensemble post-processing for probabilistic load forecasting.
The remainder of the paper is structured as follows. We introduce our data in Section 2 and describe the post-processing in Section 3. We present our forecasting methodology in Section 4. We evaluate the point and probabilistic forecast accuracy in Section 5 and finally, we conclude in Section 6.

National load and weather in Great Britain
In this section, we first describe the data for load followed by weather ensemble predictions.

Load data
We employ load for Great Britain (GB), from 2006 to 2017, inclusive. The data is sampled every hour and obtained from National Grid (NG), the company responsible for the transmission of electricity in GB. As evident from Figure 1, load exhibits a prominent recurring within-year pattern (intrayear seasonality), whereby the demand is higher in winter than in summer. The national load is a summary of all flows on the transmission grid in GB, and therefore, it does not capture all distributed energy sources. With the recent growth in unmetered and small-scale renewable generation sources, this form of measuring national demand has resulted in an overall downward trend and increased variability in the load time series. Figure 2 presents the average daily load profiles. It can be seen that load is higher during typical working hours of the day and late evenings, and load on weekends is usually lower than on working days. Moreover, load exhibits a recurring withinweek and within-day pattern. Load is overall lower on special days (such as public holidays) and proximity days (days adjacent or close to a public holiday that are not public holidays) compared to normal working days. Previous studies have typically focussed on modelling the load for normal days while ignoring load on special days (Taylor & Buizza, 2002;. Accommodating the special and proximity day effects during the modelling has been shown to result in improved load forecast accuracy across all days in the out-of-sample period (Arora & Taylor, 2018). We thus model load observed during both normal and anomalous periods.
In this study, we focus on modelling the load observed at midday. This is particularly relevant as the peak demand during summer months occurs around midday in GB. We use the first 10 years of data to train the model before testing it on the last available year 2017. 1 The last year of the training data is used as the cross-validation hold-out sample. We generate forecasts by rolling the forecast origin through each midday in the one-year out-of-sample period. We identified 18 special and proximity days in the out-of-sample data. Although we focus on generating multi-step ahead load predictions for midday, the proposed methodology could easily be adapted for forecasting load across all periods of the day.

Weather ensemble predictions
We employ the actual weather data and corresponding ensemble predictions from the NWP model of the European Centre for Medium-Range Weather Forecasts (ECMWF). The actual weather is represented by the reanalysis data, while the uncertainty in the weather forecasts is represented by the ensemble forecasts. Weather predictions from the ECMWF comprise 51 ensemble members, whereby 50 ensemble members are constructed by perturbing the initial conditions and/or model processes, and one ensemble member is generated using the best estimate of the initial condition/process (CNT: ensemble control member). We additionally use the high resolution deterministic forecast (HRES) of the ECMWF in our ensemble. We employ ensemble predictions for the following weather variables: temperature (at 2 m above ground), wind speed (perpendicular u-and v-components at 10 m above ground) and total cloud cover. The wind speed is calculated from the u-and v-components with windspeed ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi u 2 þ v 2 p : While high-resolution spatial weather data is available in a grid format from the ECMWF, we use only weather data for the seven GB locations considered by NG. These are chosen to reflect the regions with highest load. The locations are shown in Figure 3. Using the weights shown in Figure 3, a weighted average of the weather data for these locations is used by NG as input to their load forecasting models. We follow the same approach, with a key difference being that, while NG consider only point forecasts for the weather variables, we use weather ensemble predictions. We obtain the ensemble predictions from ECMWF at each of the seven locations and calculate a weighted average over each ensemble member at all locations to give one set of ensemble predictions for each time point. This weighted average set of ensemble predictions is then used in the further post-processing and forecasting steps.
To capture the influence of weather on load adequately, we use the original weather variables from the NWP and, following NG, we additionally derive two new variables, namely effective temperature and cooling power of wind. We calculate the effective temperature as done in Taylor and Buizza (2003) using where for a given period t, TO t denotes the average spot temperature of the previous four time steps, resulting in TE t being an exponential smoothed form of TO t : The rationale of using a smoothed form of temperature (TE t ) is to try and accommodate the slow and gradual change in human behaviour (and resulting demand response) to changes in the outside temperature. Additionally, the cooling power of wind variable, is based on the idea that wind speed changes electricity consumption behaviour only if the outside temperature is below a certain threshold. We again use the definition provided by NG, and used by Taylor and Buizza (2003), with this threshold at 18.3 C where for a given period t, CP t denotes the cooling power of wind, W t denotes the wind speed, and TO t represents the average temperature. These two new variables (TE t and ) have a strong correlation with load, as evident from Figure 4. It can also be seen in Figure 4 that load and temperature have a nonlinear relationship, which could potentially be approximated using an asymmetric quadratic function. The correlation between the effective temperature and load shows that the rise in load is sharper during the winter months than in the summer months. This can be attributed to the higher use of electrical equipment for heating during winter (compared to the use of cooling equipment during summer) in GB. A rise in the cooling power of wind is associated with an overall increase in electricity demand. Additionally, although the load is generally higher during the weekdays as compared to the weekend, the correlation is strong in both cases.

Weather ensemble post-processing
The raw weather ensemble predictions from the ECMWF are biased (Atger, 2003;Mass, 2003) and underdispersed (Eckel & Walters, 1998;Hamill & Colucci, 1997). The weather ensemble predictions thus need to be calibrated using post-processing methods. As pointed out by Gneiting and Katzfuss (2014), ensemble calibration aims to correct for the dispersion errors and biases in raw weather ensemble predictions, with the overall goal of maximising the sharpness of post-processed ensemble prediction distributions subject to calibration. Calibration refers to the statistical consistency between the forecast distribution and actual observations, while sharpness refers to the concentration (or spread) of the forecast distribution. Rank histograms (also known as Talagrand diagrams) can be used to assess the calibration of a probabilistic forecasting system. In an ideal forecasting system the verifying observations are equally likely to fall within any bin constructed from two ordered neighbouring ensemble members. The rank histogram distribution is thus ideally symmetric and flat with equal numbers of observations in each bin. However, while the uniform rank histogram is a necessary condition for calibration, it is not sufficient. The rank histogram of one-step-ahead raw temperature ensembles, as shown in Figure 5, is asymmetric, which indicates the presence of bias. The U-shape in Figure 5 indicates that the ensemble predictions are underdispersed. The presence of bias and dispersion errors in the weather ensemble predictions could result in a poor estimation of the load forecast distribution. To deal with this issue, we adopt a two-stage post-processing scheme. In the first stage, we calibrate the weather ensemble predictions using Ensemble Model Output Statistics (EMOS). In the second stage, we retain the multivariate dependency structures in calibrated weather ensemble predictions using Ensemble Copula Coupling (ECC). For ensemble post-processing, we use weather data for the period 1 January 2016 up to 31 December 2017, inclusive. We now describe the post-processing method.

Ensemble model output statistics
EMOS addresses the issue of both bias and underdispersion in raw weather ensemble predictions . Specifically, EMOS calibrates past ensembles using the corresponding actual historical weather data, whereby the estimated parameters (from the training data) are used for calibrating the future ensemble members. This calibration is performed by estimating a distribution for the raw weather ensemble predictions, x 1 , :::x M with M members, using a parametric distributional regression approach. In our post-processing, we use the distributions for temperature and wind speed as proposed by Gneiting (2014). The temperature ensembles are calibrated using a normal distribution N ðl, r 2 Þ and the wind speed is modelled using a truncated normal distribution N 0 ðl, r 2 Þ, where l and r 2 are calculated as x m and Figure 4. Scatter plot of load with effective temperature and cooling power of wind. Note: data for weekdays and weekends are denoted by grey and black dots, respectively. Figure 5. Rank histogram of the one-step-ahead raw temperature ensembles before any post-processing at midday.
The EMOS location parameters a> ¼ 0 correct for the bias in the raw weather ensemble predictions, while the scale parameters b> ¼ 0 adjust the spread and potentially tackle the issue of underdispersion. Cloud cover is post-processed with a multinomial logistic regression following the approach by Baran et al. (2021); Hemri et al. (2016) using the same intervals for quantisation of the forecast values in order to correspond to oktas (see Table A4). We implement the same MLR as Baran et al. (2021) and refer to their paper for more detail.
To evaluate if EMOS helps improve the calibration of our weather variables, we investigate the Probability Integral Transform (PIT) of calibrated ensembles. If F denotes a fixed, non-random predictive CDF for an observation Y, the PIT is the random variable Z F ¼ FðYÞ: It is known that if F is continuous and Y$F then Z F is standard uniform. The PIT for the temperature variable is shown in Figure 6. For a perfectly calibrated ensemble, all bins would have the same height (as denoted by the dashed red line). Compared to the relative frequency of raw ensemble as shown earlier (Figure 5), the ensemble distribution after calibration is more uniform ( Figure 6).
We train the EMOS parameters on the past 30 days and re-estimate every day. We tested different length of days for training (ranging from 10 to 100) and found that 30 days performed best on the test set in 2016. Figure 7 shows an example of a post-processed ensemble trajectory over a randomly selected six day period.

Ensemble copula coupling
After calibrating the weather ensemble predictions based on univariate distributions, as specified using EMOS, we could potentially draw independent samples from the calibrated ensemble distributions, and use them as inputs in the load forecasting model. However, this could result in combinations of weather variables that are unlikely to happen in reality, because treating calibrated ensembles as being independent would result in a loss of the multivariate dependency structures of the raw ensembles. Thus, to ensure that the calibrated weather ensemble predictions maintain the original dependency structures, we use a reordering based ECC scheme similar to the Shaake shuffle (Schefzik et al., 2013) and the stratified sampling ECC (ECC-SS) approach by Hu et al. (2016).
ECC is based on the appropriate copula being defined in the form of a reordering process. The idea is that given a dependence structure "template" (Schefzik, 2017), the samples that are drawn from multiple univariate EMOS distributions can be reordered in such a way that their rank ordering resembles the rank ordering of the raw ensemble members for the same variables. The templates are based on the raw weather ensemble predictions, where we assume that the raw ensembles capture the correlations sufficiently. While several variants exist, we use a slightly adapted version of the stratified sampling ECC (ECC-SS) proposed by Hu et al. (2016).
The ECC-SS procedure essentially includes three steps. Figure 8 illustrates these steps in a scenario with six ensemble members, which are used to describe three distributions of weather variables at two different time steps. In the first step, we rank the ensemble member values. Thus, the raw ensemble members x 1 , :::, x M with their order statistics x ð1Þ ::: x ðMÞ are used to generate a rank dependence structure at each time horizon via a permutation p, with pðmÞ :¼ rankðx m Þ for m 2 f1, :::, Mg (see table (A) in Figure 8). In the second step, we impose this rank structure on the calibrated weather ensemble predictions. As we are left with a conditional distribution function following EMOS, we impose the rank order by first splitting the calibrated ensemble distribution into M equally spaced quantiles. Each quantile then represents a rank in the raw ensemble and thus M has the same size as the raw ensembles (see (B) and (C) in Figure 8). We chose the 0 and 100% quantiles to be equal to one standard deviation below the minimum and one standard deviation above the maximum, respectively. In the third step, we draw realisationsx from bins, i. e. the intervals between the quantiles to obtain more than M total samples and to efficiently sample from the tails of the distribution. In contrast to Hu et al. (2016), we do not fix the interval size to 1 n , but use the quantiles such that the width of the bins adapt with the density. We then draw the realisations dependent on the rank structure from the raw ensembles, such that the calibrated and reordered ensemblex 1 , :::,x M is given bŷ Thus, we draw from the multivariate weather distributions at different time steps, while maintaining their dependency structures across both the weather variables and time.

Probabilistic modelling of load
To generate probabilistic load forecasts using postprocessed weather ensemble predictions, we adopt a two-stage linear regression model. We adopt this approach as it is based on the model used in practice at the NG, and was also employed by Taylor and Buizza (2003). The first stage of the linear regression model can be described as where y t is the dependent variable which in our case is the national electricity demand, x t, i are other variables describing the load and weather, D t, j are dummy variables. In our setting, the dummy variables include Friday, Saturday, Sunday, dummies for special days (e. g. public and bank holidays) and proximity days and dummy variables for the summer months (June, July, August) and winter months (December, January, February). The interaction terms C t, k are either between two variables or between a variable and a dummy variable, such as the interaction between temperature and wind, and the interaction between temperature and the weekend dummy. The load variables x t, i can be modelled as a function of a time-specific component and a weatherspecific component. For example, the time-specific component includes a counter of the day in a year, and an overall day counter for the whole data set, as well as quadratic and cubic terms of these counters. These time counters help accommodate the seasonal patterns in load time series. The weather-specific component comprises weather variables to capture variations in load due to changing weather patterns. The model is trained on actual historical weather data for the three weather variables wind speed, temperature and cloud cover to describe the real relationship among the variables without introducing any bias through forecast values.
In the second stage, the error from the first stage is modelled with the help of autoregressive terms (using lags up to 4 weeks) in the form of e t ¼ a 0 þ a 1 e tÀ1 þ ::: þ a 28 e tÀ28 þ g t , g t $N ð0, r 2 Þ: While a typical load forecasting strategy relies on using a single point forecast for each weather variable, we want to accommodate the uncertainty in  . , x 6 for three weather variables (v, y, z), across two horizons (t ¼ 1, 2). The rank ordering in the table is derived from the raw ensembles (first step of ECC-SS). (B) and (C) show the EMOS output distributions at t ¼ 1 and t ¼ 2 respectively. To maintain the rank ordering of the six raw ensembles (as shown in A), we split the post-processed distributions into six equally spaced quantiles, whereby each quantile denotes a rank order (second step of ECC-SS). While drawing realisations from the multivariate weather distributions across different horizons (t ¼ 1, 2), we impose the original rank ordering (from A) to ensure that both the variable and temporal dependencies are maintained (third step of ECC-SS).
weather predictions using the post-processed ensembles. Thus, the multivariate weather distributions are converted into load forecast distributions using Monte Carlo. We draw 1000 samples at each time step for each forecast horizon from the post-processed weather distributions while maintaining rank ordering (see Section 3.2) and use these as input into the linear regression model, resulting in our load forecast distribution.
Although we adopt a linear model, the transformation used for weather variables (such as, cooling power of wind) helps accommodate the nonlinear relationship between the weather variables and load. The objective of this study is not to compare different modelling approaches, instead, for a given wellestablished model (linear regression model in this case), we aim to assess the efficacy of post-processing weather ensemble predictions for probabilistic modelling of load. Thus, we compare the forecast accuracy of the linear regression model using a different set of input variables during the modelling, based on the following five alternative criteria: Approach 1: No Weathernot incorporating any weather-related information in the modelling. A model with weather information would be expected to outperform this baseline model. Approach 2: Actualsusing actual weather data as predictor variables. Although this information is not accessible at the time of forecasting, we use this model to provide an estimate of the upper limit on the load forecast accuracy that could theoretically be attained if perfect future weather information was available. Approach 3: Raw Ensemblesusing the raw weather ensemble predictions from the NWP models as predictor variables. Approach 4: EMOS Ensemblesemploying the post-processed (only EMOS) weather ensemble predictions. Approach 5: ECC Ensemblesemploying the postprocessed (EMOS and ECC) weather ensemble predictions. This is our proposed approach.
The linear regression model (for the above five approaches) is trained using only the data from 2006 to 2016, and validated using the data from 2017. For the approaches with weather (Approach 2-5) and without weather (Approach 1) the parameters (b 0 , a i , b j , c k ) are estimated independently while using the same set of none weather-related variables, i. e., using the same set of dummy and load variables. In total, we consider 53 predictor variables including weather (Approach 2-5) and 47 predictor variables without weather (Approach 1) for modelling load. In ordinary least squares regression, this variety of features can lead to low predictive power and reduce model inference due to problems such as over-fitting, presence of noisy (or irrelevant) predictors, and multicollinearity. A common choice to overcome these problems is to use a regularisation technique, such as the LASSO (least absolute shrinkage and selection operator) (Hastie et al., 2013;Tibshirani, 1996). The LASSO regression forces the model coefficients of less salient features to go to zero. Although this shrinkage increases the bias, it improves the forecasting accuracy (Ludwig et al., 2015).
In contrast to other regularisation techniques, the LASSO technique uses an L 1 penalty term, which sets some coefficients to exactly zero (Hastie et al., 2013). The LASSO can, therefore, be used as a feature selection method. However, to efficiently use the LASSO method, the choice of the shrinkage parameter (k) is essential. In our case, we use k-fold cross-validation on the training data set and choose k as the largest value of k such that the error is within one standard error of the minimum. Using LASSO, we select a total of 39 variables with nonzero coefficients in our model with weather variables (Approaches 2-5) and 34 in the model without weather variables (Approach 1). The full list of variables with non-zero coefficients (Table A2), as well as those with a coefficient of zero (Table A3), can be found in the supplementary materials.
We generate probabilistic load forecasts by rolling the forecast origin through each day in the out-of-sample period (2017). After each week, we re-estimate the regression coefficients for the linear regression model. As we stated earlier, the parameters for EMOS are estimated once using the cross-validation hold-out sample (2016).

Forecast evaluation
In this section, we evaluate the out-of-sample point and probabilistic load forecast accuracy using the following three performance scores: the mean absolute percentage error (MAPE), the root mean squared error (RMSE), and the continuous ranked probability score (CRPS). While the first two error measures quantify the point forecast accuracy, with the MAPE being scale independent and the RMSE putting a heavier penalty on large deviations, the CRPS quantifies probabilistic forecast accuracy summarising calibration and sharpness. The MAPE and RMSE are defined as with y t being the actual load at time point t andŷ t being the forecast load value at this time point, while N denotes the number of observations. It has been shown by Gneiting (2011) that for model evaluation based on a quadratic loss function, the mean of the density forecast is the optimal forecast. Similarly, if the evaluation is based on a symmetric piecewise linear loss function, then the optimal forecast is the median of the density forecast. Thus, for evaluation using the RMSE and MAPE, we use the mean and median of the distributional forecast, respectively. The CRPS is then defined as with F being the predictive cumulative distribution function of load, y the verifying observation and 1 denoting an indicator function. A lower value for the CRPS indicates greater probabilistic forecast accuracy. We report the average CRPS computed over all observations in the out-of-sample period. Additionally, as we want to assess whether we can improve the forecasting accuracy through post-processed ensembles, we calculate a CRPS skill score. A skill score is the percentage by which a model is more accurate than a baseline model. Using the approach with post-processed weather ensemble predictions, we summarise the one-stepahead probabilistic load forecasts for the out-ofsample data in Figure 9, where we plot the 95% prediction interval (grey area) and the median forecast (in orange) along with corresponding actual load observations (in blue). It is encouraging to see that the prediction intervals encapsulate the majority of actual observations. The reliability diagram across all horizons for the best performing model is shown in Figure 11. For each probability level, the reliability diagram presents the proportion of out-of-sample observations that fell below the corresponding quantile forecasts. In our case, we can see that less than 95% of the observations fall into the 95% prediction interval.
In Table 1, we present the MAPE, RMSE, CRPS and CRPS skill score for lead times ranging from one day to six days ahead. For the skill scores, we use the model with no weather information as the baseline (Approach 1: No Weather). Crucially, the incorporation of weather information resulted in a substantial improvement in both the point and probabilistic load forecast accuracy across all lead times considered in this study. This result highlights the importance of using weather information in load forecasting models. Encouragingly, the model Figure 9. Probabilistic load forecasts for one-day-ahead with 95% prediction interval (grey area) and the median forecast (in orange) along with corresponding true load (in blue). Note: the forecasts were generated using the linear model with postprocessed weather ensemble predictions. with ECC post-processed weather ensemble predictions outperformed the model with raw ensembles and EMOS post-processed ensembles overall, which points towards the need to post-process the weather ensemble predictions accounting for dependency structures for load forecasting applications.
To summarise the performance of different models across multiple lead times, we present the MAPE, RMSE and CRPS skill scores in Figure 10.
We expect all methods that include weather information to have a positive skill score. It can be seen from Figure 10 that the ECC (Approach 5) and EMOS (Approach 4) post-processed weather ensembles, as well as the raw weather ensemble predictions (Approach 3) perform significantly better than the baseline that uses no weather data. The ECC post-processed weather ensembles outperform the raw ensembles, as well as EMOS post-processed ensembles based on all three skill scores. Only for a forecasting horizon of 6 days ahead do the raw ensembles achieve a slightly higher MAPE accuracy. However this is not a significant difference to the other post-processed ensemble MAPE scores. Overall, compared to the baseline model, using post-processed weather ensemble predictions can improve the point and probabilistic load forecast accuracy of the model by up to 40% with the ECC post-processed ensembles performing best. To ensure that the differences between the forecast models are statistically significant, we compare the models pairwise using the Diebold-Mariano test (Diebold et al., 1995). For our approach, the ECC ensembles, we can reject the null hypothesis that this model performs worse or equal to another model, for all models except the one using actual weather data. The test statistics and their corresponding p-values for horizon one can be found in Table A1. Finally, we also take a look at the quantile decomposition of the CRPS score for horizon one across all models using CRPS f n ¼ QS a ðF t ðaÞ À1 , y t Þ, and QS a ðF À1 ðaÞ, yÞ ¼ 2ð1fy F À1 ðaÞgÀaÞðF À1 ðaÞÀyÞ: Following (Gneiting & Ranjan, 2011) we plot the mean quantile score against a showing the quantile decompositions of the mean CRPS score (see Figure 11).

Conclusion
In this study, we proposed and implemented a best practice to generate probabilistic forecasts of electricity demand using weather ensemble predictions. We used data for three weather variables (temperature, wind speed, and cloud cover), obtained from a  Figure 11. Quantile decomposition of the mean continuous probability score for the different models (left) and coverage rate for the best performing ensemble forecast model (ECC Ensembles) over all horizons (right). 51-member ensemble system and a high resolution deterministic forecast. For load forecasting, we investigated the efficacy of using ensemble postprocessing, as opposed to using raw weather ensemble predictions from NWP systems. This paper has shown how to post-process the weather ensemble predictions by accounting for temporal correlations and correlations between the weather variables. We showed that calibrating the weather ensemble predictions while accounting for their multivariate dependencies using a copula-based coupling approach improves the probabilistic load forecast accuracy, resulting in a CRPS that is noticeably better than a model that does not include any weather information. The post-processed ensembles outperform the raw ensembles, which highlights the advantage of careful post-processing for improved load forecast accuracy. The proposed modelling framework could potentially be adapted to other energy applications, such as wind and solar power generation. A useful line of future work would be to investigate this post-processing approach for modelling electricity demand at different layers of the energy hierarchy, including the low voltage level or at various locations also accounting for spatial dependencies. It would also be worth investigating the use of machine learning for accommodating the nonlinear relationship between post-processed weather ensemble predictions and load in a nonparametric modelling framework. Note 1. Although load data for 2018 is available, the corresponding weather data was not complete at the time, and we thus restricted the analysis to the complete data set we could obtain from both sources.