Large-scale statistical analysis and modelling of real and regulatory total energy use in existing single-family houses in Flanders

ABSTRACT Large-scale statistical studies on the gap between the real and regulatory energy use in residential buildings in Europe have shown that the regulatory calculation overestimated the real energy use, inflated true energy savings and undermined national energy policy making. Using data from 122,680 Flemish existing single-family houses, this research builds further on existing studies by contributing results for Flanders. The study also examines to what extent available aggregated variables explain the real annual total building energy use using statistical linear models and addresses the problem of multicollinearity and the importance of bootstrapped confidence intervals for model quality control. The overestimation of the real total energy use (and potential energy savings) by the Flemish regulatory method is exceedingly large compared to studies from other EU countries. The Flemish labels prove very poor indicators of the real energy use. Statistical linear models explain up to 46.6% of all variability and indicate that a significant extent of multicollinearity had to be corrected. Half of the variability has been left unexplained and has to be attributed to variables that were not available and the fact that the data were insufficiently accurate. Future analysis will explore whether more complex models identify more evidence.


Introduction
Approximately 40% of the energy use and 36% of the CO 2 emissions in Europe take place in buildings (EU, 2020). Hence, the building stock is considered the largest energy consumer in the EU. With the aim of reducing the energy use and CO 2 emissions for buildings in a cost-effective way by 2050, clear legislative frameworks on the building energy use (EU, 2003(EU, , 2010(EU, , 2012(EU, , 2018 have been introduced which impose building energy performance levels through labelling and reporting, pinpoint objectives for new and deeply retrofitted buildings and propose standardized building energy reduction recommendations for existing buildings (EN ISO 13790, 2008;EN ISO 52000, 2017). All EU member states have implemented their version of the directives at latest by 2009 (N.B. which some members accomplished more effectively than others (Andaloro et al., 2010)) and so national building energy registries have emerged and vastly increased ever since.
A range of EU studies has been published in the last decade focusing on the effectiveness of the EPB directive implementations (ASIEPI, 2009;BPIE, 2014BPIE, , 2020CENSE, 2010;CA EPBD, 2008, 2011, 2016, 2019, 2023. The Concerted Action EPBD in particular further enabled EU member states to share their information and experiences with adopting and implementing the European building energy legislation at the national level. The main shortcoming of the directive, as noticed by the EU studies, still was the looseness of the regulations in the directive, which left plentiful room for interpretation (i.e. regarding the applied requirements, indicators, calculation methods and assumptions) and made it nearly impossible to compare national building stock energy uses across nations in a fair manner (CA EPBD, 2008, 2011, 2016, 2019, 2023. Naturally, the calculated energy uses by the regulatory methods have been only an indication of the real energy use at stock level rather than an accurate estimate at building level, since they each used a standard average user profile (CA EPBD, 2008, 2011, 2016, 2019, 2023, a standard climate and standardized, simplified calculation procedures based on technical characteristics of the building. The question however has remained to what extent the regulatory calculated energy uses and performance indicators relate to the real energy uses, and if calculated energy savings associated with better building performance levels are fully obtained in practice (de Wilde, 2014;Guerra-Santin et al., 2009;Majcen et al., 2013aMajcen et al., , 2013bvan den Brom et al., 2018)?

State of the art
A number of studies have been published in recent years focusing on the gap between the real and regulatory calculated building energy use in The Netherlands (Majcen et al., 2013a(Majcen et al., , 2013bGuerra-Santin et al., 2009;van den Brom et al., 2018), France (Cayre et al., 2010(Cayre et al., , 2011, Germany (Loga et al., 2011;Schröder et al., 2014;Sunikka-Blank & Galvin, 2012), United Kingdom (Huebner et al., 2015;Kelly, 2011;Kelly et al., 2011aKelly et al., , 2011bLoucari et al., 2016), Switzerland (Branco et al., 2004;Cozza, Chambers, Geissler, et al., 2019, Cozza et al., 2020FSO, 2017;Streicher et al., 2018), Belgium (BFG, 2019;Delghust, 2015;Hens et al., 2010), Luxembourg (Merzkirch et al., 2014) and Denmark (Brogger et al., 2018). Most of these studies were based on statistical data analysis of large national registries combining several data sources. In The Netherlands, Majcen et al. (2013aMajcen et al. ( , 2013b used the energy performance certificates of over 340,000 cases provided by AgentschapNL (i.e. a public sector organization appointed by the Dutch Ministry of the Interior and Kingdom Relations) and combined it with real energy use data from CBS (i.e. Statistics Netherlands). Cayre et al. (2010Cayre et al. ( , 2011, in France, analysed close to 2000 energy performance certificates from residential buildings provided by EDF R&D through surveys. In Germany, Sunikka-Blank and Galvin (2012) and Loga et al. (2011) studied the energy performance ratings of respectively 3400 dwellings and 1702 dwellings. Kelly (2011) utilized 2531 English dwellings from the EHCS (English House Condition Survey) and combined it with real energy meter data from the FES (Fuel and Energy Survey). Cozza, Chambers, Geissler, et al. (2019a), , Cozza et al.(2020) used the energy performance certificates of 1172 buildings from the CECB-database (i.e. Swiss Cantonal Energy Certificate for Buildings database).
All these studies showed that the regulatory building energy calculation method overestimated the real thermal energy use and thus also the potential energy savings. This applied largely to existing, old, poorly insulated dwellings, yet less to dwellings with improved energy performance levels (Branco et al., 2004;Majcen et al., 2013a;Sunikka-Blank & Galvin, 2012). Both in Germany, The Netherlands, Switzerland and the UK, the average prediction error (i.e. the difference between the real and regulatory calculated energy use) shifted further into an under-estimation of the energy use when reaching high performance levels. This shift was often referred to by authors as the 'rebound effect' (Berkhout et al., 2000;Brogger et al., 2018;Druckman et al., 2011;Nässén & Holmberg, 2009) and reversely the 'prebound effect' (Galvin, 2013;Galvin & Sunikka-Blank, 2016;Sorrell et al., 2009;Sunikka-Blank & Galvin, 2012). The rebound effect occurred when a proportion of the energy savings was consumed by additional energy use (i.e. increased consumption of services). The prebound effect on the other hand was referred to as an under-consumption of services in old, poorly insulated houses.
While it may seem as if both terms were only related to a change in occupant behaviour (e.g. an increase in thermal comfort after an energy-saving measure due to the occupant choosing higher internal temperatures (Greening et al., 2000;Herring & Roy, 2007;Sorrell et al., 2009)), this was only partially the case. A number of studies indicated that a large portion of the prediction error was not caused by changes in occupant behaviour, but was due to simplifications and shortcomings in the regulatory calculation methods (Delghust, 2015;Guerra-Santin, 2010;Guerra-Santin et al., 2009;Kelly, 2011), the quality of the data in the registry (CA EPBD, 2016, 2019Guerra-Santin, 2010) and other basic physical effects (Zou et al., 2018). So, Deurinck (Deurinck, 2015) and Love (Love, 2014) e.g. studied how the rebound effect was related to changes in occupant behaviour and found that a large portion of the change in internal temperature occurred during unheated hours (i.e. the temperature drop between two heating periods gets smaller) and thus was a result of changes in the thermal efficiency of the building envelope and not of the occupant behaviour.
Although the existence of the gap between the real and regulatory calculated thermal building energy use is widely acknowledged, it has proven to be difficult to identify the key factors that influence it. Existing literature indicated that the main causes of the gap were inaccurate knowledge of input variables (Van Dronkelaar et al., 2016), limitations of the calculation method (Delghust, 2015;Kelly, 2011), uncertainty in weather variables and the inability to model occupant behaviour (Guerra-Santin, 2010;Tian et al., 2018;Van Dronkelaar et al., 2016;Zou et al., 2018). Further, Delghust (2015) stressed the influence that the energy performance assessor had on the reported energy rating (i.e. reporting bias) (and therefore also the gap between the real and regulatory calculated thermal building energy use). The reporting bias, namely a higher use of conservative standard values (associated with a lower assessment workload) in the assessments of less ambitious (less energy efficient) houses, was found to result in an artificial overestimation of the difference between high and low performance buildings.

Purpose of this research
Large-scale statistical studies on the gap between the real and regulatory calculated residential building energy use have been published for several EU countries, providing valuable insight about national building stock energy use and retrofit policies, the performance of national regulatory calculation methods and the explanatory power of variables in the national building energy registries. This research builds further on these studies by contributing results for Flanders and correlating them with the state of the art. For the Flemish building stock, it has been unclear whether and to what extent extant findings could be applied, as research so far focused on a few small samples and case studies of only new or deeply retrofitted dwellings (BFG, 2019;Delghust, 2015;Hens et al., 2010) and no large-scale data registries (linking theoretical energy performance assessment data and real consumption data) were previously available.
Also, as the national implementation of the EPB directive varies between countries (CA EPBD, 2016, 2019, 2023, this intrinsically means that also the prediction error (and therefore also the findings from statistical studies anchored within their local context) will vary between countries (e.g. intermittency and spatial correction factors are not included in the Belgian EPBD implementation(s), but are included in e.g. The Netherlands, Germany and the United Kingdom). Not only is it important to have knowledge about the true national stock's energy use and its relation with the reported energy performance label within their national energy performance calculation method. For the success of the national energy policies, it is even more important to highlight the relationship between the expected savings resulting from energy saving measures and the true savings that are actually achieved.

Flemish EPBD implementation
In the Flemish Region, the Flemish Energy and Climate Agency (VEKA) (VEKA, 2021) and the Ministry of Environment, Nature and Energy (DLNE, 2021) are the responsible public bodies for the Flemish EPBD implementation. For existing residential buildings (for sale and rent), the EPBD (EU, 2003(EU, , 2010(EU, , 2012 is implemented in the Flemish Energy Performance and Indoor Climate Decree (Energy decree) (FA, 2009) and certification started from respectively November 2008 (for sale) and January 2009 (for rent) (CA EPBD, 2011, 2016, 2019, 2023. For each building, an accredited EPC assessor, hired by the dwelling owner, calculates and reports the building's energy performance rating, using the regulatory white-box calculation tool (EPC software in Belgium) provided by the government, based on as-built data of the house, which is valid for a period of 10 years. In the case of missing as-built data (e.g. presence of insulation, insulation thicknesses, v 50values for the dwelling's air permeability etc.), conservative standard values are used in the calculation method.
The total, calculated annual primary energy use (TOT) includes the demands for space heating (SH), domestic hot water (DHW), space cooling (SC), auxiliary energy for building services (AUX) and the production of renewable energy (i.e. photovoltaic panels (PV)). The space heating and cooling energy is calculated using a single zone, quasi-steady-state, monthly calculation method, based on ISO 13790 (EN ISO, 13790, 2008;EPC Form., 2015). In order to rate the energy performance of existing residential buildings, the total calculated primary energy use per gross floor area (kWh/m 2 ·y) is translated into a label indicated by a letter (EPC-rating) (EPC Form., 2015; EPC Insp., 2015) ( Table 1).
The EPC also includes standardized retrofit recommendations that are automatically generated and tailored to the building, depending on the inputs. The energy use for cooking, lighting and domestic electric appliances is not included in the regulatory calculation method. The electricity production from local PV panels is taken into account. All calculated energy figures are converted into their primary-energy equivalents, with a conversion factor of 2.5 for electricity and 1.0 for natural gas. Also in Flanders, the EPC's play an important role in the long-term renovation strategy developed by the Flemish government (FG, 2016(FG, , 2020b (which in turn is part of the national energy and climate plan (NEKP, 2019)). The strategy implies that all dwellings need to achieve EPC-label A by 2050 (BPIE, 2014;FG, 2016FG, , 2020bNEKP, 2019). The current grant scheme (FG, 2020a; Fluvius, 2020) provides uniform subsidies (i.e. EPC label bonus for label improvement) for single-family houses with label E or worse who retrofit to label C or better for which the amount linearly increases from label C till A (after retrofit).
Compared to the regulatory calculation methods in Germany (DIN, 2007), The Netherlands (NEN, 2012)  Delghust, 2015;DIN, 2007;NEN, 2012). The authors thus expect the relation between the real and regulatory calculated energy use in Flanders to be worse compared to other countries and the prediction error to be larger.

Data overview and treatment
Data overview This research focused on (i) existing, EPC-rated (ii) single-family houses, (iii) with an EPC-rating assessment from between 2015 and 2018, (iv) with a reported construction year (i.e. since most standard values in the EPC calculation method depend on it). In total, 122,680 cases were sampled from the Flemish EPC registry for residential buildings (operated by VEKA), based on the four criteria above. The Flemish EPC registry is one centralized database (not publicly accessible, even for research purposes) with data from the regulatory energy performance calculation files of all registered buildings (and containing only houses constructed before 2006). It provides information about the building characteristics, technical system data and detailed building geometry data. The matching real energy use figures for natural gas and electricity as well as available declared PV characteristics were obtained from the Belgian distribution system operator Fluvius. The scope of the study was limited to single-family houses (i.e. no apartments). In order to have reliable figures of the real energy use, only homes were selected (1) with an individual heating system, (2) where SH and DHW were only covered by natural gas and/or electricity (according to the EPC-assessor) and (3) that were occupied and for which measured meter data was available for more than one year (i.e. no dwelling vacancy periods and no standard or estimate values from the energy utilities based on previous meter data). Also, the EPC registry entries had to be free from any major error or shortcoming with regard to the data (e.g. missing data or contradictions). Both for natural gas and electricity, normalized annual consumption figures for each single-family house were obtained for the years 2012-2019.
Data treatment, cleansing and filtering Data cleansing (based on analyses within and across the available datasets) revealed a substantial number of inconsistencies, indicating errors in data and contradictions between data files, changes to the dwelling after the moment of EPC assessment or missing data that required cautiousness with regard to other variables or even the case itself. Disputable data and all derived variables that could not be corrected were marked as missing data and excluded from further analysis and statistical modelling. The following four criteria were the main causes why cases were excluded from the sample: (i) Coupling of data from various databases (e.g. Fluvius real energy use data and data from the Flemish EPC registry). (ii) Inconsistencies in the PV-data.
(iii) Doubt about the reliability of the real energy consumption data. (iv) Single-family houses with energy sources other than natural gas and electricity for space heating (SH) and/or domestic hot water (DHW).
After data treatment, cleansing and filtering, the final sample (O) comprised 69,870 cases which corresponded to 56.953% of the cases in the originally received datasets. Extra information on the data treatment, cleansing and filtering was added in Appendix G.

Descriptive statistics
While the aim of the study was not to model the building stock, based on a representative sample, a better insight in the representativeness of the sample helped identify potential hiatus or biases in the correlation and modelling analyses. A few important descriptives are given below, others were gathered in the supplementary material (Appendix A).
The total Flemish single-family housing stock included 2.20 million single-family houses in 2020 (VEKA, 2020), and 1.92 million were built before 2006, of which approximately 560,000 homes were EPC-rated. The final sample (O) that was studied in this article, therefore, contains 3.2% of the total singlefamily housing stock and 12.5% of the total EPC-rated single-family housing stock (VEKA, 2020).
As can be seen in Figure 1, almost 40% of the singlefamily houses in the final sample were constructed between 1946 and 1970. Compared to all single-family dwellings in the existing single-family housing stock before 2006(Census, 2015, the construction years  were overrepresented in the final sample and the construction years before 1945 were underrepresented. Cases that were constructed before 2006 were not represented in the samples since they were covered by the EPB standards for new and thoroughly renovated buildings.
Overall, the final sample was considered representative for all the EPC-rating categories issued for the Flemish single-family housing stock. The sample contained slightly more homes in the energy-efficient categories and less homes in the worst energy rating categories. Cases with construction years  were overrepresented in the sample and cases with construction years <1945 were underrepresented.

Categorization
The final sample (O) comprised single-family houses with all possible combinations of technical and renewable energy systems (e.g. solar panels, condensing boiler, ventilation system) and energy carriers (i.e. gas and electricity) for different end uses (e.g. space heating, domestic appliances, indoor air quality, domestic hot water). The amount of collected data made it possible to define three subsamples, each containing enough cases for statistically valid analyses (Table 2) on more disaggregated levels, focusing on specific (groups of) energy uses (and carriers).
This article focused only on the statistical analyses of samples O1, O2 and A2 and statistical calculation models for samples O1 (dark grey). Sample O1 was outlined for analyses on the prediction error regarding the total primary energy use. Sample O2 was outlined for analyses on the prediction error regarding the domestic electricity consumption for all except for SH and DHW (i.e. homes with an energy carrier for SH or DHW other than electricity). Complementing O2, sample A2 was outlined for analyses on the energy use for SH and DHW (i.e. homes with only natural gas as an energy carrier for SH and DHW). Note that the real electricity consumption figures also included the electricity consumption for domestic appliances and lighting (and cooking in some cases) and that the real natural gas consumption figures possibly included the natural gas consumption for domestic cooking (N.B. however, it was not known which cases use natural gas for cooking), which were not included in the regulatory calculated energy use figures.

Data-driven statistical modelling
Model selection and practical usage Statistical linear regression models have been applied in this study as the majority of related studies in other EU countries used linear regression models (Brounen et al., 2012;Cozza et al., 2020;Meier & Rehdanz, 2010;Rehdanz, 2007). Therefore, in order to be able to compare results with the literature, the authors chose to test linear regression models as well. Also, much of the important available statistical variables (e.g. building volume, gross floor area, average U-value, air tightness, degree days, household size etc.) had a nearly linear correlation with the dependent variable (total building energy use). Apart from model prediction, the linear models also allowed to identify relationships between variables and test their significance which corresponded to the studies aims.
The linear models simplicity makes practical usage by stakeholders and/or non-experts more accessible and they do not even require a calculation tool or special software (i.e. a linear regression model can be written out easily as a one-line formula and implemented). Practically, stakeholders, such as planning authorities, could use the linear models to inform inhabitants, tenants and home owners about their individual annual building energy use (if the accuracy is high enough) and/or use them for predictions and scenario analysis (e.g. cost-effective energy reduction strategies) for larger groups of buildings (e.g. the social housing patrimony, city-scale analyses). Additionally, already in the development of such online tools (e.g. De Energie Centrale, 2021; VEKA, 2021), the models could serve as a reference of what variables have the most explanatory power and are easy enough to fill in without much effort. After all, apart from being accurate and having meaningful outputs, questionnaires by the Flemish Energy and Climate Agency showed that such online tools were only used by people if filling in all variables was easy and quick (BFG, 2021).
Data pre-processing for statistical modelling After data cleaning, extra pre-processing steps were needed to make the data ready for statistical modelling. To be able to include categorical variables in statistical models, one-hot encoding was used to translate them to dummy variables.
Then, the data underwent feature scaling to normalize the range of both the dependent and the independent variables. For this feature scaling a robust scaling (Equation 1) was preferred since it performed better in case of outliers in the data (N.B. the features contained outliers) (Roy, 2020).
with Q 2 (X) the median of the explanatory variable and IQR(X) the interquartile range of variable X (i.e.
Then, the dataset was divided into a training set and a test set by applying an 80-20%-ratio. The training set was used once to fit the model. The test set was then used to make an objective evaluation of the fit of the model with unseen input data and determine the level of model generalization.
Statistical method: ordinary least squares Statistical analyses and modelling on/with the data were all conducted in Python with the statistical packages 'scikit-learn' and 'statsmodels' in combination with the data analysis and visualization package 'pandas'. All reported correlation values were obtained using the non-parametric Kendall's τ rank correlation (Kendall, 1938) since it did not rely on any underlying distribution for the analysed variables (N.B. the distribution of several features did not come from a Gaussian process). A p-value of .05 was used for null hypothesis significance testing (two-tailed).
Initially, a linear Ordinary Least Squares (OLS) regression model had been built for both a set of 'general building variables' and a set of 'socio-demographic and weather variables'. Given the suspected issue of multicollinearity (which occurs when two or more predictor variables in a linear regression model are highly correlated), the variance inflation factors (VIF) had been inspected. VIF indicated how much the variance of an estimated regression coefficient increased if the explanatory variables were correlated. If uncorrelated, VIF = 1. In this article, a threshold of 5 had been used, as suggested in the literature (Chan et al., 2012). If VIFs greater than 5 were found in the OLS regression, then one (or more) of those correlated variables were excluded one-by-one stepwise depending on the regression coefficient and the individual VIF until a model was obtained with no collinearity issues. Correcting for multicollinearity was important since neglecting to do so meant that the regression coefficients could not have been reliably interpreted.
Further, all p-values and bootstrapped 95% CIs of the predictors were checked for irregularities and further ensured model quality. Only the predictors for which the p-values were <.05 and the CIs did not include nil were significant and thus kept in the model (N.B. if a Figure 1. Shares of single-family houses according to their construction period. Table 2. Overview of the possible subsamples from the final sample O with their intended use for specific analyses on the total energy use (O1), the domestic electricity use (O2) and the energy use for SH and DHW (A2). Further, the number of houses in the sample is specified as well as a description of the single-family houses in the respective sample. coefficient's CI included nil, the model did not know whether the influence of that variable on the model output was positive or negative (thus insignificant); therefore, that variable should be excluded from the model to ensure model quality). The exclusion of explanatory variables from the model was once more done one by one stepwise. After building the individual models (i.e. 'general building variables' and 'socio-demographic and weather variables'), models were extended and combined until resulting in a final model encompassing all explanatory variables, tested and adjusted for multicollinearity and significant regression coefficients. As all model input and target variables were normalized, the magnitude of the regression coefficients gave an indication of the variables explanatory power/relative importance in the regression model. The dependent/target variable in all of the OLS regression models was the annual real primary total energy use [kWh/y]. An overview of the independent variables (or predictors) was given in Appendix B.
In order to fulfil the necessary assumptions for linear regression models, the model input variables were checked for linearity, autocorrelation and multicollinearity; the residuals were checked for independency, homoscedasticity and normality (Figure 2) (Hastie et al., 2009;James et al., 2013).

Real versus regulatory energy use
First, a comparison was made between the real and regulatory calculated primary energy use with the primary conversion factors from the regulatory calculation method (2.5 for electricity and 1.0 for natural gas), for the cases that had both available natural gas and electricity consumption figures and where electricity was not used for SH or DHW. Since the regulatory calculation method did not take into account end uses such as household appliances, lighting or gas cooking, one might reasonably have expected the regulatory calculated energy uses to be lower. For the studied sample, the regulatory calculated natural gas consumption in Flemish single-family houses was 3 times higher than the real natural gas consumption, while the regulatory calculated electricity consumption was much lower (even close to nil) than the real electricity consumption of the same dwellings (Figure 3).
In the case of the electricity consumption, the fact that the regulatory method did not take into account the energy use for household appliances, for lighting and for cooking (N.B. it was not known which households use electricity or gas for cooking) could have explained most of the underestimation in the regulatory calculated electricity consumption. In the case of the natural gas consumption, there were generally only two end uses for natural gas (i.e. SH and DHW) (N.B. although some households used natural gas for cooking). Therefore, the difference in consumption reflected discrepancies in the technical and behavioural modelling assumptions and simplifications (e.g. standard values, formulas etc.) used to calculate the demand for SH and DHW (N.B. the difference would be even slightly greater if the households that cooked with natural gas could be excluded). It was however not the aim of this article to identify where these discrepancies come from.

Energy use versus energy rating
The gap between the real and regulatory calculated energy use per EPC-label was analysed for the total primary energy use (sample O1). Then supplementary, separate analyses were drawn for the energy use for SH and DHW (sample A2) and the domestic electricity consumption (sample O2).

Total primary energy use
The regulatory energy performance rating, the EPCscore, proved to be a significant indicator for the normalized total primary energy use according to sample O1. However, the association was only moderate (t = .243, p , .001). Yet, the rating was not a good indicator for the absolute total primary energy use [kWh/y] (t = .039, p , .001).
In Figure 4, a comparison between the real and regulatory calculated total energy use [kWh/y] is shown as a function of the energy label. As expected, the regulatory calculated energy use was steadily decreasing with the improvement of the energy performance of the building, whereas the real energy use behaved quite  and eN(0, s 2 ).
differently. There seemed to be no difference in energy use for the moderate to poor ratings (label C, D, E, F, G) and a considerable difference for the better ratings (label A and B) (however still smaller than what was regulatory calculated). For the poor energy ratings (label E and F), there was even a slight drop noticeable compared to moderate energy ratings (label C and D). Figure 5 shows a comparison between the normalized real and regulatory calculated total energy use [kWh/m 2 ·y]. Compared to Figure 4, the regulatory calculated energy use, normalized per gross floor area, was also steadily decreasing with the improvement of the energy performance of the building, but the difference in real energy use for the moderate to poor labels was slightly larger (i.e. the houses with poor labels were smaller than those with moderate labels (Appendix C)), yet still very little. These findings once more indicated that an advancement from an energy label G or F to E, D or C would not deliver the expected improvement.  Figures 4 and 5 furthermore show that single-family houses with moderate to poor performance (label B, C, D, E, F, G) used substantially less energy than calculated. Vice versa, houses with better energy performance (label A) used marginally more than calculated. This result supported previous findings in the literature (Cayre et al., 2010(Cayre et al., , 2011Cozza et al., 2020;Kelly, 2011;Majcen et al., 2013aMajcen et al., , 2013bSunikka-Blank & Galvin, 2012). The real total energy use was overestimated by the regulatory calculation method by on average 147% (i.e. the average prediction error, which was the regulatory calculated energy use minus the real energy use as a percentage of the real energy use), and by on average −9% and 217% for the energy labels A and F respectively. The spread of relative prediction error at case level (i.e. −19%-279%) indicated a major lack of fit.

Natural gas consumption
An interesting insight into the total primary energy use was gained by analysing the energy use for SH and DHW (sample A2) and the domestic electricity consumption (sample O2). In Figure 6, a comparison between the real and regulatory calculated energy use for SH and DHW [kWh/y] is shown as a function of the energy label for houses that have only natural gas as an energy carrier for SH and DHW. Similar to the findings for the total primary energy use, the regulatory calculated energy use was steadily decreasing with the improvement of the energy performance of the building. Yet, for the real natural gas consumption only little difference was observed for the energy ratings (label A till G). Figure 7 shows a comparison between the normalized real and regulatory calculated energy use for SH and DHW [kWh/m 2 ·y]. Compared to Figure 6, the regulatory calculated energy use was also steadily decreasing with the improvement of the energy performance of the building, but the difference in real energy use for the energy ratings was slightly larger (i.e. the houses with poor labels were smaller than those with better labels (Appendix C)), yet still rather little. This result once more supported previous findings in the literature (Cayre et al., 2010(Cayre et al., , 2011Majcen et al., 2013aMajcen et al., , 2013bMerzkirch et al., 2014). The considerable difference in total energy use for the houses with better energy ratings (label A and B compared to label C till G), observed in Figures 4 and 5, proved to barely result from the energy use for SH and DHW (as the decrease is very small and steady), but had mainly to be due to other building services (see further) (e.g. PV-panels).
Furthermore, the energy rating showed a moderate correlation with the real energy use for SH and DHW (t = .311, p , .001), which was larger than observed for the total energy use. The real natural gas consumption was overestimated by the regulatory calculation method by on average 266% (i.e. the average prediction error, which is the regulatory calculated energy use minus the real energy use as a percentage of the real energy use), and by on average 55% and 327% for the energy labels A and F respectively. Additionally, at case level, the spread of relative prediction error (i.e. 45-542% overestimation at case-level across EPC-ratings) indicated a major lack of fit and a continuous overestimation for all cases in the sample.

Electricity consumption
In contrast to what was observed for the natural gas consumption, the regulatory calculated electricity consumption (sample O2) was much lower compared to the real electricity consumption by a large margin ( Figure 3). In Figure 8, a comparison between the real and regulatory calculated domestic electricity consumption (expressed in primary energy use) is shown as a function of the energy label for houses that do not have electricity as an energy carrier for SH and DHW. Additionally, Figure 9 shows a comparison between the normalized real and regulatory calculated domestic electricity use [kWh/m 2 ·y]. The figures show that both the real and regulatory calculated electricity consumption had no relation with the current EPC-label. This was of course largely due to the fact that only the auxiliary energy use (mainly pumps and fans), PV energy production and space cooling were taken into account in the regulatory calculated electricity calculation while the real electricity consumption also included the electricity consumption for household appliances, lighting and, for some cases, cooking.
This result supported previous findings in the literature (Majcen et al., 2013a(Majcen et al., , 2013b. A slightly concave shaped curve for the mean real electricity consumption was noticeable in the normalized results in Figure 9 (as well as in Figure 8), which did suggest that households in single-family houses with poor energy labels (i.e. label E-G or > 300 kWh/m 2 ·y) consumed less electricity than households with a moderate energy label (i.e. label C-D or 200-400 kWh/m 2 ·y). From Figure 9, it was noticeable that the concave trend was only slightly related with gross floor area as there were only minor differences with Figure 8. In the case of dwellings with EPC-label A ('0-100' kWh/m 2 ·y), it was proven (Appendix Figure D1 and D2) that the drop in average real and regulatory calculated electricity consumption was due to the increased presence of PV-panels (which was linked with better energy labels). The presence of PV-panels therefore also explained the difference in total energy use for houses with better energy labels (label A and B compared to label C till G).

Statistical data-driven modelling
In this section, first and foremost, the predictive performance of the regulatory EPC calculation method was evaluated. Then, the individual regression models, expanded and combined regression models were reported. Note that for all models, the residuals had been inspected to ensure the quality of the models. Inspection of the QQ plots of the residuals showed that the residuals were nearly normal except for some outliers at both ends and that they were linear over a wide range of values. Furthermore, the residuals versus fitted values indicated that the residuals were nearly uncorrelated to the fitted values. For brevity, not all residual plots were presented. Appendix E in the   Figure 10 shows a scatter plot of the real and regulatory calculated total energy use [kWh/y]. In an ideal scenario, a linear function y = x should have closely described the relationship between both parameters. As expected, an ideal relation was not obtained. In fact, a negative R 2 (R 2 = −15%) between both indicated that the average real total energy use of the stock was, on average, a better prediction of the real total energy use of an arbitrary single-family house than the annual regulatory calculated total energy use (N.B. R-Squared could be negative only when the chosen model did not follow the trend of the data, so fits worse than a horizontal line (Mann, 2020;Motulsky, 2013)). Furthermore, the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) when using the regulatory calculated energy use as the prediction of the real energy use were respectively 45,808 and 35,325 kWh/y. When considering that the average real total energy use of the studied sample was 27,286 kWh/y, the MAE and RMSE results were 1.29-1.68 times larger.

Regulatory calculation method
General building characteristics OLS model General building variables explained 38.1% (adjusted R 2 ) of the variability in the real total energy use (i.e. on the test set). Two variables showed VIF values above the chosen threshold criterion so one of them had to be excluded one-by-one stepwise (i.e. building volume). Further, five variables had CIs including nil and a p-value >.05 meaning that some of those had to be excluded one-by-one stepwise as well (i.e. type of ventilation system, construction year, roof insulation?, number of floors and SH energy carrier). After exclusion of those variables, an OLS regression rerun on the remaining variables resulted in a model that explained 41.3% (R 2 adj ) of the variability in the real total energy use. Table 3 shows the standardized coefficients of the reduced OLS model (i.e. the standardized coefficients β OLS , their 95% CIs and the p-values). Five variables were significant: A larger dwelling size was associated with higher total energy use, the presence of renewable systems (i.e. PV-panels, heat pump and solar collector) was associated with using less total energy use, floor and air heating was associated with more total energy use compared to radiator space heating, having space cooling was associated with having higher total energy use and detached houses were associated with having higher total energy use compared to semi-detached houses while terraced houses were associated with having less total energy use compared to semi-detached houses.

Socio-demographic and weather OLS model
The socio-demographic and weather model explained 11.7% (R 2 adj ) of the variability in the real total energy use. Six variables showed VIF values above the chosen threshold criterion so some of those had to be excluded one-by-one stepwise (i.e. the number of children, average age of HH, HH with children, the number of 65+ per HH). Further, six variables had CIs including nil and a p-value >.05 meaning that some of those had to be excluded one-by-one stepwise as well (i.e. HH with children 5-12 and 13-18, HH with adults 30-44 and 65+, HH composition 2Ad 1Child and weighted annual GHI). After exclusion of those variables, an OLS regression rerun on the remaining variables resulted in a model that explained 13.6% (R 2 adj ) of the variability in the real total energy use. Table 4 shows the standardized coefficients of the reduced OLS model. Three variables were significant: A larger household size and a larger number of adults were associated with higher total energy use, singles were associated with less total energy use and a higher number annual heating degree days at the dwelling's location was associated with higher total energy use.

Extended building characteristic OLS model
In the next step, the general building characteristics OLS model was extended with a set of detailed building characteristics and intermediate results from the regulatory energy calculation method to test for increments  in explanatory power. For the general building characteristics model, only the variables that had remained after VIF-checks had been included. The extended building characteristics model explained 39.7% (R 2 adj ) of the variability in real total energy use. Twelve variables showed VIF values above the chosen threshold criterion so some of those had to be excluded one-by-one stepwise (i.e. EPC-calculated total energy use and the energy use for space heating, the heat transfer coefficient for transmission, the glass, window, wall, groundfloor, Figure 10. Scatter density plot of the real and regulatory calculated annual primary total energy use in single-family houses with only natural gas and/or electricity as energy carrier for both SH and DHW (each density line holds 10% of the data).  roof and building envelope surface and the total average U-value). Further, seven variables had CIs including nil meaning that some of those had to be excluded one-byone stepwise as well (i.e. EPC-calculated energy use for auxiliary, PV and space cooling, building compactness, U-values for window and roof and thermal mass). An OLS regression rerun on the remaining variables resulted in a model that explained 41.9% (R 2 adj ) of the variability in the real total energy use. This 0.6% increase in R 2 adj compared to the general building characteristics OLS model was not significant (p ≮ .05). Also the remaining detailed building characteristic variables were not very significant to the model, their coefficients were rather low. Table 5 shows the standardized coefficients of the reduced OLS model.

Combined OLS model
In a last step, the two different individual models were combined together for increments in explanatory power through adding additional variables. For the building and socio-demographic and weather model, only the variables that had remained after VIF-and CI-checks had been included. The combined 'general building and socio-demographic and weather' model explained 46.6% (R 2 adj ) of the variability in real total energy use. No variables showed VIF values above the chosen threshold criterion and no variables had CIs including nil. An OLS regression rerun on the remaining variables was thus not necessary. This 5.3% increase in R 2 adj compared to the model with only general building characteristic variables was significant (p < .01). Also in comparison to the socio-demographic and weather OLS model, the increase in R 2 adj was significant (p < .01). Table 6 summarizes the standardized coefficients for all variables that remained after VIFand CI-checks. Table 7 shows the R 2 adj of the individual OLS models and the extended and combined OLS models. As expected, building characteristic variables explained by far the most of the variability in real total energy use, on their own, and also when added to the socio-demographic and weather variables. Yet, certain variables, which were expected to be strong indicators of the total energy use (Brounen et al., 2012;Cozza et al., 2020), such as the thermal properties of the building envelope (U-value) were less important in the found models (due to them being mostly standard values instead of measured values). The available socio-demographic variables played a lesser but still significant role in explaining real total energy use. Extra detailed building variables did not add much information to the model to predict more of the variability in the real total energy use. Therefore, in explaining more of the variability (and avoiding collinearity problems), the findings seemed to suggest that the modeller can better  gather more different types of variables (i.e. building characteristics, socio-demographics, incomes, weather, appliance ownerships etc.) rather than more detailed variables within the same type of variables (e.g. extra detailed variables related to building characteristics). However, while this finding is valid for the available parameters, it might (also and in part) have resulted from a lack of accuracy of the detailed parameters as stored in the EPC-registry, resulting from inaccurate reporting by the assessors and/or resulting from the use of (conservative) standard values. Further, as a model with general building characteristics (i.e. building features that inhabitants can easily fill in themselves) performed equally well compared to a model including detailed building characteristics from the EPC-registry and intermediate results from the regulatory calculation, the final model could be used (e.g. in an online tool) as a primary model to inform inhabitants, tenants and home owners about their energy use, without needing extra inputs from the EPC-registry to improve performance.

Discussion
The strength of this study lies in the very large sample of Flemish single-family houses available for analysis, the availability of both natural gas and electricity consumption figures to explore and corroborate findings observed at the level of the houses' total energy use, and the availability, in addition to building characteristic data, of also the socio-demographic and weather data to train the OLS regression models. This research is the most comprehensive so far conducted in Belgium and since the EPBD implementation in Flanders is similar to the one in the Brussels Capital Region and Wallonia (unlike the implementation in e.g. The Netherlands, Germany, Switzerland and the United Kingdom), the results are also relevant for other regions in Belgium. Also, it is one of the few studies to explicitly address the issue of multicollinearity and the importance of bootstrapped confidence intervals to ensure model quality and reliable regression coefficients in classical OLS regression models.

Implications of the study
The analysis of the gap between the real and regulatory calculated total building energy use revealed that the regulatory calculation method overestimated the true energy savings by on average 399% (i.e. label F to A). This is important for energy policy making as the expected reduction in energy use from energy retrofit turns out to be only a fraction of what was planned. This result proves once again Cozza et al., 2020;Crawley et al., 2019;Filippidou et al., 2018;Mac Uidhir et al., 2020) the need to revise the calculation and assessment method applied for energy performance certificates, or at least to adapt the standard values used as part of it, if the aim is to more accurately estimate the energy use and the energy savings potential at building stock level. However, defining standard values as conservative values is sound from the perspective of an incentivising quality framework: good standard values would reduce the incentive for good design and thorough assessment. Incentivising and accurate performance evaluations can thus, to some extent, be conflicting aims within the EPC-framework. Also, the results seemed to indicate that such conservative standard values falsely inflate the theoretical energy savings potential. The current grant scheme (FG, 2020a;Fluvius, 2020) provides uniform subsidies (i.e. EPC label bonus for label improvement) for single-family houses with label E or worse who retrofit to label C or better, for which the amount linearly increases from post-retrofit label C till A. Based on the studies' findings, the existing approaches to subsidizing energy label improvements should be reviewed since in reality a shift from label F till label C, as currently assessed, is a very poor indicator for obtaining actual energy savings.
Further, only 46.6% of the variability in total energy use was explained by the final OLS regression model. This is considerably higher compared to other studies in Germany (Rehdanz, 2007), United Kingdom (Meier & Rehdanz, 2010), The Netherlands (Brounen et al., 2012), and Switzerland (Cozza et al., 2020), where only 23-40% of the variability was explained by OLS regression models with similar predictors. Yet, given the number of building-related, socio-demographic and weather predictors that were available in the database, the authors found the models' performance surprisingly limited. This raises the question what other factors (preferably available in existing big databases, since a surveyapproach would not be feasible) determine real total energy use, that were not available in this study. The very limited availability of user-related data did not allow for an estimation of user behaviour in the different households, while this is known to be an important factor defining real energy use in relation to the theoretical performance labels. Unsurprisingly, the performance of the OLS regression model is therefore too poor for inference Last, the study shows that multicollinearity will likely be an issue when doing OLS regression modelling at building stock level. Hence, it is important that any regression analysis checks for multicollinearity and reports the bootstrapped confidence intervals of the regression coefficients in order to end up with simple, reliable models and meaningful regression coefficients.

Limitations of the study
Three points concerning the quality of the data used should be noted. First, there is concern about the quality of the inspections on which the input data for the energy rating calculations are based. A study carried out by VEKA reported that in a sample of 302 EPC reports issued in 2013, 42% of the inspected EPC energy ratings were incorrect and the responsible EPC-assessors had to pay a fine or were suspended (CA EPBD, 2016). A subsequent study by VEKA in 2018 (CA EPBD, 2019) found that in a sample of 310 EPC reports issued, 92% of the energy performance ratings were correct or sufficiently enough in agreement with the assessment procedure (i.e. assessors were found to have correctly issued their EPC reports or received a warning but no fine), so the level of quality to which the official assessment procedure was followed had improved. It is however not clear what level of incorrectness has to be present to either receive a warning or a fine. The other 8% EPC assessors had to pay a fine or were suspended.
Secondly, the fact that an EPC assessor did an acceptable EPC-assessment does not guarantee that the reported characteristics are accurate. After all, a large number of building characteristics in the calculation method rely on standard values (especially in this study on existing EPC-rated houses with building permits from before 2006) (i.e. due to the EPC inspection protocol (EPC Insp., 2015) and the calculation method (EPC Form., 2015)) that are selected in function of limited available information or evidence (e.g. selected U-values in function of the age of the building). As discussed in the introduction, an earlier study by Delghust (2015) on a small building sample in Flanders also proved the major influence that the energy performance assessor had on the reported energy rating.
Thirdly, the Flemish energy utilities provided normalized real annual natural gas and electricity consumption figures per household between 2012 and 2019. From their normalization documentation and post-processing steps, it is however clear that only the natural gas consumption figures were climate-normalized, while electricity consumption figures were not. As the authors were not able to access the raw energy use figures (i.e. without normalization or for the actual metering periods), the consumption figures could not be corrected for this possible bias.

Conclusions
Using data from a large, regionally representative sample of 122,680 Flemish existing single-family houses, comparisons between the real and regulatory total energy use generally support previous findings in the literature. Yet, the overestimation by the Flemish building energy certification method has been exceedingly large compared to studies in most other EU countries, such as The Netherlands, Germany, United Kingdom and Switzerland (Figure 11 and Appendix F). Furthermore, this study has shown that the Flemish EPC-labels hold very poor correlation with (and generally prove to be very poor indicators of) the actual energy use for (and actual energy savings associated with) Flemish existing single-family houses (especially for the poorer energy labels). Therefore, caution is needed when using the Flemish EPC certificates, which have been initiated to 'provide more information about the energy use and efficiency of the dwelling as a Figure 11. Average gap (%) between the regulatory calculated and real total energy use per EPC-label (following the Flemish EPC label classification) for The Netherlands (NL), Germany (GE), the United Kingdom (UK), Switzerland (CH) and Flanders (FL).
tool to make it easy to compare and assess the energy performance of candidate homes and predict real energy savings resulting from tailored retrofit recommendations' (EPC Form., 2015;EPC Insp., 2015).
In order to obtain more accurate estimations of actual building energy uses, the authors suggest to revise the calculation and assessment method applied for energy performance certificates by utilizing at least more realistic standard values and taking into account zonal and intermittency effects in the formulas (as is already the case in The Netherlands, Germany and United Kingdom) or alternatively to switch to calculation methods developed based on real building energy use data.
Although the explanatory power of the trained regression models has proven to be considerably higher compared to related studies in the literature, their overall performance shows that the variability in actual total energy use that can be explained is still fairly limited. Even using all variables measuring a variety of predictor types, the models have only explained just under half of the variability in total energy use, indicating the need for more and better data (e.g. detailed v 50 -values, occupant behaviour, appliance ownership, income).
Results of the trained regression models have revealed that the explanatory power is still too poor to inform inhabitants, tenants and home owners in an accurate manner about their individual annual building energy use (e.g. through online tools and apps). Nonetheless, the trained regression models can certainly be utilized for prediction and scenario analysis for larger groups of buildings (e.g. the social housing patrimony, city-scale analyses) as the prediction errors average out, and the regulatory method's performance (which is currently used) is considerably worse. Stakeholders can therefore use the trained models as is (i.e. due to their simplicity), yet also as a reference of what variables have the most explanatory power (and are easy enough to fill in, without much effort, by non-expert users) for internal model development.
Finally, the study has shown important methodological implications, i.e. that checking and addressing multicollinearity and bootstrapped confidence intervals of the regression coefficients is crucial in performing regression analysis. The individual baseline regression models, extended and combined models showed the presence of a significant extent of multicollinearity. Also, a considerable number of the statistical variables showed bootstrapped confidence intervals including nil. Correcting both model quality measures required multiple model reruns to ensure reliable coefficients for individual predictors and models that can be interpreted and used by stakeholders.