An open workflow to gain insights about low‐likelihood high‐impact weather events from initialized predictions

Low‐likelihood weather events can cause dramatic impacts, especially when they are unprecedented. In 2020, amongst other high‐impact weather events, UK floods caused more than £300 million damage, prolonged heat over Siberia led to infrastructure failure and permafrost thawing, while wildfires ravaged California. Such rare phenomena cannot be studied well from historical records or reanalysis data. One way to improve our awareness is to exploit ensemble prediction systems, which represent large samples of simulated weather events. This ‘UNSEEN’ method has been successfully applied in several scientific studies, but uptake is hindered by large data and processing requirements, and by uncertainty regarding the credibility of the simulations. Here, we provide a protocol to apply and ensure credibility of UNSEEN for studying low‐likelihood high‐impact weather events globally, including an open workflow based on Copernicus Climate Change Services (C3S) seasonal predictions. Demonstrating the workflow using European Centre for Medium‐Range Weather Forecasts (ECMWF) SEAS5, we find that the 2020 March–May Siberian heatwave was predicted by one of the ensemble members; and that the record‐shattering August 2020 California‐Mexico temperatures were part of a strong increasing trend. However, each of the case studies exposes challenges with respect to the credibility of UNSEEN and the sensitivity of the outcomes to user decisions. We conclude that UNSEEN can provide new insights about low‐likelihood weather events when the decisions are transparent, and the challenges and sensitivities are acknowledged. Anticipating plausible low‐likelihood extreme events and uncovering unforeseen hazards under a changing climate warrants further research at the science‐policy interface to manage high impacts.


| INTRODUCTION
Understanding the likelihood, trends, and driving processes of extreme hydro-meteorological events is crucial for decision-making (Salas et al., 2018;Slater et al., 2021). However, it is challenging to compute robust statistics for low-likelihood weather events from short historical records, especially in data scarce regions. Instrumental records are typically only a few decades long and are not available everywhere (e.g., Alexander, 2016). Reanalysis products are increasingly employed to estimate extremes, as they blend observational datasets with model simulations into spatially and temporally coherent outputs, that is, 'maps without gaps' (e.g., European Centre for Medium-Range Weather Forecasts [ECMWF], 2018). For example, the ECMWF ERA5 reanalysis (Hersbach et al., 2020) has been used to estimate rainfall intensity-duration-frequency (IDF) curves globally (Courty et al., 2019), trends in extremes Geirinhas et al., 2021;Kim et al., 2021), driving processes behind extreme events (Grazzini et al., 2020), and extreme weather indices (Kennedy-Asser et al., 2021;Wehner et al., 2020). Although reanalyses overcome spatial data scarcity, they can exhibit model deficiencies or inhomogeneities (Parker, 2016), and their typical length (~70 years) may still be a limiting factor when studying extreme events.
Whereas event attribution methods raise awareness about the anthropogenic influence of high-impact weather events that have occurred (e.g., Allen, 2003;Philip et al., 2020;Stott et al., 2016) and large ensembles of climate model simulations help project future high-impact events (e.g., Deser et al., 2020;Mankin et al., 2020), UNSEEN may raise awareness about lowlikelihood high-impact weather events that could occur in the present climate. UNSEEN, therefore, has considerable potential to improve engineering design standards (e.g., Jain et al., 2020;Kent et al., 2022;Thompson et al., 2017), detecting and explaining trends in rare extremes over recent decades (e.g., Kay et al., 2020;Kelder et al., 2020), and developing eventbased storylines of plausible-yet unseen-weather extremes (Matthews et al., 2016;Sillmann et al., 2021). However, uptake is hindered by large data and processing requirements, and there are challenges associated with providing credible, multiple 'maps without gaps' from initialized predictions (Box 2). This paper presents a protocol and open workflow to apply and ensure credibility of UNSEEN for studying low-likelihood high-impact weather events globally ( Figure 1). The procedure begins by selecting the type of hydro-meteorological event of interest with requisite spatial and temporal scales (step 1). The type of event being studied informs the selection of the most appropriate prediction system (step 2). We then discuss how data can be retrieved (step 3) and pre-processed (step 4) before being evaluated (step 5). For events that are not deemed credible, practical solutions are discussed (step 6). The final step is to apply statistics and gain insights about plausible BOX 2 Three challenges associated with generating an UNSEEN ensemble The credibility of initialized ensemble predictions to represent large samples of weather events hinges on three common challenges faced by all prediction systems: the independence of the ensemble members; the stability of the model; and the fidelity of the simulations (Kelder et al., 2020;Thompson et al., 2017Thompson et al., , 2019. Independence. Ensemble member independence (i.e., the uniqueness of each model ensemble member) is closely linked to the spread and predictability of forecasts. When a forecast is initialized, the ensemble member independence is low because the ensemble members only differ slightly in their initial conditions. The spread of the individual ensemble members increases over the forecasting horizon because they develop their own 'virtual world' induced by stochastic processes in the atmosphere. The importance of ensemble member independence can be best explained through an example of extreme value analysis. Dam safety standards require preparedness for very unlikely scenarios, such as the 10,000-year inflow return value. Large ensemble hindcasts might be used to generate an UNSEEN ensemble that can capture such events. However, if the UNSEEN ensemble members are correlated, one might think that 10,000 years were simulated adequately, whereas the effective ensemble size is in practice much smaller.
Stability. Ensemble members may drift away from their initial climatology (near to an observed state) towards a steady virtual climatology (Covey et al., 2006;Hermanson et al., 2018;Sen et al., 2009Sen et al., , 2013. Such drifts are not caused by external forcing or internal low-frequency variability but by numerical errors (e.g., Liepert & Previdi, 2012;Lucarini & Ragone, 2011), model imbalances and/or discontinuities (e.g., Rahmstorf, 1995). Drift is mostly present in physical ocean variables, but can be evident in atmospheric properties (Sen et al., 2013). Hence, model instability (i.e., the presence of drift) may deteriorate the realism of the hindcast ensemble.
Fidelity. Model simulations are virtual representations of reality, and 'fidelity' refers to their ability to realistically simulate the target event(s). Hence, for robust analyses using climate model simulations, their 'virtual world' must realistically describe 'reality', that is, the extreme event being studied. Systematic errors such as in cloud microphysics, tropical cyclones, convective precipitation, teleconnections, and synoptic regimes in numerical prediction systems may bias the simulation of extreme events (e.g., Zadra et al., 2018). Processes that occur on scales smaller than the model grid cannot be resolved but must be parameterized, leading to lower fidelity (Sillmann et al., 2017). Furthermore, mechanisms such as self-intensification of droughts via landatmosphere feedbacks are currently not wellrepresented by climate models (Miralles et al., 2019). Therefore, an evaluation of model fidelity is crucial. weather events, if the large ensemble was deemed credible or identified issues could be resolved (step 7). A technical workflow (UNSEEN-open, documented at https:// unseen-open.readthedocs.io) was developed for steps 3-5 during the ECMWF Summer of Weather Code 2020 (https://esowc.ecmwf.int/). This workflow facilitates the process of retrieving, pre-processing, and evaluating the latest ECMWF seasonal prediction system 5 (SEAS5, Johnson et al., 2019) but could be adapted for other modelling systems. Through the protocol and workflow, this paper aims to transparently detail the data and code that are applied, decisions that were made, challenges that were faced, and sensitivities that may influence the outcome of the UNSEEN approach.
Following sections step through the protocol, illustrated by three worked examples of extreme events that occurred in 2020. During February, the United Kingdom endured floods that caused more than £300 million in damage and destroyed 3400 houses (Copernicus EMS, 2020). Later that year, prolonged heat over Siberia caused wildfires, invasion of pests, and infrastructure failure, as well as global impacts through the release of greenhouse gasses from thawing permafrost (Ciavarella et al., 2021;Overland & Wang, 2021). Meanwhile, wildfires in California contributed to the (then) worst fire season on record (Pickrell & Pennisi, 2020), amplifying hardships faced by communities during the COVID-19 pandemic (Moore et al., 2020). Section 2 describes the seven steps of the protocol. Section 3 provides an overview for each of the three case studies. In Section 4, we discuss the challenges and sensitivities to be mindful of when applying UNSEEN. Section 5 concludes the paper with final remarks.
2 | THE PROTOCOL 2.1 | Step 1: Define the event To apply the UNSEEN method, a hydro-meteorological event is first defined. The event definition depends on the scope of the analysis in terms of the target domain, timescale, and (meteorological) variable of interest. Any domain, timescale, and variable(s) can be selected, for example, to estimate design values or to quantify the likelihood of unprecedented events. The event definition should reflect the impact being studied. For example, larger spatial and temporal scales are used to study winter precipitation  than to study summer precipitation over the United Kingdom (Kent et al., 2022), reflecting widespread flooding from winter storms as opposed to more localized convective storms in summer. In the examples below, events were defined to best represent the footprint of historical low-likelihood high-impact weather incidents because we want to assess whether these phenomena could have been 'seen' before they occurred. For some cases, the definition is straightforward, such as for studying UK-average extreme precipitation in February (Section 3.3). In other cases, such as for the Siberian heatwave (Figure 2a,c and Section 3.1) and temperature anomalies during peak California wildfire activity (Figure 2b,d and Section 3.2), the domain and timescale may be informed by an assessment of the event anomaly 1 or the region where historical records were broken. 2 Detailed protocols for defining the extent, timescale, and meteorological variable representative of target events can be found in Philip et al. (2020)

|
Step 2: Select an appropriate prediction system Increasing computational resources and improved physical understanding of the Earth System have led to advances in seamless prediction systems over recent decades (Alley et al., 2019;Bauer et al., 2015;Hoskins, 2013;Palmer, 2019). The resolution and prediction timescale of ensemble prediction systems are important considerations to inform the choice of model ( Figure 3). Predictions ranging from weeks to years provide high-resolution but independent events well suited for regional-scale multi-day to monthly events, such as heatwaves (Cowan et al., 2020;Kay et al., 2020;Thompson et al., 2019), cold spells, wind storms (Walz & Leckebusch, 2019), and extreme precipitation (Jain et al., 2020;Kelder et al., 2020;Thompson et al., 2017) ( Figure 3, Table 1). For sub-daily extremes-such as ocean wind and wave extremes, convective storms, or wind gusts-high-resolution simulations are required to resolve sub-grid processes (Sillmann et al., 2017). In general, global medium-range simulations (10-15 days) are likely to be most appropriate for studying local, shortduration events (e.g., Breivik et al., 2014Breivik et al., , 2013Meucci et al., 2018;Osinski et al., 2016, Table 1), or additional downscaling might be needed (e.g., Guillod et al., 2018;Poschlod et al., 2021). For events with long persistence such as droughts, the ensemble members from mediumrange predictions are unlikely to be unique (low independence, Box 2). Hence, decadal predictions (1-10 years) are recommended for events with long memory (e.g., Hall et al., 2019Hall et al., , 2020Kay et al., 2018).

|
Step 3: Retrieve the ensemble hindcast and reference dataset The UNSEEN-open technical workflow was developed for steps 3-5 of the protocol with a focus on the SEAS5 prediction system, but with other systems from Copernicus Climate Change Services (C3S) also in mind (see Figure 1 and The appropriateness of hindcast ensembles for diverse types of events. This schematic shows which prediction systems are most likely to be appropriate for diverse types of extreme events. The horizontal axis represents seamless prediction timescales, where the arrows beneath indicate the different weather prediction systems covering the respective prediction timescales. The corresponding types of extreme event range from local, short-duration events requiring high-resolution simulations, to regional, persistent events involving longduration simulations. The gradual shading indicates that multiple prediction systems might be equally appropriate for some type of events.
Fading on the left-hand side is a reminder that the first few days of forecasts cannot be used because of low independence between ensemble members due to similar initial conditions. The schematic is based on Table 1.
T A B L E 1 Hydro-meteorological extremes (variable) with spatial resolution and timescale that have been studied by pooling ensembles from medium-range, seasonal, and decadal prediction systems Note: We present the spatial resolution of the most recent prediction systems for consistency, but some of the cited studies may have used earlier systems with lower resolutions. GCM, Global Climate Model; RCM, Regional Climate Model. retrieval time are optimized by (1) retrieving precomputed monthly statistics (minimum, maximum, or average) instead of retrieving all forecasts in full; (2) selecting only the target domain and months, then converting those into the relevant initialization months and lead times required for the request and; (3) optimizing retrieval functions to the structure of the ECMWF MARS archive (see the ECMWF documentation). SEAS5 ensemble members and lead times are pooled to create the UNSEEN ensemble (e.g., Kelder et al., 2020). For example, UK February precipitation is forecasted from six initialization months (i.e., the preceding September to February, Figure 4a). For longer duration 'target events', such as March-April-May average temperature over Siberia, there are fewer forecasts that can be pooled together (from four initialization months, that is, the preceding December to March, Figure 4b). We discard the first month of the forecast because ensemble members are still likely to be overly constrained to initial conditions (Kelder et al., 2020). In the end, we are left with five initialization months for monthly blocks (such as the United Kingdom and California examples) and three initialization months for seasonal blocks (such as for the Siberia example). Pooling across the 25 ensemble members yields a potential increase to 125 (monthly blocks) and 75 (seasonal blocks) compared with a single observed period.
SEAS5 is, at present, the latest seasonal prediction system of ECMWF, launched in November 2017. SEAS5 is run on a 36 km horizontal resolution and is upscaled to a 1 grid to create a homogenous dataset with the same resolution for all C3S seasonal prediction systems. SEAS5 contains 51 ensemble members (25 members were available through C3S at the time of analysis). The historical seasonal predictions that are used to generate the UNSEEN ensemble consist of two datasets: archived operational forecasts since 2017 (years 2017-2020 are used) and hindcasts that were originally run to evaluate and calibrate the operational forecasts (years 1981-2016). Inhomogeneity between the hindcasts and forecasts is not expected, but can occur because of the differences in initialization: SEAS5 hindcasts are initialized from ERA-Interim (Dee et al., 2011) and OCEAN5 (Zuo et al., 2018), but the operational forecasts use ECMWF operational analyses instead of ERA-Interim. For further details on SEAS5, see the ECMWF page (https://confluence.ecmwf. int/display/CKB/C3S+Seasonal+Forecasts) and Johnson et al. (2019).

| Step 4: Pre-process the data
In the pre-processing step, the retrieved files are merged into one multi-dimensional dataset (Hoyer & Hamman, 2017). This dataset can be stored as a NetCDF file containing the dimensions latitude, longitude, ensemble members, time (years), and lead time (initialization months). Then, a domain and timescale representative of the event being studied is selected. In the workflow, the resulting data array (with dimensions ensemble members, time, and lead time) is converted to a data frame (with variables ensemble members, time, and lead time) and stored as a csv file to match ggplot functionalities in R. This step is provided in python and is run on a local machine.

|
Step 5: Evaluate the independence, stability, and fidelity In the evaluation step, ensemble member independence, model stability, and model fidelity are tested (Box 2). F I G U R E 4 Schematic showing how forecasts initialized in different months can be pooled to extend the sample size for the same target event. The grey horizontal bars represent the seasonal forecasts, which are initialized every (leftmost) month and run for 6 months after their initialization. Dark shading indicates the relevant section of the target period (lead times greater than 1 month) used for the February UK precipitation (a) and March-May Siberian heat (b) case studies. Thompson et al. (2017) developed the model fidelity test and Thompson et al. (2019) discussed the general applicability also in terms of the ensemble member independence and model stability. Kelder et al. (2020) then developed methods for the evaluation of the independence and stability for a case study of extreme precipitation events over Norway and Svalbard. Here, we build upon and extend these evaluation tests so they can be tailored to the selected event definition. We provide functions for testing the three criteria in the 'UNSEEN' R-package (https://github.com/timokelder/UNSEEN). We switch from python to R since we believe R has a better functionality in extreme value statistics. This section describes the evaluation tests.
Ensemble member independence is tested using a modification of the 'potential predictability' test-the ability of the forecast to predict itself (Kelder et al., 2020;Lavers et al., 2014;Wilks, 2011). The correlation between each ensemble member over the hindcast period is calculated, resulting in 300 distinct pairs per lead time (Kelder et al., 2020). A trend over the hindcast period can result in artificial detection of dependence. The independence test, therefore, includes detrending of values by firstdifferencing, whereby a new series is created from the differences between each successive value in a time series. Then, the non-parametric Spearman rank correlation between ensemble members is compared with the correlation arising by chance from uncorrelated members. This may be represented as a boxplot of the correlations between all pairs of ensemble members, with background values for each of the boxplot statistics given by those expected between all pairs of uncorrelated members.
For example, for the Siberian heat case study, the independence test shows that there is stronger correlation between ensemble members than would be expected by chance ( Figure 5). The dependence between ensemble members is most pronounced for the shortest lead time used (recalling from Section 2.3 that the first month of the forecasts are removed to avoid dependence). The correlation is not caused by a trend because the time series have been detrended.
Model stability is tested by comparing distributions between the different lead times (Kelder et al., 2020), which is performed on the original, raw data. For example, for the California wildfire danger case study, we find that August temperatures tend to drift over forecast lead time ( Figure 6). First, the probability density function is plotted for each lead time (Figure 6a). This shows that lead time 6 seems to be colder for the tail of the distribution, which Boxplots show the median, inter-quartile range, 1.5Â interquartile range, and outliers of correlation values (small squares). Grey horizontal shading denotes the correlation (median, interquartile range, and 1.5Â interquartile range) that is expected by chance. Arrows indicate for lead time 2, where the median and interquartile range are higher than would be expected by chance, revealing that the ensemble members are not fully independent. contains the extreme values of interest. Then, an empirical extreme value distribution is plotted (Figure 6b), which focuses more on the tail of the distribution. The extreme value distributions show that the drift is less pronounced for rare events. For more details about the model stability test, refer to Kelder et al. (2020).
Model fidelity is tested by evaluating the consistency between the hindcast ensemble and a reference dataset. For illustrative purposes, the hindcast ensemble for February rainfall is bootstrapped into 10,000 series of equal length to the reference dataset, with all lead times pooled together (Kelder et al., 2020;Thompson et al., 2017). The mean, standard deviation, skewness, and kurtosis are calculated for each of the series. Histograms of these distribution characteristics are plotted, including their 95% confidence interval. The range of the distribution characteristics within the hindcast ensemble can then be compared with the reference dataset ( Figure 7).

| Step 6: Resolve detected issues
If the above three tests are passed, the ensemble is considered credible for applications ( Figure 1). However, if one or more tests fail, identified issues need to be resolved prior to further use. This section discusses potential solutions to resolve the issues, which are summarized in Table 2.
When the independence test or stability tests are failed, the simplest solution is to remove the problematic lead times (Solution I1 in Table 2). If ensemble member dependence cannot be corrected by removing problematic lead times-for example, when dependence persists across all lead times-it is possible to assess whether forecasts are over-dispersive or under-dispersive (Solution I2 in Table 2) by calculating the signal-to-noise ratio and/or the relationship between ensemble mean root-mean-square error (RMSE) and ensemble spread (e.g., Weisheimer et al., 2019). Another desirable (but not always practical) approach is to assess the spread in large-scale physical drivers and surface states relevant to the hydro-climatic extreme being studied (Solution I3), such as sea-surface temperatures, sea-ice conditions, soil moisture, or atmospheric patterns. The spread shows the extent to which the ensemble is tied to slowly varying properties within the prediction systems. A bounded ensemble can still provide valuable information. In fact, many weather generators are constructed to be constrained and bounded to typical weather types. Therefore, predictability is only an issue when it originates from the initial conditions. Initialcondition predictability implies that the ensemble members are not unique, whereas predictability from boundary conditions means that the ensemble members are unique but conditioned. Note that for events with short memory and low persistence, no initial-condition predictability is expected beyond 2 weeks (Lorenz, 1963). Finally, the option to calculate the effective sample size (Solution I4) is recommended when dependence remains an issue (Breivik et al., 2013). The effective sample size represents the size of the dependent sample that an independent sample would have. For example, an ensemble consisting of 1000 years of weather events containing some dependence may effectively represent only 500 unique, independent years. For an ensemble with sample size (N) that expresses dependence (correlation between ensemble members, r), the effective sample size (N*) can be calculated following Breivik et al. (2013): If model stability is an issue and cannot be corrected by removing problematic lead times (Solution S1 in Table 2), an option could be to scale each lead time to the distribution of the shortest lead time (Solution S2). When model fidelity is an issue, additive (for temperature) or multiplicative (for precipitation) adjustment may be applied (Jain et al., 2020;Kelder et al., 2020;Thompson et al., 2019) (Solution F1). If issues with model fidelity remain, it is recommended to apply other evaluation tests (Solution F2) plus assess large-scale drivers and land surface feedbacks related to the extreme event (Solution F3).
In this workflow, the fidelity test was used for its focus on rare extremes. The sensitivity of the model fidelity results to the method of assessment can be tested (Solution F2). A wide range of methods and tools to identify biases in the simulation of extreme events exist (Eyring et al., 2019) that can be applied as tests for UNSEEN applications. For example, the 'ESMValTool' has been developed for climate model evaluation (Eyring et al., 2016) including extreme events (Weigel et al., 2021). Furthermore, metrics common to the evaluation of numerical weather prediction systems, such as the forecast reliability and rank histograms, can be used for prediction systems across timescales Palmer & Weisheimer, 2018;Suarez-Gutierrez et al., 2021;Weisheimer & Palmer, 2014). In addition to the statistical evaluation methods presented so far, it may be desirable to evaluate the large-scale drivers and feedback processes of the extreme events (Solution F3) and how they are represented in the model (e.g., van der Wiel et al., 2017;Vautard et al., 2019). For example, Kay et al. (2020) and Thompson et al. (2019) assessed the large-scale drivers of simulated unseen temperature events. When inconsistencies in the variability (standard deviation) or shape (skewness and kurtosis) remain, more advanced correction methods can be applied with caution, such as the Inter-Sectoral Impact Model Intercomparison Project (ISI-MIP, Warszawski et al., 2014) bias adjustment approach (Hempel et al., 2013;Lange, 2019), which is commonly used to study climate impacts (e.g., Mitchell et al., 2017). For more guidance on bias adjustment methods, see for example Cannon et al. (2020) and Maraun and Widmann (2018).

| Step 7: Apply statistics
When the previous steps resulted in a credible large ensemble for analysis, extreme value theory can be applied to gain insight into low-likelihood events. This section discusses the steps that we take to analyse the event statistics.
One way to determine whether UNSEEN could have detected historical low-likelihood weather events is to simply assess whether one (or multiple) of the ensemble members exceeds the magnitude of the historical event. The likelihood of the historical event can then be estimated as the percentage of the ensemble members that exceeded the threshold (e.g., Thompson et al., 2017). It is also possible to compare the UNSEEN ensemble member with the highest magnitude to the magnitude of the historical event to demonstrate how severe an event could get based on a worst-case scenario from UNSEEN (e.g., Walz & Leckebusch, 2019). Furthermore, a threshold can be selected based on system vulnerabilities. For example, for the Siberian heatwave, we count the number of thawing events (MAM mean temperature >0 C) within the UNSEEN ensemble and the reference data. Another, more advanced way is to apply extreme value theory (Coles, 2001). We demonstrate this method for the record-shattering California-Mexico August temperatures. We use the UNSEEN-trends approach (Kelder et al., 2020) to estimate changes in the event by fitting a non-stationary generalized extreme Value (GEV) distribution (Katz, 2013) to the pooled UNSEEN data and to the reference data, excluding the 2020 event. As in Kelder et al. (2020), we allow the location and scale parameters of the GEV distribution to vary linearly with time, whereas the shape parameter is assumed constant over time. In addition, we fit a stationary GEV distribution and test which distribution (stationary or non-stationary) better fits the data using a likelihood-ratio test. The parameters of the GEV distributions are estimated using maximum likelihood estimation (MLE) and statistical uncertainty is estimated as 95% confidence intervals based on the normal approximation, employing the extRemes package in R (Gilleland & Katz, 2016). The resulting distribution can then be used to determine the likelihood of the historical events, and whether, and by how much, the frequency and magnitude of such events have changed over recent decades.

| CASE STUDIES
We now present three case studies where we apply the UNSEEN protocol to the 2020 Siberian heatwave, temperature anomalies during peak California wildfire activity, and UK extreme precipitation events. We describe the steps taken to generate and evaluate the UNSEEN ensemble for each of the case studies. When issues are identified, the options to resolve them are discussed and appropriate solutions applied.

| Siberian heatwave
The detailed technical steps involved in producing this example may be followed at https://unseen-open. readthedocs.io/en/latest/Notebooks/examples/Siberian_ Heatwave.html.
For the Siberian heat case study, our choice of domain and duration was informed by the location and season in which monthly temperature records were broken (Section 2 and Figure 2a). We selected the area bounded by 65-120 E, 50-70 N for the March-May (MAM) season. Seasonal predictions (SEAS5) were selected as the hindcast ensemble and reanalysis (ERA5) was chosen for reference data. All forecasts simulating March-May monthly temperatures were retrieved and pre-processed (averaged and merged) to represent the event definition. Time series show that the 2020 event was the highest within the ERA5 record and exceeded the simulations within the UNSEEN ensemble (blue cross compared with grey boxplot in Figure 8a). One interpretation is that the 2020 event was rarer than the 75 ensemble members within UNSEEN; another is that UNSEEN does not represent the true likelihood of such an extreme event. Therefore, an evaluation of the applicability of UNSEEN for this event definition is crucial.
We find some ensemble member dependence for lead time 2 (Figure 8b) but no drift over any lead times (Figure 8c-d). The fidelity test using all lead times shows that there is a cool bias in SEAS5 MAM temperatures compared with ERA5 ( Figure 8e). The standard deviation, skewness, and kurtosis are within the 95% confidence intervals, but the standard deviation is at the lower end (Figure 8f-h). Note that the difference between SEAS5 and ERA5 could also be due to temperature overestimation by ERA5 for this particular season and domain. However, Ciavarella et al. (2021) report little difference between ERA5 and GISTEMP 250-km anomalies (Hansen et al., 2010) for the Siberian heat event.
The UNSEEN March-May average Siberian temperature ensemble based on SEAS5 expresses low ensemble member dependence for lead time 2, a cool mean bias, as well as low variability compared with ERA5. Lead time 2 could be removed from the ensemble to avoid dependence (Solution I1 in Table 2) but we choose to keep the ensemble members to retain a large sample size, because the low median correlation values of~0.2 fall within accepted correlation values of <0.25 previously found for floods (Brunner & Slater, 2022). As a result of this decision, the effective sample size may be slightly lower than the 75 members being used because of the low dependence between ensemble members (see 'independence' in Box 2). The dependence and the sensitivity of the dependence result to other tests could be further assessed (Solutions I2-4 in Table 2) but is not deemed necessary in this case. We apply a mean bias adjustment (Solution F1) to solve the issue with the cold bias of the UNSEEN ensemble ( Figure S1). The UNSEEN ensemble remains conservative with a low SEAS5 standard deviation (although not statistically significant from ERA5 at the 95% confidence level). Further evaluation tests can be applied (Solution F2), and large-scale drivers and land surface feedbacks that might be unrealistic can be assessed (Solution F3) but are beyond the scope of this paper. An evaluation of feedbacks involving soil moisture or snow cover that contributed to the 2020 anomaly (Ciavarella et al., 2021;Overland & Wang, 2021) merits further research.
The resulting bias-adjusted UNSEEN ensemble is inherently conservative because of the low variability, yet captures the 2020 event, along with five thawing events (MAM mean temperature >0 C), with a near possibility as early as the 1990s (Figure 9). In comparison, there had been no observed thawing events within the reanalysis before 2020.

| Temperature anomalies during peak California wildfire activity
The technical steps to reproduce this example are available at https://unseen-open.readthedocs.io/en/latest/ Notebooks/examples/California_Fires.html.
Wildfire activity cannot be studied from meteorological variables alone, because wildfire activity depends not only on weather conditions but also on fuel stock, ignition agents, and management (Flannigan et al., 2013). For example, weather conditions may be very dry, but without fuel or ignition source(s), wildfire activity is unlikely. However, weather-driven fire danger conditions can be studied from meteorological variables (e.g., Vitolo et al., 2020). For example, trends in temperature and precipitation are associated with rising likelihood of wildfire conditions across California (Goss et al., 2020). In 2020, the wildfire season peaked in August, coinciding with record high-temperature anomalies (Figure 2b,d). Here, we demonstrate the applicability and potential of SEAS5 in estimating the likelihood and trend of such a temperature anomaly. We selected a contiguous, land-only region where August temperature anomalies were more than twice the climatological  standard deviation based on ERA5 over the domain 100-125 W, 20-45 N (Figure 2b). The ERA5 time series shows a strong increase in August temperatures over 1981-2020 for this domain, which is also present in SEAS5 (Figure 10a). We find low ensemble member dependence in the UNSEEN ensemble for all lead times (Figure 10b). We also find that the model is not stable, especially for lead time 6 the model has a cold bias (Figures 6 and 10c,d). Lastly, we find that SEAS5 overestimates mean August temperatures when compared with ERA5, but that the standard deviation, skewness, and kurtosis are not significantly different at a 95% confidence level (Figure 10e-h).
Following these tests, we remove lead time 6 from the ensemble to solve the instability for lead time 6 (Solution S1 in Table 2) and apply a mean bias adjustment to solve the warm bias (Solution F1), leaving 100 members in the pooled data. Removing problematic lead times is not an option to solve the dependence (Solution I1 in Table 2) because the issue persists across all lead times. Further assessment of the independence between ensemble members was not deemed necessary in this case because of the low median correlation values (Figure 10b).
We then use extreme value statistics and find that the trend in 2-year temperature extremes, which can be detected well within short observational records, is similar between UNSEEN and reanalysis (Figure 11a). Both reanalysis and UNSEEN suggest a strong increase in the magnitude of 100-year temperature extremes (Figure 11b), but the statistical uncertainty is much larger within the 40-year reanalysis record (blue envelope in Figure 11b) than within large sample size of UNSEEN (100 Â 40 years, orange envelope in Figure 11b). When we compare the GEV distributions with the 'year' covariate for 1981 as opposed to 2020, we find that the distribution of temperature for 1981 does not reach the magnitude of the 2020 event, whereas the distribution for the year 2020 does capture the event for both reanalysis and UNSEEN (Figure 11c). This result suggests that the temperature anomaly observed in 2020 could not have occurred a few decades ago and that it was still unlikely to occur in the present climate (i.e., the distribution for the year 2020), with a return period of more than 100 years, that is, <1% chance of occurrence. This trend is consistent with record-breaking or 'record-shattering' temperatures being expected to occur more frequently in a rapidly warming climate (Coumou et al., 2013;Fischer et al., 2021;Power & Delage, 2019).

| UK extreme precipitation
The technical steps to reproduce this example are available at https://unseen-open.readthedocs.io/en/latest/ Notebooks/examples/UK_Precipitation.html. year MAM Siberian temperature (C) OBS UNSEEN F I G U R E 9 Thawing events over Siberia within the biasadjusted UNSEEN ensemble compared with ERA5. As in Figure 8a, but after applying a mean bias adjustment to the UNSEEN ensemble. The horizontal line shows the threshold for thawing events (temperature >0 C).
Three storms hit the United Kingdom in February 2020, breaking the UK-average monthly precipitation record according to the Met Office (2020). Hence, we select country-averaged February precipitation for the UK case study. In this case, we employ the EOBS version 20.0e observational dataset as reference (Cornes et al., 2018) because precipitation observations (UK Met Office, 2020) suggest the reanalysis values may have underestimated the event. We upscale this dataset to the resolution of SEAS5 using bilinear interpolation and take the same UK spatial average as for SEAS5.
The UK February precipitation time series shows that the 2020 event was not the highest on record within the EOBS dataset (Figure 12a), while it was the highest within the HadUK-Grid dataset (Davies et al., 2021;Hollis et al., 2019). The discrepancy likely arises from the number of observation stations being incorporated, with the local HadUK-Grid dataset containing more rain gauges. Note that later versions of EOBS may have incorporated more observation stations for the year 2020, but these versions were not available at the time of analysis.
We find that SEAS5 UK February precipitation ensemble members are independent ( Figure 12b) and stable (Figure 12c,d). However, there is too little variability within SEAS5 when compared with EOBS (Figure 12e-h), raising concerns about model fidelity. Independent UNSEEN analysis of February 2020 UK precipitation using the Met Office decadal prediction system and observations also found a lack of fidelity, with observed variability outside the range of that simulated (Kay 2021, personal communication). A mean bias adjustment (Solution F1 in Table 2) does not help in this case, because it will not sufficiently adjust the standard deviation. The result will likely not be sensitive to the evaluation test (Solution F2), such as a rank histogram or reliability diagram, given that the lack of variability is also evident in the time series (Figure 12a). Further evaluating the drivers (Solution F3) and comparing the results to other datasets (Section 4.2) would be recommended, as the realistic simulation of large-scale winter precipitation variability over the United Kingdom may be hampered by the SEAS5's resolution. For example, Thompson et al. (2017) also found that DePreSys3 does not simulate the orographic enhancement over the Scottish highlands. Flat regions are better simulated, such as southern England.
We do not take this case study further, as the generated ensemble of UK-average February precipitation did not pass the fidelity test and could not be resolved. Note, however, that UNSEEN can successfully be applied to monthly winter precipitation over Southeast England . Furthermore, for a detailed analysis of the dynamics of the wet Winter 2019/2020, including the attribution of the record-breaking February 2020 precipitation to climate change, see Davies et al. (2021) and Hardiman et al. (2020).

| DISCUSSION
This paper sets out a protocol for generating a credible, large ensemble for event definitions specified by a user (step 1). The protocol guides the user through the selection of an appropriate prediction system (step 2), the retrieval (step 3), pre-processing (step 4), and evaluation (step 5) of the data, and how to resolve detected issues (step 6). Finally, the protocol describes the use of statistics to gain insight into low-likelihood events (step 7). A technical UNSEEN-open workflow for steps 3-5 is presented using ECMWF seasonal prediction system SEAS5 . In this section, we reflect on the challenges and sensitivities to be mindful of when applying UNSEEN.

| Sensitivity to the choice of event
Here, the event definition is informed by the meteorological rarity of observed low-likelihood high-impact events. This approach does not necessarily best match target impacts. For example, local fire weather indices over California would be more informative than the temperature anomaly over a larger domain that we assessed, also noting that high impacts do not necessarily need to result from rare meteorological events ( Van der Wiel et al., 2020). Furthermore, the outcome of UNSEEN is sensitive to the selection of the spatial-temporal scale of the event, as is well documented for event attribution studies (e.g., Angélil et al., 2014;Leach et al., 2020;Uhe et al., 2016). Finally, an event definition based on observed events is not available for unprecedented events. The outcomes of our case studies are therefore (1) not optimized to study impacts, (2) only hold for the specified spatial and temporal scales, and (3) can only be applied to events in hindsight. We believe these assumptions are justified to demonstrate how UNSEEN can be used to gain insight into low-likelihood events, for example, whether past events could have been detected with UNSEEN. When aiming to inform decision-making based on UNSEEN, sensitivities must be assessed and the event definition should best match target impacts, for example, through local expert knowledge of the domain and timescale that are most vulnerable to the target weather event. Practical factors may weigh in as well, such as our workflow considerations to use the average of, or daily max/min values within, monthly or seasonal blocks (Section 4.3). The protocol can help to understand the applicability of UNSEEN for the chosen event definition and can be used to test the sensitivity of the outcomes to various spatial-temporal scales.

| Model adequacy and availability
Ideally, the adequacy of the models should inform the selection of an appropriate prediction system (step 2 and Figure 3). However, the availability of model simulations may be another important consideration. The UNSEENopen workflow was developed to generate and evaluate an UNSEEN ensemble from open access Copernicus SEAS5 simulations. SEAS5 has been used in other UNSEEN studies because it provides a stable, homogeneous, global, high-resolution, large ensemble of weather variables (Hillier & Dixon, 2020;Kelder et al., 2020).
The outcomes from UNSEEN based on SEAS5 initialized ensemble predictions are not completely independent from outcomes based on reanalysis data. For example, the California-Mexico August temperature trends based on SEAS5 are conditioned on ERA-Interim (1981 and ECMWF operational analysis (ERA5 for 2017-2020, see ' Step 3: Retrieve'). Trends in lowlikelihood weather events such as the August 2020 California-Mexico temperatures are hard to constrain from reanalysis data alone, and the large sample size from UNSEEN (SEAS5) can help. The prediction timescale used here is at least 1 month, so the ensemble members are less constrained to observation stations than reanalysis. UNSEEN is, therefore, less reliable than reanalysis but represents many weather realizations that may face less of an issue with assimilation inhomogeneity over time. The evaluation of SEAS5 to ERA5 can confirm if the large sample size from UNSEEN (SEAS5) is as reliable as ERA5, but both may equally face model errors.

| UNSEEN-open workflow considerations
In this workflow (steps 3-5 in Figure 1), SEAS5 monthly statistics are retrieved locally from the Copernicus Climate Data Store (Buontempo et al., 2020;Thepaut et al., 2018), which are openly available, freely accessible, and can be retrieved without an intermediary. The case studies in this paper include monthly average statistics from CDS, but the workflow is sufficiently flexible to draw on monthly minimum or maximum data. For compound or multi-day events, daily data can be retrieved and processed to obtain the required metric.
There are two points of attention for users to consider when using SEAS5: (1) the ensemble size depends on the selected block length and (2) the ensemble represents the conditions of the most recent decades only. Forecasts run for 6 months and, therefore, an ensemble size of 125 members can be created for monthly blocks, 75 members for seasonal blocks, and events longer than 5 consecutive months are not possible without stitching forecasts (Section 2, step 3). When longer time periods are required to evaluate internal climate variability, century-long seasonal hindcasts with a similar set-up to SEAS5 but at lower resolution, such as the Coupled Seasonal Forecasts of the 20th Century (CSF-20C, Weisheimer et al., 2021), or the Atmospheric Seasonal Forecasts of the 20th Century (ASF-20C, Weisheimer et al., 2017) may be useful. The workflow is adjustable for other prediction systems, including medium/extended range, seasonal and decadal (Table 1), but, here, retrieval was optimized for Copernicus SEAS5.
At present, hindcast datasets are available for download and need to be pre-processed, which can be a timeconsuming process. An open workflow as presented in this paper would benefit from having large volumes of data such as the SEAS5 hindcast accessible on-demand via a cloud-based service (Pappenberger et al., 2021;Wagemann et al., 2018). An example of a cloud service for the meteorological and climate community, which in the future may be incorporated in the UNSEEN-open workflow to obviate retrieval of data, is the European Weather Cloud (Pappenberger & Palkovic, 2020).

| Sensitivity to evaluation metrics and solutions
The applicability of UNSEEN is determined by multiple, interrelated factors. Ensemble member independence, model stability, and model fidelity depend on the type of event being studied (variable of interest, spatial and temporal extent, and geographical location), as well as on the prediction system applied. The prediction timescale furthermore influences the independence and stability, as longer simulations are more independent but have a higher chance of drifting away from climatology. Different systems, and the way they have been downscaled, initialized, and coupled, may yield different biases. Therefore, it is recommended that our protocol is used to explore the applicability for the selected event definition and prediction system(s).
All three case studies of extreme weather events in 2020 express challenges with respect to the credibility of UNSEEN (independence, stability, and fidelity, see Box 2). We note that a wide range of evaluation metrics exist, especially to evaluate the model fidelity (described under Solution F2 and F3), and that the sensitivity of the identified challenges to evaluation metrics could be further assessed. For two out of three case studies, we find that solutions could be applied to deem the UNSEEN ensemble credible for further analysis. We note that the outcome of UNSEEN is sensitive to the judgement of appropriate solutions. For example, here, we identified weakly dependent ensemble members for the Siberian heatwave and for the California-Mexico temperature anomalies. There is a trade-off between discarding useful information compared with keeping dependent members. Here, we chose to keep all members as the ensemble member dependence was low. If we would have removed certain lead times, the results would be different.
The outcome of UNSEEN is furthermore sensitive to the type of bias adjustment. The UNSEEN ensemble may sample plausible extreme events that never occurred, and bias adjustment techniques may constrain the ensemble to observed extremes-thereby removing information about unseen events. Furthermore, observations are not the 'truth' under internal variability, resolution mismatch, and other sources of error (Casanueva et al., 2020;Wilby et al., 2017). Attention is therefore needed to evaluate which statistical properties of the extremes are being constrained to observations.

| Sensitivity to the analysis and framing
The outcome of UNSEEN is sensitive to the statistical analysis being applied. In particular, the estimation method, time window, and initialization method of UNSEEN (SEAS5) are factors that may influence estimated likelihoods and detected trends. Here, we allow the location and scale parameters to vary linearly with time (step 7). Other regression methods, other covariates than time, other reference data, and other prediction systems allowing longer time periods could be explored, but such analyses are beyond the scope of this paper.
Furthermore, correct framing of the result of an UNSEEN study is crucial. For example, for the Siberian heatwave, we did not want to apply extreme value theory or attach likelihood estimates to the event, because the UNSEEN ensemble was conservative with a low standard deviation. Nonetheless, we are confident in saying that the 2020 March-May Siberian heatwave was predicted by one of the ensemble members prior to the event happening, along with other simulations of thawing events that had not yet been observed. The sensitivity of likelihood statistics to portfolio risks should also be considered: whereas the chance of a single high-impact weather event to occur might be very low, the chance of any type of high-impact weather event to occur anywhere in the world is substantially larger. For example, Thompson et al. (2017) showed a 7% chance for unprecedented winter monthly precipitation to occur in a given year for southern England, but 34% when also including the chance of unseen events over the Midlands, East Anglia, or northeast England.

| Scope for multi-model multimethod approaches
Most studies evaluating unprecedented extreme events have used single models to assess their magnitude and frequency, but such analyses are sensitive to model structures (e.g., van Kempen et al., 2021;Wilcke et al., 2020). Multi-model approaches have therefore been used in weather predictions, climate projections, and event attribution studies (Palmer et al., 2005;Philip et al., 2020;Tebaldi & Knutti, 2007). Jain et al. (2020) were the first to apply a multi-model ensemble in an UNSEEN approach using the Climate-System Historical Forecast Project (Tompkins et al., 2017) to study extreme summer rainfall over India. Future work may investigate the extension of the UNSEEN-open workflow to include all seasonal prediction systems available in the CDS.

| CONCLUSION
Hindcast ensembles from weather predictions have considerable potential for advancing understanding of lowlikelihood high-impact weather events. Estimates of rare extreme events or compound extremes can be improved through the large number of weather events that can be generated from these ensembles (UNSEEN, van den Brink et al., 2004. To improve uptake and ensure rigour of these methods, we provide a protocol and open workflow to apply UNSEEN for gaining insights about low-likelihood high-impact events. Demonstrating our protocol and open workflow using SEAS5, we show that a stress-test of March-May thawing events over Siberia would have shown their plausibility within the UNSEEN ensemble before the event happened, for which the far-reaching impacts on permafrost peatlands were already widely known (e.g., Swindles et al., 2015). Assessing UK February monthly precipitation revealed an issue with the variability of SEAS5 for this event definition, illustrating how the protocol may help understand the limitations of UNSEEN and diagnose the lack of simulated precipitation variability in the underlying forecasting system. In the case of August 2020 temperatures during peak California wildfire activity, anomalies exceeded previous records by a considerable margin (Figure 2c,d). Such anomalous events can have large socio-economic consequences, especially when climate risk perception is driven by past experiences (Aerts et al., 2018;Weber, 2006). The UNSEEN approach reveals a strong trend in temperature extremes over the last 40 years, which has increased the likelihood of events like the August 2020 temperature anomalies in the present climate. UNSEEN shows the plausibility of such a record-shattering event in the present climate, but not in the past climate. This case study shows how UNSEEN may help to understand what kind of unseen weather events could now occur in the present climate, and thus in the near future.
Based on these case studies, we conclude that UNSEEN can provide new insights into low-likelihood high-impact events, but that there are several challenges and sensitivities of which to be mindful. It is, therefore, key to be transparent about all decisions that are made throughout the analysis, given the many sensitivities that can arise from these decisions. Our protocol and open workflow assist users to identify challenges and sensitivities, and can help gain credible insights for target highimpact weather events. The results warrant further research and application of UNSEEN at the sciencepolicy interface, to improve our preparedness to lowlikelihood high-impact weather events.