Robust skill of decadal climate predictions

There is a growing need for skilful predictions of climate up to a decade ahead. Decadal climate predictions show high skill for surface temperature, but confidence in forecasts of precipitation and atmospheric circulation is much lower. Recent advances in seasonal and annual prediction show that the signal-to-noise ratio can be too small in climate models, requiring a very large ensemble to extract the predictable signal. Here, we reassess decadal prediction skill using a much larger ensemble than previously available, and reveal significant skill for precipitation over land and atmospheric circulation, in addition to surface temperature. We further propose a more powerful approach than used previously to evaluate the benefit of initialisation with observations, improving our understanding of the sources of skill. Our results show that decadal climate is more predictable than previously thought and will aid society to prepare for, and adapt to, ongoing climate variability and change. There is increasing demand for near-time climate predictions to provide guidance for adaptation planning at policy-relevant timescales. Although previous work has shown some skill in forecasting decadal surface temperature, it has proven more difficult to make predictions for precipitation and atmospheric circulation. By using a large, multi-model ensemble of climate models, a multi-institution team lead by Doug Smith of the Met Office Hadley Centre, UK were able to make skillful decadal predictions for near surface temperature, precipitation for the Sahel and broad swathes of Europe and Eurasia, and mean sea level pressure for many regions, with some exceptions being predictions for the South Atlantic and Southern Ocean. Further work is needed to understand whether the instances in which forecasts and observations differ are due to internal variability or external factors such as solar variability, volcanoes and anthropogenic aerosols.


INTRODUCTION
Human society and natural ecosystems are vulnerable to climate variability and change, 1 which impacts food security, freshwater availability, spread of pests and diseases, heat waves, droughts, floods, cyclones, wildfires, energy supply and demand, transport, migration and conflict. There is, therefore, an urgent, and growing, need for climate information [2][3][4] to inform the World Meteorological Organisation's Global Framework for Climate Services 5 and support the United Nations Sustainable Development Goals 6 and the Sendai framework for disaster risk reduction. 7 For climate information to be useful it should be credible. 8,9 Many planners are particularly interested in the coming decade, 10 and there is an increasing need for decadal climate predictions. 4,11 An important advantage of decadal climate predictions compared to centennial climate projections is that their credibility can be assessed by performing retrospective forecasts (also known as hindcasts) of the historical period and comparing them against subsequent observations. Although previous assessments have shown high skill in decadal forecasts of surface temperature, confidence in predictions of precipitation and atmospheric circulation, which are vital for many climate impacts, is much lower. 4,[12][13][14][15][16][17] However, recent developments in seasonal forecasting have highlighted the need for very large ensembles to achieve skilful predictions especially for precipitation and atmospheric circulation. [18][19][20][21][22][23] Here we take advantage of these developments to reassess decadal prediction skill using a larger multi-model ensemble than was previously available. We also propose a new approach for identifying the sources of skill in order to gain further confidence in forecasts.

RESULTS
The signal-to-noise paradox Although individual weather events cannot be predicted more than a couple of weeks ahead, slowly varying predictable components of the climate system, including ocean variability and changes in radiative forcing from greenhouse gases, aerosols and solar variations, influence the frequency, duration and intensity of weather events over the coming seasons to years. Hence some aspects of climate, for example, multi-year atmospheric circulation changes, 24,25 the frequency of extreme weather events 26 and total amounts of rainfall, 27 are potentially predictable over the coming decade. Such forecasts will not be perfectly deterministic because uncertainties are inevitable due to the chaotic nature of the atmosphere, and errors will be introduced by imperfect climate models and imperfect knowledge of the initial state of the climate system. Taking the mean of an ensemble of forecasts reduces these uncertainties, since the different ensemble members sample different realisations of chaotic variability and, to some extent, uncertainties in the initial conditions and model formulation if different climate models are included. Ensemble averaging reduces the unpredictable noise to reveal the predictable signal in the models. The level of skill, and hence the usefulness of forecasts, depends on the magnitude of the predictable signal relative to the unpredictable noise (the signalto-noise ratio). 28 Recent studies of seasonal and annual forecasts have revealed that the signal-to-noise ratio can be far larger in observations than in climate models. [18][19][20][21]23,29 This results in the surprising situation, referred to as the signal-to-noise paradox, 20,21,30 where a climate model can predict the real world better than itself despite being There are amendments to this paper an imperfect representation of the real world and a perfect representation of itself. Although this highlights a clear deficiency in climate models it also provides an opportunity to create skilful forecasts even with imperfect models by taking the mean of a very large ensemble in order to extract the predictable signal. Where the model signal-to-noise ratio is too small, the number of ensemble members needed to remove the noise and extract the signal will be larger than it would be if the signal-to-noise ratio were correct, and the amplitude of the resulting model predictable signal will be too small. Nevertheless, the ensemble mean signal may be highly skilful if it correlates with the observed variability, and its magnitude can be adjusted in a post-processing step to produce realistic forecasts. 19,29,31,32 If the signal-to-noise paradox applies to decadal predictions then previous studies [12][13][14][15][16][17] may have underestimated the skill. This is because the ensemble size may have been too small to remove the noise, or the skill measure used may have penalised errors in the magnitude of the predicted signal which could potentially be corrected before issuing forecasts. Indeed, averaging over many hindcast start dates to increase the ensemble size shows that decadal predictions capture many aspects of observed changes associated with large variations in the north Atlantic, 33,34 and recent results with a large ensemble from one model show significant skill for predicting rainfall over land in some regions. 35 We, therefore, reassess decadal prediction skill using a larger ensemble than was previously available, and focussing on anomaly correlation which is insensitive to errors in the magnitude of variability. In order to reduce modelling and initial condition uncertainties, we use multi-model hindcasts that start every year from 1960 to 2005 from the 5th Coupled Model Intercomparison Project (CMIP5 36 ) supplemented by three additional state-of-the-art decadal predictions systems to provide a total of 71 ensemble members from seven different climate models (Table 1, Methods). For the reasons above, we assess the correlation skill of the ensemble mean, for near-surface temperature, precipitation and mean sea level pressure. We average over forecast years 2-9 in order to avoid seasonal to annual predictability and focus specifically on decadal timescales (see Methods).
We illustrate our approach by first considering the North Atlantic Oscillation (NAO) for which seasonal to annual forecasts are clearly affected by the signal-to-noise paradox. [18][19][20]23,29 The observed time-series of annual 8-year running mean NAO (black curve in Fig. 1a) is characterised by an increase from the 1960s reaching a peak for the period 1988-1994 (trend = 0.87 hPa/ decade, p < 0.01) and a decrease thereafter (trend = −0.77 hPa/ decade, p < 0.01). The model forecasts (red curve in Fig. 1a) also show an increase from the 1960s peaking for the period 1988-1994 (trend = 0.13 hPa/decade, p < 0.01), though the decrease thereafter is not significant (trend = −0.02 hPa/decade, p = 0.62). The agreement between the timing of observed and forecast variability results in a highly significant correlation skill ( Fig. 1a inset, correlation r = 0.49, p = 0.02 allowing for autocorrelation in the time-series, see Methods). However, the model forecasts underestimate the observed amplitude of variability (note the different axes for the observations and models in Fig.  1a), and skill measured by the mean-squared-skill-score (MSSS 14 ) is positive but not significant (MSSS = 0.17, p = 0.30).
The discrepancy in significance between correlation and MSSS skill measures arises because the modelled signal-to-noise ratio is too small: the models capture the phase of observed variability reasonably well but MSSS is sensitive to errors in the amplitude. This is confirmed in Fig. 1b which shows that the correlation is far higher for predicting the observations (red curve) than for predicting an individual model member (blue curve). Hence the predictable fraction of the total variability is larger in the real world than in the models. This behaviour is also found in seasonal and annual forecasts, 18,20,23 and shows that the signal-to-noise paradox also applies to decadal forecasts of the NAO. Note also that the correlation increases slowly with ensemble size (red curve) confirming the need for large ensembles to obtain robust estimates of skill. Indeed, even with 71 members the forecasts are noisy (showing large inter-annual variability in overlapping 8 year periods, Fig. 1a red curve) and the correlation is still increasing, suggesting that higher skill would be achieved with an even larger ensemble. Indeed, a simple theoretical fit 18,37 suggests the correlation could exceed 0.6 for an infinite ensemble.
The signal-to-noise paradox can be quantified by computing the ratio of predictable components (RPC) between observations and models. 19,29,30 Since the correlation squared is the fraction of variance that is predictable, the RPC can be computed as the correlation skill for predicting the observations divided by the average correlation skill for predicting individual model members (where the square root has been taken for convenience). The expected value of the rpc should equal one for a perfect forecasting system; values greater than one are symptomatic of the signal-to-noise paradox where the real world is more predictable than models. The RPC for decadal predictions of the NAO is 6, compared to 2 or 3 for seasonal and annual timescales. 19,20,23 Although some of this difference might arise from our use of a multi-model ensemble, our results clearly highlight the need for large ensembles to obtain robust estimates of skill. This conclusion is further strengthened by maps of RPC for temperature, precipitation and pressure ( Fig. 1c- that the signal-to-noise paradox is widespread in decadal predictions, and generally larger for precipitation and pressure than for temperature. Assessing the impact of initialisation Decadal variations in climate can occur through internal variability of the climate system (the atmosphere, oceans, land and cryosphere), but they are also influenced by radiative changes from external factors, including greenhouse gases, aerosols (volcanic and anthropogenic) and solar variability. 38 Climate projections 39 aim to simulate the response to external factors, but cannot predict internal variability because the modelled internal variability is not aligned, on average, with that of the real world. In contrast, decadal predictions are initialised with observations of the ocean and the atmosphere, thereby aligning internal variability with reality and enabling some aspects of internal variability to be predicted. It is important to note that initialisation may also improve the response to external factors where these leave fingerprints, especially in the ocean, that are imperfectly simulated by models but can be corrected with observations. For example, if the NAO response to solar variability and/or volcanic eruptions is too weak in model simulations 30 then their impacts on the ocean circulation will likely be too weak. Initialisation with observations would correct these fingerprints thereby improving skill. Hence, differences between initialised predictions and uninitialized simulations cannot necessarily be attributed to internal variability. Nevertheless, assessing the impact of initialisation provides valuable information on whether skill is dominated by external factors or whether internal variability could be playing a role, thereby highlighting the most important factors to be considered in order to reduce uncertainties in forecasts.
Previous studies have found fairly limited improvements from initialisation, mainly in the North Atlantic with little impact over land. [12][13][14][15]17,24 However, the methods for comparing skill were not optimal resulting in a lack of power in the significance test and an underestimation of the impact of initialisation. The power of a significance test is defined as the probability that it will correctly detect a real signal, and depends largely on the size of the signal being tested, [40][41][42] referred to as the effect size. Previous studies have assessed simple differences (or ratios) of skill, but this approach does not maximise the effect size when part of the skill is common to both sets of forecasts. For example, the global warming trend can lead to high anomaly correlations for both initialised forecasts and uninitialized simulations, with very little difference between them and hence a small effect size and low power of the significance test (see example below). This common signal introduces a bias that is not taken into account in standard significance tests and diminishes their power. 43,44 In order to increase the effect size we directly test whether initialisation predicts any of the observed variability that is not already captured by the uninitialized simulations. Specifically, we create residual forecast and observed time-series by removing, via linear regression, the uninitialized ensemble mean from the initialised and observed time-series respectively. We then test whether these residuals are significantly correlated (see Methods, noting the adjustment needed to avoid introducing a bias). By construction, Fig. 1 The signal-to-noise paradox. a Time series of observed (black curve with red/blue shading showing positive/negative values) and model ensemble mean (years 2-9, red curve) 8-year running mean annual NAO index. Note different scales for observations (± 2hPa, left axis) and model ((± 0.4hPa, right axis). Inset shows the skill measured by MSSS and anomaly correlation (circles show the skill of the ensemble mean, whiskers show the 5-95% uncertainty range, see Methods). Straight lines show trends before and after the observed peak. b Anomaly correlation as a function of ensemble size for model ensemble mean forecasts of the observations (red curve with 5-95% confidence interval shaded, see Methods) and for model ensemble forecasts of an independent ensemble member (blue curve, averaged over all possible combinations). c-e The ratio of predictable components (RPC 19 ) between observations and models for year 2-9 forecasts of near-surface temperature, precipitation and mean sea level pressure. RPC values significantly greater than one are stippled (crosses and circles show 90 and 95% confidence intervals, respectively). The NAO is calculated 20 as the difference in pressure between two small boxes located around the Azores ( the residuals contain the variability in the observations and initialised predictions that is independent of the uninitialized simulations. Hence, their correlation is potentially much larger than the difference in correlation between initialised and uninitialized forecasts that contain common signals. By increasing the effect size our approach provides a more powerful assessment of the impact of initialisation, as demonstrated below. We illustrate our approach with predictions of boreal summer (JJA, the average of June, July and August over forecast years 2-9) temperature in the sub-polar North Atlantic (SPNA). For the whole year this region shows a robust benefit from initialisation, 45 but this is less clear for JJA (Fig. 2a). Both initialised and uninitialized forecasts capture the overall warming trend and are highly correlated with the observations (r = 0.98 and 0.95, respectively) but the benefit of initialisation is not detected by the difference in correlation which is small and not significant (Fig. 2a inset, r = 0.03, p = 0.29). However, observations and initialised forecasts are cooler than the uninitialized simulations in the late 1960s to early 1990s and warmer in other periods resulting in a significant correlation between residuals (Fig. 2b, r = 0.69, p = 0.05) and a clear improvement from initialisation. Furthermore, expanding our analysis to the whole globe shows that the enhanced power obtained by testing residuals reveals significant benefits from initialisation in many regions that are not seen by a simple difference in correlations (compare Fig. 2c, d), including over land in western Europe, western Africa and the Middle East.
Robust skill of decadal predictions We now provide new estimates of decadal prediction skill together with the impact of initialisation (Fig. 3). The impact of initialisation depends on both the skill of predicting the residuals and the fraction of the total variability that they represent. We, therefore, rescale the correlation between residuals (see Methods) and present the impact of initialisation as the ratio of predicted signal arising from initialisation divided by the total predicted signal (Fig. 3b, d, f).
Our results provide robust evidence of decadal prediction skill for precipitation and pressure in addition to temperature (Fig. 3a, c, e). Temperature shows high skill almost everywhere, the main exceptions being the north-east Pacific and parts of the Southern Ocean. Precipitation shows reasonable skill (r > 0.6) in the Sahel and in a broad band across northern Europe and Eurasia, with lower but significant skill in parts of North and South America and the Maritime Continent. Pressure is also significantly skilful in most regions, the main exceptions being the eastern South Atlantic and western Indian Oceans, Africa and parts of Eurasia and the Southern Ocean.
In agreement with previous studies, [12][13][14][15]17,24 temperature skill is particularly improved by initialisation in the SPNA and the Drake Passage region (Fig. 3b). However, we also find significant improvements over the western Indian Ocean, tropical and eastern Atlantic, and land regions including Europe, the Middle East and Africa. Initialisation significantly improves precipitation skill over the Sahel and western Amazon (Fig. 3d), but contributes little to the band of skill across northern Europe and Eurasia. Sea level pressure is mainly improved by initialisation in the North Atlantic (including both the Iceland and Azores nodes of the NAO, Fig. 1), north-east Europe and parts of the Southern Ocean (Fig. 3f). A detrimental impact of initialisation in the eastern South Atlantic and western Indian Oceans and northern Africa warrants further investigation given the significant improvement in temperature skill in these regions, suggesting that the relationship between surface temperature and atmospheric circulation anomalies is not simulated correctly by the initialised hindcasts. 46 Regional predictions Having established significant skill for decadal predictions of temperature, precipitation and pressure, we now investigate their ability to predict regional patterns. A key goal is to predict decadal variability in the Atlantic and Pacific, referred to here as Atlantic Multidecadal Variability (AMV, also known as the Atlantic Multidecadal Oscillation, AMO) and Pacific Decadal Variability (PDV, also referred to as the Pacific Decadal Oscillation, PDO, or the Interdecadal Pacific Oscillation, IPO). AMV switched from cool to warm conditions around 1995, 45 with some evidence for a recent return to cool conditions. 47-49 PDV switched to warm conditions around 1976, back to cool conditions around 1998, with some evidence for recent warm conditions. 50 AMV (PDV) was warm (cool) during 1998-2014 and cool (warm) during 1978-1994 such that the difference between these periods summarises many aspects of decadal predictions, including predictions of AMV, PDV, greenhouse gas warming, and associated climate impacts. As noted above, the magnitude of forecast anomalies must be adjusted if the signal-to-noise ratio is incorrect. Although several objective methods have been proposed 19,29,31,32 further work is needed to establish the best approach. We, therefore, follow the previous studies 18,20,21 and simply compare forecasts and observations using standardised anomalies (Fig. 4). Temperature differences between these periods are dominated by the global warming trend, which masks regional differences related to atmospheric circulation anomalies. To highlight regional patterns we remove the global average temperature from each 8 year period before computing temperature differences (Fig. 4a, b). This clearly highlights enhanced Arctic warming and greater warming over land than sea, which are robust patterns expected with global warming 39,[51][52][53] and are well captured by the predictions. The decadal predictions also capture the warm AMV, especially in the SPNA consistent with previous studies, [54][55][56][57] although the relative warming in the tropical Atlantic is not captured. A negative PDV pattern is evident in the observations, seen as relatively cool conditions in the eastern tropical Pacific surrounded by a horse-shoe of warming to the west, north and south. The decadal predictions show some aspects of this pattern in the north Pacific but relative warming in the west and south Pacific is not captured. The PDV pattern is only partially simulated, and further work is needed to reconcile generally low skill for predicting PDV 58,59 with an apparent ability to predict phase transitions. 60 Both predictions and observations show relatively cool conditions over most of the Southern Ocean and Antarctica, but the observed warming west of the Antarctic Peninsula is not captured by the predictions which instead show warming further east.
Decadal predictions capture many aspects of the observed precipitation changes (Fig. 4c, d) including wetter conditions in the Sahel, north-west South America, the Maritime Continent and across northern Europe and Eurasia, and drier conditions in South America and south-west USA. Many of these changes in land precipitation appear to be part of large scale patterns extending over the ocean, including a northward shift of the Atlantic Intertropical Convergence Zone (ITCZ) consistent with warm AMV 27,47,57,61,62 and an expansion of the Hadley circulation 63 which is expected as climate warms. 64 However, drier conditions in parts of Asia, the Middle East and East Africa are not captured by the predictions.
Predicted changes in pressure are clearly anti-correlated with temperature suggesting a heat low response to greater warming Fig. 4 Predicting regional patterns. a Composite difference between initialised year 2-9 temperature forecasts verifying during the periods 1998-2014 and 1978-1994. b As (a) but for the verifying observations. (c, d) As (a, b) but for precipitation. (e, f) As (a, b) but for mean sea level pressure. For each 8-year mean forecast period (years 2-9), observed and ensemble mean forecast anomalies were standardised at each grid point by dividing by their respective standard deviations before making composites. For temperature only, we first remove the global average from each grid point before standardising in order to accentuate regional patterns that are more likely to drive circulation changes than would globally uniform warming. Units are standard deviations of 8-year means. Stippling shows where the predictions and observations are significantly different (95% confidence) D.M. Smith et al.
(hence lower pressure) over land than ocean (Fig. 4a, e). This pattern is also evident in the observations (Fig. 4f), especially over the Pacific and North and South America. However, in contrast to the predictions, observed pressure increased in the SPNA, Europe, Asia and northern Africa, and reduced in the Indian Ocean. The predictions also do not capture low pressure in the western Pacific where warming is underestimated. Reduced pressure over the Southern Ocean is seen in both predictions and observations, consistent with an increase in the Southern Annular Mode driven by increased greenhouse gases and reduced Antarctic ozone, 65 although the predictions are zonal and do not capture regions of low pressure to the west of Africa and Australia. The ridge of high pressure extending north-east from the Caribbean is captured, along with reduced pressure in the subtropical North Atlantic consistent with a northward shift of the ITCZ.

DISCUSSION
Our results show that the signal-to-noise paradox, in which the small predictable signal in climate model ensembles is inconsistent with their high level of agreement with observations, is widespread in multi-model decadal predictions, especially for precipitation and mean sea level pressure. This highlights a clear deficiency in climate models that urgently needs to be addressed. However, it also provides an opportunity to make skilful forecasts with existing models by extracting the signal from the mean of a very large ensemble and adjusting its variance. Using a larger multi-model ensemble than was previously available we demonstrate significant decadal prediction skill for precipitation and atmospheric circulation in addition to surface temperature.
Understanding the sources of skill is important for reducing uncertainties in forecasts. We propose a new approach for Role of external forcing. a Correlation between ensemble mean uninitialized simulations and observations for near-surface temperature. b As (a) but after linearly detrending both observations and simulations. c, d As (a, b) but for precipitation. e, f As (a, b) but for mean sea level pressure. Correlations are computed for the same 8 year periods used to assess the initialised predictions (Fig. 3). Stippling in (a, c, e) shows where initialised predictions are more skilful, and in (b, d, f) where correlations are significant (crosses and circles show 90 and 95% confidence intervals, respectively) D.M. Smith et al. diagnosing the impact of initialisation that is more powerful than those used previously. This reveals significant benefits of initialisation including for temperature over Europe, Sahel rainfall, and north Atlantic pressure. However, the overall patterns of skill for temperature, precipitation and pressure are largely captured by the uninitialized simulations (compare Figs. 3a, c, e and 5), and improvements from initialisation generally occur in regions where the uninitialized simulations already have some skill. This could arise because: (1) improved skill arises from predicting internal variability in regions where there is an externally forced response; (2) externally forced skill in these regions is largely incidental 66 and skill in initialised predictions arises mainly from internal variability; (3) the variability is predominantly externally forced and improved skill arises from correcting the modelled response to external factors. The first situation is particularly expected where there is a long term trend driven by slow variations in greenhouse gases. However, uninitialized simulations also capture aspects of the variability around the linear trend (Fig. 5b, d, f), highlighting the need for improved understanding of the roles of other external factors, including solar variations [67][68][69] and volcanic [70][71][72][73] and anthropogenic aerosols. [74][75][76][77][78][79][80] We show that decadal predictions are able to capture many aspects of regional changes, including precipitation over land and atmospheric circulation in addition to surface temperature. There are also several differences between the forecasts and observations, and further work is needed to assess whether these are due to internal variability that was not predicted, incorrect responses to external factors or artefacts of initialisation. Nevertheless, the levels of skill are encouraging overall and support the recent establishment of operational decadal climate predictions endorsed by the World Meteorological Organization (WMO), building on the informal forecast exchange that has been running since 2010. 81 The Decadal Climate Prediction Project 17 contribution to CMIP6 will also provide ongoing forecasts with improved models and an even larger multi-model ensemble than used here. The World Climate Research Program (WCRP) Grand Challenge on Near Term Climate Prediction recently set out the case for operational decadal predictions 4 including an Annual to Decadal Climate Update to be produced each year as an important first step towards operational services. Such information promises to benefit society by informing preparedness for, and adaptation to, near-term climate variability and change.

Observations and models
Near-surface temperature observations are computed as the average from HadCRUT4, 82 NASA-GISS 83 and NCDC. 84 Precipitation observations are taken from GPCC 85 and mean sea level pressure from HadSLP2. 86 We assess a large multi-model ensemble (71 members, Table 1) of decadal predictions from seven forecast centres using hindcasts starting each year from 1960 to 2005 following the Coupled Model Intercomparison Project phase 5 (CMIP5) protocol. 36 To assess the impact of initialisation we compare with an ensemble of uninitialized simulations (up to 56 members) that use the same climate models. We create ensemble means by taking the unweighted average of all ensemble members and assess rolling 8year means defined by calendar years 2 to 9 from each start date of the initialised predictions. The forecasting systems start between 1st of November and January each year, giving a lead time of at least a year before the assessed forecast period to focus on decadal timescales and avoid skill arising from seasonal to annual variability. Both observations and models were interpolated to a 5°longitude by 5°latitude grid before comparison. The residuals o′ and f′ represent the variability that cannot be captured by the uninitialized simulations and their correlation (r′, also referred to as the partial correlation 87 ) measures the impact of initialisation. Because the skill of f and u can be dominated by a common component (such as the global warming trend), r′ is potentially much greater than the difference in skill, thereby increasing the effect size and increasing the power of the test. [40][41][42] It is important to note that errors in estimates ofô andf arising from a finite ensemble size of u can produce a positive bias in r′. An unbiased estimate can be obtained by using independent estimates of u to obtainô andf separately. We do this by dividing the uninitialized ensemble into two independent halves and then estimate the bias by comparing biased and unbiased estimates of r′. This process is repeated using the block bootstrap approach describe below, from which we compute the median bias. We find the bias to be small (of order 10%) and remove it from the values computed for the full ensemble. Hence, our approach extends that used previously to compare seasonal forecasts from different models. 87 The residuals could be highly correlated but represent only a small fraction of the predictable signal. We, therefore, present the impact of initialisation as the ratio of the predicted signal arising from initialisation divided by the total predicted signal, computed as

Significance
For a given set of validation cases, we test for values that are unlikely to be accounted for by uncertainties arising from a finite ensemble size (E) and a finite number of validation points (T). This is achieved using a nonparametric block bootstrap approach, 14,88 in which an additional 1000 hindcasts (or pairs of hindcasts for assessing skill differences) are created as follows: 1. Randomly sample with replacement T validation cases. In order to take autocorrelation into account this is done in blocks of 5 consecutive cases. 2. For each of these, randomly sample with replacement E ensemble members. 3. Compute the required statistic for the ensemble mean (e.g., correlation, difference in correlation, partial correlation, RPC). 4. Repeat from (1) 1000 times to create a probability distribution (PDF). 5. Obtain the significance level based on a 2-tailed test of the hypothesis that skill (or skill difference) is zero, or RPC is one. For skill as a function of ensemble size (Fig. 1b) we randomly sample without replacement to obtain the required number of ensemble members, and compute the average from 1000 repetitions. Uncertainties (pink shading) are computed based on single members since the samples are not independent for larger ensembles leading to an underestimated ranges.
Uncertainties in composite differences (Fig. 4) are computed using a standard t-test of the difference between two means.

DATA AVAILABILITY
The datasets analysed during the current study are available from the corresponding author on reasonable request. D.M. Smith et al.

CODE AVAILABILITY
The code used during the current study is available from the corresponding author on reasonable request.