Standard assessments of climate forecast skill can be misleading

Assessments of climate forecast skill depend on choices made by the assessor. In this perspective, we use forecasts of the El Niño-Southern-Oscillation to outline the impact of bias-correction on skill. Many assessments of skill from hindcasts (past forecasts) are probably overestimates of attainable forecast skill because the hindcasts are informed by observations over the period assessed that would not be available to real forecasts. Differences between hindcast and forecast skill result from changes in model biases from the period used to form forecast anomalies to the period over which the forecast is made. The relative skill rankings of models can change between hindcast and forecast systems because different models have different changes in bias across periods.

A skill score for a forecast quantifies how well the forecast did in repeated trials relative to a well defined reference forecast. There are many different measures of model skill. Different measures reward and penalise different attributes of a forecast, and thus a particular model may do well on some measures but not others [24]. It is important therefore to use a range of skill measures. In the main text we used the random walk skill score, which is well suited for the comparison of forecast systems. It is also desirable to provide a measure of skill relative to simple baselines such as chance or past averages.
One such measure is the Gerrity score (GS) [60], which assesses ENSO as a categorical forecast (El Niño, neutral, La Niña) based on which range the Niño3.4 value lies. Another measure is the mean squared skill score (M SSS) [61], which assesses the forecast according to how far the forecast Niño3.4 values lie from the observed values, and is thus referred to as a distance-based skill score. We assess the M SSS for each model forecast relative to a climatological forecast (where likelihoods are set relative to past behaviour). The GS and M SSS skill scores for ENSO for a set of dynamical models and for a linear and logistic regression model are shown in Supplementary Figure 1.
For categorical forecasts, all models are better than chance for leads out to about 10 months. The dynamical models have mostly similar categorical scores to the linear and logistic regression models (given allowance for sampling uncertainty). For the distance-based skill measure, all models are better than a climatological forecast for about 6 months. Subsets of the dynamical models are partly similar, and partly worse than the linear regression model. The cola model is an outlier with less skill at short lead times (< 3 months) than all other models for both categorical and distance-based skill scores. This is related to a discontinuity in initial conditions for this model (see Supplementary Note 2).
Model more skillful than chance forecast Model less skillful than chance forecast 0* 1 2 3 4 5 6 7 8 9 10 11 Lead 3 j=1 q(fi, oj)sij, where q(fi, oj) is the number of forecasts in category i that had observations in category j and sij are elements of a scoring matrix. The scoring matrix tabulates the reward or penalty for each of the nine possible combinations in forecasting one of 3 categories and observing one of 3. sij is configured such that GS = 0 for random and constant forecasts. The chance forecast provides an equal weighting to each of the three El Niño Southern Oscillation (ENSO) categories. For GS, a perfect forecast scores 1 and a chance forecast scores 0. Panel b shows the mean squared skill score (M SSS), which measures the mean squared difference between the Niño 3.4 forecasts and observations relative to a forecast of the climatological value of Niño 3.4 for each month. The M SSS is based on the mean square error (M SE). , where r here is taken to be a climatological forecast for each month, m. For M SSS a perfect forecast scores 1 and a climatological forecast scores 0. The dynamical models are represented by the colored lines in the legend and the statistical models are the black lines. The shaded bands on each line represent a 5-95% range of values from a bootstrap resampling. Sampling variability is approximated from 1000 resampled estimates of the skill metric, generated by bootstrapping across forecast and ensemble member (prior to computing the ensemble mean). The skill scores are calculated using fair anomalies for hindcasts over the test period 1999-2015.
The cola model has a discontinuity in the supplied initial conditions for its forecasts around 1999 [15,37,39]. This discontinuity is potentially important as the reference period used in this work is prior to 1999 and the testing period is after 1999. This discontinuity applies only to the cola model and will degrade skill of the cola model. We keep the cola model in the set of models here as it illustrates the role of initial condition errors relative to biases in forecast climatology with lead time. The discontinuity is evident by plotting the difference (error) between each model's initial Niño3.4 value and the observed value. This plot is shown in Supplementary Figure 2. For a two-sample Kolmogorov-Smirnov test on the 1982-1998 and 1999-2015 error distributions, for cola and gmao the null hypothesis (that the distributions are identical) is rejected at a p-value of 0.01. The error distributions of each period are indistinguishable at this p-value for all other models. For the other models, any change in model bias between the training and testing periods is not due to changes in initial condition error.
The cola model has some (non-zero) change in model bias, ∆B, at low lead (leads 0 * , 1), whereas the other models have almost no change in bias at this range. This can be seen clearly in row g of Fig. 4 for the month of March for example, where most models (except cola) have almost no change in bias until lead 3. The issue is not simply that cola has some bias at low lead, as that is also true of the aer04 model for example. The bias at low lead is consistent from training to testing period in the aer04 model so that its change in bias is small. For cola the bias is not consistent at low lead from training to testing period, resulting in a larger change in bias at low lead. The change in bias at low lead explains why cola performs poorly using fair-based methods at low lead times relative to the other models.

S3
One might argue that the use of a distance measure to score forecasts in Fig. 2 favours the linear regression model, which is based on a fit that minimizes the distance of the regression fit with the observations. The dynamical models are of course not explicitly designed to 'fit' ENSO in this way. To provide an alternative regression that is not based on a distance fit, we show the random walk skill scores for a comparison with the categorical logistic regression model in Supplementary Figure 3 panels a-e. In this case all ENSO forecast anomalies are converted to ENSO categories, and the skill score rewards winning forecasts in the correct category. The comparison of the linear regression forecasts with the logistic regression forecasts is represented by the black line, which is squarely in the white 'chance' zone indicating that neither forecast is superior. For the fair variants, the dynamical models are either in the white zone or on the boundary between this zone and the zone where the models are less skillful than the logistic regression. This is a marginally better result for the dynamical models than for the comparison with linear regression (Fig. 2), but does not change the qualitative conclusion that the dynamical models are not outperforming the simple regression forecasts. The results for lead 3 months are repeated for all lead times between 1 and 11 months in panels f-j. The random walk skill score is mostly constant as a function of lead time for the categorical forecast comparisons.
Model more skillful than logistic regression

S4
In this section we compare three simple fair variants (fair, fair-sliding, fair-all) using the same forecast systems. The comparison of fair and fair-sliding is shown in panels a, b, and c of Figure Supplementary 4. For the more sensitive models displaying larger change in bias from training to testing periods (aer04, cola, florA, florB), at lead 3 and lead 6 (panels a and b), the random walk score tends to progressively favour fair-sliding over fair as more and more forecast comparisons are made later and later in the hindcast period. Each successive forecast comparison steps further away from the period used to define the reference climatology for fair, whereas fair-sliding moves the reference climatology along behind the forecast and so continues to do better and better relative to fair. The random walk scores as a function of lead time are shown in panel c. For almost all the models there is progressively more advantage to fair-sliding relative to fair as lead time increases. Keeping the reference climatology current (fair-sliding) is increasingly important at longer leads.
The results for comparison of fair-sliding and fair-all are shown in panels d, e, and f. The more sensitive models (aer04, florA, florB) get progressively better for fair-sliding relative to fair-all with successive forecast comparisons. That is to say, these models receive more of an advantage by keeping their reference climatology current (as close as possible to the forecast time) than by growing a larger and larger reference climatology even though it extends to the time of the forecast. The extra information in the longer fair-all reference climatology (that is further away from the forecast time) is actually degrading the skill in these models. The skill advantage for fair-sliding over fair-all is mostly manifest at longer lead times. For the least sensitive models (cm3, cm4i), skill tends to be higher when the longer reference climatology (fair-all) is used.
In summary, the 'fair' method yielding highest skill for the dynamical model hindcasts is fair-sliding, where the reference climatology slides along behind the forecast being made. This method is generally superior to even one in which all prior forecasts are used in forming the reference climatology. This result can be partly understood in terms of the periods tested. Most of the model forecast climatologies exhibit warming from the training to testing period, whereas the observations cool slightly over these periods. These differences between observed and model climatological trends will contribute to large bias errors for a static reference climatology like fair [53]. The result favouring fair-sliding is probably also a particular feature of ENSO hindcasts/forecasts because of the nature of ENSO. That is, ENSO hindcasts are particularly challenging to evaluate for skill because the available time series to develop the model hindcast climatology is somewhat pathological in containing small numbers of very large events. As one changes forecast periods from one decade to another, one can easily pick up a large new event (big El Niño or La Niña) in a forecast testing period that is not well sampled from the forecast climatology over prior periods. That makes it challenging to develop a reference climatology and increases the sensitivity of bias-corrected skill to choice of reference period. As such, it pays for the reference climatology to be as close as possible to the testing period. For real forecasts, one might effectively decrease the contribution of bias errors to forecast degradation by updating the reference climatology to be as current as possible every time a new forecast is launched.

S5
The mean of the forecast anomalies as a function of lead time and calendar month, f f cst m,τ , is shown in Supplementary  Figure 5 for each of the 'fair' methods here (fair, fair-all, fair-sliding). For 4 of the 9 models (aer04, cola, florA, florB) the mean anomalies are biased large and positive, which degrades their forecast skill. The warm biases in the mean anomalies are largest for the fair method (row a) and smallest for fair-sliding (row c), consistent with the relative skill between the 'fair' methods shown in Supplementary Figure 4. The cm3, cm4i, nemo, and seas5 models are less sensitive than the other models in that they are least affected by the choice of reference period. This is related to the degree of constancy of climatological bias in these models.  Figure 7 shows the skill of dynamical model forecasts of onset of El Niño and La Niña relative to linear regression forecasts using the unfair method. For this method the results don't change much from the original periods (panels b and e) to the switched periods (panels c and f). This stands in contrast to results using the fair method (Fig. 5), where the lead-time dependent changes in bias across different periods can act to help or hinder a forecast of onset. . Panels c and f are the same as panels b and e, except that the training and testing periods have been switched.