Importance of internal variability for climate model assessment

Benchmarking climate model simulations against observations of the climate is core to the process of building realistic climate models and developing accurate future projections. However, in many cases, models do not match historical observations, particularly on regional scales. If there is a mismatch between modeled and observed climate features, should we necessarily conclude that our models are de ﬁ cient? Using several illustrative examples, we emphasize that internal variability can easily lead to marked differences between the basic features of the model and observed climate, even when decades of model and observed data are available. This can appear as an apparent failure of models to capture regional trends or changes in global teleconnections, or simulation of extreme events. Despite a large body of literature on the impact of internal variability on climate, this acknowledgment has not yet penetrated many model evaluation activities, particularly for regional climate. We emphasize that using a single or small ensemble of simulations to conclude that a climate model is in error can lead to premature conclusions on model ﬁ delity. A large ensemble of multidecadal simulations is therefore needed to properly sample internal climate variability in order to robustly identify model de ﬁ ciencies and convincingly demonstrate progress between generations of climate models. npj Climate and Atmospheric Science (2023) 6:68 ; https://doi.org/10.1038/s41612-023-00389-0


INTRODUCTION
Climate models are remarkably successful in reproducing many earth-system phenomena such as atmospheric jet streams, oceanic currents, monsoons, the El Niño Southern Oscillation (ENSO), the North Atlantic Oscillation (NAO), and the response of global climate to external forcing [1][2][3] .From their basis in the Navier-Stokes equations of fluid dynamics, even extreme events like heatwaves and cold snaps, floods and droughts, cyclones, and storms all appear spontaneously in climate model simulations.In some cases, models also warn us of more intense extreme events than we have yet experienced but which could plausibly occur at any time in the current climate [4][5][6][7] .Scientists also use climate models to understand the physical mechanisms behind past changes in climate 8 , to understand and predict extreme events 9 , to project future climate change, and to inform governments and policymakers about the impacts of climate change 10 .
To gain confidence in model projections, the fidelity of models is assessed by comparing model simulations of the historical climate with observations.This trial of models using observations is core to identifying current model deficiencies and prioritizing areas for further development 11 .The scientific literature presents many examples where a mismatch between model and observed climate features has been reported, such as trends in regional rainfall amount and temperature [12][13][14] , multidecadal changes in atmospheric circulation 15 and climatology 16 , the frequency or magnitude of extreme events 17,18 , global teleconnections 19 , interaction between different modes of climate variability 20 or external forcing effects 21,22 .There can be many reasons why models disagree with observations.However, even in the case of perfect models, perfect boundary conditions, and perfect observations, a lack of agreement between the modeled and observed climate can still arise simply due to chaotic internal variability 13,[23][24][25][26][27][28][29] .
Due to the inevitable presence of internal variability, each realization of climate, in both observations and models, represents only one possible realization out of many 30,31 .When evaluating models against observations, this issue (known as sampling variability) is supposed to be handled through statistical tests, but the power of those tests relies heavily on the assumed chance process generating internal variability.This leads to a misinterpretation of significance tests, which is widespread in climate science, leading to overly lax criteria 32 .This problem is only exacerbated if the statistical tests do not take proper account of multidecadal internal variability, which is very difficult to quantify from single or small ensembles of models, or from observations.Many of the studies claiming a mismatch between models and observations, including some cited in the IPCC WG1 report 10 , continue to use only a single or a small ensemble of simulations from a given model.These studies generally show that the differences are statistically significant, typically for time periods that are considered sufficiently long.Here, we provide three illustrative examples to demonstrate how internal variability cannot be easily ruled out as a cause of commonly reported discrepancies even when decades of observations and model data are available.

METHODS
We use an ensemble of initialized climate simulations from the Climate Historical Forecasts Project (CHFP) database.The ensemble member rainfall from each model was spatially averaged over the selected domains (i.e.20-28°N, 76-87°E for north-central India and all-India land region) and bias-adjusted using the difference between the ensemble mean and observations for each model's hindcast period.Model fidelity was then tested using the UNprecedented Simulated Extremes in Ensembles (UNSEEN) method 4,27 and rainfall multimodel ensemble (MME) was created using a selected set of models that passed the UNSEEN tests and are statistically indistinguishable from the observations (GloSea5, CFS, and the MPI models for north-central India for 1950-2013, and GloSea5, ECMWF-S4, CFS, MIROC5, and MPI-LR for all-India for 1901-2013) (see Table 1 of Ref. 4 for details of models, Ref. 33 for details on the CHFP).We have also previously tested the fidelity of the CHFP models in representing JJA rainfall over India and found a reasonably realistic simulation of JJA rainfall in most of these models, i.e. the spatial distributions were comparable to the observations, models show statistically significant skill in simulating the interannual variability of JJA rainfall, and the ENSO-rainfall teleconnections were similar to the observations 4,34 .The ensemble mean trend was removed from each ensemble member time series for each model to remove any influence of climate change before creating MME.A total of 10,000 time series of length 51 years were resampled from MME by randomly selecting ensemble members in sets of three consecutive years to preserve the interannual autocorrelation in rainfall.Linear trends were calculated as the slope of the line of best fit for each time series to obtain 10,000 values of trends.Each trend value was then multiplied by the length of the time series (i.e.51 years) to obtain the change in JJA total rainfall over a 51-year period, shown on the x-axis (Fig. 1a, b).We also tested the sensitivity of the extreme values shown in Fig. 1 to the model variance by removing one of the models with the highest variance while resampling.We find no substantial influence of the differences in the models' variance on the extreme values presented in Fig. 1.
The rainfall MME created for Fig. 1b was used to randomly resample 10,000 time series of lengths 20 and 50 years for all-India rainfall and corresponding sea-surface temperatures (SSTs) for the Nino3.4region (5°S to 5°N, 170°W to 120°W).The correlation coefficient between all-India rainfall and corresponding Nino3.4SSTs was calculated for each of the 10,000 time series for both 20 (Fig. 4a) and 50 years (Figure not shown).Similar to rainfall, the SSTs were detrended before resampling to remove the influence of forced climate change on SSTs.For Fig. 4b, we randomly selected 30 correlation values from the 10,000 correlation values for the length of 50 years.We then calculated the average of the selected 30 correlation values to obtain the mean correlation and this was repeated 10,000 times to obtain the distribution in Fig. 4b.

DO MODELS REPRODUCE OBSERVED TRENDS?
Over the latter half of the 20th century, historical observations show a significant reduction in the summer monsoon rainfall over parts of India 14,[35][36][37] .However, most historical climate model simulations from different CMIP generations have shown a consistent increase in rainfall during this period and beyond into the 21st century under increasing greenhouse gas forcing 14,38,39 .
Several hypotheses have been proposed to explain the apparent mismatch in observed and model-simulated trends in monsoon rainfall over north-central India, including large observational uncertainty 40 , the recent warming of the Indian Ocean 41 , radiative forcing due to aerosols 42 or changes in land-use and land-cover 43 .In most cases, these factors are found to have systematic effects on the modeled monsoon rainfall.However, it is also possible that internal variability on multidecadal timescales contributes to this mismatch.
We find that the most extreme drying and wetting trends in the Indian Meteorological Department (IMD) observational record for both regional (Fig. 1a) as well as all-India rainfall (Fig. 1b) lie within the range of plausible trends from chaotic internal variability in the current climate.(Note that only the models that pass fidelity tests for observed Indian monsoon rainfall have been used for this analysis.)Even more extreme values than have been observed in the IMD record so far are possible, solely due to internal variability, in the absence of any systematic forced effects.For example, there is around a 1-in-160-year chance of a wetting trend of magnitude larger than the wettest trend, and a 1-in-18-year chance of a drying trend of magnitude larger than the driest trend in the observational record.Therefore, for this case, internal variability cannot easily be rejected as the cause of the models' apparent failure to capture the observed drying trend (also see Ref. 44,45 ).
Unprecedented climate extremes are often a manifestation of both internal variability and external forcing.However, in many cases, the internal variability is so large that it can easily negate or greatly overwhelm any forced response in climate trends, even on multidecadal time scales 27,46 ).For example, in this case, internal variability can be large enough to overwhelm the wetting trend due to greenhouse gas forcing and give temporary drying trends in monsoon rainfall.
Another example where the role of internal variability was ignored is the recent paper by Scafetta 47 , which claimed that no model with a climate sensitivity >3 °C was consistent with observed trends since 1979.However, this claim is based on comparing each model's ensemble mean change with an observational estimate (from ERA5) without taking either observational uncertainty or model internal variability into account.When these elements are included and appropriate statistical tests were employed (following Ref. 48), it was demonstrated that the majority of the models with sufficiently large ensembles, even some of those with high climate sensitivity of 5 °C, are consistent with the observations 49 .
There also exists a selection bias when comparing the most extreme trends in the observational record with model-simulated trends 50 .Inspection of observational time series, with no a priori reason to select the particular period of most extreme trends, followed by comparison with model-simulated trends for the same time period introduces a selection bias and the impression that models fail to produce observed trends.In many cases, those extreme trends in regional climate can appear at any time in the model simulation and not necessarily during the observed period, irrespective of any forced changes which are often smaller than the internal variability for short-or medium-term periods.It is therefore very difficult to argue that models cannot reproduce observed extremes by pre-selecting extreme periods in the observations and then testing models for the same historical period.Significance testing of these extreme trends is difficult in such cases and simple tests based on a single or limited ensembles of simulations to reject the null hypothesis of internal variability in favor of a model error are often invalid 51 .
Whilst the Indian rainfall trend presented here is only one example, many other cases of models apparently failing to reproduce observed trends for other regions and variables also exist 15,16,[52][53][54] .Therefore, it is necessary to re-examine such cases and carefully rule out internal variability as a cause of apparent mismatch between observed and modeled trends to robustly identify the true model errors.

HOW SHOULD WE TEST MODELS FOR EXTREME EVENTS?
In addition to the selection bias in time, there also exists a selection bias in space.If we preselect a particular extreme event from the observational record, which, by definition, is a rare event, and look for similar events in climate model simulations, then the chances of finding rare events of the same magnitude, duration, and spatial scale, at the same location will necessarily be low.However, in this case, it is premature to then conclude that models cannot simulate the observed extreme event.For instance, if we then use a large ensemble of climate simulations and search for a similar intense event with no a priori specification of exactly where it should happen, it is often possible to find a similar extreme event in the simulations, leading to very different conclusions about model fidelity 46,55 .
To illustrate this, we use the example of the German floods of 2021.The observed event in July 2021 was associated with daily mean rainfall reaching as high as 150 mm over parts of Germany (Fig. 2c).Searching ensembles of climate simulations from multiple models for European rainfall extremes of similar magnitude to that observed, reveals several instances with rainfall intensity in climate models reaching, and in some cases even exceeding the rainfall rate seen in the observational record (Fig. 3a-d).We can also examine the mean sea level pressure (MSLP) to determine if the simulated extreme corresponds to a realistic circulation pattern.In all cases, the extreme rainfall events in the models are co-located with extreme low-pressure regions (Fig. 3e-h), indicating low-level convergence and enhanced probability of intense rainfall, similar to the observed event which had low MSLP in the vicinity.
To illustrate the point further, we also examined a large ensemble from a single model, the CESM1 56 , as this isolates the impact of internal variability.While we find similar mid-latitude extremes occurring randomly anywhere over the selected domain in some simulations of the large ensemble, other simulations from the same model did not simulate any such events.This shows that a-priori constraining the regional scale and location in the model to that of the pre-selected local record event invalidates commonly used statistical significance tests and hence can lead to the erroneous conclusion that models cannot reproduce the observed extreme.One potential way to handle this issue is to employ a spatially aggregated probability perspective, for example as demonstrated by Ref. 46 , and aggregate rainfall distributions over a spatial domain (e.g. for Köppen Geiger climate zones or regions with similar topographic features or climate variability).However, any such aggregation raises other questions, as the distribution is no longer independent and identically distributed.

ARE TELECONNECTIONS CHANGING?
Large-scale internal variability from phenomena such as ENSO, NAO, or the Indian Ocean Dipole (IOD), are known to influence regional climate across the globe through teleconnections.These teleconnections lead to systematic changes in regional climate far from the center of action of the variability itself [57][58][59][60] and often contribute to the predictability of regional weather and climate events.For example, the ENSO-rainfall teleconnection is crucial for tropical rainfall prediction at seasonal lead times 34,61 .
Recent literature questions the stationarity of these and other teleconnections and in many cases argues that causal mechanisms, such as externally forced climate change, are driving systematic change in either the pattern or the strength of the teleconnection from one epoch to another, and that these changes are not represented in climate models.For instance, changes in the ENSOrainfall teleconnection have been reported for several tropical regions including India 62 , East Asia 63 , North America 64 , and Africa 65,66 , as well as many other large-scale teleconnections, such as the recently discovered connection between the Quasi-Biennial Oscillation and Madden Julian Oscillation 20,67 and the ENSO-Asian teleconnection 68 .
Several studies already highlight internal variability as a cause of the apparent mismatch between model and observed teleconnections [69][70][71][72][73][74] but many others continue to suggest that mismatch implies model error.Therefore, we re-examine one well-known example: ENSO and Indian summer monsoon rainfall teleconnection.
Figure 4 (a) shows that the distribution of ENSO teleconnections in rainfall resamples covers an enormous range of correlations on 20-year timescales (r = −0.90 to 0.47).A similar result holds for 50year timescales (r = −0.80 to 0.22).Note that this range occurs due to sampling variability rather than any true systematic nonstationarity, and includes the historical periods such as 1980-2010 when non-stationarity in the ENSO-Indian rainfall teleconnection has been reported in observations 62 .Whilst methodologies to calculate the ENSO-monsoon relationship vary in the literature, even the extreme examples of apparent non-stationarity in observed teleconnections sit well within the spread of plausible ENSO-monsoon teleconnections due to the internal variability of the climate system.The model resamples even show the possibility of a positive correlation on 20-year timescales; opposite in sign to the observed teleconnection.
In addition, there is also a growing body of literature suggesting that the ENSO-Indian rainfall teleconnection could change in the future under climate change 75,76 .However, Fig. 4 (b) shows that the mean ENSO-rainfall correlation for the CMIP6 multimodel ensembles (of size ~30) for both historical and future periods 72 sits within the range of internal variability (r = −0.31 to −0.51).Therefore, great care is needed before we can conclude with confidence that there is any robust change in the ENSO-monsoon relationship in the future.
Finally, given the broad range of possibilities in Fig. 4a, b, and the fact that simulated future changes are well within this range, it is unlikely that observational data will yield significant examples of changes in teleconnections that are extreme enough to rule out internal variability and detect any true non-stationarity on any reasonable timescale into the future.

A CALL FOR MORE RIGOROUS MODEL ASSESSMENT
We find that studies claiming a mismatch between model and observed climate phenomena are often too quick to ignore the null hypothesis that such apparent discrepancies between models and observations can arise due to chaotic internal variability.We have presented examples where this applies to changes on multidecadal timescales, such as global and regional trends, recent extreme events, and apparent changes in observed teleconnections.Assessing models against the rare and most extreme observed cases automatically introduces a selection bias into the process of model evaluation, which is particularly compounded for extreme events on small spatial scales.Limited observational records can easily show apparent non-stationary or spurious teleconnections due to internal variability and sampling error.Using a single simulation or small ensembles of simulations can easily lead to premature conclusions about model fidelity.Therefore, a large ensemble of multidecadal simulations is needed to properly sample initial condition uncertainty and convincingly demonstrate model failure to capture observed phenomena, as well as to assess the progress or deterioration in performance between older and newer generations of models.Initialized large ensembles are already being used for seasonal predictions 33,77 , and more recently for multidecadal predictions 78 and projections 53,79 .Using these ensembles to isolate internal variability from true model errors provides a powerful second application.

FUTURE OUTLOOK AND SUMMARY
While a large ensemble is desirable to account for fluctuations due to internal variability, we acknowledge that these are computationally expensive and may not be always available.Therefore, we also highlight other potential model evaluation methodologies that could be used.For instance, in contrast to picking a single period in observations and testing a single model simulation against that, we suggest using a longer time period, and sampling all possible periods of a fixed length within that interval.For example, sampling 20-year trends over 50 years for both observations and models and comparing the distribution of trends 4,51 .Comparing multimodel mean trends directly with the observations is also not a fair comparison 49 and for these cases, statistical tests, similar to the UNSEEN method 6 , could be employed to test model fidelity.
Grid point comparisons for extremes, such as calculating spatial distributions of extremes (e.g., 1-day maximum rainfall) and comparing those with the model spatial distribution, are likely to show apparent disagreement.A similar problem has been recognized as 'double-penalty' in high-resolution weather prediction where verification scores are penalized twice, i.e. for simulating a feature in the wrong place and not simulating a feature at the right place 80 .Simple approaches, such as pooling daily maximum rainfall values over a time period and spatial domain for both observations and models, and then comparing the distributions, or extending more sophisticated methodologies (such as Hi-RA 81 ) used in high-resolution weather prediction, could be considered for evaluating climate models for extremes.
There are also new emerging methodologies, such as resampling of observational records to create pseudo-observational large ensembles 5 , quantification of the degree of discrepancy between models and observations using Bayes factors 32 , spatial aggregation of regional extremes 46 , ensemble boosting methods where a climate model is reinitialized to generate large samples of extreme events at relatively low cost 82 , physical or process-based evaluation of models using storyline methods 32 , and intelligent use of machine learning methods to differentiate variability from model errors 83 , all of which could be employed for better control of internal variability and sampling error.
Finally, we emphasize that models will always contain errors.The proper sampling of internal variability is a necessary but not sufficient condition for assessing model fidelity; it is also crucial to assess if the model simulates the phenomena of interest and is fit for purpose.Certainly, not all disagreements between models and observations can be attributed to internal variability.for the ensemble size of 30 and for the SSP5-8.5 scenario.

Fig. 1
Fig. 1 Internal variability in rainfall trends.Change in June-July-August (JJA) total rainfall over a 51-year period from internal variability in a multimodel ensemble (MME) of climate predictions for (a) north-central India (20-28°N, 76-87°E) and (b) all-India (land-only) rainfall.The darker color indicates wetter trends.The dotted lines show the most extreme 51-year trends in the IMD observational record during 1950-2013 (the period for which drying trends were observed over north-central India) and 1901-2013 for all-India.The MME here is from the Climate Historical Forecasts Project (CHFP).See Methods for more details.

Fig. 2
Fig. 2 Observed extreme rainfall over Europe in July 2021.Daily mean rainfall (mm) over Europe for 12-15 July (a-d) and corresponding daily mean sea level pressure anomaly (hPa) (e-h) from E-OBS v26.0 dataset.The daily sea level anomalies are calculated with respect to monthly values for July 2021.

Fig. 3
Fig. 3 Simulation of extreme European rainfall in climate models.Daily mean rainfall (mm) over Europe for a selected day between 1950-2014 (a-d).Corresponding daily mean sea level pressure (hPa) is shown in (e-h).The CESM-LE case is from the CESM1 Large Ensemble and the remaining models are from CMIP6 ensembles.

Fig. 4
Fig. 4 Internal variability in the ENSO-monsoon relationship.Probability distribution of Pearson's correlation coefficient between detrended Nino3.4 sea-surface temperatures (SSTs) and all-India rainfall from the Climate Historical Forecasts Project (CHFP) multimodel ensemble.Panel (a) is for resampled rainfall and SST time series of length 20 years and shows the range of correlations corresponding to individual ensemble member time series.Panel (b) is for 50-year time series and correlation values were randomly resampled in sets of 30 and therefore represent the range of mean correlations for 30 ensemble members (see Methods).The dotted lines show some extreme examples of correlation values from the literature for corresponding timescales.Examples shown in panel (b) are from Ref.72 for the ensemble size of 30 and for the SSP5-8.5 scenario.