Comparing the model-simulated global warming signal to observations using empirical estimates of unforced noise

The comparison of observed global mean surface air temperature (GMT) change to the mean change simulated by climate models has received much public and scientific attention. For a given global warming signal produced by a climate model ensemble, there exists an envelope of GMT values representing the range of possible unforced states of the climate system (the Envelope of Unforced Noise; EUN). Typically, the EUN is derived from climate models themselves, but climate models might not accurately simulate the correct characteristics of unforced GMT variability. Here, we simulate a new, empirical, EUN that is based on instrumental and reconstructed surface temperature records. We compare the forced GMT signal produced by climate models to observations while noting the range of GMT values provided by the empirical EUN. We find that the empirical EUN is wide enough so that the interdecadal variability in the rate of global warming over the 20th century does not necessarily require corresponding variability in the rate-of-increase of the forced signal. The empirical EUN also indicates that the reduced GMT warming over the past decade or so is still consistent with a middle emission scenario's forced signal, but is likely inconsistent with the steepest emission scenario's forced signal.

| CMIP5 CGCMs used in this study. 'X' marks indicate that a particular CGCM run was incorporated into the ensemble mean (representing the forced signal) in the given experiment. Also shown are the hemispheric-to-global variability scaling factors that were applied to the low frequency component of unforced hemispheric surface temperature reconstructions (Fig 1c) to convert them to representations of GMT variability (Fig. 1d). One preindustrial control run was used for each CGCM to obtain the hemisphere to global scaling factor.     Supplementary Fig. S2d). (c), High frequency IMFs from the instrumental record lined up with low frequency IMFs from the reconstructed record. At this point the low frequency variability has been converted to a representation of GMT (step 3, Methods). (d), AR(2) simulated IMFs that attempt to emulate the approximate magnitude and frequency, but not necessarily the phase, of the corresponding IMFs in column c. , Same as (a) but comparing Southern Hemispheric variability to GMT variability. In both cases there is a statistically significant correlation (P values shown in the panels) indicating that information on unforced GMT variability can be gleaned from hemisphere mean surface temperature variability albeit with substantial uncertainty. The ratio of the y-value to the x-value, for each model, is the scaling factor listed in Table S1.

Test of methodology
Because the methodology for constructing the ESRUN has not been implemented previously, it was necessary to subject some of the fundamental concepts to basic tests in order to determine if the method was sound. For this task, a single CGCM was utilized (CSIRO-Mk3-6-0) because it had a large number of ensemble members (10) for the 'historical' experiment. We did not use the 'last millennium' experiment because no CGCMs had a large number of ensemble members for this experiment. For the following three demonstrations of concept, we treat the output of this CGCM as the "true" climate system.
1. Conserving physical modes of unforced variability. The ability of EMD (and subsequent AR (2) modeling of IMFs) to conserve physical modes of unforced variability on a variety of time scales was tested. This was done by simply decomposing CSIRO-MK3-6-0's unforced preindustrial control run with EMD, simulating the IMFs with AR(2) models, and summing the AR(2) simulations to create a stochastic realization of unforced GMT variability. This process was performed 100 times and the resulting synthetic time series were compared to the original unforced control run that they were attempting to emulate (considered the "true" unforced GMT variability in this experiment). The synthetic unforced GMT time series produced in this manner had a similar standard deviation ( Fig.  S5a), mean power spectra (Fig. S5b), and mean autocorrelation function ( Fig. S5c) compared to the true unforced variability produced by the control run. This result indicates that the physical modes of variability in a CGCM's control run were conserved by the described methodology. 2. Ability of Multiple Linear Regression to remove the forced signal. It has been suggested that external forcings may excite and/or modulate modes of unforced variability [3][4][5] . This might mean that GMT could have a nonlinear response to external radiative forcings. Therefore, it was necessary to test whether it was reasonable to use Multiple Linear Regression to remove the forced signal from temperature datasets. This was done by utilizing 10 forced runs of CSIRO-MK3-6-0 over the period from 1880-2011 (historically forced from 1850-2005 and forced with RCP 6 from 2006 to 2011). In this experiment, the observed temperature was represented by GMT realizations associated with individual forced ensemble members of CSIRO-MK3-6-0 and the external radiative forcings were the same as those shown in Fig. S2. Application of Multiple Linear Regression to each of the ensemble members then produced 10 unforced GMT time series that were used to produce 100 synthetic unforced GMT realizations (10 for each) according to the ESRUN methodology. This ensemble of 100 synthetic unforced GMT realizations was compared to the "true" unforced GMT variability of the CGCM's unforced control run. Note that the synthetic unforced GMT realizations produced in this experiment were blind to the "true" unforced variability of the CGCMs control run that they are attempting to emulate. These 100 synthetic unforced GMT time series had a similar standard deviation ( Fig. S5d) mean power spectra (Fig. S5e), and mean autocorrelation function ( Fig. S5f) compared to the "true" unforced variability produced by the control run. This similarity indicates that Multiple Linear Regression is able to remove the forced component of variability from the record while leaving the unforced variability behind. 3. Conversion of unforced hemispheric temperature variation to GMT variation. The conversion from unforced hemispheric mean surface temperature variability to unforced GMT variability was also tested. This test was conducted in a similar manner to experiment 2. In this experiment, however, the observed GMT was represented by Northern Hemisphere mean surface temperature realizations associated with individual forced ensemble members of CSIRO-MK3-6-0. Application of Multiple Linear Regression then produced 10 unforced time series that were again used to produce 100 synthetic unforced realizations (10 for each) using the ESRUN methodology. In this case, however, the IMFs were converted from hemispheric variability to GMT variability with a conversion factor of 0.77 (the mean Northern Hemisphere-to-GMT conversion factor in the CMIP5 ensemble (Table S1)). These 100 synthetic unforced GMT realizations were then compared to the "true" unforced GMT variability of the unforced control run. Note that, as in experiment 2, the synthetic unforced GMT realizations produced in this experiment were blind to the "true" unforced variability of the CGCMs unforced control run that they are attempting to emulate. The 100 synthetic unforced GMT time series produced in this manner had a similar standard deviation (Fig. S5g), mean power spectra (Fig. S5h), and mean autocorrelation function (Fig. S5i) compared to the "true" unforced GMT variability produced by the control run. This similarity indicates that the conversion from Hemispheric to GMT unforced variability may be achieved through a simple scaling factor.  Figure S6 compares the power spectral density of unforced GMT estimated from the instrumental record and unforced GMT estimated from the reconstructions. Reconstructions where not used for the simulation of high frequency variability in the ESRUN because the instrumental record is of sufficient length to characterize such variability and because many reconstructions obviously underestimate high frequency variability (probably to due temporal resolution limitations of several proxies). On the other hand, it is valuable to estimate low frequency variability using the reconstructions because they are of longer length than the instrumental record and they are from a time period characterized by weaker external forcing.

The Division between High and Low Frequency Variability
We chose 15 years as the division between high and low frequency variability because this serves as a natural division between ENSO-like variability (which has a characteristic timescale of ~3-7 years) and slower moving modes of variability such as the Atlantic Multidecadal Oscillation and Pacific Decadal Oscillation which have timescales of multiple decades 6 . Also the 15-year division allowed the first three IMFs from the instrumental record (Fig. S2d) to be included in creation of the ESRUN. The timescales of variability for these first three IMFs indicate that they likely arose from unforced internal dynamics. On the contrary, the 4 th instrumental IMF (Fig. S2d) cools from the late 1800s to ~1910 and then warms from ~1910 to ~1940. It is possible that this variability in particular is due to external forcings 7 that have not been completely removed by the Multiple Linear Regression procedure. Because of ambiguity between forced and unforced variability in the instrumental record at this and longer timescales, it is valuable to use additional estimates of low frequency variability provided by the reconstructions. Figures S7 and S8 show the primary results of the manuscript when the high/low frequency division is assigned to be 35 years which allows the first four IMFs from the instrumental record (Fig. S2d) to be included in the ESRUN. Figures S7 and S8 indicate that the primary results are similar for both cutoff frequencies but that there is slightly less energetic low frequency variability in the ESRUN when the cutoff frequency is 35 years. This may be because some of the low frequency variability in the instrumental record has been mistaken for forced variability and the radiative forcings have been implicitly over-fit to the observed GMT at the 15-35 year timescale.

Sensitivity of Results to AR Order Assignment
Figures S9 and S10 show the main results of the manuscript when Bayesian Information Criterion (BIC) 8 was used at Step 4 ( Fig. 4) to assign the autoregressive model order with the best balance between goodness-of-fit and model complexity. In this case, the model order was selected [among AR(1), AR(2),…,AR (10)] that minimized the BIC. Comparison of Figs. S9 and S10 with Figs. 2 and 3 respectively, indicates that the primary results of the manuscript are very similar when BIC is used to assign AR model order and when the model order is pre-assigned to be AR(2).
Figs. S11 and S12 also show very similar results when model order is pre-assigned to be AR (7). This is in contrast to Figs. S13 and S14 where model order was pre-assigned to be AR(1) and there is noticeably less low frequency variability present in the ESRUN.
Overall, these results indicate that AR(1) models are unable to simulate some low frequency variability present in the unforced GMT time series. However, results are very similar when AR(2), AR (7) or AR(X) [where X is assigned from 1 to 10 using BIC] models are used to create the ESRUN. Therefore, it appears that only marginal improvements are achieved by increasing model order past AR(2) and thus we present the AR(2) results in the main text of the manuscript. The ability of AR(2) models to simulate the unforced GMT time series is also illustrated in Figs. S3 and S5.

Reconstruction information
The following is information on the surface temperature reconstructions used in this study. The information below comes directly from Wahl, et al. 9 Ammann and Wahl 10 TITLE: Northern Hemisphere Average Annual Temperature Reconstruction DESCRIPTION_SUMMARY: Uses multiple proxy types, input into inverse regression-truncated EOF climate field reconstruction spanning entire globe at incomplete 5x5 deg grid. Only N.
Hemisphere average is reported here.