Possible causes of data model discrepancy in the temperature history of the last Millennium

Model simulations and proxy-based reconstructions are the main tools for quantifying pre-instrumental climate variations. For some metrics such as Northern Hemisphere mean temperatures, there is remarkable agreement between models and reconstructions. For other diagnostics, such as the regional response to volcanic eruptions, or hemispheric temperature differences, substantial disagreements between data and models have been reported. Here, we assess the potential sources of these discrepancies by comparing 1000-year hemispheric temperature reconstructions based on real-world paleoclimate proxies with climate-model-based pseudoproxies. These pseudoproxy experiments (PPE) indicate that noise inherent in proxy records and the unequal spatial distribution of proxy data are the key factors in explaining the data-model differences. For example, lower inter-hemispheric correlations in reconstructions can be fully accounted for by these factors in the PPE. Noise and data sampling also partly explain the reduced amplitude of the response to external forcing in reconstructions compared to models. For other metrics, such as inter-hemispheric differences, some, although reduced, discrepancy remains. Our results suggest that improving proxy data quality and spatial coverage is the key factor to increase the quality of future climate reconstructions, while the total number of proxy records and reconstruction methodology play a smaller role.


Proxy maps and local and hemispheric mean correlations
In this section we provide maps of the SH and NH proxy locations and display the local and hemispheric mean correlations of the proxy data. Figure S1: Top: Location of proxy records used in this study. NH and SH proxies are taken from ref. 1 and ref. 2 , respectively. The tropical records north of the equator are used for the SH reconstructions, because they have a stronger (and significant) relationship to SH mean temperature than to NH mean temperature 2 . Sizes of the circles represent the length of the record (between 111 and 1000 years) and the colors show the correlation with local temperature over 1911-1990 (details

Alternative NH targets, proxy data and reconstructions Sensitivity to NH target season and domain
We use a full hemispheric mean over the calendar year window as NH reconstruction target to allow comparing our results with ref. 2 , who used the reconstruction ensemble of ref. 3 for the NH. Ref. 1 use a May-August (MJJJA) extra-tropical land-only target for their reconstruction. While this extra-tropical temperature composite has a much larger variance, the two targets are highly correlated ( Figure S4) and so are the resulting reconstructions ( Figure S5). Our results and conclusions are not sensitive to the choice of the NH reconstruction target ( Figure S6 and Figure S28).  Alternative NH proxy data and reconstructions Figure S7- Figure S9 compare the real-world NH reconstruction based on the proxy network of ref. 1 with alternative reconstructions: A reconstruction based on the same CPS methodology but using the new NH proxy collection of ref. 4 and the reconstruction ensemble of ref. 3 .
The NH reconstruction used herein apparently has a relatively small Present-day -LIA amplitude compared to the other reconstructions ( Figure S8). Note that this is mainly because this reconstruction has relatively warm values during 1600-1650, which was used as LIA baseline period. Using an alternative baseline period would yield much more similar amplitudes between the three reconstructions (not shown).
Despite these differences, our results and conclusions are not affected by the choice of the NH reconstruction as shown by Figure S9. The NH reconstruction based on the dataset of ref. 4 shows slightly better agreement with the model data (e.g. higher NH vs. SH correlations). Note that this database has a higher number of records that are more homogeneously distributed across the NH, confirming our interpretation of the importance of well distributed proxy locations.

Pseduoproxy generation
In this section, we provide additional information and a more detailed description of the pseudoproxy generation described in the Methods section of the main text.
An overview of the different pseudoproxy generation methods applied herein is presented in Table S1. We generated a range of PPE using different concepts and amounts of noise. First, the model field is subsampled at the locations of the proxy records in the SH 2 ( Figure S1) and NH 1 5 and references therein). In addition, we also generate pseudoproxies with realistic noise levels for each proxy. This means that the SNR is calculated for each proxy individually based on its correlation with local temperatures in the real world (equation 5 in the Methods). This is in contrast to earlier studies 6,7 , where SNR was sampled from a distribution generated based on the correlations of all proxies in the global proxy database. We use the individual empirical correlation for each proxy because the time period covered by the proxies in our database varies considerably. Hence, for the quality of the reconstruction at the beginning of the reconstruction period, it plays an important role whether the longest proxies have a high or low SNR. For realistic results, individual SNRs are therefore preferable.
However, the realistic SNR values used herein (median 0.41 in the SH and 0.51 in the NH) are close to the ones reported by ref. 6 , who found a mean SNR of 0.45 using a similar method.
In the real world, the noise on proxy data is probably not white, as non-proxy noise can arise from temporally and even spatially correlated causes. Therefore it may be important that the pseudoproxies have similar spectral properties as the real proxies. We therefore created pseudoproxies that have the same AR1 coefficient as the corresponding real proxies 8,9 . We generated two sets of such AR1 proxies, one with realistic temperature correlations (SNRr-AR1 experiment #6) and one with SNR 1 (SNR1-AR1, #8) As described in the main text and methods, we generate additional pseudoproxies, where we set the signal to noise ratio in a way that replicates the correlations of the proxy with the field mean target instead of local temperatures. This is done again by creating pseudoproxies with realistic AR1 properties using realistic SNR (SNRrt-AR1 experiment #7) and SNR 1 (SNR1t-AR1 experiment #9).
As expected, the PPE experiments based on target correlations show improved results in terms of skill and amplitude, compared to the pseudoproxies based on local correlations, particularly for the SH ( Figure S10 and Figure S11). Experiment #9, using target correlations and SNR1 yields skill values that are clearly higher than those from all other experiments including the NoNoise experiment and real proxy reconstructions.
From these results we conclude that an SNR of 1 is unrealistically high when using target instead of local temperature correlation (experiment #9). In contrast, the underperformance (i.e. lower skill) of #5 and #6 (realistic local correlations) compared to the real proxies ( Figure S10 and Figure S11) is at least partly caused by the reduced correlation of local temperatures with the field mean target in the model world.
In the main text, we limit our interpretations to the experiments #1 (NoNoise), #8 (SNR1-AR1, label LocalCor in the main text) and #7 (SNRrt-AR1 label TargetCor), as these are most consistent with the realworld reconstructions in terms of reconstructions skill. To evaluate this consistency, we use five different verification skill metrics (formulae see Methods section of the main text): RE ( Figure 3) and r 2 for each ensemble member over the verification period sampled individually for each member; and RE, r 2 and RMSE of the ensemble median over the 1881-1910 early verification period. We calculate the mean difference between real proxy reconstructions and PPE for each skill metric across all model simulations used (temporally averaged across the reconstruction period, see also Figure S10 and Figure S11). The PPE are then ranked using the average value across all five skill metrics (after reversing the sign for RMSE, Table   S2).
Of the noise-perturbed PPE, the SRN1 and SNR1-AR1 experiment are closest to the real proxies in both hemispheres. These two experiments have similar skill metrics and reconstruction time series (not shown), but the AR1-based PPE have more realistic spectral properties leading to performance that is more similar to the real-world data for some metrics (e.g. Figure S26). We therefore select SNR1-AR1 as "best match" experiment. From the two experiments based on target correlations, the SNRrt-AR1 PPE is clearly favorable and is used as second "best match" experiment shown in the main text. Sensitivity plots for the Skill plots for each individual model simulation are shown below (Section S12).

Proxy-Sytem-Model -based pseudoproxies
As an additional validation to assess the robustness of our results to pseudoproxy generation, we calculate a set of additional PPE based on pseudoproxies derived using proxy system models (PSM).
We use tree-ring width and coral δ 18 O PSM pseudoproxies (PSM-PP) as described in Ref. 10 . Similar to Ref. 10 , PSM pseudoproxies are computed for tree-ring width using the VS-lite model 11 and the model of Ref. 12 for coral δ 18 O. The PSM pseudoproxies are generated for the CESM1-CAM5 model ensemble, using the input variables of monthly temperature and precipitation for the trees and annual sea surface temperature and sea surface salinity for the corals. The proxy network is based on that of the updated PAGES2k proxy network 4 and realistic PSM parameters are derived from the PAGES2k proxy data and historical observational data, as in Ref. 10 . For the PSM-based PPE, we use 397 tree-ring and 44 coral PSM-PP with latitude > 0°N to reconstruct NH mean climate (as opposed to the tree-ring network of Ref. 1 in the main text, see also Figures Figure S9-Figure S11). For the SH, we generate additional PSM-PP for those tree-ring records that were used as SH temperature predictors in the reconstruction of Ref. 2 , but are not included in the PAGES2k database (31 records). In the SH reconstruction of Ref. 2 , only three out of the eleven records extending back beyond 1400 are coral or tree-ring data. The other records are Lake sediments (3), marine sediments (1) or ice cores (4). To allow for a skillful reconstruction in the first few centuries of the reconstruction period, these data are required. The model simulations we use herein are not isotope-enabled and there are, to our knowledge, no generally applicable PSMs for sediment archives. Therefore, we use the same noise-based pseudoproxy data as in the main text (LocalCor experiment) for the 32 non-tree-ring and -coral records in the SH, leading to a total of 124 records. Across both hemispheres, this results in a total of 565 records used for this experiment, 94% of which are PSM-based. We use the same target temperature datasets and seasonal windows as for the other PPE.  Interestingly, the PSM PPE show a larger loss of low-frequency variance ( Figure S13b). This reduced present-day LIA amplitude is already evident in the PSM input proxy data (not shown). We hypothesize that it is caused by the fact that precipitation (tree-ring PSM) and sea surface salinity (coral PSM) are used besides local temperature to generate the PSM and these variables do not show the pronounced industrial-era warming trend that is inherent in temperature. Sea surface salinity incorporated in the coral PSM pseudoproxies (in addition to sea surface temperature) generally even trends in the opposite direction from temperature.
In contrast, the PSM PPE can reproduce the volcanic cooling and forcing response in D&A of the CESM model truth in the NH very well and much better than the LocalCor and TargetCor and NoNoise experiments (Figures Figure S15Figure S16). This is most likely caused by the fact that the PSM PPE is based on eight times more input proxy data (441 as opposed to 53) in the NH, which are also distributed over the entire hemisphere (except regions north of 72°N) in contrast to the mid-latitudes-only network used for the other PPE. This corroborates our conclusion that proxy distribution is key to capture the response to external forcing.
Along with the alternative reconstruction methods (SectionS10) and targets (section S3) and additional noise-based PPE (Section S2), this PSM PPE experiment using a different concept of pseudoproxy construction and a larger proxy network supports the conclusions drawn in the main text.

Reconstruction amplitude
In this section, we show additional plots for the present-day minus LIA amplitude ( Figure 3 in the main text). We show the amplitudes for each experiment listed in Table S1 across all model simulations and  the results for each model ensemble (HadCM3 and CESM1-CAM5) separately. The effect of the number of proxies available in PPE with randomly sampled locations is also shown in Figure S23 and Figure S24. Figure S17: Low frequency amplitudes of the SH reconstructions as in Figure 3b but for all PPE listed in Table S1.
Figure S18: Same as Figure S17 but for NH. Data from the NH real proxy reconstruction of Frank et al. 3 are shown in the first box on the left. Figure S19: Same as Figure S17 but only for the four simulations of the HadCM3 model. Figure S20: Same as Figure S17 but only for the ten simulations of the CESM model. Figure S21: Same as Figure S18 but only for the four simulations of the HadCM3 model. Figure S22: Same as Figure S18 but only for the ten simulations of the CESM model.  6. North -South correlations and differences Figure S25 compares the NH vs. SH differences and correlations with the results from ref. 2 , who used a large ensemble (n=524) of NH reconstructions based on different reconstruction techniques and proxy datasets, partially also including decadal-scale and lower frequency proxies 3 . The results are similar to the ones from the NH dataset used herein (black solid), providing further evidence that our results are robust to the choice of reconstruction method and proxy database.  Figure S26 includes all PPE and shows that our conclusions are robust to the choice of optimal noise levels. Note that the SNR1t-AR1 experiment has unrealistically low noise levels, as shown by the strongly overestimated verification skill ( Figure S10 and Figure S11). Only the AR1 experiments are able to generate NH-SH differences that are substantially different from the model truth. Figure S26: Same as Figure S25 but including all experiments listed in Table S1. Figure S27 shows the timing of global extreme decades as calculated as in ref. 2 . First, the fraction of decadally smoothed reconstruction ensemble members that exceed the ±1 std.dev. threshold is calculated for each hemisphere. Fractions from the SH and NH are then multiplied to yield global probabilities for extreme cold (blue) and warm (red) decades.

Superposed Epoch Analysis for volcanic eruptions.
This section provides additional Figures for the SEA analysis.  9. Additional plots for the D&A scale factors Figure S30: Same as Figure 6 in the main text but including also the period 1400-1999 for the D&A analysis (right panels). The stronger response to forcing during this period is due to the strong and common anthropogenic warming trend in models and reconstructions in both hemispheres.

Alternative reconstruction methods.
In this section we present plots based on the alternative reconstruction methods PaiCO 13 and BHM 14 .          Figure S30, the left panels correspond to the period shown in Figure 6 in the main text. Figure S44: Same as Figure 1 in the main text but for the simulation HadCM3 r2. Figure S45: Same as Figure 1 in the main text but for the simulation HadCM3 r3. Figure S46: Same as Figure 1 in the main text but for the simulation HadCM3 r4. Figure S47: Same as Figure 1 in the main text but for the simulation CESM1-CAM5 ensemble member 2. Figure S48: Same as Figure 1 in the main text but for the simulation CESM1-CAM5 ensemble member 3. Figure S49: Same as Figure 1 in the main text but for the simulation CESM1-CAM5 ensemble member 4. Figure S50: Same as Figure 1 in the main text but for the simulation CESM1-CAM5 ensemble member 5. Figure S51: Same as Figure 1 in the main text but for the simulation CESM1-CAM5 ensemble member 6. Figure S52: Same as Figure 1 in the main text but for the simulation CESM1-CAM5 ensemble member 7. Figure S53: Same as Figure 1 in the main text but for the simulation CESM1-CAM5 ensemble member 8. Figure S54: Same as Figure 1 in the main text but for the simulation CESM1-CAM5 ensemble member 9. Figure S55: Same as Figure 1 in the main text but for the simulation CESM1-CAM5 ensemble member 10. Figure S56: Same as Figure S10 but for the HadCM3 simulation only. Note that the early verification and RMSE skill are calculated for the reconstruction ensemble mean, so yield only one value per PPE. Figure S57: Same as Figure S11 but for the HadCM3 simulation only. Figure S58: Same as Figure S10 but for the HadCM3_r2 simulation only. Figure S66: Same as Figure S10 but for the CESM1 (member 2) simulation only. Figure S73: Same as Figure S11 but for the CESM1-CAM5 (member 5) simulation only.