Background & Summary

Reconstructions of past climate have become integral to climate assessments1. Such reconstructions employ a wide variety of mathematical techniques, ranging from purely statistical2 to data assimilation techniques that fuse observations and model output3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20. To establish their relative merits, these reconstructions must be benchmarked against reference datasets. This is routinely done on subsets of the instrumental period using cross-validation, but such efforts tend to underestimate the true spread of reconstructions in the pre-instrumental era21, indicative of overfitting.

While pre-instrumental intercomparisons of reconstruction methods have occasionally been carried out with real-world proxy observations22,23, such efforts are fundamentally limited by the lack of a true benchmark: pre-instrumental climates were not, by definition, observed directly, so these intercomparisons can only inform on convergence or divergence, but cannot provide any metric of their closeness to the true climate.

To sidestep this hurdle, pseudoproxy experiments (PPEs) have long been used as a laboratory to benchmark climate reconstruction methods. The heart of PPEs is to start from the output of long integrations of a global climate model and to apply mathematical transformations to this output to mimic the processes whereby paleoclimate proxies register these climate variations in space and time24. Because the original climate is specified, and sampled perfectly in space and time, the ability of a reconstruction to recover this climate is known. Moreover, as the generating process of these “pseudoproxies” is specified, it can be manipulated to yield insights into the sources of uncertainty contributing to reconstruction error. While simple PPE designs are informative, the more realistic the target climate and pseudoproxy generation process, the more relevant this benchmark becomes, so there is considerable potential in this avenue of research16,25,26,27,28,29,30,31.

Initial work used the simplistic assumption that paleotemperature proxies were a linear superposition of local temperature and Gaussian (white) noise, sampled uniformly in time32,33,34. Over time, more realistic pseudoproxy constructions were developed, involving other climate variables, more elaborate noise models, realistic spatiotemporal sampling, and noise levels approximating real proxy networks16,28,35,36. Recent work29 has leveraged more realistic proxy system modeling (PSM) frameworks37,38,39,40,41,42 to capture the essential physical, chemical, biological and geological processes that translate climate signals into the paleoclimate records that form the basis of climate reconstruction efforts (e.g. ref. 43). However, such models have yet to gain widespread use, so even recent efforts have sometimes employed a simplistic “temperature + noise” pseudoproxy design16,29,44.

The PAGES 2k Phase 2 global multi-proxy database (Fig. 1) has been widely used for studies of Common Era climate since its release43. It has played a central role for investigating the multi-decadal and longer-term surface temperature variability45,46 and the spatiotemporal temperature patterns of various climatic epochs23,47 over the Common Era. In addition, it has served as the principal data source for the latest version 2.1 of the Last Millennium Reanalysis (LMR) products48, and has become a common network template for pseudoproxy studies31,49. However, these PPE related studies used only a partial network and employed a simplistic “temperature + noise” design, and a systematic pseudoproxy emulation of the PAGES 2k network has yet to be produced. The PAGES 2k Phase 2 has a number of known biases that present challenge to global annual mean temperature reconstruction50, which need to be rigorously evaluated.

Fig. 1
figure 1

The PAGES 2k Phase 2 network43.

Here, we do so by generating a pseudoproxy dataset that: (i) emulates the majority of the PAGES 2k Phase 2 network43, (ii) employs a more realistic data-generating mechanism with proxy system models (PSMs) and isotope-enabled climate model simulations, and (iii) explicitly separates sensor, archive, and observational effects. By combining various pseudoproxy designs, noise levels, and spatiotemporal sampling scenarios, we generate many digital avatars of the PAGES 2k network, supporting the evaluation of climate reconstruction methods in a wide variety of settings. To illustrate the use of this dataset, we show its application to a suite of climate field reconstructions10,48,51.

Methods

Reference climate

Our base climate utilizes the “iCESM1” last millennium simulation (iCESM-LM hereafter) generated by the isotope-enabled Community Earth System Model (iCESM)52. As an addition to the standard CESM, iCESM simulates the isotopic water fluxes transported between its five major isotope-enabled components, including the atmosphere model iCAM, the land model iCLM, the ocean model iPOP, the sea ice model iCICE, and the river runoff model iRTM. The atmosphere model iCAM tracks water tracers and isotopes in all phases through processes such as surface fluxes, boundary layer mixing, cloud physics, convection, and advection, and simulates precipitation δ18O variability with high fidelity53. The land model iCLM considers the water vapor flux and isotope fractionation in vegetated land surfaces54. Main processes include water isotope exchanges among soil, spaces under and above canopy, and leaves. The land and vegetation types and amount of canopy use a modern climatological mean with a constant seasonal cycle55. The ocean model iPOP transports water isotopes passively through resolved flow and parameterized turbulence, and the simulated seawater δ18O is validated under present-day climate conditions56. The sea ice model iCICE simulates the sinks of the isotopic water mass through melting and sublimation processes, and the sources through snowfall, sea ice growth, and vapor condensation52. All components coupled together provide a plausible simulation of the water isotope fields.

The iCESM-LM simulation applies the transient external forcings following the same setup for the CESM Last Millennium Ensemble (CESM-LME)57. The solar forcing comes from the total solar irradiance reconstruction by Vieira et al.58 patched with spectral variations from Schmidt et al.59. The last millennium volcanic forcing is based on the ice core-based index by Gao et al.60, while for the historical period, an eruption dataset by Ammann et al.61 is adopted. The greenhouse gas forcing, namely the concentrations of the main long-lasting greenhouse gases (i.e., CO2, CH4, N2O), are derived from Antarctic ice core analyses by Schmidt et al.59. For the land use and land cover boundary conditions, the reconstruction by Pongratz et al.62 and that by Hurtt et al.63 are merged together to yield a consistent land use change. The orbital forcing is computed in the model based on Berger et al.64. The ozone forcing comes from the Whole Atmosphere Community Climate Model (WACCM) and the prescribed aerosol forcing are applied only over the historical period. For more details, please refer to CESM-LME57.

Proxy network

Figure 1 shows the PAGES 2k Phase 2 Network43. It consists of 692 records from 648 globally distributed sites, archived in trees, corals and sclerosponges, marine sediments, lake sediments, glacier ice, documentary sources, speleothems, boreholes, bivalves, and hybrid records. Each archive includes single or multiple observation types, among which tree ring width (TRW), maximum latewood density (MXD), coral and sclerosponge δ18O and Sr/Ca, lake varve thickness, and ice core δ18O are essential to Common Era temperature reconstructions (e.g., LMR) and their PSMs have been developed by recent efforts already38. We therefore focus on these proxy types, and generate their emulations to form our pseudoproxy network (Fig. 2). For proxy sites located within the same model grid cell, the input climate signals are the same, while the generation mechanisms vary according to their proxy types.

Fig. 2
figure 2

The spatiotemporal availability of the PAGES 2k pseudoproxy network with realistic and full temporal availability.

Proxy system modeling

Following the proxy system modeling framework37, we build our pseudoproxy network based on the iCESM output, leveraging the PSMs from the PRYSM toolbox38 and the CFR codebase65. The concept of PSMs encompasses both geophysical/chemical/biological process-based models, as well as statistical models; both can be either linear or nonlinear. In this study, both categories of PSMs have been adopted, depending on availability. A given PSM can only be applied if its inputs are within the scope of the available climate variables. In addition, the more complex the PSM, the more parameters it contains, and these parameters must generally be fitted to modern observations, lest they introduce additional sources of uncertainty.

As in all modeling endeavors, the choice of PSM is therefore a trade-off between “sins of omission” (excessive simplicity) and “sins of commission” (excessive complexity). The present dataset used the most complex PSMs where justified by scientific understanding and available data. When these conditions were not met, simpler PSMs were selected to avoid sins of commissions or logistical hurdles (e.g. model fields available at too coarse a resolution).

Statistical PSMs, although highly idealized, are still based on scientific understanding of the geophysical/chemical/biological processes leading to the transduction of climate signals into proxy archives. As shown in Tardif et al.48, even linear, statistical PSMs for tree-ring width that include bivariate and seasonal dependence can yield vastly more realistic results than the traditional fitting to annual temperature.

Forward modeling of tree ring width (TRW)

Tree-ring width (TRW) is a major observation source to investigate the Common Era climate. In the PAGES 2k database, TRW represents the largest network with 354 records, most of which are located in the Northern Hemisphere. Depending on the location and species, TRW chronologies may record not only temperature variations but also moisture conditions, although the climatic signals can be modulated by biological memory effects49,50,66,67,68,69,70,71,72. The relationship between TRW and the environmental variables is thus complex, and TRW PSMs with various complexity levels have been developed since 2000, including TREERING200073, Vaganov-Shashkin (VS)74 and its simplified version VS-Lite75,76,77, MAIDEN (Modeling and Analysis In DENdroecology)78,79, and even the land model ORCHIDEE (ORganizing Carbon and Hydrology In Dynamic EcosystEms)80.

This work used VS-Lite developed by Tolwinski-Ward et al.75,76,77 to generate our pseudo-TRW network because of its overall skill79,81, simplicity, and capacity to be widely applied to the PAGES 2k sites. VS-Lite takes monthly temperature and precipitation signals as input, and emulates a threshold-dependent tree-ring monthly growth response to the climate with piece-wise linear growth response functions (Eq. (1)) determined by four parameters: the lower and upper thresholds for temperature and soil moisture, respectively:

$${g}_{V}(V)=\left\{\begin{array}{lcl}0 & {\rm{if}} & V\le {V}_{1}\\ (V-{V}_{1})/({V}_{2}-{V}_{1}) & {\rm{if}} & {V}_{1}\le V\le {V}_{2}\\ 1 & {\rm{if}} & {V}_{2}\le V\end{array}\right.,$$
(1)

where V represents T (temperature) or M (moisture). The overall growth response is then the minimum of these two response functions modulated by the insolation-based growth response (gE) (Eq. (2)), which is determined by the latitude of the site, and the final output TRW is the standardized series of the annual integration of the monthly growths, with an error term added on (Eq. (3)):

$$g={g}_{E}\ast \min \{{g}_{T},{g}_{M}\},$$
(2)
$${\rm{TRW}}={\rm{standardize}}\left({\int }_{{t}_{s}}^{{t}_{e}}gdt\right)+\zeta ,$$
(3)

where t represents time in month, ts and te denote the window for the integration, and ζ is a pink noise term (i.e. a stochastic process with a spectral density \(S(f)\propto {f}^{-\beta }\), with β a positive constant). Setting ts < 0 (a specific month in the previous year) can help mimic the biological memory effect or other unaccounted for sources of low frequency variability in TRW66,69,72,82,83,84,85. Following ref. 75, we set ts = −4 and ts = 12, which represents an integration window from September of the previous year to December of the current year for the Northern Hemisphere, and from March of the current year to June of the next year for the Southern Hemisphere. The pink noise term is added to further mimic other non-climatic processes such as the detrending process of TRW records, following the formulation of colored noise proposed in ref. 86 with tuned spectral scaling slope87,88 β = 2 and SNR = 1 (signal-to-noise ratio defined in standard deviation24,26,89). Without this term, the scaling slope of the simulated TRWs will be significantly flatter than that observed in the real records, especially for the low-frequency band40. The need for this can be viewed in two ways: on one hand, it suggests that tree-ring width records in the PAGES2k database contain more low-frequency than expected from the climate signal and simple persistence structure present in VS-Lite alone, perhaps due to data processing (detrending and standardization) or unaccounted for biological or ecological processes. Alternatively, this can be seen as a result of a “sin of omission” in VS-Lite and an incomplete mimic of the full range of biological processes important for the autocorrelation structure of temperature-sensitive tree-ring series.

The four threshold parameters T1, T2, M1, M2 are crucial to the behavior of the model. We calibrate them against the CRUTS monthly temperature and precipitation observations90 version 4.05, using the Bayesian inference method elaborated in ref. 76. This essentially generates optimal posterior probability distributions for each threshold parameter by updating the prior distributions over Monte-Carlo iterations, and yields the estimate of each parameter following maximum likelihood estimation (MLE). With the calibrated parameters, iCESM-simulated monthly temperature and precipitation signals can be translated to the corresponding pseudo-TRW series. An example of the generated pseudo-TRW chronology and its comparison to the real-world counterpart in time and frequency domains is shown in Fig. 3.

Fig. 3
figure 3

The dashboard for the tree ring width record “NAm_136” in dataset “ppwn_SNRinf_rta”. The unit “NA” stands for “not applicable” as the variable is a standardized index and thus unitless. “PSD” refers to power spectral density and is in the unit of power (squared unit of the proxy variable) per year.

Forward modeling of maximum latewood density (MXD)

Compared to TRW, maximum latewood density (MXD) more faithfully tracks growing season temperature history without distortions due to biological memory effects49,50,68,69,84,91,92,93,94,95. As there is not yet a published, tractable proxy system model for MXD, here we use a simple univariate linear model to emulate the behavior of MXD series:

$${\rm{MXD}}=a{T}_{{\rm{seasonal}}}+b,$$
(4)

where a represents a linear slope factor, Tseasonal the average temperature over the growing season, and b the intercept. The growing season is calibrated against the CRUTS dataset, version 4.05. Following ref. 48, the season that yields the optimal regression skill is picked from an expert-curated pool of growing season candidates, including the default calendar year option (Jan-Dec) and variants of warm seasons (i.e., Jun-Aug, Mar-Aug, Jun-Nov for Northern Hemisphere, and Dec-Feb, Dec-May, Sep-Feb for Southern Hemisphere) during which trees are expected to grow. An example of a generated pseudo-MXD chronology is shown in Fig. 4.

Fig. 4
figure 4

The dashboard for the maximum latewood density record “NAm_134” in dataset “ppwn_SNRinf_rta”. The unit “NA” stands for “not applicable” as the variable is a standardized index and thus unitless. “PSD” refers to power spectral density and is in the unit of power (squared unit of the proxy variable) per year.

Forward modeling of coral and sclerosponge δ 18O

In contrast to trees, corals and sclerosponges mainly cover the tropical ocean regions and are thus of great importance to investigating tropical climate variability, including El Niño Southern Oscillation (ENSO)17,96,97,98,99,100,101. Following Brown et al.102, we use a bilinear model to simulate coral and sclerosponge δ18O based on the annual sea surface temperature (SST) and seawater δ18O (denoted as δ18Osw) signals:

$${\delta }^{18}{\rm{O}}=a{\rm{SST}}+b{\delta }^{18}{{\rm{O}}}_{{\rm{sw}}},$$
(5)

where a = −0.22 represents the linear slope factor, and b = 0.97002 the conversion factor from VSMOW to VPDB. Thompson et al.103 state that since the δ18Osw network is scarce, they use sea surface salinity (SSS) to estimate δ18Osw. However, a salinity-based PSM is reliant on the SSS/δ18Osw relationships that are known to be nonstationary and are based on extremely limited observational data104; the original formulation based on δ18Osw is thus preferable given that the iCESM output is leveraged. An example of the generated pseudo-coral/sclerosponge δ18O chronology is shown in Fig. 5.

Fig. 5
figure 5

The dashboard for the coral δ18O record “Ocn_075” in dataset “ppwn_SNRinf_rta”. “PSD” refers to power spectral density and is in the unit of power (squared unit of the proxy variable) per year.

Forward modeling of coral and sclerosponge Sr/Ca

The skeletal trace element ratio Sr/Ca in corals and sclerosponges has a straightforward temperature interpretation. Following Corrège et al.105, we apply a simple univariate linear model based on the annual SST signal, but with fixed parameters:

$${\rm{Sr}}/{\rm{Ca}}=a{\rm{SST}}+b,$$
(6)

where a represents the linear slope factor with a Gaussian distribution with mean of −0.06 and standard deviation of 0.01, and b is the intercept with a mean value around 10.553 based on Table 1 in Corrège et al.105. In this study, we take a = −0.06 and b = 10.553. An example of the generated pseudo-coral/sclerosponge Sr/Ca chronology is shown in Fig. 6.

Table 1 The seasonality of each lake varve thickness site.
Fig. 6
figure 6

The dashboard for the coral Sr/Ca record “Ocn_067” in dataset “ppwn_SNRinf_rta”. “PSD” refers to power spectral density and is in the unit of power (squared unit of the proxy variable) per year.

Forward modeling of ice core δ 18O

Glacier ice cores mainly cover the polar and mountain regions, where trees cannot grow. They are usually well-preserved and span a long time interval with annual time resolution, and are important to investigate long-term climate change. For ice core δ18O, we apply the corresponding module in the PRYSM toolbox38, which is based on the work of Johnsen106, Whillans and Grootes107, Cuffey and Steig108, Johnsen et al.109, and Küttel et al.110.

Its sensor model takes precipitation-weighted δ18O to emulate the δ18O input to ice:

$${\delta }^{18}{O}_{{\rm{weighted}}}=\sum \left(p{\delta }^{18}{O}_{p}\right)/\sum p,$$
(7)

where p represents the monthly precipitation amount, and δ18Op the precipitation δ18O. The precipitation-weighted δ18O is then corrected based on the elevation difference between the proxy site and its nearest model grid point with a rate of −0.25 per 100 meters. Next, its archive model emulates the compaction and diffusion processes of isotopes in ice via a convolution with a Gaussian kernel G:

$${\delta }^{18}{O}_{{\rm{ice}}}=G\ast {\delta }^{18}{O}_{{\rm{weighted}}\cdot }$$
(8)

An example of the generated pseudo-ice core δ18O chronology is shown in Fig. 7.

Fig. 7
figure 7

The dashboard for the ice core δ18O record “Arc_029” in dataset “ppwn_SNRinf_rta”. “PSD” refers to power spectral density and is in the unit of power (squared unit of the proxy variable) per year.

Forward modeling of lake varve thickness

Varves, or annually laminated sediments, can be valuable temperature proxies for the Common Era due to their high-resolution and because they can be found in areas where other annually-resolved archives are absent. Varve thickness or mass accumulation rate are directly related to sediment input and deposition, which in turn can be strongly related to climate in some lakes, however many phenomena can affect varve thickness, and the relationship between climatic and environmental drivers and varve thickness is often complex and typically varies from lake to lake111. Temperature-driven varve thickness records are most common in the Arctic, where summer temperature can have strong and direct impacts on sediment transportation by melting winter snowpack and glaciers and extending the ice-free season.

The PAGES 2k Phase 2 database includes eight sites with varve thickness records interpreted to respond to temperature. Mechanistically simulating varve thickness is complex, highly site-specific, and not practical for most PPE studies. Nevertheless, most varve thickness records share characteristics that are readily and simply simulated. There are two key processes that we simulate. First, because varve thickness measures a depositional process, the distribution of varve thickness is zero-bounded and right-skewed, and is appropriately modeled with a Poisson or Gamma distribution112. Second, varve thickness records typically include substantial year-to-year memory. Unlike most sedimentary records, this is not due to bioturbation or post-depositional mixing (as this would destroy the laminations). However, glacial and sedimentary processes in the watershed and in the lake can be prone to significant memory, which affects the spectral characteristics of varve thickness records.

Here, we apply a simple model as below:

$${\rm{thickness}}=\Gamma \left({T}_{{\rm{seasonal}}}\right)+\Gamma (b),$$
(9)

where Γ(·) represents a mapping from the original distribution to a Gamma distribution, Tseasonal is the seasonally-averaged temperature calculated based on the seasonality metadata of each site (Table 1)43, and b is a realization of fractional Brownian motion with Hurst index H = 0.75 and SNR = 1, a combination we find fits well with the real records. An example of the generated pseudo-lake varve thickness chronology is shown in Fig. 8.

Fig. 8
figure 8

The dashboard for the lake varve thickness record “Arc_025” in dataset “ppwn_SNRinf_rta”. “PSD” refers to power spectral density and is in the unit of power (squared unit of the proxy variable) per year.

Pseudoproxy production workflow

Figure 9 shows the general procedure for pseudoproxy generation. The starting point is the isotope-enabled Community Earth System Model (iCESM) last millennium plus historical simulation52 (Section Reference Climate), chosen so that the isotope-related proxies can be simulated with minimal assumptions.

Fig. 9
figure 9

Flow chart of the general procedure for pseudoproxy generation.

Environmental variables are taken from the iCESM output, including air surface temperature, precipitation rate, sea level pressure, precipitation δ18O, seawater δ18O, and sea surface temperature (SST). Proxy metadata are taken from the PAGES 2k dataset (Section Proxy Network), including the location information, time axis, archive type, sensor, species, seasonality, etc.

These two sources of information (environmental variables and proxy metadata) are then fed to the PSMs for tree-ring width (TRW), maximum latewood density (MXD), coral/sclerosponge δ18O, coral/sclerosponge Sr/Ca, lake varve thickness, and ice core δ18O, which translate the climatic signals and generate the raw output in proxy space (Section Proxy Modeling).

The raw output is then treated as signal, and white noise is added with a set of signal-to-noise ratio (SNR, defined in standard deviation24,26,89) options (∞, 10, 2, 1, 0.5, 0.25)22,28, where SNR = ∞ is a noise-free case, SNR = 1 means that the signal and noise share an equal standard deviation, etc. We generate datasets with two types of temporal sampling: (1) full annual sampling over 850–2005 CE, and (2) the realistic temporal availability of each record (Fig. 2).

Because iCESM is a biased representation of reality, the pseudoproxies generated by this workflow inherit some of the same biases in low-order statistics like mean and variance. To facilitate comparison with real-world records, we apply a bias correction and variance matching against the real records, according to the mean and variance of the real proxy measurements over the common timespan to the pseudoproxy counterpart. Note that this step shifts and scales the time series, but has no impact on the statistical distribution (e.g., Gaussian or Gamma), nor the spectral characteristics (i.e., scaling slopes and peaks) of the simulated proxies.

As a benchmark, we also generate pseudoproxies following the traditional temperature-plus-noise method: the temperature signal at the grid cell nearest each proxy site is added with white noise with the same set of SNR options and the same two types of temporal sampling.

This workflow results in multiple pseudoproxy emulations of the PAGES 2k network, differing in:

design either “temperature-plus-white-noise” model (tpwn) or using the pseudoproxy models described above, with added white noise (ppwn)

noise level as quantified by the SNR of ∞ (pure signal), 10, 2, 1, 0.5 or 0.25.

temporal sampling either full annual sampling over 850–2005 CE (fta), or the realistic temporal availability of Fig. 2 (rta).

Data Records

Table 2 lists the pseudoproxy datasets generated in this study, which we call “pseudoPAGES2k”. The dataset IDs indicate the property of each dataset. For instance, “tpwn_SNR10_fta” means that the pseudoproxies are generated with the temperature-plus-white-noise method with SNR equals to 10 and full temporal availability, while “ppwn_SNR0.5_rta” means that the pseudoproxies are generated via the PSM hierarchy with white noise added on and SNR equals to 0.5, and the realistic temporal availability is applied, and so on and so forth. The datasets are archived at Zenodo113 (https://doi.org/10.5281/zenodo.7652533).

Table 2 A list of the “pseudoPAGES2k” pseudoproxy datasets.

The “iCESM1” last millennium simulation (iCESM-LM) used in this study can be accessed at a data server hosted by Rorbert Tardif at University of Washington (https://atmos.washington.edu/~rtardif/LMR/prior).

The PAGES 2k Phase 2 Network used in this study can be accessed at the National Center for Environmental Information’s World Data Service for Paleoclimatology (https://www.ncei.noaa.gov/access/paleo-search/study/21171).

Technical Validation

To verify if the generation procedure (Fig. 9) yields a realistic pseudoproxy emulation of the PAGES 2k database, we validate the generated pseudoproxies against the original records’ statistics, in both time and frequency domains. We emphasize that this is a validation specifically of the realism (and therefore utility) of the pseudoproxy generation procedure, rather than an evaluation of any single PSM or GCM, which has been done elsewhere. Rather, we aim to show that, coupling these models–imperfect though they may be–can produce pseudoproxies that emulate key characteristics of the target series. In the time domain, a good pseudoproxy emulation should reproduce the probability distribution shape of the real proxies; this may be assessed via split violin plots. In the frequency domain, a good emulation should reproduce the power spectral density (PSD) of the target series, indicating the energy partitioning per frequency interval, particularly the periodic and continuum114 characteristics of the series.

Figures 3 to 8 show examples for specific records, one site per proxy type. Since the real records may be unevenly-spaced in time, we leverage the Weighted Wavelet Z-transform (WWZ) method implemented in Pyleoclim115, to obtain the PSD curves. As illustrated by the PSD plots and the probability distribution plots, the pseudoproxies show an overall good agreement with the real records, including, for instance, the steep attenuation of high-frequencies in the ice core δ18O record shown in Fig. 7, and the long tail distribution of the varve thickness record shown in Fig. 8. To validate thoroughly the spectral characteristics, Fig. 10 shows the spectral analysis by proxy types. It can be seen that overall good agreement is achieved between the pseudoproxies and their real counterparts, indicating a realistic emulation from the spectral perspective. This should result in more realistic assessments of the spectral characteristics of reconstruction skill. We emphasize that the procedure of bias correction and variance matching has no bearing on these aspects of the validation, as it simply adds a scale and offset to the pseudoproxies, without modifying their probability distribution shape or spectral characteristics.

Fig. 10
figure 10

Spectral analysis of the pseudoproxy records in dataset “ppwn_SNRinf_rta” by proxy type. The gray curves denote the power spectral density (PSD, in the unit of power per year, i.e., squared unit of the proxy variable per year) of the real records, while the colored curves denote that of the pseudoproxy records.

Usage Notes

To illustrate the many potential uses of this dataset, we provide Jupyter notebooks (Code Availability) for the basic analysis and visualization of the dataset, as well as applications to climate field reconstruction. Specifically, we provide Python-based examples of:

  • Loading and visualizing the pseudoPAGES2k dataset.

  • Filtering the pseudoPAGES2k dataset according to various criteria.

  • Generating dashboards like Figs. 38.

  • How to apply Paleoclimate Data Assimilation (in the vein of the Last Millennium Reanalysis48) to the pseudoPAGES2k dataset, and its use for benchmarking climate field reconstruction methods.

Other potential uses of this dataset and its production workflow include optimal sampling design116. A natural extension would be to add age uncertainties to these pseudoproxies, as done in ref. 31.