A statistics-based reconstruction of high-resolution global terrestrial climate for the last 800,000 years

Curated global climate data have been generated from climate model outputs for the last 120,000 years, whereas reconstructions going back even further have been lacking due to the high computational cost of climate simulations. Here, we present a statistically-derived global terrestrial climate dataset for every 1,000 years of the last 800,000 years. It is based on a set of linear regressions between 72 existing HadCM3 climate simulations of the last 120,000 years and external forcings consisting of CO2, orbital parameters, and land type. The estimated climatologies were interpolated to 0.5° resolution and bias-corrected using present-day climate. The data compare well with the original HadCM3 simulations and with long-term proxy records. Our dataset includes monthly temperature, precipitation, cloud cover, and 17 bioclimatic variables. In addition, we derived net primary productivity and global biome distributions using the BIOME4 vegetation model. The data are a relevant source for different research areas, such as archaeology or ecology, to study the long-term effect of glacial-interglacial climate cycles for periods beyond the last 120,000 years.


Background & Summary
Studying the ecology and environment throughout past climatic changes often involves environmental reconstructions that are either based on paleoclimate proxies or on paleoclimate simulations. Unfortunately, even by today's standards, simulating the climate over periods of thousands or hundreds of thousands of years in a continuous way can still be a costly and time-consuming endeavour. Present or future climate simulations are based on comprehensive Global Climate Models (GCMs) that resolve processes at high temporal and spatial resolution, such as those used in the fifth IPCC Assessment Report 1 . Climate model reconstructions for longer, continuous periods back in time are, therefore, challenging. They have to span a much longer period and are, thus, computationally too expensive. Instead, GCMs provide snapshots for a specific time or short transients in the order of a few thousand years. For longer, transient simulations of tens or hundreds of thousands of years, we rely on simulations from Earth System Models of Intermediate Complexity (EMICs) 2,3 but they come at the cost of lower spatial resolution and a simplified representation of the climate system 4 .
Although there are a some high-resolution paleoclimate data sets readily available for download, for example, WorldClim 5 , PaleoClim 6 , or ecoClimate 7 , their temporal coverage is limited to a few snapshots of key periods in the past, for example, Mid-Holocene (6,000 years before present (BP)) the Last Glacial Maximum (21,000 years BP), or the Last Interglacial Period (130,000 years BP). An exception is PaleoView 8 which covers the transient period of the last deglaciation, but this only goes back 21,000 years. Longer, continuous climate data sets of the past, based on HadCM3 9 snapshots, have become available more recently, for example a Northern Hemisphere data set for the last 60,000 years 10 or a bias-corrected, high-resolution terrestrial climate data set of the last 120,000 years 11 .
Here, we used a linear regression model to extend existing HadCM3 climate simulations of the last 120,000 years to create climate reconstructions of the last 800,000 years. We then applied a bias correction 12 of the model output using present-day gridded observational data (CRU TS v. 4.04 13 ) to downscale climate output to a final horizontal resolution of 0.5°1 1,13 . This new data set is complementary to the aforementioned high-resolution terrestrial climate data set of the last 120,000 years 11 (which should be preferred for studies of the last glacial cycle, as they are based directly on the GCM output), and it is an extension for readers to explore the climate history for earlier periods of the past.
In this paper, we present annual and monthly mean climatologies for the last 800,000 years in 1,000 year time steps ( Table 1). The data set includes air temperature, precipitation, total cloud cover, 17 bioclimatic variables 14 , as well as biomes and annual and monthly net primary productivity, the latter based on BIOME4 simulations 15 , run on the debiased climatologies. We validated the long-term climate change signal using time series of various proxy records.

Methods
Our climate reconstructions are based on a set of linear regression models for each of the HadCM3 model grid boxes (N = nlon × nlat = 96 × 73 = 7008). Each linear model predicts a climate variable, which can be either temperature, precipitation, or total cloud cover, i.e., the dependent variable. The independent variables, i.e., the forcing terms, of the model are three orbital parameters, atmospheric CO 2 , and a surface type mask (land, ocean, or land ice), five variables in total.
Each linear model uses 72 data points given by the HadCM3 snapshots throughout the past 120 thousand years (ka) 16,17 . These snapshots cover both the Last Glacial Maximum, one of the coldest glacial stages, and the Last Interglacial, one of the warmest interglacial stages during the Middle and Late Pleistocene. By applying available long-term forcing to the solutions of the linear models, we reconstructed the climate for periods before 120 ka. The forcing consists of CO 2 18 , interpolated to 1 ka intervals for the last 800 ka, orbital parameters, taken from numerical solution to the Earth's orbit around the sun 19 , and surface type masks based on numerical ice-sheet www.nature.com/scientificdata www.nature.com/scientificdata/ model output 2 and a global sea-level record 20 . At this stage, the reconstructed climate of the last 800,000 years has the same coarse spatial resolution as the underlying HadCM3 snapshots. In a last step, we applied a bias correction (including spatial downscaling) for the terrestrial climate to derive a spatially explicit data set that covers the  Fig. 1 Flowchart showing how the different data sets have been used as input for the different stages of the paleo-climate data generation: A linear regression combines 72 low-resolution (LR) HadCM3 snapshot simulations with the external forcings, i.e., CO 2 , orbital parameters, and surface type masks (ocean, land, land ice), which provides the basis of the long-term climate reconstructions using long-term forcings and surface type masks. The final bias correction procedure yields the high-resolution (HR), bias-corrected (BC) climate data set for the last 800 ka. Fig. 2 (a) Time series of the four external parameters: CO 2 and orbital parameters for the last 800 ka and (b) the associated parameter space as scatter plot matrix (blue dots). The continuous CO 2 record is from the EPICA Dome C ice core in Antarctica 18 . The orbital parameters are numerical solutions for the Earth's orbit and rotation in terms of eccentricity, precession, and obliquity 19   www.nature.com/scientificdata www.nature.com/scientificdata/ 30 minutes. The ocean and sea-ice component of HadCM3 has a horizontal resolution of 1.25° × 1.25° and 20 vertical levels. HadCM3 simulations were run with a prescribed ice-sheet and continental geometry. We used output from the atmospheric component of the 72 available HadCM3 simulations covering the last 120,000 years in 2,000-year intervals from 120,000 to 24,000 years BP and in 1,000-year intervals from 22,000 years BP to present-day 16,17 . Surface type mask: Ice-sheet extents, sea level and lakes. As ice sheet extents for the period outside the HadCM3 snapshots, we used model outputs from CLIMBER-2/SICOPOLIS simulations 2 for which Northern Hemisphere ice sheet extents and heights are available for the last 800 ka in 1 ka-year intervals. For the more recent period from 122-0 ka, we used the ice sheet configurations from the ICE-6G data set 21 (http:// www.atmosp.physics.utoronto.ca/ peltier/data.php). Changes in the coast lines affecting the land-sea mask were derived from a global sea-level record 20 . We overlaid those changes on top of present-day coast lines, taken from the ETOPO1 data set 22 (https://ngdc.noaa.gov/mgg/global/global.html), while we preserved inland lakes which were taken from the Global Lakes and Wetlands Database 23 (https://www.worldwildlife.org/pages/ global-lakes-and-wetlands-database).
Training and test data. We divided the HadCM3 snapshots into a training (80%) and a test data set (20%).
The training data set was used to fit the linear model while the test data was used for a comparison to the recon- But instead of randomly dividing the snapshots into the training/test data, we followed an approach with the aim to preserve as much variance as possible in the training data, i.e. maximise the variance of the predictors. This is best illustrated by the phase plots of the parameters, i.e., the predictors (Fig. 2). The training data set covers the edges of each phase plane and thus maximises the phase space covered by the linear regression model. This choice of training data ensured that the linear regression model interpolated within the phase space and did not need to extrapolate for the test data.
The procedure was as follows. We calculated the covariance matrix of the full parameter set (n = 72), C full . Then, we randomly created a sample training data set (k = 58) for which we computed the covariance matrix C sample . If the eigenvalues of C sample were larger than the eigenvalues of C full , then the training sample data set contained at least as much variance as the full data set and this sample training data set was marked as a candidate for the final training set. After several iterations (N = 10,000), we summed up how many times each time slice had appeared within a candidate training set. We then ranked all time slices according to this number. In the final step, we picked the 80% top-ranked time slices as training data.  Table 2. List of data sets that can be found in the Open Science Framework repository [28] under the project's data directory.
www.nature.com/scientificdata www.nature.com/scientificdata/ Note, the division into training and test data set was made only for the validation of the linear model. For the actual climate reconstructions with the linear regression model, we used all 72 snapshots to make full use of the complete set of available data. The linear regression model. For each HadCM3 grid box, we fitted a linear regression model to a local series of each climatic variable of interest (temperature, precipitation, or cloud cover) with the following independent variables or forcings: atmospheric CO 2 concentrations (as a major greenhouse gas), three variables reflecting the orbital forcing 19 , and the surface type, which is either ocean, land, or land ice. The orbital parameters are obliquity ε and two combinations of eccentricity e and precession ω: e · sin ω, henceforth referred to as precession index I, and e · cos ω (precession index II), and they are a generally accepted set of orbital forcings 24,25 . We chose temperature T, precipitation P, and total cloud cover C as dependent variables. The independent variables are given by the normalised forcings.
More formally, let Y(x, t) be a time series of a climate variable in a specific grid box x at time t. Our linear model should explain variations of Y(x, t), ΔY(x, t), around a mean value Y x t ( , ): To make the linear model well-conditioned, all independent variables were normalised. The mean was subtracted and the result were then divided by the standard deviation.
Precipitation and total cloud cover are bounded variables which can lead to linear model predictions outside of physically meaningful ranges. A common procedure to prevent these out of bounds predictions is to apply a transformation to the data beforehand. To prevent the linear model from predicting negative precipitation values, we therefore applied a logarithmic transformation to precipitation, which maps values from [0, +∞] to [−∞, +∞]. Thus, in the case of precipitation, the linear regression coefficients predict the response in terms of www.nature.com/scientificdata www.nature.com/scientificdata/ anomalies in the exponent. For total cloud cover, expressed as fraction between 0 and 1, we choose a logit transformation which maps values from [0, 1] to [−∞, +∞]. Note that the minimum values for precipitation and total cloud cover, as simulated by HadCM3, are never exactly zero (or smaller), except at the poles, 90° N/S, which were excluded for that reason; therefore, the transformations always yield finite values. The decomposition of temperature T, precipitation P, and total cloud cover C, i.e., the ΔY(x, t) on the left hand side of Eq. (1) is: The linear regression model for each (transformed) anomaly is: Here, β 1 to β 5 are the regression coefficients for the respective predictors (see Fig. 3 for maps of β coefficients). Surface type changes are captured by the categorical variable M(x, t) ∈ [ocean, land, land ice]. For example, around coastlines land grid boxes can turn into ocean grid boxes when sea level is high. Similarly, expanding ice sheets turn land grid boxes into ice-covered grid boxes, and the climate variable Y(x, t) may respond to different surface types in different ways. The categorical variables were encoded using Treatment coding. The first level, www.nature.com/scientificdata www.nature.com/scientificdata/ ocean, was chosen as a reference level and is by definition zero β 5 (x) = 0. For the other two levels, land and land ice, a different β 5 (x) was assigned for each-in effect, a different intercept for ΔY(x, t).
We solved the linear model for each transformed variable, applied the extended forcing to generate the 800 ka climate reconstructions, and transformed the resulting data back to its original range according to Eqs. (2-4). At  Table 3. www.nature.com/scientificdata www.nature.com/scientificdata/ this stage, we have an extended HadCM3 model output of annual and monthly mean for temperature, precipitation, and total cloud cover, still at the original HadCM3 resolution, for the last 800 ka. Spatial downscaling to 0.5° and bias correction. The CRU TS climate data set. For the bias correction of the extended HadCM3 model output, as predicted by the previously described linear model, we used variables from the CRU TS (Climatic Research Unit Timeseries) data set (v. 4.04)0 13 . Before the bias correction step, the data set has been bi-linearly interpolated from the spatial resolution of 3.75° × 2.5° to the CRU TS resolution of 0.5° × 0.5°. CRU TS v. 4.04 contains monthly time series fields of precipitation, daily maximum and minimum temperatures, cloud cover, and other variables covering all land areas (except Antarctica) for 1901 to 2019. As reference period for the bias correction we chose 1961-1990. We applied the additive "delta" method, which is the most effective bias correction with respect to paleoclimate reconstructions 12 , to all climate variables predicted by the linear model (subscript LM) to create the final, bias-corrected climate data set Y x t ( , ): LM

CRU TS L M
The BIOME4 vegetation model. We used the BIOME4 model 15 to calculate annual net primary productivity (NPP) and to determine the global distribution of biomes. BIOME4 is a coupled biogeography and biogeochemistry model that simulates competition of different plant functional types (PFTs). It optimises the leaf area of each PFT as a function of NPP. BIOME4 forcing consists of monthly values of temperature, precipitation, and sunshine percentage, as well as values of annual minimum temperature and atmospheric CO 2 concentrations. Sunshine percentage was computed using total cloud cover 26 . For atmospheric CO 2 we used the same data set as described earlier 18 . In its default setup, BIOME4 does not incorporate orbital variations that would affect top-of-atmosphere  www.nature.com/scientificdata www.nature.com/scientificdata/ (TOA) insolation. Instead, it is approximated by a cosine function representative of present-day insolation only. We therefore updated the TOA insolation representation in BIOME4 so that it takes changes in the Earth's orbit into account 27 . Further inputs, which were kept constant through time, are water holding capacity and percolation rate.

Data Records
All data records are publicly available as NetCDF files in the project repository in the data directory 28 . Monthly mean and annual mean climatologies. The climate variables that are part of our data set are listed in Table 1 and can be downloaded as NetCDF files (Table 2) from the Open Science Framework data repository 28 . All climate variables are available on a 0.5° × 0.5° resolution, i.e., the same regular grid as the CRU TS v4.04 data set 13 .

Bioclimatic variables.
For ecological applications such as species distribution modelling, bioclimatic variables are more relevant than the actual climate variables because they capture information about annual and seasonal climate conditions as reflected by temperature and precipitation 14   www.nature.com/scientificdata www.nature.com/scientificdata/ can be directly derived from monthly mean temperature and precipitation data, and are thus included in our data records (Tables 1, 2). Annual Mean Diurnal Range (BIO2) and Isothermality (BIO3) cannot be calculated from the available climate model data (HadCM3) and are therefore not included. For BIO2, we do not have the monthly minimum and maximum temperature, and BIO3 depends on BIO2 (BIO3 = BIO2/BIO7 × 100).
Net primary productivity and biomes. Our data record contains annual and monthly net primary productivity as well as categorical biome data, both of which were calculated with the BIOME4 vegetation model, using the 800 ka of reconstructed and bias-corrected climate (Tables 1, 2).

technical Validation
Comparison to original HadCM3 simulations. We validated our reconstruction from the linear model against the original HadCM3 snapshots. As comparison metric, we used R 2 values, a goodness of fit estimator that measures the proportion of variance explained by the linear model, and the root mean squared error (RMSE), an estimator of the goodness of the model that measures how far the linear model predictions are from the HadCM3 test data (Fig. 4).
Overall, our linear model is a better predictor for temperature than for precipitation and total cloud cover. Temperature responds more directly to local forcings than precipitation and cloud cover, because it is determined by the energy balance of downward and upward longwave and shortwave radiation and turbulent heat fluxes. The downward shortwave radiation depends on incoming solar radiation that is determined by orbital variations, whereas downward longwave radiation is determined by greenhouse gases such as CO 2 and water vapour, as well as cloud cover. Large-scale atmospheric circulation changes have a much smaller effect on temperature. The high R 2 and low RMSE values in most regions (Fig. 4a,b) mean that temperature is locally well constrained by global CO 2 and orbital variations and our linear model captures this effect well.
The matter is more complicated for precipitation and cloud cover. Both variables are directly affected by the hydrological cycle which itself depends on large-scale atmospheric dynamics, such as monsoonal systems in the tropics and subtropics, or mid-latitude storm systems. Local interactions between the atmosphere and the surface, such as evaporation and transpiration over the ocean, or deep convection over the tropics, matter to a lesser extent. Instead, processes and circulation features like moisture transport or the atmospheric Hadley cell dynamics determine the non-local response of precipitation (and cloud cover) to CO 2 or orbital variations to a much larger extent. Because of the larger dynamical component of the hydrological cycle, precipitation and cloud cover are much less constrained by the forcing than temperature. As a result, the linear model shows less predictive skill for precipitation and total cloud cover (Fig. 4c-f).
Our reconstruction represented only long-term climatologies of past climate changes, similar to other GCM snapshot of the past 11,16 . Therefore, the data set does not contain sub-millennial scale variability.
The spatial and temporal covariance of the model output. Our climate reconstructions are based on a pixel-specific linear model, one for each of the HadCM3 grid boxes. By design, spatial autocorrelation is not an issue, which it would be if we were to analyse all points simultaneously. In that case, spatial autocorrelation would invalidate the linear model as the residuals would be spatially autocorrelated.  www.nature.com/scientificdata www.nature.com/scientificdata/ However, HadCM3 exhibits a certain spatial structure, and such a pixel-by-pixel approach, where spatial grid cells are treated independently, cannot guarantee that this spatial structure, or covariance, is preserved. We can nevertheless show that the reconstructed climate fields exhibit the same spatial structure as the original HadCM3 model output. First of all, the regression coefficient maps (Fig. 3) indicate that the climate response to external forcings is spatially coherent. We calculated this spatial coherence in terms of the spatial covariance matrix www.nature.com/scientificdata www.nature.com/scientificdata/ (Fig. 5). It consists of the covariance between the time series of any two grid points, i.e., it is a 7008 × 7008 matrix (= 96 × 73 (nlon × nlat) = 7008) of covariances over the 72 snapshots.
For both HadCM3 and the linear model, the covariance matrices are similar in structure and magnitude, and their differences are relatively small (Fig. 5). The covariance matrices are much more similar for temperature than for precipitation and total cloud cover, because the linear regression works better for temperature than for the other two climate fields. Overall, the covariance matrices of the linear model reconstructions are so similar to HadCM3 that we can conclude that the spatial structure is indeed preserved.
For any time series of observations (for example, as shown in Figs. 6, 7), we can assume that data points of that time series are temporally autocorrelated because of possible lag effects that derive from non-equilibrium climate dynamics. However, our reconstructions are based on snapshot simulations that are assumed to be in equilibrium, and any such lags are therefore omitted. The relationship between outputs of snapshots and their forcings can thus be treated as independent data points with no temporal autocorrelation.
Comparison to marine and terrestrial proxies. We compared the reconstructed climate data with marine proxies (before the downscaling/bias-correction step) as a means of highlighting how well the reconstructed long-term climatologies compare to empirical reconstructions.
Marine sediment cores are valuable archives of past sea surface temperature (SST) records. Because their associated bio-geochemistry is relatively straightforward, marine proxies can be utilised as paleo-thermometers and are thus well suited for a direct proxy-model comparison. For these proxies, we compared model-derived mean annual temperature (MAT) time series directly with proxy-derived SST time series and calculated the correlation between the two. Note that MAT and SST are not the same climatological quantities; SST is the temperature of the ocean surface and has a lower limit of about −1.8°C, the freezing point of saltwater. While we expect MAT and Fig. 9 Scatter plot of the 20 Middle and Late Pleistocene terrestrial proxies (x-axis) used in this study versus reconstructed mean annual temperature (y-axis). The respective correlation coefficient is shown on the top right in each plot and information about each proxy can be found in Table 4. www.nature.com/scientificdata www.nature.com/scientificdata/ SST to co-vary in low and mid-latitudes, at higher latitudes, seasonal or perennial sea ice cover makes a comparison between both variables problematic.
Although terrestrial proxies are rarely available as direct temperature estimates, we could still calculate the correlation between the model-derived and the proxy-derived time series. However, the interpretation of terrestrial proxies from a climate perspective can be problematic. For example, pollen-based vegetation reconstructions are suggested to be less reliable as climate proxies, particularly for interglacials 29 . Other land-based proxies such as dust deposits integrate long-term climatic changes over large regions and hence do not necessarily capture climatic effects at their specific location.
For the comparison of our climate reconstructions with proxy reconstruction, we assembled long-term marine SST and terrestrial climate proxy reconstructions (Figs. 6, 7, 8, 9) that cover a period of at least 150 ka during the last 800 ka (Tables 3 and 4). hadcm3_000-120_jan_regression_temp_co2-ecospre-esinpre-obl.nc hadcm3_000-120_feb_regression_temp_co2-ecospre-esinpre-obl.nc … hadcm3_000-120_dec_regression_temp_co2-ecospre-esinpre-obl.nc hadcm3_000-120_ann_regression_temp_co2-ecospre-esinpre-obl.nc hadcm3_000-120_jan_regression_prec_co2-ecospre-esinpre-obl.nc hadcm3_000-120_feb_regression_prec_co2-ecospre-esinpre-obl.nc … hadcm3_000-120_dec_regression_prec_co2-ecospre-esinpre-obl.nc hadcm3_000-120_ann_regression_prec_co2-ecospre-esinpre-obl.nc hadcm3_000-120_jan_regression_tcc_co2-ecospre-esinpre-obl.nc hadcm3_000-120_feb_regression_tcc_co2-ecospre-esinpre-obl.nc … hadcm3_000-120_dec_regression_tcc_co2-ecospre-esinpre-obl.nc hadcm3_000-120_ann_regression_tcc_co2-ecospre-esinpre-obl.nc Table 5. List of diagnostic regression results that can be found in the Open Science Framework repository 28 under the project's data/coeffs directory. These files contain useful statistical summary information such as R 2 , standard errors, or residuals for the individual, pixel-based linear regression models. ( www.nature.com/scientificdata www.nature.com/scientificdata/ In locations where we have empirical proxies, both on land and over the ocean, our regression-based climate reconstructions match the original HadCM3 simulations well. These reconstructions can therefore be considered as representative of the simulated HadCM3 climate. As a consequence, any differences with respect to the proxies records will persist in our reconstructions and, therefore, needs to be removed. Model bias. Because the linear model was fitted to match HadCM3 model outputs, the resulting bias of our reconstructions is similar to the HadCM3 model bias. The bias of reconstructed temperature and precipitation with respect to the CRU TS data set has a similar spatial pattern (Fig. 10) and is as large as the bias shown for HadCM3 climate reconstructions of the last 60 ka 10 . This is also why our long-term reconstructions show the same bias towards the assembled paleo-climate proxy records as the original HadCM3 simulations (Figs. 6, 7). From this, we concluded that this similarity means that our present-day climate reconstruction is of the same quality as the original HadCM3 simulation it is based on. As discussed earlier, the bias was removed from our climate reconstructions using the "delta" method 12 .
Informed user notice. Our final dataset is just as good as i) the goodness of the applied linear regression model, and ii) the underlying climate model, HadCM3 in our case. How well the linear regression model performs for different variables and different regions can be seen in Fig. 4. It works usually better for temperature and thus for temperature derived bioclimatic variables (see Table 1). For precipitation and total cloud cover, we suggest to carefully assess Fig. 4 if the region of interest is well represented by the statistics-based reconstruction. The actual numbers for the goodness of the fit and the model, R 2 and RMSE, can be found in the diagnostic files listed in Table 5.
The quality of the underlying climate model, HadCM3, can be assessed in two ways. First, by looking at the model bias (Fig. 10), and second, by looking how well HadCM3 compares to proxy records that go well back in time and show the longer-term climatic changes happening on glacial-interglacial time scales (Figs. 6,7,8,9). However, most geological proxies are useful for quantitative temperature comparisons, and only few, terrestrial proxies exist for precipitation, and they mostly reflect qualitative changes, e.g., wetter vs. drier periods. For cloud cover no such geological proxies exist.

Usage Notes
Examples on how to access the NetCDF files of the reconstructed climate using Python or R are provided in the examples directory of the project repository 28 .

Code availability
Model code for the linear regression as well as the code for the analysis and visualisation of figures is publicly available in the project repository 28 . NetCDF files have been processed using cdo 30 . We used the Python language for most of our scripts with a few bash scripts as wrappers. The workflow for the data generation process is managed by Snakemake 31 . The linear regression is based on the statsmodels package 32 . All visualisations are made with matplotlib 33 using cartopy 34 for maps. Other Python packages used are (in alphabetical order): adjustText 35 , BeautifulSoup4 (https://www.crummy.com/software/BeautifulSoup/), netCDF4 36 , numpy 37 , pandas 38,39 , scipy 40 , scikit-image 41 , and tqdm 42 .