Introduction

Microfossil data (pollen, diatoms, foraminifera, chironomids, testate amoebae, ostracods) are widely employed as proxy indicators of past environmental variations, with applications in palaeoclimatology, environmental monitoring, and the study of ecosystem sensitivity, resilience and anthropogenic impact. Since the 1970s, a range of quantitative approaches have emerged to infer palaeoenvironmental variables from microfossil assemblages1,2,3. Numerical palaeoenvironmental reconstructions are generally based on a modern calibration dataset, consisting of surface sediment (i.e., chronologically recent) samples of species assemblages, with modern environmental data attached to each surface sample. Palaeoenvironmental reconstructions are then prepared using a model trained with the calibration dataset and applied to samples of fossil assemblages. A persistent challenge, however, has been to validate these reconstructions, due to limited independent knowledge about past environmental conditions4. Cross-validation (CV) analyses of the modern calibration data help give an estimate of reconstruction ability for fossil samples5,6,7,8. In this work, we explore the ability of a range of quantitative algorithms to robustly reconstruct climate variables from fossil pollen data. Our particular focus is on the reconstruction of secondary climate variables, i.e., variables with a lesser ecological effect on the studied biological proxy, compared to the larger effect of the primary climate variable9,10.

Several factors motivate this focus on refining methods to extract reconstructions of secondary climate variables. First, secondary variables have a long history of use in palaeoclimatology, because the multivariate nature of micropalaeontological datasets permits application of multivariate statistical methods that can, in theory, independently reconstruct multiple environmental variables, with analysis of variance enabling designation of variables with primary, secondary, etc. explanatory power11. However, recent literature has highlighted important challenges in the reconstruction of secondary (and further) variables. These challenges include the risk of spurious reconstructions, due to the ecological model for the secondary model lacking independence from the effects of the primary variable10,12, the assumption that the covariance structure between environmental variables is preserved over time13, and the unreliability of common CV schemes in identifying models which can reconstruct secondary variables in the presence of spatially autocorrelated and ecologically significant nuisance variables6,8. These results have cast doubt on many published reconstructions of secondary variables. Yet, secondary variables are important for palaeoclimatology, because they reveal diagnostic signals of the past variability in vital climate factors, forcings, feedbacks, or atmospheric-oceanic circulation mechanisms, that would otherwise be undetectable. For example, while in northern latitudes climate reconstructions from pollen and chironomids usually indicate summer temperature as a primary variable, past changes in winter temperature or precipitation may be vital to detecting and understanding past variation in sea ice extent14, oceanic circulation15, or drought regimes16,17. In ideal situations, palaeoclimate information is available from multiple proxies, allowing an independent validation4 of reconstructions for hard-to-reconstruct variables such as moisture18 or winter temperature19.

Second, there is a strong basis in ecological theory for the reconstruction of secondary variables: species distributions and abundances are affected by multiple environmental variables, and each species is likely to have its own unique fundamental niche20,21. Taxon responses to past and present climate changes are highly individualistic22, and indicator taxa exist for a number of climate variables9,23,24. Thus some signal of multiple climate variables should be extractable from multivariate fossil datasets that include multiple indicator taxa25, though potentially challenging in practice.

Our final motivation is the on-going and rapid advances in a relatively new class of reconstruction approaches, consisting of machine-learning (ML) algorithms that use ensemble models of regression trees7,9,26,27. ML approaches have important theoretical strengths in the reconstruction of secondary variables. First, tree ensembles make selective use of predictor variables, and are thus able to focus on the potentially small number of useful indicator taxa for a secondary variable, while ignoring the numerous and abundant indicators of the primary variable. Second, the tree models give equal weight to rare and abundant taxa, which can help capture the signal of secondary variables in the variation of relatively rare fossil taxa7,27,28. Beyond these conceptual strengths, an increasing number of recent studies suggest regression tree ensembles can provide well-performing calibration models for microfossil proxies7,9,26,27,29 and for species distribution models applied to fossil pollen data30. Here, we employ three types of regression tree ensemble models: random forest (RF), boosted regression tree (BRT), and extremely randomized trees (ETREES). We also test another family of ML models, artificial neural networks, using both the traditional (NNET) and the Extreme Learning Machine (ELM) implementations. This is the first use of two of these ML methods (ETREES, ELM) in palaeoclimate reconstruction.

In this study, we test the ability of pollen–climate calibration models to detect the signals of both primary and secondary climate variables, while controlling for two key sources of bias. First, we select minimally correlated primary and secondary variables, to ensure the independence of the pollen–climate calibration models. Second, we evaluate the models using a new CV approach, a variation on h-block CV6,8 that employs a variable radius. In this approach, the models are tested in several CV cycles while omitting samples from an increasingly large radius (h) around the test sample. In this way, model performance is evaluated with increasingly poor analogues available for the test sample in the calibration data, providing a strong test of the generality and robustness of the models. This approach also allows h to be customized to the dataset at hand, to find the optimal tradeoff between removing spatial autocorrelation effects and maximizing calibration dataset size. We prepare the models using pollen–climate calibration datasets (surface sediment pollen samples and modern climate data) from North America and Europe (Fig. 1). Finally, we test these models in palaeo-reconstructions using four fossil pollen datasets of the present (Holocene) and last interglacial (LIG) periods from North America and Europe (Fig. 1, Table 1). We use eight quantitative methods to prepare the calibration models (Table 2), with the five ML-based methods compared against three classical approaches (weighted averaging (WA), weighted averaging-partial least squares (WAPLS), and the modern analogue technique (MAT)). We show that the pollen–climate calibration models developed here perform well for both primary and secondary variables. Further, we find regression-tree-based ML approaches to consistently outperform other reconstruction methods.

Figure 1
figure 1

Modern and fossil pollen datasets in North America and Europe.

Table 1 Fossil pollen datasets used for the study.
Table 2 Modelling tools used and their parameterization.

Results

Variable selection

We choose mean July temperature (Tjul; Fig. 2a) as the primary reconstructed variable for both North America and Europe, due to the strong effect of this variable on calibration species data variation. The secondary reconstructed variables used differ between North America and Europe, due to the different cross-correlation structures between climate variables on the two continents. Most notably, in North America summer and winter temperature related variables are strongly correlated, while in Europe they are not. We thus select mean January temperature (Tjan; Fig. 2b) as the secondary variable to be reconstructed in Europe, however in North America water balance (Fig. 2c), i.e. the difference between the total annual precipitation and evapotranspiration24, is reconstructed. These secondary variables are known to be ecologically important with respect to affecting plant distributions and abundances and each has a low correlation with the primary reconstructed variable Tjul (Fig. 2d).

Figure 2
figure 2

Spatial distribution and correlation of reconstructed variables. The maps show the modern values for (a) July mean air temperature (Tjul), (b) January mean air temperature (Tjan), and (c) Annual water balance. Panel (d) shows observed values and the Spearman correlation (ρ) for the reconstructed climate variables for the European and North American pollen–climate calibration datasets.

Cross-validations

Transitioning from leave-one-out CV (h = 0) to a small h of 100–200 km produces an initial decrease in model performance, indicated by an increase in RMSEP (Fig. 3a,b). This is expected from earlier h-block experiments6,7,8,26, due to the removal of pseudo-replicate samples that closely resemble the test sample due to spatial autocorrelation in nuisance variables. This initial decrease in performance varies between methods and is especially noteworthy for MAT in Europe, where MAT is the best-performing method in leave-one-out CV, but falls behind the best-performing methods with increasing h (Fig. 3a,b). At intermediate h values (~200–1000 km), prediction performance decreases only gradually, and the relative performance of the methods remains largely unchanged. At large h values (>~1000 km) model performance markedly deteriorates (Fig. 3a,b), as the number (Fig. 3c) and quality (Fig. 3d) of available analogues in the calibration data worsens. As at small h values, this deterioration is larger for some methods, particularly MAT and neural networks (NNET, ELM), compared to unimodal transfer functions (WA, WA-PLS) and regression tree ensembles (RF, BRT, ETREES) which perform better in prediction with poor analogues.

Figure 3
figure 3

Cross-validation (CV) results. Results are shown with a series of h-block CV’s, with h increasing from 0 to 1500 km at 100 km increments (note: h = 0 is equivalent to leave-one-out CV). Orange bars indicate the best h, estimated based on the range of a variogram fitted to the residuals of a weighted averaging (WA) model. (a) Root-mean-square error of prediction (RMSEP) for the primary variable in North America and Europe (July mean temperature). (b) RMSEP for the secondary variable (water balance in North America, January mean temperature in Europe). (c) Loss of calibration data in CV models, shown for individual sites and as the median for all sites. (d) Compositional distance (squared chord distance) to best pollen analogue in CV models, shown for individual sites and as the median for all sites.

This intermediate h range, and the corresponding RMSEP-vs-h plateau is expected to give an unbiased estimate of predictive ability, with the effect of pseudoreplicate samples minimized, but with the models still having sufficient data for prediction7. We independently estimated the correct h to use based on the range of a circular variogram fitted to the residuals of a WA model (as suggested in refs6,8). These estimates for correct h (200–600 km; orange bars in Fig. 3) fall on the observed RMSEP-vs-h plateaus (Fig. 3a,b), indicating a congruence among methods.

In the medium-h zone expected to give unbiased estimates, the tree-ensemble approaches (BRT, ETREES, RF) have the best performance and rank as the top three methods in all cases (Table 3). Among these three methods, BRT has the lowest RMSEP in three cases out of four, while ETREES has the lowest RMSEP for Tjul in North America. While the RMSEP differences between the tree-ensemble methods are relatively minor, in the European models for both Tjul and Tjan, BRT has a lower maximum bias by a considerable margin compared to ETREES and RF (Table 3). The three tree-ensemble methods are followed by WA, WAPLS, MAT and ELM in varying order, while NNET is the worst-performing method, ranking at bottom in three cases out of four and second-worst in the fourth case. The overall performance achieved is strong, with the RMSEP of the best-performing model, for each dataset-variable pairing, constituting between 6.5% (Tjul in North America) and 10.0% (water balance in North America) of calibration data gradient length. The coefficients of determination (R2) of the models range from 0.20 (NNET model for North American water balance) to 0.88 (ETREES model for North American Tjul). For further details, see Supplementary Figs S5S8.

Table 3 Cross-validated performance metrics for the individual pollen–climate calibration models.

Palaeo-reconstructions

In the reconstructions of Holocene climate variations at the two North American test sites (Fig. 4a,b), the main feature in Tjul is the early-Holocene rise, followed by a late-Holocene fall, showing the well-known mid-Holocene temperature maximum of the northern mid-latitudes31. The water balance curves, by contrast, show an early-Holocene decline, culminating in a dry period starting around 8 ka (depending on smoothing bandwidth considered), and followed by a gradual rise in water balance at around 6 ka. The early–mid Holocene maximum in aridity is consistent with numerous multi-proxy records from the Great Plains17,18,32,33,34. The strong differences in temporal pattern further suggests that both temperature and water balance signals can be separately deconvolved from mid-continental North American pollen records16,18,32,34.

Figure 4
figure 4

Palaeoclimate reconstructions. Reconstructions are shown for primary and secondary climate variables and prepared with eight calibration methods from each fossil dataset. The black dashed lines indicate the modern climate values at the fossil sites. The SiZer maps (lower panels) show the significant features of the reconstructions, based on the curve using the calibration method with the strongest CV performance for the climate variable in question (BRT or ETREES; Table 3). The reconstruction is smoothed at different bandwidths, with the bandwidth used at each point on the vertical axis indicated by the horizontal distance between the white lines. For each point in time and each bandwidth (h), red indicates a significant rising trend, blue a significant falling trend, purple a lack of a significant trend, and grey a lack of sufficient data for meaningful inference.

The European Holocene reconstructions (Fig. 4c) for Tjul and Tjan are broadly similar to each other. In the long-bandwidth (>1 ka) end of the SiZer map, the early-Holocene rise, mid-Holocene maximum, and late-Holocene decline are statistically significant features for both variables. The Tjul reconstruction thus shows the classical, mid-Holocene summer temperature maximum of the European high latitudes seen in palaeoclimate reconstructions35,36 and modelling37. The Tjan reconstruction is more difficult to validate compared to Tjul, because of broader uncertainty about whether the temporal variations of Tjan and Tjul should be correlated (e.g. due to GHG forcing) or anti-correlated (e.g. due to orbital forcing)31,38,39. Earlier reconstructions35,36 and modelling37 have shown only weak trends for Tjan in NE Europe, and reconstruction and modelling uncertainties are much greater for Tjan compared to Tjul. However, our mid-Holocene maximum in Tjan is consistent with fossil evidence for one well-understood winter temperature indicator in Northern Europe, the hazel (Corylus avellana), which shows a major, northward range expansion in Finland and Scandinavia during the mid-Holocene40.

By contrast, for the European LIG site (Fig. 4d), the main features at the long-bandwidth end of the SiZer map differ between Tjul and Tjan. The Tjul reconstruction shows an early-LIG rise, mid-LIG maximum followed by a late-LIG cooling. However, the Tjan reconstruction only shows a significant warming trend spanning much of the LIG. These trends are consistent with major identified forcings, as the first-order rising (Tjan) and falling (Tjul) temperature trends closely follow the changes in winter and summer insolation, respectively19. At shorter bandwidths, the beginning and end of an abrupt cold event at ca. 128–126 ka, correlated with changes in North Atlantic circulation19,41, are also significant features.

The agreement between different methods is generally good in all reconstructions, however increased spread is observed especially in early parts of the interglacials and the late-glacial section included in the Moon Lake record (Fig. 4d). The periods of larger spread between methods generally coincide with periods of increased modern analogue distances found for the fossil samples (Supplementary Fig. S9), likely due to non-climatic effects during pioneer vegetation stages or non-analogue climates not included in the calibration data42. However, NNET is in several instances a clear outlier. For the European fossil sequences (Fig. 4c,d), the NNET curves exhibit larger sample-to-sample noise compared to other methods, and for some sections of the Laihalampi sequence (Fig. 3c) also indicate much colder temperatures than other methods. In North America, NNET jumps between a handful of values in the Tjul reconstructions, and reconstructs no variation in water balance but only a single outlier value. For water balance in North America, a second outlier is seen in WA, for which the mid-Holocene decline is considerably smaller than for the multi-method median. The reason for the shallower water balance anomaly with WA might be the well-known weakness of this method in predicting for samples near the ends of the calibration-data gradient4,43. This tendency is evident in the residual pattern of the WA-based pollen–water balance model, which (along with NNET) has a considerably larger positive bias among all eight models at the dry end of the modern water balance gradient (Supplementary Fig. S6), likely contributing to a positive bias in the palaeoclimate reconstruction during the mid-Holocene dry stage.

Discussion

Based on these findings, we consider BRT or other regression tree ensemble machine-learning methods as highly promising tools for climate reconstructions from fossil pollen data. The other regression tree ensemble methods RF and ETREES show a broadly similar performance and behaviour in CV (Fig. 3) compared to BRT, and also produce similar palaeo-reconstructions (Fig. 4). The most important difference between BRT, RF, and ETREES is found in maximum bias, where BRT significantly improves on RF and ETREES in three cases out of four. The maximum bias usually affects the gradient ends7,26, and this is also seen in the residual patterns of the models prepared in this study (Supplementary Figs S5S8). This means BRT is relatively strong in predicting for samples located near the ends of the environmental gradient covered by the calibration data. Hence, BRT may be particularly useful for palaeo-reconstructions in situations where the fossil dataset is located towards a fringe of the available calibration data.

Beyond the strong performance, we note that BRT has numerous practical benefits in applications with microfossil datasets. First, among these calibration methods, BRT has unusually powerful tools to analyse the model structure. For example, the user can extract the percentage contributions and plot the modelled responses curves of each taxon9,26,44, which can be invaluable to estimate the effect of each taxon and to verify that the models are consistent with prior ecological knowledge19. For example, Table 4 shows that the secondary-variable models have a distinct structure compared to the primary-variable models, and well-understood indicators for the secondary variables, such as Corylus and Quercus for Tjan in Europe and Artemisia and Chenopodiaceae for water balance in North America, are employed. Second, the BRT models are not affected by monotonic data transformations performed on the predictor set (here, calibration species data). Third, BRT can handle complex responses and assumes no specific response shape, which is a benefit with microfossil calibration datasets which commonly show a mixture of linear, unimodal and multi-modal species responses. This is in contrast with parametric calibration methods assuming a specific response type, such as WA and WA-PLS which fit unimodal response functions regardless of the shape of the underlying response. Fourth, BRT implicitly incorporates interactions between predictors, e.g., situations where a given taxon is only useful in a subset of the calibration data, or indicates different environmental conditions in different subsets44.

Table 4 Relative contribution of the ten most important predictor taxa for the boosted regression tree pollen–climate models for July mean temperature (Tjul), January mean temperature (Tjan), and water balance.

However, BRT may not be the optimal choice with small datasets (n < 100) and/or data exhibiting strong linear responses45, a limitation also observed with relatively small microfossil proxy calibration datasets7. A further practical challenge of BRT is the non-trivial parameterization (Table 2), which requires CV testing with alternative parameterizations for each dataset44. Moreover, BRT uses relatively large ensemble models (ca. 2,000–20,000 trees) which can be a challenge especially with large datasets and in calculation-intensive CV schemes like the leave-one-out or the h-block, potentially involving calculation times of several CPU hours for a full CV cycle. Hence, for some applications, RF may represent a cost-effective alternative to BRT, with similar predictive performance (Fig. 3, Table 2) here reached with small ensemble models of 100 trees, and requiring no further parameterization. The practical benefits of BRT also largely apply to RF, as they arise from the general properties of regression tree based modelling28.

Among neural network models, we find ELM to improve on the classical NNET, which has the overall weakest performance of all methods (Table 2). In palaeo-reconstructions (Fig. 4), NNET has clear trouble producing diverse predictions for the fossil samples. For North-American water balance only a single value is reconstructed. For North-American Tjul the reconstructions are semi-discrete, with most data points falling on a handful of values, although the reconstructions generally follow the main features of the reconstructions with other methods. ELM generally outperforms NNET in CV (Fig. 3, Table 3) while also producing more realistic palaeo-reconstructions (Fig. 4). However, ELM fails spectacularly during CV for some h values (Fig. 3), suggesting great sensitivity to small variations in the calibration data. We thus recommend caution in the use of ELM until its behaviour is better understood. However, this result showcases another benefit of the variable-radius h-block CV, because it reveals an instability in ELM that would have been missed with a single CV cycle.

Good CV performance is no guarantee of reconstruction ability. Even if the CV scheme used gives a robust estimate of predictive ability in the modern world, fossil samples come with additional caveats, such as non-analogue climates not represented in the modern calibration data13,46 and taphonomic inconsistencies between some calibration and fossil data29. Thus, the criteria used here (low correlation to other environmental factors, a significant effect in calibration data, an ecological basis for selection) should be considered necessary but not sufficient guarantees that a useful palaeo-reconstruction is obtained for the variable in question. Large-radius h-block CV may approximate the problem of no analogues, due to spatial autocorrelation in the calibration species data: as calibration data is lost with increasing h (Fig. 3c) the predictions are done with increasingly poor analogues available for the test sample (Fig. 3d). In our results, we see important differences in the CV performance at high h values (>1000 km), with especially MAT and neural network based models (NNET, ELM) struggling compared to the other approaches (Fig. 3a,b).

Our reconstructions from fossil datasets (Fig. 4) are mainly proofs of concept, in which we check the results for major suspicious features, e.g., differences between methods, high noise, or major patterns inconsistent with prior knowledge. These are possibly “easy” test cases, with e.g. exceptionally strong opposite trends in winter and summer temperature forcing during the LIG19,37 and prior multiproxy evidence for major mid-Holocene aridity in the Great Plains of North America17,18,33,34,47 as well as prior applications of pollen-based palaeoclimatic transfer functions to separately reconstruct past temperature and moisture variations16,48,49. Whether robust and repeatable reconstructions can be achieved for secondary or tertiary variables across a larger body of micropalaeontological data remains a question for future research. These efforts should include not only the refinement of the proxy–climate calibration models, but also an increasing use of supporting multi-proxy data to control for proxy-specific biases.

While multivariate climate reconstructions from pollen and other microfossil proxies have been prepared for decades, important pitfalls have been identified in this approach, including the possible lack of model independence10 and issues in model validation6. Here we show that reconstruction of primary and secondary climate variables is indeed possible, at least for some variables and regions, given careful variable selection, consideration of the proxy ecology, and sufficiently sensitive reconstruction algorithms. We find major independent features in our primary and secondary variable palaeo-reconstructions, including the opposite summer and winter temperature trends of the LIG in Europe, as well as the mid-Holocene drought in North America coinciding with the temperature maximum. These opposite first-order trends in primary and secondary variables emerge despite weak positive correlations in the calibration data, which means the secondary-variable reconstructions are unlikely to be driven by calibration data correlations (sensu ref.10). Considering the independent features in the palaeo-reconstructions, the agreement with the identified climate forcings and complementary proxy data, the robust CV performance, and the ecological realism of the calibration models, we suggest current advanced machine-learning techniques are able to detect the independent signals of both primary and secondary climate variables. Hence, this work supports the judicious use of numerical techniques and microfossil data to reconstruct both primary and secondary climate variables.

Conclusions

  • Well-performing pollen–climate calibration models were achieved for secondary climatic variables (January temperature in Europe, water balance in North America), despite a conservative CV scheme, and in the absence of correlations with the primary climate variable (July temperature). Palaeoclimate reconstructions prepared from fossil datasets for the primary and secondary variables show independent features consistent with known climate forcings, palaeoclimate modelling, and other proxy data.

  • Among different calibration techniques, regression tree ensemble methods (BRT, RF, ETREES) generally perform best. BRT further outperforms RF and ETREES in maximum bias, particularly for samples located near the ends of the data gradient. In prediction for samples without good analogues in the training data, MAT and neural network models (NNET, ELM) perform considerably worse than the other methods.

  • Our study highlights the usefulness of variable-radius h-block CV as a practical and a neutral scheme for calibration model selection. This approach (1) removes the effect of spatial autocorrelation from the CV results, (2) helps identify the h range giving unbiased performance estimates, (3) tests model behaviour with poor modern analogues at large h values, and (4) may reveal the instability of a calibration method with small data variations.

  • Overall, these analyses show how concerns about the robustness of palaeoclimatic reconstructions due to effects of spatial autocorrelation and temporally varying cross-correlation can be allayed through use of newer ML approaches and careful attention to the CV method. Reconstruction of secondary variables is possible, at least for some regions and variables. Careful consideration of the underlying ecology of the biotic proxy being used, and of the modern cross-correlation structure of environmental variables, are vital to guarantee that useful and independent calibration models are obtained for both the primary and secondary variables.

Methods

Datasets

The North American modern pollen data (2254 samples) were derived from a subset of the North American Modern Pollen Database50,51, with samples removed if they source from regions that floristically differ from the north-central US, where the two fossil pollen sites are located. For this analysis, all samples from the southeastern US and western North America were removed, due to different species of Pinus and other taxa found in these regions and known interregional differences in species-climate relationships51. For Europe, we use an 807-sample pollen–climate calibration set derived from the European Modern Pollen Dataset52 (EMPD), including lakes from the northern part of the EMPD and using a harmonized taxonomy of 73 terrestrial pollen and spore types19. We extracted climate data for both the European and North American calibration samples from the CRU CL v. 2.0 climate grids53 with the raster library54 for R55, using bilinear interpolation based on four closest grid cells and lapse-rate corrected (6.4 °C/km) based on the difference between site elevation and grid cell elevation.

To test the pollen–climate models in palaeo-reconstruction we use four previously published fossil pollen datasets (Table 1). The North-American datasets cover the Holocene (and about 2 ka of the late-glacial period in the Moon Lake dataset) and are located at the prairie–mixed forest ecotone at the eastern fringe of the North American Great Plains. The European sites Laihalampi and Sokli are located in Finland within the boreal forest zone and cover the present and last interglacials, respectively. The North American datasets were acquired from the Neotoma Paleoecology Database. The Laihalampi dataset was provided by the original data author, while the Sokli dataset was published by the present authors19.

Variable selection

To guide the selection of reconstructed climate variables, we consider the following to be requirements for a useful variable. The variable should have:

  1. 1.

    a significant effect on species variation in the calibration data,

  2. 2.

    a low correlation with other ecologically significant variables, to guarantee that the effect can be independently modelled, and by extension, that no spurious features emerge in palaeo-reconstructions10, and

  3. 3.

    an autecological basis, to guarantee that the statistical effect (point 1) is not due to correlation with unknown, ecologically significant variables.

To find the two climate variables for each region (Europe and North America) that best meet these requirements, we use the following workflow:

  1. 1.

    Calculate a Spearman correlation matrix for a large number of climate variables. We included 20 variables with possible ecological influence, representing temperature (annual mean, December-to-February mean, mean of coldest month, January mean, June-to-August mean, mean of warmest month, July mean, number of frost days, growing degree days) precipitation (annual total, total for driest month, total for wettest month, total for warmest quarter, total for coldest quarter) and moisture availability (water balance (difference between annual precipitation and evapotranspiration24), Priestley-Taylor alpha (ratio of actual to potential annual evapotranspiration23)), and their seasonal variation (temperature range (range of monthly temperature means), temperature seasonality (SD of monthly temperature means × 100), precipitation seasonality (ratio of largest to smallest monthly precipitation total), continentality index56).

  2. 2.

    From the full set of 20 variables, pick a subset for which all between-variable absolute correlations are < 0.7. (Note: at this stage the variables are not taken as ecologically meaningful, merely that their independent effects can be modelled.)

  3. 3.

    Run an ensemble of modelling tools to estimate how well the variation in each climate variable is explained with the calibration species data. Rank the variables based on variance explained (mean R2 across for all modelling tools used). Here, we used an ensemble of 10 modelling tools (generalized additive models, conditional inference trees, random forests, extremely randomized trees, multivariate adaptive regression splines, generalized additive models by likelihood based boosting, gradient boosting with regression trees, gradient boosting for additive models, Extreme Learning Machine neural networks, and single-hidden-layer neural networks).

  4. 4.

    Check that each variable has an independent effect in non-metric multidimensional scaling; if not, exclude.

  5. 5.

    Check that each variable is ecologically credible; if not, exclude.

  6. 6.

    Pick the two best remaining variables.

The full results of these analyses are presented in Supplementary Figs S1S4 and Tables S1S4. The choice of reconstructed climate variables differs between Europe (Tjul, Tjan) and North America (Tjul, water balance), due to differences in cross-correlation structure among climate variables on the two continents.

In North America, summer and winter temperature related variables are strongly correlated (ρ = ~0.8) (Supplementary Fig. S1), precluding using both for reconstruction, and we only choose Tjul for further consideration. Within a subset of five variables with acceptable between-variable correlations (Supplementary Fig. S2), pollen–climate models for Tjul have the highest mean R2 (0.89) (Supplementary Table S1), and Tjul is thus selected as the primary variable. Tjul is followed in the R2 ranking by two moisture-related variables, precipitation of the warmest quarter (0.75) and water balance (0.71). Of these variables, we select water balance as the secondary variable despite a slightly lower R2, due to being a moisture-related variable incorporation evapotranspiration (see below), and due to having only a minimal (ρ = 0.05) although still statistically significant (p = 0.02) correlation with the primary variable Tjul, while for precipitation of the warmest quarter the correlation to Tjul is considerably higher (ρ = 0.46).

In Europe, summer and winter temperature related variables are much lower correlated (Supplementary Fig. S3), allowing both Tjul and Tjan (correlated at ρ = 0.28; p < 0.001) to be considered. Within the subset of five variables with acceptable between-variable correlations (Supplementary Fig. S4), Tjan and Tjul are the two variables with highest mean R2 values (0.79 and 0.72, respectively; Supplementary Table S3), and are thus selected as the two reconstructed variables. While Tjan has a higher R2 compared to Tjul in the cross validation, we designate Tjul the primary variable as our fossil datasets are located in the northern subset of the calibration data, and here the effect of summer temperature is clearly dominant to winter temperature9,26, and the signal of Tjul is thus expected to be stronger in these fossil datasets.

Our three climate variables (Tjul, Tjan, water balance) reflect principal limitations on plant growth and survival23,24,57. In seasonally variable environments, annual mean temperature does not represent the growing season or over-wintering conditions, which play a more central role in governing the distribution and abundance of plants58,59. In high- and mid-latitudes in the northern hemisphere, Tjul describes the temperature of the warmest month and overall growing season conditions, whereas Tjan indicates the wintertime conditions and general stress (related to overwintering survival) of the coldest period of the year. Predictors representing water availability for plants are often derived from mean annual precipitation, however precipitation is a poor surrogate for plant-available water. This is because water availability is strongly related to evaporation, for example in cold climates 500 mm rainfall per year produces a positive water balance (precipitation minus evapotranspiration), whereas in temperate systems the same amount of rainfall creates semi-arid conditions with a negative water balance. Thus, water balance represents a more accurate measure of plant available water compared with precipitation59.

Calibration methods

We use eight quantitative reconstruction approaches which can be divided into four methodological families. The models were run in R55 using parameterizations listed in Table 2. The CV runs were performed on the Taito supercluster of CSC – IT Center for Science Ltd., Espoo, Finland. For the code implementing the h-block CV run, see Supplementary Code.

The modern analogue technique2 (MAT) is a traditional non-parametric reconstruction approach with microfossil data. The method looks for n closest modern assemblages for each fossil sample, using a chosen compositional distance metric, and calculates the reconstructed palaeoclimate value as the mean (or weighted mean) of the values at the modern sample sites. Weighted averaging3 (WA) and weighted averaging-partial least squares43 (WAPLS) are closely related methods with a long history of use in microfossil-based palaeoclimate reconstructions. They are based on fitting unimodal response functions to the modern distribution of each taxon, and then calculating the palaeoclimate value based on the modern responses of all the taxa found in the fossil sample. MAT, WA, and WAPLS were implemented with the R package rioja60.

We also use two families of machine-learning based modelling approaches: regression tree ensembles and neural networks. The random forest61 (RF; implemented with the R package randomForest62) and extremely randomized trees63 (ETREES; for implementation see ref.64) are ensemble models of regression trees, in which a number of trees are calculated and the final prediction calculated as the mean of the predictions from the individual trees. RF and ETREES differ in the approaches used to create variety in the individual trees: while RF uses bootstrap samples of the entire training data for each tree, and uses a randomly selected subset of the entire predictor set when determining each tree split, ETREES selects the cut point at random. Our final tree-ensemble method, the boosted regression tree44,65 (BRT; implemented with the R package gbm66), differs in that the ensemble is built sequentially, with each added tree aiming to explain the residuals of the previously fitted ensemble. The BRT can thus be likened to an additive regression model in which the individual terms are regression trees. RF and BRT have seen some recent use with microfossil data7,9,19,26,27,29,30. To our knowledge, this is the first application of ETREES in this field. Finally, we use two variations neural network algorithms, the traditional implementation67 (NNET) and Extreme Learning Machine68 (ELM). The R packages nnet69 (NNET) and elmNN70 (ELM) both implement single-hidden-layer feedforward neural networks. NNET uses sigmoid activation functions in the hidden layer and optimizes the weights of all connections. ELM has a wider variety of alternative activation functions and uses random weights in the hidden layer. NNET has seen some use in palaeo-environmental reconstructions from microfossil proxies4 while ELM has not.

Cross-validation

A recognized limitation of many commonly used CV schemes is their susceptibility to over-estimate predictive ability in the presence of spatial autocorrelation in the calibration data6,8. The h-block CV6 has been suggested as a solution, in which in each CV iteration the training set omits not only the test sample, but also all samples within a specified radius (h) from the test sample. However, the h-block CV introduces a new challenge of choosing a correct h which removes pseudo-replicate samples but does not undermine performance by removing too much data8. Here, we adopt the approach of running h-block CV with a range of h (0–1500 km at 100 km increments). Of these, the h = 0 iteration is equivalent to the common leave-one-out CV. We observe the change in RMSEP with increasing h to estimate an h that represents a balance between removing pseudo-replicates but retaining sufficient data coverage7. As a guide to assessing the results, we also estimate the optimal h following ref.6, who suggest estimating h as the range of a circular variogram fitted to the residuals of a WA model in a leave-one-out CV. CV performance is summarized with 1) the root-mean-square error of prediction (RMSEP) and 2) maximum bias (the largest mean of prediction residuals found for any of the 10 equal length segments of the calibration data climate gradient), representing a “worst case” error for some segment of the calibration data climatic gradient.

Palaeoclimate reconstructions

Palaeoclimate reconstructions for the primary and secondary variable were prepared from the fossil datasets with each of the eight calibration models. To assess the presence of independent features in the reconstructions, we prepare SiZer maps71 (implemented with the R package SiZer72) based on the palaeoclimate curves. Here we use the palaeoclimate curve prepared with the method which showed the strongest CV performance in predicting the reconstructed climate variable on that continent (BRT or ETREES; Table 3). In the SiZer analysis, a family of smoothers with a range of bandwidths is first applied to the palaeoclimate curve. The derivative of the smoother of each bandwidth is then analysed for significant deviations from zero, to identify time segments at which the smoother has a statistically significant rising trend, a falling trend, or no trend. The results are displayed as a two-dimensional coloured raster (the SiZer map) where the y axis indicates the smoothing bandwidth considered, x axis the point in time, and the cell colour the presence of a significant rising trend (red), a significant falling trend (blue), the lack of a significant trend (purple), or the lack of data for meaningful inference (grey).