State-of-the-art global models underestimate impacts from climate extremes

Schewe, Jacob; Gosling, Simon N.; Reyer, Christopher; Zhao, Fang; Ciais, Philippe; Elliott, Joshua; Francois, Louis; Huber, Veronika; Lotze, Heike K.; Seneviratne, Sonia I.; van Vliet, Michelle T. H.; Vautard, Robert; Wada, Yoshihide; Breuer, Lutz; Büchner, Matthias; Carozza, David A.; Chang, Jinfeng; Coll, Marta; Deryng, Delphine; de Wit, Allard; Eddy, Tyler D.; Folberth, Christian; Frieler, Katja; Friend, Andrew D.; Gerten, Dieter; Gudmundsson, Lukas; Hanasaki, Naota; Ito, Akihiko; Khabarov, Nikolay; Kim, Hyungjun; Lawrence, Peter; Morfopoulos, Catherine; Müller, Christoph; Müller Schmied, Hannes; Orth, René; Ostberg, Sebastian; Pokhrel, Yadu; Pugh, Thomas A. M.; Sakurai, Gen; Satoh, Yusuke; Schmid, Erwin; Stacke, Tobias; Steenbeek, Jeroen; Steinkamp, Jörg; Tang, Qiuhong; Tian, Hanqin; Tittensor, Derek P.; Volkholz, Jan; Wang, Xuhui; Warszawski, Lila

doi:10.1038/s41467-019-08745-6

Download PDF

Article
Open access
Published: 01 March 2019

State-of-the-art global models underestimate impacts from climate extremes

Nature Communications volume 10, Article number: 1005 (2019) Cite this article

27k Accesses
176 Citations
262 Altmetric
Metrics details

Subjects

Abstract

Global impact models represent process-level understanding of how natural and human systems may be affected by climate change. Their projections are used in integrated assessments of climate change. Here we test, for the first time, systematically across many important systems, how well such impact models capture the impacts of extreme climate conditions. Using the 2003 European heat wave and drought as a historical analogue for comparable events in the future, we find that a majority of models underestimate the extremeness of impacts in important sectors such as agriculture, terrestrial ecosystems, and heat-related human mortality, while impacts on water resources and hydropower are overestimated in some river basins; and the spread across models is often large. This has important implications for economic assessments of climate change impacts that rely on these models. It also means that societal risks from future extreme events may be greater than previously thought.

Significantly wetter or drier future conditions for one to two thirds of the world’s population

Article Open access 11 January 2024

Climate damage projections beyond annual temperature

Article Open access 17 April 2024

A Large Ensemble Global Dataset for Climate Impact Assessments

Article Open access 14 November 2023

Introduction

Estimation of the total damages caused by climate change requires a quantification of climate impacts across a large range of economic and societal sectors. These sectors include agriculture¹, water resources², energy supply and demand³, human health⁴ and ecosystem services⁵. There are approaches that integrate damages across sectors, such as the highly idealised damage functions used in integrated assessment modelling⁶, but also more sophisticated, coupled economic modelling frameworks that combine individual sectoral models^7,8,9. However, these approaches are centred on gradual changes in physical and biophysical indicators—such as crop yields or water resources—and largely ignore the impacts of extreme climate and weather events.

This is a serious research gap because such events cause enormous damages¹⁰. For addressing it, sectoral impact models must be able to credibly represent the impacts of extreme events. The goal of this paper is to test whether this is the case in the complex, process-based impact models that are routinely being applied in global-scale climate impact assessment^1,2,3. While these models may be too costly to integrate them directly in cross-sectoral economic models, they are the benchmark for any simpler models. More generally, these complex impact models, in conjunction with global climate models, form the basis of much of our current knowledge about future global climate change impacts, as reflected in the Intergovernmental Panel on Climate Change reports, for instance. Whether they capture extreme events well is therefore a key concern even beyond the application in economic assessments.

And yet, it is not known how well the current suite of models can reproduce the multi-sectoral impacts of a given climatic extreme event. Global process-based impact models have been evaluated in terms of average quantities and sometimes in terms of inter-annual or intra-annual variability^11,12,13,14, but their performance under extreme conditions has rarely been tested at large spatial scale¹⁵, and never—to our knowledge—in a multi-sector setting. And since events that are very rare today may become much more frequent in the future¹⁶, testing for variability alone may not be enough.

Here, we choose the 2003 European heat wave and drought (EHWD) event as a test case. The EHWD was substantially stronger than previously observed events; it severely impacted several important sectors across a large geographical area, and its impacts are relatively well documented. We examine the impacts of the EHWD in a large ensemble of state-of-the-art impact models covering agriculture, water resources, terrestrial and marine ecosystems, energy, and human health, for the first time in a common modelling framework. For each of these sectors, we identify key observed impacts of the 2003 EHWD reported in the literature and/or recorded in public databases, and examine how closely the models—driven by observations-based climate data—reproduce those impacts. As a common impact metric, we choose the deviation of 2003 from the historical average, adjusted for long-term trends, and normalised by the historical standard deviation (except for human health; see Methods). We thereby circumvent potential biases in the baseline or the average inter-annual variability, and instead focus on the models’ ability to pick out the anomalous 2003 event from the rest of the time series.

The Results section first provides a climatological analysis of the 2003 EHWD, and then presents impact model results for each sector. In the Discussion section, we summarise and evaluate our findings across sectors, and discuss their implications for integrated assessments of climate change impacts and for future model development.

Results

The 2003 European heat wave and drought

The 2003 EHWD stretched over the entire summer, with large and persistent hot anomalies especially during June and August (Supplementary Fig. 1), and it extended across much of Western and Central Europe (Supplementary Fig. 2). The June–August average temperature anomalies (relative to 1961–1990) were extreme, reaching 2 °C (2 standard deviations (σ) above the mean) averaged over Europe and more than 5 °C regionally (more than 3σ, and in some locations 5σ)^17,18. Due to both decreased precipitation and increased evapotranspiration, the high temperatures were accompanied by an intense drought, and the dry soils in turn amplified the heat wave^19,20.

The EHWD had numerous impacts on the environment, economy and human health¹⁸. In this paper, we focus on large-scale impact indicators which can be compared between observations and global-scale models (Fig. 1). In terms of these indicators, the summer of 2003 is characterised by anomalies of up to 5σ in observed data. In particular, southern Europe saw extreme reductions in ecosystem gross primary productivity (GPP) and large excess human mortality rates. Substantial relative reductions in crop yields, river flow and hydropower production were experienced across the different parts of Europe affected by the EHWD. The ability of impact models to capture these large anomalies is summarised in Fig. 1, and discussed in the following sections.

The EHWD was exceptional compared to the historic record, but given continued global warming, comparable events are anticipated to occur about every 10 years by the middle of the 21st century, depending on future greenhouse gas emissions^21,22,23. Even when changes in the mean climate state are discounted—tantamount to assuming full adaptation to gradual climate change—relative deviations like the 2003 EHWD could still occur compared to the background climate state in the late 21st century²⁴. Furthermore, the impacts of the 2003 event could have been much greater if spring conditions had been drier¹⁹, e.g. similar to those that prevailed in 2011²⁵, which underlines the risk of even more-extreme events in the future. Being able to estimate the damages from such events is therefore crucial for assessments of future climate change impacts.

River flow and water resources

Due to the prolonged rainfall deficit (Supplementary Figs. 1–3), the summer of 2003 was anomalously dry in central Europe. This was visible in surface runoff and river discharge levels^26,27,28, which are indicators of renewable freshwater availability for ecosystems and human uses, such as irrigation or the cooling of thermal power plants²⁹. In August 2003, discharge in both the Rhine and Elbe rivers reached record low levels³⁰. Navigation was impeded on the German section of the Rhine on 37 days in 2003 due to low flow³¹. Satellite-based gravity measurements indicate an exceptional depletion of terrestrial water storage compared to 2002³².

We examine monthly average river discharge at major gauging stations across Europe where continuous data since 1979 is available from the Global Runoff Data Centre, and compare this to an ensemble of global hydrological models (Methods). The observed data indicate negative anomalies of ~1.5–2σ during August 2003 at five of the stations: in the Rhine (Lobith) and the Danube basin (Bratislava, Achleiten, and Inn at Passau-Ingling), as well as in the largest Swiss catchment, Aare (Figs. 1 and 2). These larger anomalies are reproduced closely by most model simulations, with a relatively small spread across the ensemble. The multi-model median is very close to the reported value at the Rhine, Tisza, Danube, and Aare stations. Conversely, many models overestimate the more moderate anomalies observed e.g. in the Elbe and Oder rivers, and the ensemble spread tends to be larger there (Fig. 2). Out of 12 stations, there are 5 stations where 75% or more of the models simulate a negative discharge anomaly larger than 1 standard deviation even though the observed anomaly is smaller. There is no station where the magnitude of the observed anomaly is underestimated by many models.

Results are similar in simulations that ignore the effects of human land use, dams and reservoirs, and water withdrawals (Supplementary Fig. 4), suggesting that present human modifications of the hydrological system do not substantially change the response of the system to this type of extreme event. This is consistent with the moderate effect of human interventions on streamflow and drought conditions in central and western Europe reported in earlier studies^33,34. Indeed, in the hydrological models that include human interventions, irrigation water demand would increase under drought, but this has little effect on river flow when the flow is already low and limits the amount available for withdrawal (Supplementary Fig. 5).

Model performance differs little between June, July and August (Supplementary Fig. 6). While there is some variation in which stations are matched best in each month, in those cases where there is a large anomaly (such as in the Danube in July) the model ensemble tends to reproduce that anomaly closely. Results are also insensitive to the choice of climate forcing data set (Supplementary Fig. 7). Our results are consistent with recent studies attesting a relatively good performance of global hydrological models for discharge and runoff variability in European catchments, compared to other regions^12,35. At the same time, the tendency towards false-alarms is in agreement with a dry bias induced by the models’ potential evaporation schemes³⁶. The fact that one of our models (number 6) appears consistently at the dry end of the ensemble may be because it applies a temperature-based evaporation scheme (Hamon) that has been shown to induce a particularly large bias when applied outside its calibration range³⁶.

Agriculture

Agricultural summer crop yields were poor in 2003 due to the combined effect of the drought and the excessive temperatures recorded since June^18,37. Low harvests incurred an estimated €13 billion of uninsured losses in the European Union (EU)³⁸. At the EU level, maize and wheat were the most damaged crops, with production shortfalls of 21 and 11%, respectively¹⁸. We examine maize and wheat crop yields simulated by an ensemble of 12 global gridded crop models, and compare them with data reported by FAOSTAT (obtained from http://faostat.fao.org/site/567/default.aspx on 30 August 2016) for central and southern European countries with continuous data since 1979. Consistent with previous studies³⁹ and early assessments by the COPA-COGECA agricultural association⁴⁰ (Supplementary Table 2), we find the 2003 EHWD was associated with large negative yield anomalies in the FAOSTAT data set in France, Germany and Italy for both maize and wheat; as well as in Spain for maize, and in Austria and Portugal for wheat (Fig. 3, black circles). Additionally, we find substantial yield reductions of one standard deviation or more for both crops in Hungary, and for wheat in Bulgaria, Greece and Romania. Hungary and Romania in particular are among the EU’s largest maize producers.

The ability of the crop model ensemble to reproduce these impacts is mixed (Fig. 3, boxes and numbers). Out of 5 countries with an observed negative maize yield anomaly larger than 1 standard deviation, there are 3 (4) countries where all models (more than 75% of models) underestimate that anomaly. For wheat, the numbers are 2 (7) out of 9 countries. The best-matching models differ from country to country, in line with a previous evaluation of these models¹¹ which also found mixed skill in reproducing overall inter-annual yield variability. There is broad agreement across models on the sign of the anomaly in most of the strongly impacted countries; except for wheat in Italy, France, Portugal and Greece, where nearly half of the models show a positive anomaly. In terms of magnitude, the large anomalies in France and Italy are underestimated by the entire ensemble for both crops. Generally, the agreement is somewhat better in Middle and Eastern European countries, such as Hungary, Bulgaria or Poland, compared to Western Europe. These general results are robust against changing the climate forcing data set, although the yield loss in Italy is better reproduced with an alternative data set (Supplementary Fig. 8). We also note that COPA-COGECA report a decline in maize yields in Austria by about 10% between 2002 and 2003, which is not reported in the FAOSTAT data (Supplementary Table 2); thus, the real observed value in Fig. 3 may be closer to the model ensemble mean than the FAOSTAT value shown by the black circle.

The poorer performance in Southwestern versus Eastern European countries may be due to more widespread irrigation in the former. The crop models assume full irrigation in irrigated areas and do not account for potential limitations in water availability due to drought, which induces an overestimation of irrigated yields, and thus biases total yields in countries with much irrigation, such as France or Italy (Supplementary Fig. 9). This is in line with a recent study showing that extreme heat leads to strong declines in maize yield in both observations and models only under rainfed conditions¹⁵. Moreover, vernalisation of winter crops (the requirement of cold temperatures for flowering) is a known problem in crop models⁴¹, and the positive sign of the simulated wheat yield anomalies in southern Europe in many models may be due to winter wheat flowering, maturing and being harvested early and thus escaping the heat wave⁴². Finally, as is the case with the other sectors, the crop model simulations were not specifically designed to reproduce 2003 yields, and used standard management and phenology settings; for instance, fixed planting dates and a constant number of heat units required for a crop to mature.

Terrestrial ecosystems

The summer of 2003 saw a major reduction in overall GPP in parts of Europe, based on vegetation model results^39,43 and remote sensing greenness index data^44,45. The largest anomaly was situated in France, but negative anomalies were also observed in northern Spain, Italy, and parts of the Balkan and Germany (Supplementary Fig. 10).

We evaluate an ensemble of global vegetation models over two overlapping rectangular regions (Fig. 4a): one (West) comprising France and northern Spain, where the centre of the negative GPP anomaly was located; and one (Central) including Germany, central Italy, and the western Balkans, which also saw large negative anomalies, but also including the Alps which saw a positive anomaly⁴⁶. As observational benchmark we use MODIS^47,48 GPP estimates derived from remotely sensed absorbed fraction of photosynthetically active radiation (fAPAR; Methods). We find negative anomalies of 4 to 5σ in MODIS-derived GPP during 2003 (Fig. 4b, black circles), consistent with previous analyses⁴⁵.

The magnitude of the observed anomaly is underestimated by 75% of models in the Central region, and by all models in the West region. In both regions the model ensemble spreads over a large range that also includes positive anomalies and near-zero values (Fig. 4b, numbers). There is no systematic difference between models that include and exclude human-induced changes to the natural vegetation pattern (land use change). The estimates are largely insensitive to the choice of climate forcing data set (Supplementary Fig. 11), despite substantial differences in annual shortwave radiation between data sets⁴⁹.

Our finding of a smaller sensitivity of GPP to drought in models than in observations is consistent with a previous multi-model study which, however, assessed a shorter time span⁴⁵. Inspecting individual models’ GPP time series (Supplementary Figs. 12 and 13), we note that those models that most closely reproduce the 2003 relative anomaly also exhibit a pronounced positive anomaly in 2007 and 2008, consistent with the MODIS estimates. The large spread across models for the reduction of GPP in the 2003 EHWD is despite a relatively high spatial correlation with MODIS-derived GPP globally in all these models⁴⁹.

Previous studies^50,51 have suggested that many global vegetation models do not capture the vegetation response to drought, that is, extremely low soil moisture and air humidity, especially for unprecedented or long-lasting cases. While the models adopt mechanistic, or process-based, approaches, most models include empirical parameterisations for temperature and moisture response functions, which may be less reliable outside the historical range of variability for which they have been calibrated. Moreover, not all ecosystem models account for heat stress effects on plant photosynthesis and water stress effects on respiration⁵², which might explain the underestimation of the EHWD’s impact by many models. Other possible reasons include the lack of differential species response to drought⁵³, the lack of deep rooting access to water for forests, irrigation and more generally a poor description of the phenology of cultivated vegetation⁵¹.

Energy

Low water availability and high water temperatures during the 2003 EHWD affected electricity supply due to declines in the output of hydropower plants, and impaired operation of thermoelectric power plants that require freshwater for cooling. Across Europe, more than 30 nuclear power plant units had to reduce their production in summer 2003 because of constraints for cooling water uses^54,55. Together with hydroelectric power restrictions, this led to a doubling in electricity spot market prices compared to the previous summer³⁰. Even wind power potential was at record low levels in several countries (Methods).

We analyse model simulations of hydropower plants’ usable capacity, i.e. the maximum available output at a given point in time (Methods). In the absence of observations of usable capacity, we compare these simulations to reported annual hydropower generation from the global database of the U.S. Energy Information Administration (EIA); while noting that a plant’s actual power generation can be lower than its usable capacity depending on management. We exclude countries with less than 20 plants or less than 100 MW total installed capacity. The observed data show substantial negative anomalies of around 1.5–2.5σ in 2003 for a number of countries, among them Italy which is one of the largest hydropower producers in Europe (Fig. 5). Sweden, Finland, and the UK registered large negative anomalies, too; consistent with the observation that northern Scandinavia and the British Isles experienced dry conditions in summer 2003, even though they were not at the centre of the heat wave. At the same time, Spain, Portugal, and Greece registered above-average power generation; consistent with near-average summer precipitation and slightly above-average precipitation during the rest of the year (Supplementary Figs. 2–3).

The simulated anomalies are very close to the observed ones for Italy and the UK, relatively close for Finland and the Czech Republic, and also closely reproduce the positive deviation in Greece. Large observed negative anomalies in Latvia and Belgium are not reproduced. For the remaining countries, the model mostly overestimates the negative anomalies; underestimates the positive anomalies; and simulates below-average usable capacity for some countries that saw near-average power generation, such as Switzerland and Germany. The latter may be because the underlying hydrological model does not capture enhanced glacial melting during the heat wave, which could offset runoff deficits downstream. In total, out of 9 countries with an observed negative hydropower generation anomaly larger than 1 standard deviation, there are 3 countries where the usable capacity model underestimates the magnitude of that anomaly. On the other hand, out of 25 countries, there are 10 countries where the model simulates a negative discharge anomaly larger than 1 standard deviation even though the observed anomaly is smaller in magnitude.

A previous evaluation of the hydropower model, using the same climate input, found relatively low root mean squared error in southern European countries, but larger values in some north European countries, with particularly poor performance in Latvia⁵⁶. Our analysis broadly confirms this regional pattern, while focusing explicitly on a drought event. We note the complexity in comparing actual hydropower generation with hydropower usable capacity, which is simulated by the model. In particular, annual power generation is strongly affected by power demand, which was above-average in the winter of 2003 due to low temperatures. The hydropower model focuses on the physical impacts of changes in water resources on usable power plant capacity; economic feedbacks of the assessed water constraints and related management or adaption options (e.g. on energy prices and the supply-demand portfolio) are not modelled. More importantly, no monthly observational data are available, so that we can only compare annual averages between model and observations. Simulated summer-only (June–August) anomalies are generally larger than for the annual average, and show a similar relative distribution across countries, which suggests that the anomalies seen in the annual average are largely due to the EHWD (Supplementary Fig. 14).

Human health

The 2003 EHWD caused a high human death toll mainly due to circulatory- and respiratory-related causes, especially among the elderly population⁵⁷. As many as 70,000 excess deaths were estimated for Europe, and 14,800 for France alone^58,59. We apply a set of city-specific statistical models that describe a linear relationship between maximum daily apparent temperature (AT_max) and daily mortality (see Methods). We calculate AT_max from daily mean and maximum temperatures and relative humidity, from the same observations-based climate data set used in the other sectors above, as well as two additional data sets (GSWP3 and PGFv2).

There is no comprehensive and consistent database of city-scale excess mortality to compare our estimates to. Instead, we compile results from earlier studies that specifically estimated the effect of the 2003 EHWD on natural mortality in European cities (Supplementary Table 3). These are based on actual mortality data reported for the 2003 summer e.g. by hospitals and city authorities, and are the closest available analogue to direct observations. In contrast, the models we applied⁶⁰ were trained on mortality and climate data from 1990–2000, and our estimates for 2003 are then derived based on the estimated model, reported baseline mortality rates, and the climate forcing.

We find that our statistical estimates are consistent with the range of literature values for Barcelona, London, and Rome, while they are substantially below the observed value(s) for Paris, Milan and Turin (Fig. 6). For Turin, our estimate is zero for all three climate forcing data sets, while an analysis^61,62 based on observed mortality found more than 60 excess deaths per 100,000 people occurred during the 2003 EHWD. For Valencia and Zurich, our models also underestimate observed values, although they come somewhat close when using the upper 95% confidence level for the slope of the temperature-mortality relation. Altogether, our models substantially underestimate the magnitude of the impact in 5 out of 8 cities for which excess mortality was observed. To our knowledge, this is the first independent evaluation of the performance of these models outside their calibration period.

Potential reasons for the discrepancies between our model estimates and observed mortality include compounding effects that the statistical models do not account for, including: the added effect of heat wave duration, the urban heat island effect, and air pollution (both with particulate matter and ozone); as well as the difference in spatial resolution between the climate data used to train the models (local airport weather stations) and the 0.5°-resolution forcing data sets (see Methods for an extended discussion).

Interestingly, while results are similar across the different climate forcing data sets for most of the cities, we find a strong dependence on the climate forcing for Milan. Unlike impacts on water resources or plant growth, which integrate the weather effect over timespans on the order of weeks to months and over larger areas, excess mortality can result from just a few days of extreme heat in a particular location (city). Thus, small-scale deviations between the different climate forcing datasets or, for that matter, between a gridded reanalysis dataset and the local observations, can lead to large differences in the mortality outcome.

Marine ecosystems

The 2003 EHWD also produced anomalously high sea surface temperatures (SST) in the seas around Europe, most pronounced in the north-western Mediterranean Sea (Supplementary Fig. 15). This surface warming affected subsurface temperatures, vertical mixing, and currents, with potential effects on marine ecosystems and, by extension, on fisheries⁶³. Indeed, several impacts, positive as well as negative, on local and regional marine ecosystems have been reported in the literature, including a die-off of Posidonia oceanica seagrass habitats⁶⁴ and rocky benthic communities⁶⁵; and effects on the reproduction of some fish species like Bluefin tuna⁶⁶, anchovy and round sardinella⁶⁷. However, no direct impacts on commercial fisheries are reported in the literature due to the EHWD, despite evidence for longer-term effects of warming on fisheries landings⁶⁸.

We simulated total consumer biomass at a 0.5° horizontal resolution with an ensemble of four marine ecosystem models⁶⁹. As there is no widely accepted long-term database of fish biomass appropriate to the scales relevant to the EHWD, to gauge the impact of the event on marine ecosystems and fisheries, we compare relative change in reconstructed catch⁷⁰ to the relative change in simulated consumer biomass, taking the assumption that harvest is a reasonable indicator of biomass. We find no discernible impact of the 2003 EHWD on fish biomass in any of the models, nor in the reconstructed catch data (Supplementary Fig. 16). In a few cases, models show a drop in 2004, which may be a lagged effect of the EHWD, but this is hard to verify in the absence of a longer time series, and not supported by the reconstructed catch data.

Nonetheless, the ocean reanalysis dataset used to force the marine ecosystem models does capture the 2003 anomalous warming in the Mediterranean and North Sea with a similar magnitude as in the observed data (Supplementary Fig. 17). It also shows a local maximum in plankton net primary productivity (NPP) in 2003 in the Mediterranean and the North Sea, and captures an observed phytoplankton abundance peak in the English Channel⁷¹ (Supplementary Fig. 18). At the same time, simulated NPP exhibits large year-to-year variability, and the 2003 value is generally less exceptional in a long-term context than for temperature (Supplementary Fig. 19).

Thus, literature, data and models appear to agree that the 2003 EHWD affected physical ocean parameters as well as plants and planktonic NPP, but that the signal did not propagate to higher trophic levels on a large scale immediately. This may be because the event was too brief or too weak to cause a significant change in upper trophic levels. Another reason may be that temporary warming can act on fish biomass in two opposite ways: on the one hand, higher temperatures tend to increase the amount of food that fish need to maintain growth rates, as well as their mortality due to predation, disease or senescence. On the other hand, short-term warming increases phytoplankton growth, which has a positive effect on fish biomass. The net effect may thus be small. Further evaluation of marine ecosystem model simulations resolving individual species, and/or forced with higher-resolution climate data e.g. from regional climate models, might illuminate whether the models agree with the data for the right reasons; and longer simulation runs would be needed to account for any lagged effects beyond 2004. Nonetheless, our results suggest that this modelling framework is applicable to extreme events like the 2003 EHWD in a similar fashion as in the terrestrial impact sectors.

Discussion

Our study is unique in assessing the ability of several state-of-the-art impact models, for multiple sectors, to replicate the impacts from a recent, high-impact climatic extreme event, the 2003 EHWD. When impacts occur simultaneously in different sectors, it is important that models replicate not only the average variability of sectoral impact indicators, but the actual timing and magnitude of particular events. The rationale of our study was therefore to approach the assessment of model ability differently—not through a classical evaluation of impact statistics over several decades^11,14,35—but instead to answer a what-if question: what if these models are used to predict the impacts of a single extreme event like the EHWD?.

The answer, according to our results, is that the total impacts of such an event would likely be seriously underestimated. This is despite some encouraging results in individual sectors. For instance, the observed low water levels in the Rhine, Danube, and Aare rivers are matched very well by the median across our hydrological models, and the spread across models is moderate. Furthermore, in most countries where a substantial reduction in hydropower generation is observed, our hydropower model also simulates a substantial reduction in usable capacity. With respect to human health, our statistical models yield heat-related excess mortality rates of a similar magnitude seen in some observed estimates for Barcelona, London and Rome. We also find that generally the choice of climate forcing data set is not a major source of uncertainty.

Importantly though, severe impacts on agriculture and ecosystem productivity are underestimated by a large margin by most of the sector models. None or few of the crop models (depending on the climate forcing) reproduce the extremeness of the negative maize yield anomalies, of three and four standard deviations, respectively, in France and Italy, which are both among the largest maize producers in Europe. The crop model ensemble also systematically under-predicts the severity of the EHWD’s impacts for wheat in Western Europe, implying that increasing temperatures and more frequent heatwaves due to climate change could have substantially more severe impacts in the region than typically assumed. Similar results are found for natural ecosystem primary productivity: only a few of the global vegetation models come close to the enormous negative anomaly seen in remote sensing-derived GPP in Western Europe, and some models even display a positive anomaly. For both the crop and GPP models, neither the multi-model mean nor any individual model reliably reproduce observations. The mortality models estimate only half of the observed rate of heat-related excess mortality in Paris, which saw by far the most fatalities of all cities.

Taking all sectors together, and across all countries, cities, regions and rivers studied, severe negative impacts (deviation of more than 1σ, or more than 10 excess deaths per 100,000 people) were observed in 38 cases (Fig. 1). In 20 of them, the severity of impacts is underestimated by the respective models, counting in those countries where 75% or more, but not all, of the crop models underestimate yield impacts. Leaving out those countries, we are left with 14 cases where the entirety of models underestimates the observed impacts. This summary statistic is necessarily crude, but highlights the magnitude of the problem.

While many of the largest impacts of the EHWD are thus underestimated by the sectoral model ensembles, we also find a number of “false alarms”, mainly for water resources and hydropower. Hydrological models show flow reductions of up to 2σ in the Weser, Elbe and Oder rivers, where no substantial anomalies were observed. The hydropower model shows large reductions in usable capacity in Switzerland, Poland and Germany, while reported hydropower generation was near average in all these countries. On a regional level, the underestimation of impacts across sectors is particularly evident in southern Europe; while in northern and eastern Europe, impacts on crop yields and river flow are matched relatively well, and some hydropower impacts are even greatly overestimated (Fig. 1).

Whilst our study is the most comprehensive systematic comparison to date between simulated and observed impacts for an extreme climatic event, it is not exhaustive in the treatment of different sectors, e.g. we did not include livestock production, vector-borne diseases or forest fires¹⁸. Assembling an even larger multi-disciplinary team could have partly addressed this. For some sectors suitable observational data for model comparison were unavailable (nuclear power generation). In all sectors, we have focused on large-scale impact indicators—e.g. average monthly discharge or country-level crop yield—which are close to what the models are designed to represent, and which we assume to reflect the total societal costs and key environmental impacts of the 2003 EHWD to a large extent. Smaller-scale characteristics of some of the impacts (e.g. minimum daily river flow or maximum daily excess mortality) could have important implications not captured by the more aggregate indicators, for example the overburdening of critical infrastructure for a short time. Representing such processes accurately will be a substantial challenge for the relatively large-scale models applied here.

We have investigated a single climate event, but our results are relevant beyond the case of the 2003 EHWD. In 2015 and 2018, central Europe was again hit by extremely hot and dry summer conditions, and projected trends in temperature and precipitation suggest that such events will likely occur more frequently in the future⁷². While each event is unique, similar events produce similar biophysical impacts: future drought-heat wave events will again lead to low river flows, reduce ecosystem productivity, and damage crop yields. As long as the model shortcomings exposed by our analysis persist, they will likely lead to biased estimates of extreme event impacts.

These shortcomings are related to the representation of both natural processes and human management in the models. For instance, human adaptation (or the lack thereof) can affect the severity of impacts for a given extreme event⁷³. Indeed, excess human mortality during the 2006 heatwave in Europe was much lower than 2003, possibly because of better preparedness⁵⁷. This might go some way towards explaining the underestimation of observed impacts by the heat-related mortality models: the 2003 EHWD was outside the climatological range that the models were trained on, and the range that the health system was prepared for. In process-based models, the capacity to respond to extreme climate conditions can be overestimated if resource constraints are not accounted for. An example is the lack of constraints on irrigation water in our crop model simulations. For instance, streamflow in the Vienne basin, which contains large shares of irrigated maize cropland⁷⁴, was anomalously low in summer 2003 (Fig. 2); and indeed, the maize yield anomaly in France is simulated more realistically when irrigation is switched off, than in the default model setup with irrigation (Supplementary Fig. 9). Accounting for such cross-sectoral interactions⁷⁵ and reflecting changing levels of societal adaptation in models will be critical for more accurate impact and damage estimates in the future.

Important natural processes such as heat stress effects on ecosystem photosynthesis are also missing in some models, while other processes such as potential evapotranspiration are parameterised through empirical relationships that may not hold outside past ranges of climatic variability³⁶. Such shortcomings will affect impact assessments of other climatic events, too. In addition, there are known issues that have not emerged in our study, but may prove problematic when assessing the impacts of extreme events of different types, or outside Europe. For instance, the global ecosystem models do not consider the potentially amplifying interactions between climate change and forest disturbances, such as wind storms⁷⁶; and some global hydrological models struggle to capture the magnitude and timing of processes such as transmission loss and snowmelt accurately³⁵. Finally, our models may perform worse in parts of the developing world where constraints from observational data are more limited.

Our results emphasise the importance of considering the uncertainties of current biophysical impact estimates. Modelling studies of future climate change impacts on mean conditions in different sectors^1,2,77,78 have shown that the uncertainties of biophysical impact estimates are of a similar magnitude as climate modelling uncertainty. A common practice has since been to use multi-model ensembles rather than single impact models, as is standard in climate modelling. We have shown that for an extreme event, even an entire ensemble of impact models can sometimes completely miss observations. This urgently calls for a shift in the agenda of model development and evaluation away from mean conditions towards extremes. In the meantime, a precautionary interpretation of impact estimates is recommended, where the most pessimistic model in an ensemble must be considered just as plausible as those in the centre of the distribution.

For quantifying the ultimate impacts of climate change on different aspects of the economy, or parts of society, different metrics need to be applied beyond the (mainly) biophysical indicators studied here; e.g., macro-economic damages, public health burdens, or effects on societal equality and political stability. However, any such analysis needs to rest on a firm assessment of the biophysical impact, and will be unreliable if the biophysical impact is not captured accurately. Our study thus provides important context to any subsequent analysis of economic and societal impacts.

Methods

Climate forcing data

The primary climate forcing dataset used in this study is the latest WATCH meteorological forcing data set (WFDEI)⁷⁹, which is based on ERA-Interim reanalysis data. Two other data sets are used for comparison, for some sectors: the Global Soil Wetness Project phase 3 forcing data set (GSWP3; http://hydro.iis.u-tokyo.ac.jp/GSWP3/)⁸⁰, based on the 20th century reanalysis (20CR); and the Princeton Global Forcing data set version 2 (PGFv2)⁸¹, based on NCEP/NCAR Reanalysis. The magnitude and spatio-temporal pattern of the 2003 summer temperature and precipitation anomaly is similar in all three data sets (Supplementary Figs. 1 and 2), and in agreement with other reanalysis products^17,18.

Model ensemble

Our multi-sectoral ensemble of impact models (Supplementary Table 1) includes 7 global hydrological models, 12 global gridded crop models, and 8 global vegetation models. In addition, we analysed the impacts on hydropower production as well as heat-related mortality, using single models; and potential impacts on fisheries using four marine ecosystem models. All terrestrial model simulations were driven by WFDEI climate forcing for the period 1979–2010, and followed a common protocol (ISIMIP2a, www.isimip.org), allowing us to compare models within a sector (such as agriculture) and to apply a common method of analysis across sectors. For some sectors, additional simulations were conducted, driven with PGFv2 or GSWP3 climate forcing. In those cases, ensembles are slightly different for the different climate forcings because not all models were run with all climate data sets. In Figs. 2–4, as well as the corresponding supplementary figures, models are uniquely identified by numerals (see Supplementary Table 1).

In contrast to the terrestrial models, the marine ecosystem models need not only atmospheric but also oceanic forcing variables, and are therefore driven by a reanalysis-based climate model run. While thus not completely consistent with the other sectors, these simulations demonstrate how our analysis framework can be expanded to the marine realm.

Importantly, the model simulations were not specifically designed to capture a particular event. Models were set up for best overall performance, and parameters were kept constant throughout the simulation period, with climate as the only source of annual-scale variations (except that the hydrological simulations account for the construction of dams and reservoirs over time). This is similar to simulations under future climate change scenarios, where e.g. agricultural management choices in any particular year are not known. Therefore, if models are found to accurately reproduce the impacts of the 2003 EHWD using the present set-up, this increases confidence in their ability to quantify the impacts of future extreme events represented in climate projections.

Impact metric

In all sectors except human health, we measure impacts as a given variable’s deviation in 2003 from the historical average, adjusted for long-term trends, and normalised by the historical standard deviation (see below for more details). That is, we measure the extremeness of the event in a given variable, not the absolute magnitude of change in that variable. If a model correctly simulated the absolute magnitude of change in a variable in 2003, but simulated the same magnitude of change every other year, then the agreement in 2003 may not be for the right reasons, and there would be little confidence in the model’s response to extreme events in future climate scenarios. In the paper we use the term “impact” to refer to the relative deviation, or extremeness, of an indicator, unless otherwise stated. An exception is heat-related excess mortality, which is already defined as a deviation from the expected baseline mortality.

Water resources

Monthly discharge data was obtained from the Global Runoff Data Centre (GRDC, http://www.bafg.de/GRDC/EN/02_srvcs/21_tmsrs/riverdischarge_node.html) for gauging stations along major rivers in the geographic region that was affected by the 2003 EHWD. Only stations with catchments of ~10,000 km² (equivalent to ~4 model grid cells) or larger were selected. Stations also had to have continuous monthly data available from GRDC for the whole period 1979–2008. Simulated daily discharge data was first averaged to monthly values. Then, for both simulated and reported discharge, a single month of interest was extracted (e.g., August), resulting in a yearly series of values for that month over 1979–2008. From this series, a linear trend was removed, and the standard deviation calculated. The 2003 relative discharge anomaly was then calculated as the difference between 2003 and the 1979–2008 average, divided by the standard deviation (with 2003 excluded from the calculation of the standard deviation). We note that none of the models consider changes in glacial meltwater runoff, which was an important factor during the 2003 EHWD in Alpine catchments²⁶.