Main

The European Union’s ambition to reach climate neutrality in 2050 as part of the European Green Deal1 requires a thorough transformation of the full energy–economy system. Insights required for this transition are obtained from various lines of research, such as analysis of the technical mitigation potential, the effectiveness of policy instruments and opportunities for system changes given the interests of stakeholders and institutional barriers2. An important part of quantitative information on mitigation pathways is obtained from model-based scenario analyses, such as those recently published in the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report3,4.

Still a large spread is associated with the scenario output, which originates from many types of uncertainty5,6. Structural uncertainty stems from differences in numerous assumptions regarding, among others, technological innovation and uptake, market behaviour, preferences and changes in specific activities. Parametric uncertainties involve differences in parameter calibrations or are consequential to differences in sectoral granularity and regional and temporal scale. Fundamental modelling choices may also vary, such as those concerning mathematical formulation (for example, optimization versus simulation frameworks), model structure and foresight. Substantial differences can be recognized across the model outcomes5, potentially even yielding contradictory observations6. To have a more accurate understanding of energy and climate policy scenario outcomes, it is important to have insights into both (1) where models differ substantially and where they agree and (2) how a model’s output relates to the overall ensemble, yielding insights on outliers and discrepancies. This is specifically important because studies commonly use individual models rather than large ensembles in both scientific literature7 and in policy reports (for example, in national policy studies). Only when single-model results are contextualized by the model’s position in the larger ensemble, the reader would be able to have a complete and correct interpretation of the output. Additionally, such quantification of the model’s position in the larger model ensemble allows for tracking model development8.

Both questions require a stylized set of results across the model range in which only a well-defined number of assumptions is varied. To this end, multi-model comparison exercises have been effective. This is done within confined projects3 and, for instance, in the long tradition9 of studies by the Energy Modelling Forum10,11, of which many scenario runs are collectively used in the Assessment Reports by the IPCC10,11. Still the observed large model differences—especially in estimates of costs12, the diffusion of individual renewable energy technologies13 and demand sector development14,15—motivate more research in this area and emphasize the importance of interpreting single-model results in light of larger model ensembles.

Many multi-model comparison studies test the (un)certainty of an outcome by looking at the range across models in scenarios designed for other purposes, for example, to describe the effect of current mitigation measures6. While insightful for that particular question, quantifying and evaluating overall model behaviour requires analysis beyond typical scenarios and typical variables—hence the importance of analysis of diagnostic scenarios, expressed in diagnostic variables or indicators. Such practice is well established in the climate and atmospheric sciences16,17,18. Examples of indicators are the equilibrium climate sensitivity and the transient climate response, which are associated with diagnostic scenarios in which the CO2 concentration is doubled or quadrupled. In the emissions mitigation literature, examples of diagnostic multi-model studies are Kriegler et al. (2015) and Harmsen et al. (2021), which propose a limited set of diagnostic indicators that reflect crucial model behaviour aspects8,19. Using two diagnostic scenarios—with a constant and exponentially increasing carbon tax, respectively—Kriegler et al. condense the output of models into four core diagnostic indicators: (1) the relative abatement index, expressing the overall mitigation effort; (2) the carbon intensity over energy intensity, expressing the mitigation strategy; (3) the transformation index, expressing the required transformation; and (4) the cost per abatement value, expressing the policy costs per unit of marginal abatement. Harmsen et al. continued on this path by updating the calculations by Kriegler et al. for a larger number of models and model versions, adding an indicator on primary fossil fuel energy reduction and an analysis of model inertia. The latter yields a total of six key diagnostic indicators for energy system models and integrated assessment models (IAMs).

Some model aspects are more indicative of model behaviour than others: for example, model sensitivity to carbon taxes already reveals much about drivers of the model’s output, much like equilibrium climate sensitivity is a core classifier in climate models. Still analysing individual model dimensions (or indicators) yields only a limited view due to the higher-dimensional intertwinement of mitigation, policy, energy supply and energy demand. For example, differences in solar power deployment under similar emissions levels can be understood only when information about wind power (as a potential competitor), carbon dioxide capture and storage (CCS) or energy demand reductions are provided. This motivates comparing multiple model dimensions at the same time. While Harmsen et al.’s six indicators are useful to classify models for each indicator individually (that is, one-dimensional), the aim of this paper is to characterize the overall model typology (that is, high-dimensional)—moving beyond a mere long list of ‘individual’ diagnostic indicators, towards developing a comprehensive overview or ‘story’ of the model’s behaviour. This requires the analysis to go back and forth between many different aspects of the same model. In addition, a comprehensive overview of model behaviour would include an extension of the existing list of diagnostic indicators to a higher level of detail (Supplementary Information A.6).

Framework and fingerprints

Here we quantify typical model behaviour in a framework that we colloquially refer to as the model’s ‘fingerprint’. It assesses five key model behaviour dimensions: responsiveness, mitigation strategies, energy supply, energy demand and costs and effort. Each dimension is subsequently expressed in several diagnostic indicators, as shown in Fig. 1. The set of diagnostic indicators extends previous work8,19 with new indicators to obtain a more complete and multi-dimensional perspective on model behaviour (Supplementary Information A.6 for a comparison with earlier work). For example, the ‘responsiveness’ dimension describes more than only tax response (R1), adding the speed of response (R2–3) and scenario sensitivities (R4 and R5). Other notable extensions on previous work are including carbon capture (M3), the role of non-CO2 (M4), more details on the energy supply mix (Es1–7), electrification of end-use sectors (Ed1–3), hydrogen use (Ed4) and demand shifts (C3). In addition, all indicators are computed from a more extended set of diagnostic scenarios (Methods), containing scenario variation beyond only varying carbon price trajectories.

Fig. 1: Framework of model fingerprints.
figure 1

Explanation of the framework that is used to compute model fingerprints. The five diagnostic indicator dimensions are shown in colours: responsiveness (green), mitigation strategies (yellow), energy supply (blue), energy demand (red) and costs and effort (purple). The framework uses the statistics of the ensemble to visualize the results of a single model, resulting in a circular diagram per model. The inner circle represents the median per indicator, calculated from the ensemble containing all model-scenario combinations—that is, including all other models as well. The outer circle and the centre indicate the medians ± two standard deviations, respectively. Example ranges (yellow shaded, not based on data) for indicators M1M4 are shown: the ranges show the range that the respective model covers across its scenarios. These ranges (Figs. 23) exclude the scenario that includes only current implemented policies (referred to as the DIAG-NPI scenario, Table 3), which is used only as a reference for R1, R2 and C1. Note that differences in this reference may cause differences in the indicators; this is discussed in Supplementary Information A.3, which is why we use the comparison with historical values as much as possible (whenever a reference is needed). The shape or indicator arc widths do not have any quantitative meaning themselves. For M4, direct air capture (DAC) and bioenergy with carbon capture and storage (BECCS) are included, and for the energy intensity in M2, we use gross domestic product (GDP) at purchasing power parity (PPP).

The framework’s principle is to compare a model’s scenario range (Fig. 1, yellow shaded) to the ensemble statistics in terms of its medians (η) and standard deviations (σ), yielding the range of typical (relative) model behaviour: the area towards the outer ring indicates above-median (up to ƞ + 2σ) output and the area towards the circle centre indicates below-median (down to ƞ − 2σ) output. Whereas scenario ensembles of energy models are rarely normally distributed, defining the range this way does ensure more intuitive equal ranges above and below the median. The diagnostic exercise performed to fill these diagrams includes eight different energy system models, involving ten diagnostic scenarios tailored to explore different model aspects. The scenarios (Methods and Table 3) have a high carbon price, aimed at approximating the Paris Agreement goals and the European Union’s climate neutrality goal.

In Figs. 2 and 3, one can see the outcomes of the framework for the eight involved energy system model versions. For illustration purposes, we discuss the fingerprints of three model versions in detail before moving on to the more overarching patterns and typologies that the fingerprints reveal. Detailed model-by-model observations, as well as model acronym meanings, are provided in Supplementary Information C. In Fig. 2a, we see the fingerprint of the IMAGE model20, a process-based IAM. The model shows medium levels of carbon emissions abatement (R1) and carbon intensity reduction (M1). Still it has a relatively low mitigation timescale (R2), high maximum mitigation speed (R3) and sensitivities (R4–5) and relies relatively more on energy intensity reductions and carbon capture (M2–3). Consequentially, primary fossil use is higher than the ensemble median, notably coal (Es1). Relatively low values of solar and nuclear energy use (Es3, Es7) contrast with the high use of biomass (Es6, in most scenarios) in the primary energy mix, and high electrification of transport (Ed1), medium for industry (Ed2) and low for buildings (Ed3) is found. Transformation in the energy supply and demand sides depend on the scenario, while the costs remain approximately average (C1–3).

Fig. 2: Model fingerprints.
figure 2

ad, Model typologies or ‘fingerprints’ of the IMAGE (a), REMIND (b), MESSAGEix-GLOBIOM (c) and WITCH (d) models for Europe in 2050. The axis for each indicator ranges between the ensemble median (ƞ) ± two standard deviations (σ). These medians and standard deviations are computed from the full ensemble, that is, the eight models and nine linearly increasing-price scenarios (excluding the current-policies scenarios), and the coloured shaded scenario ranges are for each individual model indicated by the panel titles. Data close to the centre reflect below-median results. Data towards the outer ring reflect above-median results. More details on the indicators are in Fig. 1.

Fig. 3: Model fingerprints.
figure 3

ad, Model typologies or ‘fingerprints’ of the TIAM-ECN (a), PRIMES (b), PROMETHEUS (c) and sector-coupled Euro-Calliope (d) models, for Europe in 2050, similar to Fig. 2. The axis for each indicator ranges between the ensemble median (ƞ) ± two standard deviations (σ). These medians and standard deviations are computed from the full ensemble, that is, the eight models and nine linearly increasing-price scenarios (excluding the current-policies scenarios), and the coloured shaded scenario ranges are for each individual model indicated by the panel titles. Data close to the centre reflect below-median results. Data towards the outer ring reflect above-median results. More details on the indicators are in Fig. 1.

Figure 2b shows the fingerprint of the REMIND model21, which is a Ramsey-type general equilibrium growth IAM—quite different from IMAGE. Indeed, the fingerprint also looks very different indicating substantial differences between the output of REMIND on the one hand, and those of IMAGE on the other hand. Striking are the higher levels of abatement (R1), carbon intensity reductions (M1) and fast reaction to carbon pricing (low R2, high R3). Consistent with these observations, primary fossil use (Es1–Es3) is limited, whereas variable renewable energy and electrification in transport and buildings (D1, D2) are higher than ensemble medians. For industry, electrification is quite sensitive to scenario assumptions. The associated transformation index (C2) is expectedly high, while the costs (C1) are lower than the ensemble median.

A third and final example we will discuss here is shown in Fig. 3b—the PRIMES model22, which is an energy system model that provides projections of energy demand, supply, prices and investments. This model does not react heavily to the various scenario assumptions, reflected in the low scenario sensitivity of the primary energy mix (R4) and demand (R5) and in the narrow scenario ranges shown in the fingerprint diagram. PRIMES projects near-median levels for most indicators, with a few exceptions: a relatively low mitigation speed potential (R3), high energy intensity reduction (M2), high relative non-CO2 reductions (M4), low solar energy use (Es4) and notably a very high hydrogen use (Ed5).

Model typology

A summary of the general tendencies of the models is shown in Table 1. We use a similar (but extended) approach to this as in previous literature8, where a general classification was given of each of the models. The assessments in the table are directly based on the fingerprints in Figs. 2 and 3 and quantify the extent to which the models deviate from the ensemble.

Table 1 Overview of model typologies

The spread in Table 1 indicates the significance of the differences in reported model deviations. For example, in response to carbon taxing in REMIND (+1.8σ) versus that in TIAM-ECN (−1.6σ), the energy intensities in WITCH (+2.1σ) and TIAM-ECN (−1.7σ) or the relative mitigation of non-CO2 emissions in IMAGE and REMIND (−1.1σ and −1.2σ) was compared to that in TIAM-ECN (+1.8σ). Whereas such values are expected for individual model runs (as the values are expressed in standard deviations), this table reports the medians per model, revealing that the model outputs are indeed substantially different in certain respects. In other words, inter-model differences are high compared to intra-model (that is, scenario) differences. Note that Table 1 contains combinations of diagnostic indicators. For example, the averaging electrification across the end-use sectors conceals the high model differences in individual sector electrifications such as transport in REMIND (+2.2σ) and industry in WITCH (+1.7σ). For more detail, we refer the reader to Figs. 2 and 3 and Supplementary Information B.1.

To provide structure to the discussion, we sorted the models by their tax response (second column, R1): that is, carbon emissions in 2050 under a carbon tax scenario compared with that in the current-policies scenario of the same model. REMIND is by far the most responsive (green); it shows almost two standard deviations higher relative abatement and, in addition, has a high scenario response. This high responsiveness coincides with high carbon intensity reduction, renewables and electrification. It emphasizes carbon intensity reduction with emphasis on CO2 emissions while projecting average numbers for energy intensity reduction and carbon capture. Euro-Calliope partially shows similar results, also having a (moderately) high tax response, carbon intensity reduction and renewables rollout but to a lesser extent than REMIND. From a system point of view, these models show a consistent fingerprint; the high tax response (green) is achieved by replacing fossils (yellow) by renewables (blue) and high electrification (orange). IMAGE can also be regarded as a model with moderately high responsiveness, projects relatively high carbon intensity reductions and, like REMIND, indicates relatively low non-CO2 reductions compared with CO2. However, as observed in Fig. 2, it differs from REMIND and Euro-Calliope by relying more on carbon capture (+1.4σ) and fossils (+0.7σ).

This consistent fingerprint can also be observed at the other end of the spectrum. This is clearly expressed in TIAM-ECN output, which has a low tax response, low carbon intensity reduction, high fossil use and low electrification. MESSAGEix also shows this mirror image to some extent, but it is closer to the rest of the ensemble than TIAM-ECN in this respect. While this model initially rapidly mitigates emissions (reflected in a low R2—approximating REMIND), the abatement becomes relatively slow over time, yielding an overall medium to low relative abatement (R1) of −0.5σ. Note that this is partially consequential to differences in what years are represented in each model year—for MESSAGE, 2050 represents the five preceding years. These abatement projections are accompanied by relatively low (sectoral average) electrification of −0.4σ and higher fossil use of +0.5σ and also a low carbon and energy intensity reductions (−0.8σ and −0.7σ, respectively; Supplementary Information B.1).

Besides MESSAGEix, the three other models with medium tax response—PRIMES, PROMETHEUS and WITCH—are the models with the highest energy intensity reductions while having low (WITCH) or medium (PRIMES, PROMETHEUS) carbon intensity reductions. This is most clearly visible in WITCH output, projecting energy intensity reductions of over two standard deviations higher. The three models have another commonality: they are all moderately sensitive to scenario assumptions. However, in many other ways, their projections notably differ. WITCH relies more on carbon capture in contrast to PROMETHEUS, which, in turn, has higher nuclear and bioenergy in their energy supply mix than any other model in this experiment. The latter is related to the high level of biofuel use in PROMETHEUS.

A commonality between WITCH, IMAGE and MESSAGEix can also be found; they all have rather high fossil use and carbon capture (the exception to this pattern seems to be TIAM-ECN, which does not project as much carbon capture). Interestingly, while high carbon capture intuitively would generate high abatement and carbon intensity reductions, this is not visible here, with, for example, MESSAGEix having +1.1σ carbon capture but −0.8σ carbon intensity reduction (Supplementary Information B.1).

In previous papers, scholars have tried to link model behaviour to the underlying model type in terms of solution approach (for example, partial or general equilibrium, recursive dynamic or inter-temporal). For example, Daioglou et al. (2020)23 tried to link bioenergy use to model type, and Harmsen et al. (2021)8 compared the output of six diagnostic indicators to the model types. However, in both studies, ‘no direct relationship between model type and model behaviour’ was found8. Even though this study contains more unique scenarios than in previous studies, we again find no clear link between model behaviour and model type (although the set considered here does include a smaller set of models). In Table 1, we see that inter-model differences are not only higher than inter-scenario differences in many of the indicators but also higher than inter-model type differences. Technology-rich energy system models can be among the most tax-sensitive models for abatement (Euro-Calliope) or among the lesser tax-sensitive ones (MESSAGEix)—that is, being technology rich does not imply a certain level of carbon tax sensitivity. Similar patterns can be observed for general versus partial equilibrium models and inter-temporal vs myopic models. It is important to emphasize that while having more scenarios than in previous studies allows for a more representative fingerprint of each model, a clear limitation for trying to attribute model behaviour to its type is the limited number of models (eight). Adding more models could help in observing such relations, but models are rarely fully independent from each other—which is a limitation in these model comparison exercises in general.

Discussion and conclusion

In this paper, we propose a unique framework to characterize energy model typology by quantifying their ‘fingerprints’ in scenario ensembles. We identified five key model dimensions along which models differ: model responsiveness, mitigation strategies, energy supply, energy demand and costs and effort split further into a total of 24 diagnostic indicators. A number of these indicators are re-used from existing literature8,19 (Supplementary Information A.6), but we add more dimensions such as the energy supply mix, carbon capture, non-CO2 and demand responses, which allows for a more complete interpretation of model behaviour. The indicators are calculated and visualized relative to the ensemble statistics of a set of tailored diagnostic scenarios, yielding a comprehensive insight into each model’s fingerprint in relation to the other models.

The framework contextualizes results from individual models (or rather model versions, which may also evolve their behaviour24,25) by identifying typical model behaviour, which yields better interpretation and understanding of them. The latter, of course, is of vital importance to both researchers and policymakers8,19. Additionally, the substantial model differences, being commonly larger than inter-scenario differences, motivate caution on using individual model results without ensemble context in general. Specifically, we find highly tax-responsive models to also have the highest renewables, electrification and carbon intensity reduction. IMAGE is also relatively highly responsive but distinguishes itself with high fossil and carbon capture use. TIAM-ECN, having the lowest tax response, indeed also shows a low carbon intensity reduction. MESSAGEix shows a similar pattern (but more closely to the rest of the ensemble), and both of these models additionally show a low energy intensity reduction. Most models with medium tax-responsive models (WITCH, PROMETHEUS and PRIMES) show higher energy intensity reductions and moderate scenario responsiveness but vary in many other ways.

The framework also has limitations. First of all, the scenario projections are not statistical predictions—hence, the model-scenario ensemble cannot be used as such, and the ensemble median should not be interpreted as ‘most appropriate’. Analogously, model behaviour that is distinct from the rest of the ensemble (outliers) should not be treated as ‘least probable’ and neither should the sorting of models by carbon tax response (in Table 1) reflect a form of ‘ranking’. Whereas typical other multi-model comparison studies focus much more on robust messages across models, our aim is to identify and highlight their differences. The model differences presented here do not even necessarily illustrate a lack of consensus on future outcomes. Potentially, they also reveal that deep mitigation can be achieved in multiple ways (that is, a form of ‘policy freedom’), albeit each conditioned in various respects, motivating the high dimensionality of this analysis. Hence, for the purpose of exploring the future, model differences can therefore be useful. This study aims only to map these differences; it does not identify its sources—which also requires the analysis of model inputs and parameters26,27, for example associated with efficiency differences in final energy (R5, Ed1–3 and C3). A second important remark is that the fingerprint framework and Table 1 contain quantities that are relative to the ensemble. Therefore, the fingerprints dilute information on absolute differences—motivating the importance of Supplementary Table 4—which should always be taken into account when using this framework. The relative nature of the framework also means that there is no single ‘absolute’ fingerprint of any model; it is always subject to the ensemble the model is being compared to. Structural bias in the community (that is, across all included models) will therefore not be easily detected. Adding models to the presented diagnostic exercise or choosing a different regional scope will modify the model fingerprints (for example, Supplementary Information B.4 for global results). Still by choosing the diagnostic scenarios with a wide range of assumptions and a large variety of models, we do approximate a comprehensive overview of the behaviour of the eight models included (for the region of Europe). Finally, we also note that the scope of this research was Europe and deep mitigation scenarios. The model fingerprints will look different when applied to other regions or global results and to scenarios with little climate policy.

Even though we show only one application of the framework, it is intended as a general framework—both the methodology, code and data are flexible in nature. The model fingerprints for global output (rather than European) in Supplementary Information B.4 are an illustration of this, but it is also well suitable as a diagnostic tool for model development by determining the fingerprints for different versions of a single model. To quantify model fingerprints at a more detailed sectoral level, the dimensions, indicators and even the scenarios can be adjusted accordingly. On diagnostic indicators, earlier literature8,19 stresses the importance of applicability to diverse models, beyond only the relevance, quantifiability and identification of heterogeneity. For the purpose of the latter, this framework with the current selection of diagnostic indicators is well suited whereas for the former, the framework may require translation—for example, for models that cover only the electricity sector. This way the framework can form the basis of future model intercomparison projects.

Methods

Diagnostic experiment

We conducted a diagnostic experiment in which we ran ten different scenarios using eight different models. The models used are shown in Table 2. The models intentionally vary in many respects: coverage (global or Europe), scope (IAMs and energy system models), type (simulations and optimizations) and more. Whereas the focus of this work is Europe, most models in fact have a global coverage. For global models, the basis of all scenarios is globally the trajectory associated with current policies implemented. Carbon prices are also issued globally, and whenever applicable, specific scenario assumptions are listed (in Table 3) for both the globe and Europe.

Table 2 Models used in the analysis
Table 3 Description of the diagnostic scenarios

The scenarios in this diagnostic experiment are described in Table 3. A current-policies (DIAG-NPI) baseline scenario is included for reference. All other scenarios are based on DIAG-NPI but, in addition, have a predefined fixed linearly increasing carbon price profile and integrate additional constraints or assumptions (third column in Table 3). One scenario has the linearly increasing carbon price next to current national policies implemented (DIAG-C400-lin) without any additional assumptions. The carbon price profile cp(t) is up to 2025 similar to each model’s current policies (cpNPI(t), from DIAG-NPI), after which it becomes a linearly increasing tax as follows, expressed in US dollars (2010) and time t in years:

$${\mathrm{cp}}(t)=\left\{\begin{array}{ll}{\mathrm{c}}{\mathrm{{p}}}_{\mathrm{NPI}}(t) & {\rm{if}}\,t < 2025\\ 130+18\times (t-2025) & {\rm{if}}\,t\ge 2025\end{array}\right.$$
(1)

which results in a carbon price of US$580 t−1 CO2 in 2050 (US$400 t−1 CO2 in 2040). The eight variations of DIAG-C400-lin are explained in Table 3 and include one with a limit on the bioenergy potential (DIAG-C400-lin-LimBio) and one where electricity prices artificially kept low (DIAG-C400-lin-HighElectrification).

Definitions of indicators

In this diagnostic exercise, we focus on Europe and 2050, but the framework can easily be adjusted for other regions or time frames. We consider only CO2 emissions and no other Kyoto gases and only for the energy sector (the only exception being indicator M4) because not all models from the ensemble include emissions from land use or other non-energy sectors in their scenario results. Another reason involves the focus of this study, which is on characterizing energy models rather than climate policy models in the broader sense. As shown to the bottom right of Fig. 1, some indicators are computed relative either to the current-policies scenario or to historical values of 2017–2021. The current-policies scenarios already differ substantially between models (Supplementary Information A.3), which can partially be explained by differences in regional aggregation (Supplementary Information A.1: for example, REMIND shows CO2 emissions reductions by 74% (for EU28) while MESSAGEix shows increases by 33% (also including Turkey). Such differences in the current-policies scenario will affect the indicators, but for some (that is, R1 and R2), it is required to compare to a scenario (from the same model) with a lower carbon price. A more elaborate discussion on using the current-policies scenario or historical data (2017–2021) as a reference is given in Supplementary Information A.3. We emphasize that we build upon earlier work8,19 in several of these indicators and that while some indicators discussed in this section are new definitions, the contribution of this paper lies in bringing many model dimensions together in one comprehensive model typology overview.

We refer to the first dimension as Responsiveness (green), which expresses the response of model output to various conditions into five indicators. The first indicator in this category is the sensitivity of the model’s calculated emissions abatement to carbon pricing (R1), which we quantify using the relative abatement index, that is, the relative reduction in emissions of carbon pricing scenarios with respect to the current-policies base scenario and emissions from energy and industrial processes for scenario s and year t denoted by E(s, t):

$${R}_{1}(s)=\frac{E(\mathrm{NPI},2050)-E(s,2050)}{E(\mathrm{NPI},2050)}$$
(2)

We calculate R1 by comparing the linearly increasing carbon price scenarios (Table 3) to the current-policies scenario, but we note that the term ‘relative abatement index’ is used previously for other pricing profiles as well: for example, Harmsen et al. uses the same terminology for an exponentially increasing carbon price8, while we use a linearly increasing one (equation (1)). The second indicator (R2) is the timescale of carbon price response, which is defined as the number of years until the scenario has 66% lower CO2 emissions than the current-policies scenario has in the same year. A related (Supplementary Information A.2) third indicator is the maximum mitigation speed (R3), which is the maximum emissions reduction over any five-year increment, yielding insights in how fast the model can mitigate in short periods. The fourth (R4) and fifth (R5) indicators focus on scenario sensitivity for each model, resulting in a single value (per model) rather than a scenario range, indicating how the primary energy mix and the final energy demand across sectors vary among its scenarios. We write the fraction of an energy carrier c in the primary energy mix in scenario s as fc,s, its average across all scenarios as \(\bar{{f}_{c}}\) and the total number of unique carriers considered as nc. Then, R4 is computed as follows:

$${R}_{4}=\frac{1}{{n}_{c}}\sum _{c}{\rm{var}}\left(\frac{{f}_{c,s}}{\overline{{f}_{c}}}\right)$$
(3)

In other words, it represents the average of the inter-scenario variances per primary energy carrier fraction. R5 is computed similarly but uses final energy use across the industry, transportation and buildings sectors (instead of primary energy carrier fractions). If R4 and R5 are high, there is much inter-scenario variation in the primary energy mix or final energy demand in the model’s output, while if these values are low, the scenarios are more or less similar and the model is relatively static with respect to the scenario assumptions.

The second dimension to express model typology concerns the mitigation strategies (yellow in Fig. 1). Whereas to some extent all models rely on all mitigation strategies we categorize here, it is useful to quantify which are more prominent in model output relative to others in the ensemble. The first indicator in this dimension is the reduction in carbon intensity (M1, emissions divided by the final energy) compared to the baseline scenario. Similarly, the second mitigation strategy (M2) involves reduction in energy intensity, that is, final energy divided by the gross domestic product (purchasing power parity). These reductions are compared to the average historical values between 2017 and 2021, by taking an ensemble average across this period. The third mitigation indicator (M3) is expressed in the amount of carbon captured (including direct air capture) in 2050, focusing on the energy sector. The fourth (M4) is the quotient between non-CO2 emissions reduction and CO2 emissions reduction (using 100-year global warming potential values analogous to IPCC Sixth Assessment Report). Because only a few models reported non-CO2 emissions specifically from the energy sector, we are calculating this quotient based on non-CO2 and CO2 emissions from all sources.

The third dimension involves energy supply, expressed in the primary energy mix (blue). The associated indicators are primary energy consumption per carrier calculated as fractions of the total primary energy in 2050: coal (Es1), oil (Es2), gas (Es3), solar (Es4), wind (Es5), biomass (Es6) and nuclear (Es7).

The fourth dimension focuses on the energy demand (red). We have chosen to split up the three major end-use sectors transport, industry and buildings when looking at their respective electrification (Ed1, Ed2 and Ed3). The fourth indicator quantifies the CO2 emissions in the electricity sector (Ed4), which is partially redundant when already knowing the primary energy mix (shown in the blue indicators), but we include it to provide a quick overview of specifically the electricity sector. To address the high inter-model variation on hydrogen28, Ed5 quantifies the fraction of hydrogen in the final energy mix across scenarios.

The fifth and final dimension of model typology concerns mitigation costs and effort (purple). Policy costs in energy system models can be calculated in different ways, such as consumption loss and the area under the marginal abatement cost curve—we use whatever is available for each model. The first indicator in this dimension concerns the costs per marginal abatement value (C1), which captures the ratio between cumulative additional policy costs (being the scenario costs minus that in the current policies) and the value of emissions reduction; if this number is high, the abatement was expensive. Note that this metric is related to (a cumulative version of) the cost per abatement value metric in earlier work8,19 and is computed as follows:

$${C}_{1}(s)=\frac{{\rm{Cumulative}}\,{\rm{additional}}\,{\rm{policy}}\,{\rm{costs}}\,{\rm{between}}\,2020\,\mathrm{and}\,2050}{\overline{\mathrm{CP}}\cdot (E(s,2020)-E(s,2050))}$$
(4)

where \(\overline{\mathrm{CP}}\) is the average carbon price over the period 2020–2050, and E(s,t) is the CO2 emissions in scenario s and time t. The second indicator is the transformation index (C2), first proposed by Kriegler et al. (2015)19, which is the sum of the absolute changes in the fractions of the primary energy carriers in 2050 (compared to the values in the period 2017–2021). Similarly, C3 quantifies the changes in final energy demand across sectors.

Model typology table

Table 1 contains a quantification of typical model behaviour in terms of how it deviates from the ensemble. This ‘deviation metric’ is calculated as follows, for each indicator i and model m:

$${\rm{Deviation}}\,{\rm{metric}}\,(i,m)=\frac{{\eta }_{m}-\langle \eta \rangle }{\langle \sigma \rangle }$$
(5)

where η denotes the median and brackets (<>) indicate the full ensemble. In other words, we compute the difference of the model’s median with the ensemble’s median and express this in the number of standard deviations. In a few columns of Table 1, we combine indicators for readability (Supplementary Table 5 provides the more detailed version of this table). For scenario responsiveness, we average the metric above for R4 and R5. Similarly, for electrification, we average it for Ed1, Ed2 and Ed3. In the column on energy and carbon intensity and the one on energy supply, simply the highest deviation is reported.