Identifying energy model fingerprints in mitigation scenarios

Energy models are used to study emissions mitigation pathways, such as those compatible with the Paris Agreement goals. These models vary in structure, objectives, parameterization and level of detail, yielding differences in the computed energy and climate policy scenarios. To study model differences, diagnostic indicators are common practice in many academic fields, for example, in the physical climate sciences. However, they have not yet been applied systematically in mitigation literature, beyond addressing individual model dimensions. Here we address this gap by quantifying energy model typology along five dimensions: responsiveness, mitigation strategies, energy supply, energy demand and mitigation costs and effort, each expressed through several diagnostic indicators. The framework is applied to a diagnostic experiment with eight energy models in which we explore ten scenarios focusing on Europe. Comparing indicators to the ensemble yields comprehensive ‘energy model fingerprints’, which describe systematic model behaviour and contextualize model differences for future multi-model comparison studies

Energy models are used to study emissions mitigation pathways, such as those compatible with the Paris Agreement goals.These models vary in structure, objectives, parameterization and level of detail, yielding differences in the computed energy and climate policy scenarios.To study model differences, diagnostic indicators are common practice in many academic fields, for example, in the physical climate sciences.However, they have not yet been applied systematically in mitigation literature, beyond addressing individual model dimensions.Here we address this gap by quantifying energy model typology along five dimensions: responsiveness, mitigation strategies, energy supply, energy demand and mitigation costs and effort, each expressed through several diagnostic indicators.The framework is applied to a diagnostic experiment with eight energy models in which we explore ten scenarios focusing on Europe.Comparing indicators to the ensemble yields comprehensive 'energy model fingerprints', which describe systematic model behaviour and contextualize model differences for future multi-model comparison studies.
The European Union's ambition to reach climate neutrality in 2050 as part of the European Green Deal 1 requires a thorough transformation of the full energy-economy system.Insights required for this transition are obtained from various lines of research, such as analysis of the technical mitigation potential, the effectiveness of policy instruments and opportunities for system changes given the interests of stakeholders and institutional barriers 2 .An important part of quantitative information on mitigation pathways is obtained from model-based scenario analyses, such as those recently published in the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report 3,4 .
Still a large spread is associated with the scenario output, which originates from many types of uncertainty 5,6 .Structural uncertainty stems from differences in numerous assumptions regarding, among others, technological innovation and uptake, market behaviour, Article https://doi.org/10.1038/s41560-023-01399-1energy supply and energy demand.For example, differences in solar power deployment under similar emissions levels can be understood only when information about wind power (as a potential competitor), carbon dioxide capture and storage (CCS) or energy demand reductions are provided.This motivates comparing multiple model dimensions at the same time.While Harmsen et al.'s six indicators are useful to classify models for each indicator individually (that is, one-dimensional), the aim of this paper is to characterize the overall model typology (that is, high-dimensional)-moving beyond a mere long list of 'individual' diagnostic indicators, towards developing a comprehensive overview or 'story' of the model's behaviour.This requires the analysis to go back and forth between many different aspects of the same model.In addition, a comprehensive overview of model behaviour would include an extension of the existing list of diagnostic indicators to a higher level of detail (Supplementary Information A.6).

Framework and fingerprints
Here we quantify typical model behaviour in a framework that we colloquially refer to as the model's 'fingerprint'.It assesses five key model behaviour dimensions: responsiveness, mitigation strategies, energy supply, energy demand and costs and effort.Each dimension is subsequently expressed in several diagnostic indicators, as shown in Fig. 1.The set of diagnostic indicators extends previous work 8,19 with new indicators to obtain a more complete and multi-dimensional perspective on model behaviour (Supplementary Information A.6 for a comparison with earlier work).For example, the 'responsiveness' dimension describes more than only tax response (R 1 ), adding the speed of response (R 2-3 ) and scenario sensitivities (R 4 and R 5 ).Other notable extensions on previous work are including carbon capture (M 3 ), the role of non-CO 2 (M 4 ), more details on the energy supply mix (Es 1-7 ), electrification of end-use sectors (Ed 1-3 ), hydrogen use (Ed 4 ) and demand shifts (C 3 ).In addition, all indicators are computed from a more extended set of diagnostic scenarios (Methods), containing scenario variation beyond only varying carbon price trajectories.
The framework's principle is to compare a model's scenario range (Fig. 1, yellow shaded) to the ensemble statistics in terms of its medians (η) and standard deviations (σ), yielding the range of typical (relative) model behaviour: the area towards the outer ring indicates above-median (up to ƞ + 2σ) output and the area towards the circle centre indicates below-median (down to ƞ − 2σ) output.Whereas scenario ensembles of energy models are rarely normally distributed, defining the range this way does ensure more intuitive equal ranges above and below the median.The diagnostic exercise performed to fill these diagrams includes eight different energy system models, involving ten diagnostic scenarios tailored to explore different model aspects.The scenarios (Methods and Table 3) have a high carbon price, aimed at approximating the Paris Agreement goals and the European Union's climate neutrality goal.
In Figs. 2 and 3, one can see the outcomes of the framework for the eight involved energy system model versions.For illustration purposes, we discuss the fingerprints of three model versions in detail before moving on to the more overarching patterns and typologies that the fingerprints reveal.Detailed model-by-model observations, as well as model acronym meanings, are provided in Supplementary Information C. In Fig. 2a, we see the fingerprint of the IMAGE model 20 , a process-based IAM.The model shows medium levels of carbon emissions abatement (R 1 ) and carbon intensity reduction (M 1 ).Still it has a relatively low mitigation timescale (R 2 ), high maximum mitigation speed (R 3 ) and sensitivities (R 4-5 ) and relies relatively more on energy intensity reductions and carbon capture (M 2-3 ).Consequentially, primary fossil use is higher than the ensemble median, notably coal (Es 1 ).Relatively low values of solar and nuclear energy use (Es 3 , Es 7 ) contrast with the high use of biomass (Es 6 , in most scenarios) in the primary energy mix, and high electrification of transport (Ed 1 ), medium for industry (Ed 2 ) and low for buildings (Ed 3 ) is found.Transformation in preferences and changes in specific activities.Parametric uncertainties involve differences in parameter calibrations or are consequential to differences in sectoral granularity and regional and temporal scale.Fundamental modelling choices may also vary, such as those concerning mathematical formulation (for example, optimization versus simulation frameworks), model structure and foresight.Substantial differences can be recognized across the model outcomes 5 , potentially even yielding contradictory observations 6 .To have a more accurate understanding of energy and climate policy scenario outcomes, it is important to have insights into both (1) where models differ substantially and where they agree and (2) how a model's output relates to the overall ensemble, yielding insights on outliers and discrepancies.This is specifically important because studies commonly use individual models rather than large ensembles in both scientific literature 7 and in policy reports (for example, in national policy studies).Only when single-model results are contextualized by the model's position in the larger ensemble, the reader would be able to have a complete and correct interpretation of the output.Additionally, such quantification of the model's position in the larger model ensemble allows for tracking model development 8 .
Both questions require a stylized set of results across the model range in which only a well-defined number of assumptions is varied.To this end, multi-model comparison exercises have been effective.This is done within confined projects 3 and, for instance, in the long tradition 9 of studies by the Energy Modelling Forum 10,11 , of which many scenario runs are collectively used in the Assessment Reports by the IPCC 10,11 .Still the observed large model differences-especially in estimates of costs 12 , the diffusion of individual renewable energy technologies 13 and demand sector development 14,15 -motivate more research in this area and emphasize the importance of interpreting single-model results in light of larger model ensembles.
Many multi-model comparison studies test the (un)certainty of an outcome by looking at the range across models in scenarios designed for other purposes, for example, to describe the effect of current mitigation measures 6 .While insightful for that particular question, quantifying and evaluating overall model behaviour requires analysis beyond typical scenarios and typical variables-hence the importance of analysis of diagnostic scenarios, expressed in diagnostic variables or indicators.Such practice is well established in the climate and atmospheric sciences [16][17][18]
Figure 2b shows the fingerprint of the REMIND model 21 , which is a Ramsey-type general equilibrium growth IAM-quite different from IMAGE.Indeed, the fingerprint also looks very different indicating substantial differences between the output of REMIND on the one hand, and those of IMAGE on the other hand.Striking are the higher levels of abatement (R 1 ), carbon intensity reductions (M 1 ) and fast reaction to carbon pricing (low R 2 , high R 3 ).Consistent with these observations, primary fossil use (Es 1 -Es 3 ) is limited, whereas variable renewable energy and electrification in transport and buildings (D 1 , D 2 ) are higher than ensemble medians.For industry, electrification is quite sensitive to scenario assumptions.The associated transformation index (C 2 ) is expectedly high, while the costs (C 1 ) are lower than the ensemble median.A third and final example we will discuss here is shown in Fig. 3bthe PRIMES model 22 , which is an energy system model that provides projections of energy demand, supply, prices and investments.This model does not react heavily to the various scenario assumptions, reflected in the low scenario sensitivity of the primary energy mix (R 4 ) and demand (R 5 ) and in the narrow scenario ranges shown in the fingerprint diagram.PRIMES projects near-median levels for most indicators, with a few exceptions: a relatively low mitigation speed potential (R 3 ), high energy intensity reduction (M 2 ), high relative non-CO 2 reductions (M 4 ), low solar energy use (Es 4 ) and notably a very high hydrogen use (Ed 5 ).

Model typology
A summary of the general tendencies of the models is shown in Table 1.We use a similar (but extended) approach to this as in previous literature 8 , where a general classification was given of each of the models.The assessments in the table are directly based on the fingerprints in Figs. 2 and 3 and quantify the extent to which the models deviate from the ensemble.
The spread in Table 1 indicates the significance of the differences in reported model deviations.For example, in response to carbon taxing in REMIND (+1.8σ) versus that in TIAM-ECN (−1.6σ), the energy intensities in WITCH (+2.1σ) and TIAM-ECN (−1.7σ) or the relative mitigation of non-CO 2 emissions in IMAGE and REMIND (−1.1σ and −1.2σ) was compared to that in TIAM-ECN (+1.8σ).Whereas such values are  To provide structure to the discussion, we sorted the models by their tax response (second column, R 1 ): that is, carbon emissions in 2050 under a carbon tax scenario compared with that in the current-policies scenario of the same model.REMIND is by far the most responsive (green); it shows almost two standard deviations higher relative abatement and, in addition, has a high scenario response.This high responsiveness coincides with high carbon intensity reduction, renewables and electrification.It emphasizes carbon intensity reduction with emphasis on CO 2 emissions while projecting average numbers for energy intensity reduction and carbon capture.Euro-Calliope partially shows similar results, also having a (moderately) high tax response, carbon intensity reduction and renewables rollout but to a lesser extent than REMIND.From a system point of view, these models show a consistent fingerprint; the high tax response (green) is achieved by replacing fossils (yellow) by renewables (blue) and high electrification (orange).IMAGE can also be regarded as a model with moderately M i t i g a t io n s tr at eg ies Ed
This consistent fingerprint can also be observed at the other end of the spectrum.This is clearly expressed in TIAM-ECN output, which has a low tax response, low carbon intensity reduction, high fossil use and low electrification.MESSAGEix also shows this mirror image to some extent, but it is closer to the rest of the ensemble than TIAM-ECN in this respect.While this model initially rapidly mitigates emissions (reflected in a low R 2 -approximating REMIND), the abatement becomes relatively slow over time, yielding an overall medium to low relative abatement (R 1 ) of −0.5σ.Note that this is partially consequential to differences in what years are represented in each model year-for MESSAGE, 2050 represents the five preceding years.These abatement projections are accompanied by relatively low (sectoral average) electrification of −0.4σ and higher fossil use of +0.5σ and also a low carbon and energy intensity reductions (−0.8σ and −0.7σ, respectively; Supplementary Information B.1).
Besides MESSAGEix, the three other models with medium tax response-PRIMES, PROMETHEUS and WITCH-are the models with the highest energy intensity reductions while having low (WITCH) or medium (PRIMES, PROMETHEUS) carbon intensity reductions.This is most clearly visible in WITCH output, projecting energy intensity reductions of over two standard deviations higher.The three M i t i g a t io n s tr at eg ies Ed

Article
https://doi.org/10.1038/s41560-023-01399-1models have another commonality: they are all moderately sensitive to scenario assumptions.However, in many other ways, their projections notably differ.WITCH relies more on carbon capture in contrast to PROMETHEUS, which, in turn, has higher nuclear and bioenergy in their energy supply mix than any other model in this experiment.The latter is related to the high level of biofuel use in PROMETHEUS.
A commonality between WITCH, IMAGE and MESSAGEix can also be found; they all have rather high fossil use and carbon capture (the exception to this pattern seems to be TIAM-ECN, which does not project as much carbon capture).Interestingly, while high carbon capture intuitively would generate high abatement and carbon intensity reductions, this is not visible here, with, for example, MESSAGEix having +1.1σ carbon capture but −0.8σ carbon intensity reduction (Supplementary Information B.1).
In previous papers, scholars have tried to link model behaviour to the underlying model type in terms of solution approach (for example, partial or general equilibrium, recursive dynamic or inter-temporal).For example, Daioglou et al. (2020) 23 tried to link bioenergy use to model type, and Harmsen et al. (2021) 8 compared the output of six diagnostic indicators to the model types.However, in both studies, 'no direct relationship between model type and model behaviour' was found 8 .Even though this study contains more unique scenarios than in previous studies, we again find no clear link between model behaviour and model type (although the set considered here does include a smaller set of models).In Table 1, we see that inter-model differences are not only higher than inter-scenario differences in many of the indicators but also higher than inter-model type differences.Technology-rich energy system models can be among the most tax-sensitive models for abatement (Euro-Calliope) or among the lesser tax-sensitive ones (MESSAGEix)-that is, being technology rich does not imply a certain level of carbon tax sensitivity.Similar patterns can be observed for general versus partial equilibrium models and inter-temporal vs myopic models.It is important to emphasize that while having more scenarios than in previous studies allows for a more representative fingerprint of each model, a clear limitation for trying to attribute model behaviour to its type is the limited number of models (eight).Adding more models could help in observing such relations, but models are rarely fully independent from each other-which is a limitation in these model comparison exercises in general.

Discussion and conclusion
In this paper, we propose a unique framework to characterize energy model typology by quantifying their 'fingerprints' in scenario ensembles.We identified five key model dimensions along which models differ: model responsiveness, mitigation strategies, energy supply, energy demand and costs and effort split further into a total of 24 diagnostic indicators.A number of these indicators are re-used from existing literature 8,19 (Supplementary Information A.6), but we add more dimensions such as the energy supply mix, carbon capture, non-CO 2 and demand responses, which allows for a more complete interpretation of model behaviour.The indicators are calculated and visualized relative to the ensemble statistics of a set of tailored diagnostic scenarios, yielding a comprehensive insight into each model's fingerprint in relation to the other models.
The framework contextualizes results from individual models (or rather model versions, which may also evolve their behaviour 24,25 ) by identifying typical model behaviour, which yields better interpretation and understanding of them.The latter, of course, is of vital importance to both researchers and policymakers 8,19 .Additionally, the substantial model differences, being commonly larger than inter-scenario differences, motivate caution on using individual model results without ensemble context in general.Specifically, we find highly tax-responsive models to also have the highest renewables, electrification and carbon intensity reduction.IMAGE is also relatively highly responsive but distinguishes itself with high fossil and carbon capture use.TIAM-ECN, having the lowest tax response, indeed also shows a low carbon intensity reduction.MESSAGEix shows a similar pattern (but more closely to the rest of the ensemble), and both of these models additionally show a low energy intensity reduction.Most models with medium tax-responsive models (WITCH, PROMETHEUS and PRIMES) show higher energy intensity reductions and moderate scenario responsiveness but vary in many other ways.The framework also has limitations.First of all, the scenario projections are not statistical predictions-hence, the model-scenario ensemble cannot be used as such, and the ensemble median should not be interpreted as 'most appropriate'.Analogously, model behaviour that is distinct from the rest of the ensemble (outliers) should not be treated as 'least probable' and neither should the sorting of models by carbon tax response (in Table 1) reflect a form of 'ranking'.Whereas typical other multi-model comparison studies focus much more on robust messages across models, our aim is to identify and highlight their differences.The model differences presented here do not even necessarily illustrate a lack of consensus on future outcomes.Potentially, they also reveal that deep mitigation can be achieved in multiple ways (that is, a form of 'policy freedom'), albeit each conditioned in various respects, motivating the high dimensionality of this analysis.Hence, for the purpose of exploring the future, model differences can therefore be useful.This study aims only to map these differences; it does not identify its sources-which also requires the analysis of model inputs and parameters 26,27 , for example associated with efficiency differences in final energy (R 5 , Ed 1-3 and C 3 ).A second important remark is that the fingerprint framework and Table 1 contain quantities that are relative to the ensemble.Therefore, the fingerprints dilute information on absolute differences-motivating the importance of Supplementary Table 4-which should always be taken into account when using this framework.The relative nature of the framework also means that there is no single 'absolute' fingerprint of any model; it is always subject to the ensemble the model is being compared to.Structural bias in the community (that is, across all included models) will therefore not be easily detected.Adding models to the presented diagnostic exercise or choosing a different regional scope will modify the model fingerprints (for example, Supplementary Information B.4 for global results).Still by choosing the diagnostic scenarios with a wide range of assumptions and a large variety of models, we do approximate a comprehensive overview of the behaviour of the eight models included (for the region of Europe).Finally, we also note that the scope of this research was Europe and deep mitigation scenarios.The model fingerprints will look different when applied to other regions or global results and to scenarios with little climate policy.
Even though we show only one application of the framework, it is intended as a general framework-both the methodology, code and data are flexible in nature.The model fingerprints for global output (rather than European) in Supplementary Information B.4 are an illustration of this, but it is also well suitable as a diagnostic tool for model development by determining the fingerprints for different versions of a single model.To quantify model fingerprints at a more detailed sectoral level, the dimensions, indicators and even the scenarios can be adjusted accordingly.On diagnostic indicators, earlier literature 8,19 stresses the importance of applicability to diverse models, beyond only the relevance, quantifiability and identification of heterogeneity.For the purpose of the latter, this framework with the current selection of diagnostic indicators is well suited whereas for the former, the framework may require translation-for example, for models that cover only the electricity sector.This way the framework can form the basis of future model intercomparison projects.

Article
https://doi.org/10.1038/s41560-023-01399-1for an exponentially increasing carbon price 8 , while we use a linearly increasing one (equation (1)).The second indicator (R 2 ) is the timescale of carbon price response, which is defined as the number of years until the scenario has 66% lower CO 2 emissions than the current-policies scenario has in the same year.A related (Supplementary Information A.2) third indicator is the maximum mitigation speed (R 3 ), which is the maximum emissions reduction over any five-year increment, yielding insights in how fast the model can mitigate in short periods.The fourth (R 4 ) and fifth (R 5 ) indicators focus on scenario sensitivity for each model, resulting in a single value (per model) rather than a scenario range, indicating how the primary energy mix and the final energy demand across sectors vary among its scenarios.We write the fraction of an energy carrier c in the primary energy mix in scenario s as f c,s , its average across all scenarios as fc and the total number of unique carriers considered as n c .Then, R 4 is computed as follows: In other words, it represents the average of the inter-scenario variances per primary energy carrier fraction.R 5 is computed similarly but uses final energy use across the industry, transportation and buildings sectors (instead of primary energy carrier fractions).If R 4 and R 5 are high, there is much inter-scenario variation in the primary energy mix or final energy demand in the model's output, while if these values are low, the scenarios are more or less similar and the model is relatively static with respect to the scenario assumptions.
The second dimension to express model typology concerns the mitigation strategies (yellow in Fig. 1).Whereas to some extent all models rely on all mitigation strategies we categorize here, it is useful to quantify which are more prominent in model output relative to others in the ensemble.The first indicator in this dimension is the reduction in carbon intensity (M 1 , emissions divided by the final energy) compared to the baseline scenario.Similarly, the second mitigation strategy (M 2 ) involves reduction in energy intensity, that is, final energy divided by the gross domestic product (purchasing power parity).These reductions are compared to the average historical values between 2017  DIAG-C400-lin-LimNuclear Identical to DIAG-C400-lin Nuclear power limited to today's levels and no new constructions are allowed except capacity already under construction.
DIAG-C400-lin-HighEff Identical to DIAG-C400-lin By means of a variety of lifestyle change or high efficiency options (models are free to choose), reach final energy values of 26 EJ in Europe in 2050 including bunkers and non-energy.
To align with earlier diagnostic scenario runs 8 , 2010 US dollar is used as currency for the carbon pricing.All other prices and monetary variables are in 2020 Euro.More details can be found in the model-scenario protocol, publicly available 32 .Because of the variety in model characteristics, not all models were able to run all prescribed scenarios.Still over 90% of all theoretically possible model-scenario combinations are included in the database (Supplementary Information A.5).
Cumulative costs per abatement value* Cumulative additional costs divided by the cumulative sum of carbon prices times the marginal CO 2 emissions reductions (over period 2020-2050) Hydrogen use Hydrogen use in the final energy mix as a fraction of the total Demand shifts † Sum of absolute changes in final energy demand in end-use sectors

Fig. 2 |
Fig. 2 | Model fingerprints.a-d, Model typologies or 'fingerprints' of the IMAGE (a), REMIND (b), MESSAGEix-GLOBIOM (c) and WITCH (d) models for Europe in 2050.The axis for each indicator ranges between the ensemble median (ƞ) ± two standard deviations (σ).These medians and standard deviations are computed from the full ensemble, that is, the eight models and nine linearly increasing-

Fig. 3 |
Fig. 3 | Model fingerprints.a-d, Model typologies or 'fingerprints' of the TIAM-ECN (a), PRIMES (b), PROMETHEUS (c) and sector-coupled Euro-Calliope (d) models, for Europe in 2050, similar to Fig. 2. The axis for each indicator ranges between the ensemble median (ƞ) ± two standard deviations (σ).These medians and standard deviations are computed from the full ensemble, that

. Examples of indicators are the equilibrium climate sensitivity and the transient climate response, which are associated with diagnostic scenarios in which the CO 2 concentration is doubled or quadrupled. In the emissions mitigation literature, examples of diagnostic multi-model studies are Kriegler et al
due to the higher-dimensional intertwinement of mitigation, policy,

Table 1 | Overview of model typologies
The columns refer to individual or combined diagnostic indicators and the cells below it show how many standard deviations the model median differs from the ensemble median (Methods).Missing values are denoted with dashes.'Medium' implies a deviation within ±0.5σ.'High' and 'Low' imply larger positive and negative deviations, respectively.Deviations of more than ±1σ are emphasized in bold.Abbreviations in the first column: IT (inter-temporal), GE (general equilibrium), PE (partial equilibrium) and RD (recursive dynamics).For carbon and energy intensity (CI and EI) and in the column on energy supply, only the highest deviation is indicated.An extended version of this table with values for each diagnostic indicator can be found in Supplementary Information B.1.

Table 2 | Models used in the analysis
types (fourth column) are abbreviated: IAM (integrated assessment models), IT (inter-temporal), GE (general equilibrium), PE (partial equilibrium) and RD (recursive dynamics).More information can be found at https://www.iamcdocumentation.eu/index.php/IAMC_wiki.When referring to 'Europe', we mean the European Union together with the United Kingdom (that is, EU28), or the closest resemblance of this per model based on their regional aggregations. Model

Table 3 | Description of the diagnostic scenarios
33obal primary modern bioenergy supply limited to 100 EJ, which is on the lower end of the range given in the Sixth Assessment Report by IPCC32.Maximum European primary biomass use is limited to 7 EJ, which is close to current use33.Biomass imports of Europe set to 0.DIAG-C400-lin-LimCCSIdentical to DIAG-C400-lin CCS (including BECCS and DACCS) limited to 2 Gt CO 2 per year globally.Max CCS use in Europe limited to 250 Mt CO 2 per year.