Global nature run data with realistic high-resolution carbon weather for the year of the Paris Agreement

The CO2 Human Emissions project has generated realistic high-resolution 9 km global simulations for atmospheric carbon tracers referred to as nature runs to foster carbon-cycle research applications with current and planned satellite missions, as well as the surge of in situ observations. Realistic atmospheric CO2, CH4 and CO fields can provide a reference for assessing the impact of proposed designs of new satellites and in situ networks and to study atmospheric variability of the tracers modulated by the weather. The simulations spanning 2015 are based on the Copernicus Atmosphere Monitoring Service forecasts at the European Centre for Medium Range Weather Forecasts, with improvements in various model components and input data such as anthropogenic emissions, in preparation of a CO2 Monitoring and Verification Support system. The relative contribution of different emissions and natural fluxes towards observed atmospheric variability is diagnosed by additional tagged tracers in the simulations. The evaluation of such high-resolution model simulations can be used to identify model deficiencies and guide further model improvements. Measurement(s) atmospheric carbon dioxide, methane and carbon monoxide Technology Type(s) numerical simulation Factor Type(s) None Sample Characteristic - Organism long-lived greenhouse gases Sample Characteristic - Environment atmosphere Sample Characteristic - Location global atmosphere Measurement(s) atmospheric carbon dioxide, methane and carbon monoxide Technology Type(s) numerical simulation Factor Type(s) None Sample Characteristic - Organism long-lived greenhouse gases Sample Characteristic - Environment atmosphere Sample Characteristic - Location global atmosphere

the CO 2 Human Emissions project has generated realistic high-resolution 9 km global simulations for atmospheric carbon tracers referred to as nature runs to foster carbon-cycle research applications with current and planned satellite missions, as well as the surge of in situ observations. Realistic atmospheric CO 2 , CH 4

Background & Summary
Reducing human-made emissions of CO 2 is at the heart of the climate change mitigation efforts in the Paris Agreement. In support of such efforts, the CO 2 Human Emission (CHE) project (www.che-project.eu) has designed a prototype system to monitor CO 2 fossil fuel emissions at the global scale. This challenging task requires the capability to detect and quantify the localised and relatively small signals of fossil fuel emissions in the atmosphere compared to the large variability of background CO 2 concentrations not directly affected by local sources, and to distinguish anthropogenic sources from vegetation fluxes [1][2][3] . Using observations of atmospheric constituents to estimate emissions 4,5 relies on a good understanding and accurate modelling of their atmospheric variability, which is largely determined by the weather-driven atmospheric transport together with surface biogenic fluxes and anthropogenic emissions. In the CHE project a library of nature runs of CO 2 and species co-emitted with CO 2 has been produced at different scales and with varying degrees of complexity 6 which complements previous nature runs 7 .
Nature runs are very high-resolution simulations that mimic nature, in that they provide a realistic representation of processes of interest, in this case those modulating atmospheric CO 2 variability. These simulations provide a reference for Observation System Simulation Experiments (OSSEs) 8 Quantitative Network Design (QND) 9 . In OSSEs and QND studies, synthetic observations extracted from nature runs are used to assess the impact of different observing system configurations 10 . It is envisaged that such a monitoring system will rely on the use of a large variety of measurements including species co-emitted with CO 2 that can help to isolate the fossil fuel emissions 3,11 . The future CO2M (Copernicus CO2 Monitoring) satellite mission is purposely designed to provide a high-resolution imaging capability to detect CO 2 emission hotspots with high-precision observations of atmospheric CO 2 concentrations 2,3,12 . CO2M will complement a constellation of satellites 4 and a global in situ network 5 to quantify the atmospheric CO 2 variability from which emissions will be derived with atmospheric inversion systems.
Simulating a realistic distribution of CO 2 and co-emitters depends on the representation of the surface fluxes, chemical sources/sinks, and atmospheric transport. Here we use the Copernicus Atmosphere Monitoring Service (CAMS) high-resolution forecast of CO 2 , CH 4 and CO (https://atmosphere.copernicus.eu/charts/cams/ carbon-dioxide-forecasts) which has been demonstrated to produce realistic and accurate variability of carbon weather [13][14][15] . The configuration of the nature run is shown in Fig. 1. Note that the CHE nature run is a free-running tracer simulation unlike the CAMS high-solution forecast which is initialised daily from an atmospheric composition analysis.
The CHE nature run aims to support scientific studies that will shed light on the challenges of estimating CO 2 emissions with the goal to build a CO 2 monitoring and verification support capacity 3 . These challenges span a wide range of aspects from sparse observing systems, consistency between ocean/land observations from different satellite-view modes 16 , large variability in the biogenic signal 17 , large representativity errors in anthropogenic emissions 13 , transport errors 18 and stringent requirements of high accuracy observations to estimate small signal with respect to large background values 16,19 . This global high-resolution dataset can provide a reference for testing different approaches to address those challenges.

Methods
Modeling framework. The CAMS high resolution forecasting system at the European Centre for Medium Range Weather Forecasts (ECMWF) 13,14,20 has been used to produce the nature run dataset which includes simulations of CO 2 , CH 4 and CO as illustrated in Fig. 1. It is based on the Integrated Forecasting System (IFS) model cycle 46R1 used to produce the operational weather forecast from June 2019 to June 2020 21 . The model has a reduced octahedral Gaussian grid 22 with a resolution of Tco1279 (corresponding to approximately 9 km) and 137 model levels. The simulations have been produced by running a sequence of 1-day IFS forecasts of the carbon tracers and weather. The weather forecasts are initialized with state-of-the-art re-analysis of meteorological fields (ERA5) 23 . The atmospheric tracers start from the CAMS re-analysis 24,25 initial conditions at the initial date of the dataset and from then onwards they are cycled from one forecast to the next in a free-running style. The different model components for the carbon weather forecast in the IFS, including the representation of the emissions for the different tracers, are listed in Table 1. All the emissions at the surface are prescribed except for the CO 2 biogenic fluxes which are modelled online 26,27 , providing consistency between the response of fluxes to atmospheric conditions and tracer transport 28 . There are various differences with respect to the CAMS operational high-resolution forecast in 2015: improved anthropogenic emissions [29][30][31] and natural CO 2 ocean fluxes 32 ; as well as an improved IFS model version 21 and initial conditions [23][24][25] . The configuration of the simulations with daily re-initialisation of the weather forecast and free-running tracers ensures consistency of the tracer evolution throughout the simulation by avoiding jumps in their concentrations brought by the assimilation of observations in the analysis, while maintaining a realistic and accurate simulation of their atmospheric transport and variability of the underlying biogenic fluxes from the model 26,27 . Model output. The standard parameters available from the CHE nature run dataset are listed in Table 2 and Table S1 in Supplementary Information file 1. Additional experimental tagged tracers are provided to characterize the atmospheric enhancement associated with the natural surface fluxes and anthropogenic emissions ( Table 3). The enhancement can be computed by subtracting the concentrations of the background tracer without the specific emission/flux from the tracer concentration with the flux/emission. This assumes that the transport is linear. It is worth noting that artificial negative enhancements can occur in the vicinity of plumes due to numerical oscillations associated with the cubic interpolation of the advection scheme around very steep gradients. This can be considered a numerical error in the simulation. The CO 2 tagged tracers are simulated without applying any mass fixer in order to ensure the signal comes only from the flux. The tagged tracers provide the enhancement during each 1-day simulation as they are re-initialised every day at 00UTC in order to avoid growing errors associated with the mass conservation 33,34 . This means the flux enhancement is reset to zero at 00 UTC. Detailed information on those tracers is provided in Table 4. Figure 1b provides an overview of the different types of model output from the CHE nature run dataset and how these can be compared to other datasets including various types of observations 5,35,36 as well as atmospheric inversions/simulations of carbon tracers 9 . Such a comparison can shed some light on the different components of the uncertainty in the simulations of carbon tracers coming from the surface fluxes, the atmospheric transport and the representativity error associated with the limited model resolution 14 . A complementary lower resolution ensemble of simulations 18 (25 km in the horizontal) has been also produced using the same model setup which provides information on emission uncertainty 30 , transport uncertainty and impact of meteorological uncertainty on biogenic fluxes. Two other major sources of uncertainty stem from the initial conditions of the carbon www.nature.com/scientificdata www.nature.com/scientificdata/ tracers at the beginning of the simulation 24,25 and the biogenic flux model 26,27 . An estimation of these uncertainties is provided in the Technical Validation section.
Example: Using tagged tracers to characterise anthropogenic plumes over land and ocean. In order to monitor anthropogenic CO 2 emissions, it is crucial to observe the CO 2 plumes emanating from the emission sources. These observations need to be based either targeted field campaign observations 13 or on high resolution imaging satellites 10 . As satellites have different viewing geometries over land and ocean 16 , it is very important to understand how many of these plumes are located over land, ocean and coastal regions. Moreover, satellite observations only provide total column CO 2 over cloud-free regions. Table 4 provides an example of statistics on the proportion of anthropogenic plumes accumulated over a 24-hour period over land/ocean and the proportion www.nature.com/scientificdata www.nature.com/scientificdata/ of plumes under cloudy conditions for January and July 2015. These fossil fuel tagged tracers and other tagged tracers associated with the biogenic fluxes, ocean fluxes and biomass burning emissions are all included in the CHE nature run dataset (see Table 3).
Example: insights into total column variability. The CO 2 , CH 4 and CO observing system is based on in situ observations, at the surface or from tall towers, and remote sensing observations from ground-based stations or satellites providing partial/total column observations. There are currently very few vertical profile observations from aircrafts 37,38 and Aircore measurements 36,39 that can be used to link the two observation types. For low-resolution transport models assimilating both surface and total column observations in an atmospheric inversions framework, it can sometimes be challenging to combine the surface and total column variability for various reasons. These include errors in the remote sensing observations 16 , representation errors near the surface and model transport errors associated with vertical mixing 40 , atmospheric chemistry 41 , as well as long-range transport 42 and the impact of stratospheric intrusions 43 . The global nature run can be useful to characterize the column variability of carbon tracers 44 associated with transport. Figure 2 illustrates the potential use of the CHE nature run to explain the variability of XCO 2 , XCH 4 and XCO at 24 TCCON sites (https://tccondata.org). The coefficient of determination shows that the variance of the total column can be explained by the different layers in the column in the nature run. When the column is well mixed, the contribution from the different layers is similar. At the sites where the influence of local emissions or natural fluxes is strong, the layers near the surface dominate the variability. Long-range transport in the free troposphere and upper troposphere/lower stratosphere  www.nature.com/scientificdata www.nature.com/scientificdata/ also plays an important role, as depicted by the green/orange bars with higher r 2 values than the near-surface layers in purple/red. The dataset can also be used to assess the important contribution of the stratosphere in the variability of XCH 4 45 .

Data Records
The CHE nature run dataset can be accessed through the ECMWF API following the examples provided in 46 . The data can be extracted on the native octahedral grid with the original resolution (tco1279, corresponding to approximately 9 km) or on a regular latitude/longitude grid at the required resolution of the user. Both grib and NetCDF formats are available. The dataset extends from 26 December 2014 to 31 December 2015. The list of contents is provided in Table 2. All meteorological and tracer fields and surface fluxes have been archived with 3-hourly time steps with respect to the 00 UTC initialization of the weather forecast.
Step 0 of all the meteorological parameters represents the initial conditions taken from ERA5 23 . Atmospheric species (CO 2 , CH 4 and CO) at step 0 are equivalent to tracers from the previous day at step 24, because they are free-running from one 1-day forecast to the next as illustrated in Fig. 1. Note that the emissions of CO and the CO 2 emissions from aviation are not stored in the CHE nature run dataset, but they can be obtained from the Copernicus Atmosphere Data Store (https://ads.atmosphere.copernicus.eu).

technical Validation
The dataset is based on the state-of-the-art operational NWP and CAMS forecasting system 21,47 which has been proven to produce reliable and accurate atmospheric CO 2 , CH 4 and CO variability [13][14][15] . The CHE nature run focuses on 2015, a year characterised by a pronounced decrease in the terrestrial carbon sink associated with the  Table 3. List of experimental CO 2 tagged tracers from the CHE nature run dataset. Each tracer is identified with a given experimental parameter ID. *** Note that the units of tagged tracers for the total column need to be converted from kg m −2 to ppm as described in 2D Atmospheric Composition parameters.  [%] of CO 2 , CH 4 and CO total column with different partial layers in the atmospheric column in January and July 2015 at 24 TCCON sites (tccon.org). The atmospheric layers are defined as follows: from surface to 400 m (SFC), from 400 to 2 km (BL), from 2 km to 5 km (FT), from 5 km to 10 km (UTLS), from 10 km to the top of atmosphere (STRAT). All the column and partial column data have been detrended before calculating the coefficient of determination. All r 2 values shown are statistically significant with p-value < 0.01 except when the r 2 < 0.001. (2022) 9:160 | https://doi.org/10.1038/s41597-022-01228-2 www.nature.com/scientificdata www.nature.com/scientificdata/ Example: Evaluation of CO 2 sources/sink by vegetation. Biogenic CO 2 fluxes associated with vegetation over land can dominate atmospheric CO 2 variability on a wide range of time scales from diurnal, synoptic, seasonal to inter-annual 28 . They are a crucial component for the estimation of the background CO 2 underlying the fossil fuel plumes from emission hotspots. This background CO 2 has not been directly influenced by the plumes emanating from local anthropogenic sources, but it results from the larger-scale fluxes associated with biogenic sources and sinks over land. The European Eddy Covariance (EC) ecosystem flux data collected and processed by the Integrated Carbon Observation System (ICOS) 53 are used to evaluate the uncertainty of modelled biogenic fluxes in the IFS (Fig. 3) which are bias-corrected 27 in the CHE nature run. These modelled fluxes are also compared to other flux products, such as FLUXCOM 54,55 (extended to include varying diurnal meteorology from ERA5) and the CAMS CO 2 inversion (v18r3) product 56,57 . The EC data were processed and the Gross Primary Production (GPP) and ecosystem respiration (Reco) estimated using the standard methods applied in FLUXNET 58 using the observed Net Ecosystem Exchange (NEE). Fig. 3 shows an overall underestimation of the seasonal cycle of NEE, GPP and Reco at the EC sites with typical errors of around 2 μmol m −2 s −1 . Synoptic-scale errors are smaller while the diurnal cycle has larger errors of around 4 μmol m −2 s −1 (not shown in Fig. 3). This underestimation is exacerbated by the anomalously high NEE and Reco observed during the European drought in 2015 (Fig. SB7.3 49 ). This type of evaluation can be used to understand the source of biogenic flux errors and improve the underlying biogenic models, as well as to quantify the uncertainty of prior fluxes for atmospheric inversions 59 .

Land
Example: Simulation and observation mismatch in the total column of CO 2 , CH 4 and CO. The TCCON data 60 which is widely used as a reference to evaluate biases in global measurement of CO 2 , CH 4 and CO total column averages-referred to as XCO 2 , XCH 4 and XCO-from space 16 is used here to assess the inter-hemispheric gradient, seasonal cycle and synoptic day-to-day variability in the nature run dataset (Fig. 4). The large-scale patterns of variability on a monthly scale are generally well represented for the three species. The amplitude of the XCO 2 seasonal cycle is underestimated at most TCCON sites, with the summer trough being 1 to 3 ppm higher than observed. This is consistent with the general underestimation of the biogenic sink during the growing season shown in Fig. 3. XCH 4 is overestimated in spring/summer and underestimated in autumn/winter, due to errors in the seasonality of the chemical sink and emissions (e.g. wetlands, agriculture and biomass burning). XCO is underestimated in winter which is a common feature in many models and emission data sets 61 and overestimated in summer/autumn, often caused by the biogenic emissions of isoprene, which have a large impact on southern hemisphere and global background values 62 of CO. Other sources of error are associated with the chemical sources/sinks 61 and fire emissions 63 , as 2015 was an extreme year for CO because of Indonesian fires in autumn 64 . Part of the bias shown in Fig. 4 also comes from the CO 2 , CH 4 and CO initial conditions at the start of the nature run extracted from the CAMS re-analysis 24,25 . The random error in the sub-monthly variability (STDE in Fig. 4) -associated with surface fluxes/emissions and atmospheric transport -is generally below 1.5 ppm for XCO 2 , 10 ppb for XCH 4 and 10 ppb for XCO, except at urban sites near emission hotspots such as Pasadena, Tsukuba and Paris.
Example: Fine-scale structure in vertical profiles. The vertical profiles of CO 2 , CH 4 and CO are illustrated in Fig. 5 with a comparison to AirCore observations 36,39 from the National Oceanic and Atmospheric Administration (NOAA) Global Monitoring Laboratory and the lower-resolution CAMS surface in situ inversion dataset 57,65,66 . While most global transport models used in atmospheric inversion systems have too coarse horizontal and vertical resolution to be able to represent the fine-scale vertical structure, the CHE nature run is able to capture the small-scale anomalies along the atmospheric column from the surface up to the lower stratosphere (50 hPa). The profiles on three different consecutive days show the large variability associated with day-to-day synoptic transport, particularly for CO 2 . Capturing this type of vertical variability is important because it reflects the ability of atmospheric transport models to represent vertical mixing and long-range transport. Both need to be accurately represented in atmospheric inversions in order to accurately infer surface fluxes. Examples of anticorrelation between the near-surface CO 2 and XCO 2 are also shown in Fig. 5j (e.g. 7,9,15,20,21 and 24 June) which are associated with the advection of anomalously high/low CO 2 air in the free troposphere (above 700 hPa) and the opposite decrease/increase of CO 2 near the surface. This emphasizes the importance of tracer transport above the planetary boundary layer in explaining the variability of XCO 2 also shown in Fig. 2.

Code availability
The IFS forecast model and the Meteorological Archival and Retrieval System (MARS) software are not available for public use as the ECMWF Member States are the proprietary owners. However, the CHE global nature run www.nature.com/scientificdata www.nature.com/scientificdata/ dataset and the MARS data extraction features are freely available through ECMWF API (https://www.ecmwf.int/ en/forecasts/access-forecasts/ecmwf-web-api) following a registration step (https://apps.ecmwf.int/registration/). The data can be accessed using python (https://www.python.org). The commands and steps required are detailed in the Supplementary Information file 1 (S2).