Independent verification of reported fossil fuel emissions from atmospheric observations is critical to transparently tracking progress towards the reduction targets formalized in the Paris Agreement1,2,3,4. Uncertainty (1σ) of reported annual fossil fuel emissions is estimated to be around the 5% level5, allowing for methodological uncertainties in deriving CO2 fluxes from energy statistics and incomplete data, with larger uncertainty on decadal timescales due to potential systematic errors in reporting6. At the global scale, fossil fuel emissions can be verified in principle by quantifying all other terms in the global carbon budget5,7 as follows:


where LU is land use and land cover change, O is the ocean sink, B is the natural terrestrial carbon exchange and AGR is the atmospheric CO2 growth rate, and where F includes small contributions from the cement manufacture and the carbonation sink as well as other industrial processes. This approach has been hampered, however, by unexplained discrepancies in the budget on interannual and decadal timescales1,2,3,4,5, as reflected by δ, sometimes called the budget imbalance. In the 2021 assessment by the Global Carbon Project (GCP)5, δ has a near-zero mean but varies strongly from year to year between −0.73 and 0.8 GtC yr−1 (1σ range), with average imbalances as large as 0.6 GtC yr−1 during some decades, corresponding to roughly 6–8% of current fossil fuel emissions.

On interannual timescales, the main contributor to δ is unexplained exchanges of atmospheric CO2 with the land biosphere (Bnet = B − LU)5. Interannual variability in this exchange is mostly controlled by natural climate variability in the tropics and semi-arid extratropical regions8,9. Much of this variability is tied to the El Niño/Southern Oscillation (ENSO), but the complex nature of climate control has so far hampered our ability to fully account for these flux fluctuations. ENSO modulates terrestrial CO2 release via temperature- and moisture-driven changes in net primary productivity and, to a lesser extent, heterotrophic respiration and biomass burning10,11,12,13,14. Soil moisture plays a critical role in net biome exchange through land–atmosphere interactions that amplify temperature extremes9,15,16,17. However, local compensation of water-driven anomalies in photosynthesis and respiration and spatial anticorrelation of precipitation patterns limit the influence of water availability relative to temperature on global carbon exchange13,18. Interannual variability in ocean carbon sequestration is roughly 80% smaller than on land and follows different temporal patterns5,11,19.

There is also substantial unexplained variability in CO2 on decadal timescales5,20. Bastos et al.21 reviewed the unexplained low AGR in the 1940s and suggested a connection to increased land abandonment caused by World War II. Rafelski et al.22 evaluated the same period and identified coincident global cooling as an important contribution to a low AGR. Keeling et al.23 showed that an anomalous decrease in the natural carbon sinks in the mid-1980s was necessary to reconcile the stagnating AGR with a concurrent deceleration in fossil emissions. Keenan et al.24 and Ballantyne et al.25 discussed a temporary slowdown in the AGR after 2002. Both explain the decade-long slowdown by increased carbon sequestration into the terrestrial biosphere linked to the ‘warming hiatus’ and ongoing CO2 fertilization.

Two general approaches have been taken to model land carbon exchanges (LU and B), one using dynamic global vegetation models5,10,11 and one using linear regression of B against observational climate indices, such as average tropical land temperature or ENSO state11,13,14,19,22,26,27. However, both methods leave a substantial fraction of variability in the observations unexplained. While global regression models have the advantage of being less complex and less computationally expensive than dynamic global vegetation models, they do not fully account for (1) spatial heterogeneity in the temperature sensitivity of carbon exchange or (2) variations with season (for example, spring versus summer warming can have opposite effects on plant growth).

Information on spatial and seasonal details in the temperature sensitivity of land carbon exchange is available from the extensive set of atmospheric CO2 records observed across Earth’s surface. On the basis of 196 measurement stations, Rödenbeck et al.19,28,29 derived annually repeating, season-specific gridded temperature sensitivities of land carbon fluxes within an atmospheric inversion framework (‘NEE-T inversion’). The estimated sensitivities are ecologically meaningful and possess predictive power as they are (1) robust against cross-validation and (2) compatible within uncertainties with analogous regression coefficients from fully independent eddy covariance CO2 flux measurements. Rödenbeck et al.29 used these temperature sensitivities to study interannual variability of land carbon fluxes and showed that previously observed changes in temperature sensitivity12 were partially explained by the pattern of warming rather than ecosystem or physiological changes.

Here we make use of these spatially and seasonally resolved temperature sensitivities to improve on previous global regressions of the net land flux (Bnet,mod). We develop a predictive model of global carbon fluxes that can be applied on interannual to decadal timescales to study the stability of carbon cycle dynamics and to verify fossil fuel emissions.

Simple regression model of the net land sink

We estimate the annual net land carbon exchange Bnet (Fig. 1), using the model

$${B}_{\mathrm{net},\mathrm{mod}}=a\times \varGamma +b\times \mathrm{CO}_{2}+c$$

where the term \(a\times \varGamma\) represents temperature-driven carbon fluxes, with \(\varGamma =\mathop{\sum }\nolimits_{i=1}^{n}{\gamma }_{i}{T}_{i}\) being a weighted spatial average of local land temperature anomalies Ti with weights for each grid cell i of n being proportional to the local seasonally resolved temperature sensitivity γi taken from the CarboScope inversion run ‘sEXTocNEET_v2021’19. γi values repeat annually and have been shown to possess predictive power outside this study by Rödenbeck et al.19,29. Temperature anomalies Ti are taken from Berkeley Earth30 and have been decadally detrended by subtracting an exponential moving average with a 2.5-year time constant (Methods). This effectively forces climate-driven fluxes to behave as exchanges with a fast-turnover pool as in Rafelski et al.22. The term \(b\times \mathrm{CO}_{2}\) accounts for a long-term increasing sink, including but not limited to atmospheric CO2 fertilization. Average atmospheric CO2 concentrations are calculated as the mean of CO2 concentrations at Mauna Loa and the South Pole, which closely track global trends in the global NOAA CO2 record from sampling stations in the well-mixed marine boundary layer in recent years31 but are continuous from 1958 to 2021. Following work by Dohner et al.32, the global mean land use flux is assumed constant over the regression period and contained in the regression coefficient c, but c also adjusts for the non-zero pre-industrial values of \(a\times \varGamma\) and \(b\times \mathrm{CO}_{2}\).

Fig. 1: Re-evaluation of Earth’s carbon budget.
figure 1

a,b, Earth’s carbon budget recalculated using our ‘Weighted-T’ regression model of terrestrial carbon exchange and a simple ocean model (a) and a comparison between the observed and our modelled atmospheric growth rate of CO2 (b). Reported annual fossil fuel (Frep, including cement effects) emissions are from the Global Carbon Project 20215. The atmospheric CO2 growth rate (AGRobs) is calculated from the average of observed CO2 concentrations at Mauna Loa and the South Pole observatories39. The ocean sink (Omod) is determined using a pulse-response model driven with observed CO2 concentrations33. The net land carbon exchange (Bnet,mod) is derived from the ‘Weighted-T’ model (see text). Annual average fluxes (dots) are shown at the midpoint of each year and five-year running means (solid lines) are applied to all data to guide the eye. b, Comparison of the observed AGR to the growth rate calculated as AGRmod = Frep − Omod − Bnet,mod.

Source data

The coefficients a, b and c (Table 1) are derived from a linear regression of Bnet,mod against the residual land sink (Bnet,res):


where AGRobs is also calculated from the average of CO2 concentrations at Mauna Loa and the South Pole. Frep is GCP-reported fossil emissions including the cement carbonation sink5, Omod is the ocean carbon uptake calculated using the well-established global pulse-response model by Joos et al.33 and ϵ is the residual of the optimized model. The parameters are tuned to match the overall growth in CO2. The tuning also partly compensates for imperfections in the representation of the ocean sink O, but replacing Omod with the central ocean sink estimate from GCP has little impact (Methods and Extended Data Table 2). In the following, Bnet,mod according to equation (2) (here in combination with Omod) is called the ‘Weighted-T’ model. Values and uncertainties of all coefficients as well as the generalization error (that is, the root mean square error (RMSE) of the model when presented with new data) are derived using jackknifing and cross-validation, respectively, by repeatedly declaring non-overlapping five-year windows as validation data and the remaining 55 years as training data for the regression (Methods). This method prevents overfitting and ensures that the regression model has predictive power over interannual to decadal timescales.

Table 1 Regression results and performance metrics for various model configurations using different inputs for predictor Γ and equivalent calculations for the Global Carbon Project between 1960 and 2020

The Weighted-T model has a cross-validated RMSE of 0.50 ± 0.09 GtC yr−1 (1σ) and can explain approximately 75 ± 6% of the variance in the residual land sink (Bnet,res). RMSE values reduce to 0.16 ± 0.04 GtC yr−1 after filtering the optimized model residual ϵ to decadal timescales. A corresponding RMSE calculation of the budget imbalance δ from process models reported by the GCP5 yields substantially higher values of 0.76 ± 0.11 GtC yr−1 and 0.36 ± 0.14 GtC yr−1 on interannual and decadal timescales, respectively (Extended Data Figs. 1 and 2). For a fair comparison, given that these models are not explicitly tuned, we also detrend δ by subtracting a best fit \(\alpha \times \mathrm{CO}_{2}+\beta\). However, this fit only accounts for 6% of the variance and does not reduce the cross-validated RMSE. We hence consider δ and ϵ to be fundamentally comparable quantities. Performance gains of the Weighted-T model compared to GCP process models are greatest on decadal timescales as demonstrated by the smaller ratio of decadal to interannual RMSE (GCP/Weighted-T interannual: 0.76/0.50 = 1.52; versus decadal: 0.36/0.16 = 2.25). We also repeated the cross-validation with 10, 15, 20 and 30-year windows, and observed a substantial increase in the RMSE only for windows longer than 15 years, supporting that the model has predictive power on decadal timescales.

We also tested two simpler model configurations, replacing predictor Γ with either unweighted tropical mean land temperature (|lat|≤25, ‘Tropical-T model’) or the four-month lagged Niño 3.4 index (‘ENSO model’). The ENSO and Tropical-T models both yield higher RMSE values on interannual and decadal timescales than the Weighted-T model (Table 1). Allowing different lag times for Γ of up to one year for all model configurations does not improve the model fits. Notably, ϵ of the Weighted-T model is not temporally autocorrelated, in contrast to the residuals of the Tropical-T and ENSO models or the GCP budget imbalance3, leading to slower error accumulation and therefore higher predictive power on decadal timescales. Including global mean stratospheric aerosol optical thickness at 550 nm34 as an additional predictor for volcanic influences in the regression did not improve the performance of the Weighted-T model and only modestly improved the performance of the ENSO model, yet the Weighted-T model still outperforms all other model configurations (Extended Data Figs. 3, 4 and Extended Data Table 1). Similarly, excluding the three years affected by the Mount Pinatubo eruption (1991–1993) from our regression analysis improves the performance of all models, but the Weighted-T model continues to have the lowest RMSE among models (Extended Data Table 3).

Surprising stability of the carbon cycle

The atmospheric growth rate of CO2 calculated using the Weighted-T land model (Fig. 1) provides a unified understanding of a wide range of interannual and decadal features previously only considered in isolation. Accounting for spatial and seasonal heterogeneity in the sensitivity of land carbon fluxes to temperature and other correlated climate variables substantially improves our ability to reproduce annual CO2 growth compared to alternative models. The Weighted-T model successfully captures the large ENSO-related impacts on AGR, including the strong El Niño event in 1997–1998 and the subsequent La Niña, while also accounting for the impacts of temperature on the carbon cycle on other timescales. Except for the three years immediately following the eruption of Mount Pinatubo in 1991, the modelled AGR reproduces most decadal features of the observations, including the stagnating AGR in the 1980s following two decades of steady increase, as well as slow AGR growth in the 1990s and 2000s when the land sink and fossil emissions underwent large decadal changes. The poorer performance following Pinatubo indicates that influences neglected by the model, such as impacts of diffuse light on plant growth35, played an important role during this period.

The good performance of the Weighted-T model suggests a perhaps surprising stability of the sensitivity of the net carbon exchange to climate and CO2 forcing at the global level since 1958. The Weighted-T model achieves good agreement with observations despite assuming a constant local land response to temperature and atmospheric CO2 throughout 1958–2021. Although all regression coefficients in the land sink model are constants and the values of the local temperature sensitivity γi from the ‘NEE-T inversion’19 repeat annually, the residual ϵ of the Weighted-T model shows no trend and is homoscedastic (that is, the magnitude of the residual is stable over the time series). Our results therefore contrast with previous studies that highlight substantial changes in the physiological response of the biosphere to temperature12,25,36. Instead, the good agreement with observations obtained with the Weighted-T model supports previous findings that the apparent changes in land carbon sensitivity at the global scale are partly explained by shifts in the pattern of warming29. Although the \(a\times \varGamma\) term of the Weighted-T model accounts only for carbon exchange with fast-responding carbon pools (approximately 2.5-year turnover) which attenuate most but not all decadal variability in temperature, this term nevertheless accounts for much of the decadal behaviour of the model. Mechanistically, the success of the model indicates that short-lived pools dominate variability in land carbon fluxes over a broad range of timescales as suggested previously22. The term \(b\times \mathrm{CO}_{2}\) accounts for the multi-decadal response of the Weighted-T model to increasing CO2, changing climate and other correlated factors, but has less of a mechanistic underpinning and is therefore not suitable for long-term forecasting.

Verifying global fossil fuel emissions

The ability of the Weighted-T regression model to tightly reproduce the interannual variations in the natural part of the global carbon budget also provides a means to verify reported emissions at improved accuracy with observations of atmospheric CO2 concentrations1,3,4. The Paris Agreement has a five-year stocktake cycle in which ongoing emission reduction efforts are reviewed. Reported global emissions (Frep) can be compared to inferred emissions (Fif) calculated by combining equations (1)–(3):

$${F}_\mathrm{{if}}=\mathrm{AGR}_{\mathrm{obs}}+{O}_{\mathrm{mod}}+(a\times \varGamma +b\times \mathrm{CO}_{2}+c)$$

Consider the following counterfactual scenario (Fig. 2): global emissions were reported to be constant at a value of Frep = 8.95 GtC yr−1 after 2010, while in reality emissions continued to increase. In 2015 an interested party could use our regression model to hindcast emissions since 2010 (Fif) using available observations of temperature and CO2. The difference between the inferred Fif and reported Frep is detectable in the cumulative CO2 flux as an unexplained buildup of CO2 in the atmosphere that is already statistically significant by 2015 and continues to grow thereafter (Fig. 2b). In contrast, it would take almost ten years to detect a significant disagreement if, instead of the cumulative emissions (that is, the impact on atmospheric CO2 concentrations), we used the difference in Fif and Frep directly (that is, the impact on AGR, Fig. 2a). This is because cumulative uncertainty increases as the square root of the time difference (that is, \({\sigma }_{\mathrm{cum}}=\sqrt{\Delta t}\times \mathrm{RMSE}\)) while the emission signal (\(S=\Delta t\times F\)) grows linearly, as it is simply integrated over time3. Our scenario illustrates that, if reported emissions stayed constant at current levels of around 10 GtC yr−1, any misreporting of greater than \(\mathrm{CI}_{95 \% }\times \frac{\sqrt{\Delta t}\mathrm{RMSE}}{\Delta t\times S}=1.96\frac{\sqrt{5}\times 0.5}{5\times 10}=4.4\%\) could be detected at the 95% confidence level (\(\mathrm{CI}_{95 \% }=1.96\times{\sigma }_{\mathrm{cum}}\)) after Δt = 5 years and would manifest as a disagreement with observed CO2 concentrations of >1 ppm. Following previous approaches that use the GCP budget terms for emission verification which have a larger and temporally autocorrelated uncertainty given by its budget imbalance3,4, we could instead only confidently detect a two times larger misreporting of emissions (8.8%) over the same window (Methods). As we apply the verification only once (at the five-year window), there is no issue with repeat testing3. Nevertheless, even with the improved performance of the Weighted-T model, we are still unable to detect short-lived emission changes such as the 0.52 GtC reduction in 2020 global emissions compared to 2019 caused by the COVID-19 pandemic5,37.

Fig. 2: Fossil fuel emissions and time-integrated difference between inferred and reported fossil fuel emissions.
figure 2

a, Comparison between reported emissions (black line) and emissions inferred based on observed atmospheric CO2 (red crosses and red line). Uncertainty in annual inferred emissions is given by the RMSE and shown as 1σ and 2σ envelopes around the five-year running mean (dark and light grey, respectively). Annual reported emissions are obtained from the Global Carbon Project5. Beginning in 2020 we also show rescaled monthly emission estimates (green dots; Methods) from Carbon Monitor37 to highlight impacts of the COVID-19 pandemic. The solid blue line illustrates the counterfactual scenario of falsely reported constant emissions after year 2010 (see text). b, Integrated difference between annual inferred emissions (Fif), GCP-reported emissions (Frep, black line) and emissions reported in our counterfactual scenario (blue line). The y-axis can be directly mapped to a deviation between the observed atmospheric CO2 concentrations (in ppm) and concentrations expected from the reported emissions (second y-axis on the left, scaling factor is 2.124 GtC ppm−1, see text). The range of integrated emission differences that are plausible after 2010 if emissions were reported correctly are shown as grey bands (1σ and 2σ) around a dotted blue line. Note that the integration constant is arbitrarily chosen to anchor the time series to zero in 2010. Stratospheric aerosol optical depth at 550 nm (SAOD, orange line and area) is shown as a proxy for volcanic activity34.

Source data

Outside the period affected by the Mount Pinatubo volcanic eruption, Frep and Fif agree remarkably well since 1958, supporting the accuracy of reported emissions Frep. Figure 2 confirms the reported plateauing of emissions in the last decade and our regression model finds no evidence of inaccurate reporting of emissions since the Paris Agreement was adopted in December 2015. The model also does not detect any significant underreporting of emissions in the 1990s and 2000s, as suggested previously using atmospheric data employing a different approach1,2.

Our carbon cycle model can also be applied to study emissions before the start of direct atmospheric CO2 measurements using ice core reconstructions and gridded temperature fields available since 1900 (Fig. 2 and Extended Data Fig. 5). From 1900 until around 1935, annual inferred emissions are 0.59 GtC yr−1 higher than reported emissions, followed by the 1940s where inferred emissions are 0.3 GtC yr−1 lower. Although we cannot exclude contributions from unrepresented multi-decadal variability in the ocean sink and inaccurate temperature and/or CO2 input data, we suggest that these disagreements are most likely explained by neglected variations in land use emissions, which are assumed constant in our regression model but almost certainly varied to some degree over the twentieth century5. LU has a large uncertainty, so 0.59 GtC yr−1 higher LU emissions at the beginning of the twentieth century compared to the average after 1960 are plausible. Bastos et al.21 suggested that LU decreased anomalously in the 1940s because of the socioeconomic turmoil of World War II and explored several hypothetical LU scenarios to account for increased land abandonment. These scenarios yielded changes in AGR of up to 0.39 GtC yr−1, more than sufficient to explain the underestimation of emissions in our hindcast during the 1940s. Previous carbon cycle models, in contrast, produce discrepancies too large to be completely explained by the land abandonment scenarios in the analysis of Bastos et al.21. The improved performance of the Weighted-T model in the 1940s compared to models reviewed by Bastos et al.21 is probably explained by its ability to at least partially capture an increase in land carbon uptake associated with declining surface temperatures between 1940 and 1960, similar to the box model by Rafelski et al.22.

In summary, by combining a simple regression model of the net land sink and a pulse-response ocean model33, we built a framework that explains many of the important interannual and decadal features of the atmospheric CO2 rise since 1960 and also provides a new context for understanding variability from 1900 from 1960. The net land sink model accounts for the effect of CO2 fertilization (and/or other correlated long-term trends) and climate-driven CO2 flux variations which are calculated based on the spatially and seasonally resolved temperature sensitivity of biosphere carbon exchange determined previously from the CarboScope global atmospheric CO2 inversion19. The model framework can be used to verify reported global emissions to within 4.4% at the 95% confidence level over the five-year stocktake window of the Paris Agreement. Model parameters are constant yet the model’s performance over the last six decades is consistently strong, suggesting that there were no large surprises in the carbon cycle. Discrepancies in the early twentieth century point to larger temporal variability in the land use flux before 1960.


Regression model input

In the following we describe the data needed to set up the regression model of the net land sink. The local temperature sensitivity of land carbon fluxes and the ocean carbon sink model need to be evaluated at least at monthly resolution for numerical accuracy and to account for seasonality in plant growth. Therefore, the ocean and land inputs to equation (2) are first calculated monthly and then averaged to annual resolution for the regression analysis.

Rödenbeck et al.19 derived the local temperature sensitivity (γ*) of interannual carbon fluxes in a global CO2 inversion. Extending the classical carbon flux inversion in which time-dependent land carbon exchange is optimized, the ‘NEE-T inversion’ imposes a linear relationship between detrended temperature anomalies and interannual carbon fluxes in each land grid cell. The mean seasonal cycle of the carbon fluxes, the trend in seasonal cycle and the long-term flux trend are accounted for separately in the inversion. γ* is seasonally resolved (with a four-week decorrelation scale) but repeats every year, yielding 52/4 = 13 independent degrees of freedom in time. The resulting γ* field from CarboScope inversion ‘sEXTocNEET_v2021’ is provided on a spatial grid of 2.5° × 2° and at daily resolution, which we average monthly. To obtain the final weighting factors γ(i,mon) used in our regression modelling, we normalize the monthly averaged sensitivity \(\hat{\gamma }\) in each grid cell i out of n total cells according to:

$$\gamma \left(i,\mathrm{mon}\right)=\frac{1}{12}\frac{\hat{\gamma }\left(i,\mathrm{mon}\right)}{\mathop{\sum }\limits_{k=1}^{12}\mathop{\sum }\limits_{l=1}^{n}\hat{\gamma }\left(l,k\right)}$$

Monthly temperature fields were downloaded from Berkeley Earth30 on a 1° × 1° latitude–longitude grid. The data are re-gridded to match the resolution of the Jena CarboScope inversion results and subsequently decadally detrended in each grid cell by subtracting an exponential moving average of past temperatures with a 2.5-year time constant. This avoids causality problems associated with other filters, such as a centred running mean, which effectively includes information about the future, and allows the model to be readily updated because it produces valid output on the leading edge of available data. Subtracting the exponential moving average is essentially the same as assuming that temperature modulates the carbon flux into or out of a carbon pool with a 2.5-year turnover time, paralleling Rafelski et al.22.

For the main configuration of our net land sink model, the ‘Weighted-T’ model, Γ is calculated as \(\varGamma =\mathop{\sum }\nolimits_{i=1}^{n}{\gamma }_{i}{T}_{i}\), the sum of all n gridded temperature sensitivity weights γi and local temperature anomalies Ti over the globe. For alternative configurations of the model, we also calculate global mean land temperature and the Niño 3.4 index from the re-gridded temperature fields and use either for Γ.

De-seasonalized CO2 observations from the South Pole (SPO) and Mauna Loa (MLO, Hawaii) are available at monthly resolution from the Scripps CO2 Program after 195839. Their monthly average has been merged with a spline fit to the Law Dome ice core CO2 record available from the Scripps CO2 website to extend the time series to 180039,40.

The atmospheric growth rate of CO2 (AGR) is calculated as the 12-month-centred difference. For example, the AGR for May 2018 is calculated as the difference in CO2 concentrations between November 2018 and November 2017.

Monthly ocean carbon uptake is calculated using a pulse-response ocean model published by Joos et al.33, which only needs a history of atmospheric CO2 concentrations as input. The model fully accounts for the nonlinearity of carbon dissolution chemistry through a high-order parametrization and represents carbon exchange with the deep ocean via circulation through a simple pulse-response function. The model has been shown to effectively emulate the behaviour of three-dimensional ocean models for the twenty-first century at a fraction of the computational cost. The model is spun up from an equilibrium state in 1800 and with parameter values for the HILDA (HS + LS) model configuration given in Table 2 and section A2.2 of Joos et al.33. Model output is rescaled to match the integrated ocean sink between 1800 and 2021 of 176.03 GtC reported in the Global Carbon Project (GCP)5.

We use annual emissions from fossil fuel combustion and industrial processes reported by the GCP5. To reduce the number of terms in the budget, the influence of the cement carbonation sink is grouped into the Frep estimate.

Carbon Monitor37 began reporting monthly fossil emissions starting in 2019, which we use to highlight effects of the COVID-19 pandemic in Fig. 2. To remove effects of the seasonal cycle from the data and build a coherent time series with emissions reported by the GCP5, we first divided Carbon Monitor emissions after 2020 by their monthly value in 2019 and then multiplied that anomaly by the GCP annual emissions for 2019.

Regression model set-up and uncertainties

First, a regression model is fitted to all input data between 1960 and 2020 to get a best estimate of the coefficients a, b and c and the r-squared value of the model. Then, input data are split into M = 12 non-overlapping blocks of five years to determine uncertainty on model parameters and find the generalization error which measures how well the model performs on data it was not trained on. The grouping into blocks addresses potential concerns about autocorrelation in the data with a timescale of <5 years. M − 1 blocks, that is, the ‘training data’, are used to fit the regression model and obtain parameter estimates, and the remaining block, that is the ‘validation data’, is used to evaluate model performance according to the mean squared error (MSE). The splitting procedure is repeated using each block once as validation data and can be thought of as a combination of the statistical methods ‘jackknifing’ and ‘cross-validation’.

Jackknifing produces a suite of M = 12 subjugate models that were fitted to different choices of training data and used to determine uncertainties for different parameters of the regression model, including coefficients a, b and c and the R2 value. To account for the overlap in training data for different subjugate models, we report uncertainties δx for parameter x based on equation (6):

$${\delta }_{x}=\sqrt{\frac{M-1}{M}{\sum }_{i=1}^{M}{\left({x}_{i}-\bar{x}\right)}^{2}}$$

where \(\bar{x}\) is the mean of parameter x over all subjugate models i.

Cross-validation produces M estimates of the generalization error, that is MSEj, derived from different validation periods \(j=1,\ldots ,M\) (Extended Data Fig. 1). We find the average \(\overline{\mathrm{MSE}}\) and RMSE (\(\mathrm{RMSE}=\sqrt{\overline{\mathrm{MSE}}}\)) of all subjugate models as the best estimate of the generalization error. For decadal RMSE values, the regression residual ϵ is smoothed with a ten-year moving average, including at the edges where this introduces a slight bias, before the MSEj is calculated in the validation period.

The \(\overline{\mathrm{MSE}}\) and RMSE of the GCP5 budget imbalance are calculated similarly. GCP data are split into matching five-year blocks and the MSE is determined for each independent five-year block.

To allow quantitative comparison of different models and a statistical comparison to the GCP, some confidence bounds must be placed on the generalization error. Cross-validation produces a distribution of MSE values (MSEj) obtained for different choices of the five-year validation blocks (j). Although the validation blocks are independent, the training data for each subjugate model overlap strongly. Therefore, the variance of MSEj values captures only some of the uncertainty of the generalization error. To estimate the true generalization error we use a heuristic method which relies on rescaling the variance of MSEj values obtained from cross-validation38:

$$\mathrm{var}\left(\overline{\mathrm{MSE}}\right)\approx \left(\frac{1}{M}+\frac{{n}_{2}}{{n}_{1}}\right)\mathrm{var}\left(\mathrm{MSE}_{j}\right)$$

where n2 is the length of the validation period and n1 is the length of the training period for the model. Using this formula, we obtain values of 0.25 ± 0.08, 0.36 ± 0.12, 0.55 ± 0.31 and 0.58 ± 0.17 (1σ) for the MSE of the Weighted-T model, Tropical-T model, ENSO model and the GCP, respectively. We use the delta method (which involves a mapping according to a first-order Taylor expansion of the derivative) to estimate the corresponding RMSE values.

We perform two sensitivity tests to further investigate the role of volcanic activity and the ocean model in our regression analysis. We repeat the regression analysis including stratospheric aerosol optical depth34 as an additional predictor to better capture the influence of volcanic eruptions on land carbon exchange. Results and regression parameters are presented in Extended Data Table 1. Although the Weighted-T model itself does not benefit from the additional predictor and we observe slightly improved performance of the ENSO and Tropical-T model, the Weighted-T model continues to outperform both. When the period of the Mount Pinatubo eruption (1991–1993) is excluded from the regression analysis instead, the performance of all models increases (Extended Data Table 3) but the results remain qualitatively the same. We also repeat the regression analysis using the mean of GCP ocean sink estimates instead of the Joos et al. pulse-response model. This leads to a minor improvement of the RMSE of all model configurations but does not alter our conclusions (Extended Data Table 2).

To support our claim that the GCP budget imbalance and the residual of our regression model are comparable quantities, we correct the GCP budget imbalance with a fit of the imbalance to the atmospheric CO2 concentration. After cross-validation, this yields an RMSE of 0.8 ± 0.11 which is higher than the value calculated without the additional fitting correction. This shows that the atmospheric CO2 record has no predictive power for the GCP budget imbalance and that our results are not an artefact of additional tuning.

Finally, we repeat the regression analysis after changing the CO2 term to a logarithmic relationship, that is, \(b\times {\rm{ln}}({\rm{CO}}_{2})\), which does not substantially change the performance of the models.

GCP emission verification

Following refs. 3,4, we benchmark our ability to verify reported fossil fuel and land use change emissions from the GCP global carbon budget5 using its residual imbalance as a metric of uncertainty. The budget imbalance is temporally autocorrelated and best represented by an AR(1) process with an AR coefficient of 0.35 as determined by the ARMIA function of the Python statsmodel package. Monte Carlo simulations (n = 200,000) of the GCP budget imbalance represented as this AR(1)-processes find a cumulative uncertainty of 4.4 GtC (95% CI) after five years, or 8.8% assuming ongoing emissions of 10 GtC yr−1.