Background & Summary

Accurate quantification of gross primary production (GPP) through photosynthesis is essential for studies of ecosystem function, carbon cycle, human welfare, and net-zero carbon emission1,2,3,4. Various methods have been developed to estimate GPP at the global scale, which can be divided into three main categories: enzyme kinetic (process-based) models5,6,7, light use efficiency (LUE) models8,9,10,11,12, and data-driven approaches13,14,15,16,17. While a wide range of global GPP estimates is available, the significant discrepancies in GPP estimates generated by different methods remain one of the most uncertain aspects in quantifying the global carbon cycle18,19,20,21,22. Over the past decade, advances in global remote sensing of solar-induced chlorophyll fluorescence (SIF) have made it possible to inform on vegetation photosynthetic activity at a global scale23,24,25,26,27,28,29,30, providing new opportunities for accurate GPP estimates.

SIF is a small fraction of re-emitted light accompanying the absorption of photosynthetically active radiation (PAR) by excited chlorophyll-a molecules in the spectral range from 650 to 800 nm31. The first approved global mission designed explicitly for SIF measurement of terrestrial vegetation, the FLuorescence EXplorer (FLEX), was selected as the eighth Earth Explorer mission of the European Space Agency and will be launched in 202532. The global SIF datasets currently used are estimated from atmospheric sensors because they have the required spectral resolution and signal-to-noise ratio (details of the sensors are given in Table 1). However, the existing SIF records have long been limited by their low spatial resolution and sparseness in data acquisition. For instance, the Global Ozone Monitoring Experiment-2 (GOME-2)24 and the SCanning Imaging Absorption SpectroMeter for Atmospheric CHartographY (SCIAMACHY)26 provide spatially continuous coverage of SIF but with large footprint size (hence low spatial resolution, Table 1). Conversely, the Greenhouse Gases Observing Satellite (GOSAT)23 and the Orbiting Carbon Observatory-2 (OCO-2)25 offer much smaller footprint size, but with sparse and thus spatially discontinuous measurements.

Table 1 Space-borne instruments currently in orbit enabling SIF estimation.

The above dilemma is partially addressed by the TROPOspheric Monitoring Instrument (TROPOMI) on the Copernicus Sentinel-5P mission thanks to the significantly increased spatiotemporal resolution and data coverage27. TROPOMI has almost global coverage (except for small gaps between footprints) and high spatial resolution (3.5 km × 5.5 km at nadir)33. Compared with the earlier missions, TROPOMI has a daily revisit time to provide a significant increase in the number of clear-sky measurements. However, Sentinel-5P was launched in October 2017, and the TROPOMI SIF data are only available since April 2018, limiting its use for long-term applications.

This study uses machine learning algorithms to reconstruct TROPOMI SIF (RTSIF) for a longer period to alleviate the issue above. RTSIF is generated based on the Caltech TROPOMI SIF data27, the nadir bidirectional reflectance distribution adjusted reflectance (NBAR)34, land surface temperature (LST)35, and land cover data36 from the Moderate Resolution Imaging Spectroradiometer (MODIS), the PAR data37 from the Earth’s Radiant Energy System (CERES), and the vegetation type data38 from the International Satellite Land Surface Climatology Project, Initiative II (ISLSCP II). This dataset extends the time coverage of the TROPOMI SIF data and provides a long-term, high-resolution, and global SIF record. RTSIF is in good agreement with TROPOMI SIF and has been evaluated against the GOME-2 and OCO-2 SIF. We further demonstrate the consistency between RTSIF and tower measured SIF and GPP. The proposed dataset provides a new dataset for SIF evaluation and could benefit related ecosystem, carbon cycle, and net-zero carbon emission studies.

Methods

Framework overview

Figure 1 illustrates the overall framework used to generate RTSIF. Based on the LUE concept, SIF can be expressed as follows according to Zhang et al.39 and Zhang et al.40:

$${\rm{SIF}}={\rm{PAR}}\times {{\rm{fPAR}}}_{{\rm{chl}}}\times {\rm{FE}}$$
(1)

where fPARchl is the fraction of PAR absorbed by chlorophyll (\({{\rm{APAR}}}_{chl}\)) and FE is the fluorescence efficiency. Since SIF originates from the solar energy absorbed by chlorophyll-a molecules41, it is highly correlated with \({{\rm{APAR}}}_{chl}\), the product of \({{\rm{fPAR}}}_{chl}\) and PAR42,43,44. Previous studies have shown that \({{\rm{fPAR}}}_{chl}\) can be estimated from surface reflectance using radiative transfer models45, and thus PAR and surface reflectance have been widely used to reconstruct SIF46,47,48,49,50,51. Previous studies have also shown that the high correlation between SIF and APARchl is limited to unstressed conditions52, while drought and other environmental stresses can affect FE. LST can be used as a proxy of thermal stress in predictive models of SIF53,54,55,56. In this study, we further consider that including biome type may improve the prediction accuracy of the SIF model given the plant structural and physiological differences in different biomes and different photosynthetic pathways in C3 and C4 plants. We finally selected surface reflectance, PAR, LST, land cover, and C3/C4 fraction as input variables for the RTSIF modeling.

Fig. 1
figure 1

The workflow to generate RTSIF.

Data Sets

We used multiple datasets as input to generate RTSIF. All the datasets used are summarized in Table 2 and described in detail as follows.

Table 2 Datasets used in developing the machine learning model for RTSIF and their characteristics.

The Caltech TROPOMI SIF data between March 2018 and December 2020 were used for model training and evaluation. We followed the filtering scheme in the original reference27 to retain daily average clear-sky SIF data with cloud fractions less than 0.1, and excluded the data with a sun zenith angle (SZA) greater than 60° and a view zenith angle (VZA) greater than 70°. The ungridded data through the filtering scheme were aggregated to 0.05° grids at an 8-day resolution, the grid size of which was close to the footprint size of the TROPOMI SIF data. Averaging the multiple observations reduces the uncertainty in the original SIF retrievals by \(\sqrt{n}\) (n is the effective number of observations in the grid cell)25. For each 0.05° grid, only the SIF footprint covering the center of the grid was recorded as valid retrievals, and the SIF values were calculated only when more than four valid retrievals were included. We used the SIF values at 740 nm from the 743–758 nm retrieval window, which is optimal for high retrieval precision and low sensitivity to clouds33.

Ancillary input data including the MODIS land products, the CERES products, and the ISLSCP II products were used to generate RTSIF. The MODIS products included LST (MOD11C135), land cover (MCD12C136), and seven bands for nadir bidirectional reflectance distribution adjusted reflectance (NBAR; MCD43C434). To reduce the uncertainty in the SIF modeling, only high-quality MOD11C1 (QA < 2) and MCD43C4 (QA < 2) data were used and aggregated to an 8-day average. Gap-filling and smoothing algorithms were used to reconstruct the 8-day MOD11C1 and MCD43C4 data57 and replace the poor observations caused by bad atmospheric conditions. We used an updated land cover map (MCD12C1) for each year. PAR data (SYNI PAR37) from the CERES products were used, aggregated to 8-day, and interpolated to 0.05° using bilinear interpolation. The ISLSCP II C4 vegetation map was used for natural C4 vegetation distribution38, assuming that all the vegetation types within each 1° grid cell shared the same C3/C4 ratio.

Data-Driven approach

Extreme Gradient Boosting (XGBoost) is an enhanced version of the machine learning algorithm named Gradient Boosted Decision Tree (GBDT)58. It constructs enhanced trees that can handle complex nonlinear relationships59,60. As a boosting algorithm, XGBoost consists of multiple decision trees, each of which is trained with the residual error of the predicted result from the previous decision tree, and finally iterates the results of all the decision trees before producing the final result. Compared with other traditional GBDT algorithms that only use first-order derivatives, XGBoost performs a second-order Taylor expansion on the loss function between computed results and actual observations to accelerate the convergence of the model during training and provide higher efficiency in finding the optimal solution. In addition, XGBoost has a regularization term to control the complexity of the model, which can effectively avoid overfitting. The TROPOMI SIF and the input variables constitute a dataset containing a large number of data samples (about 36 million). The current machine learning algorithms have difficulties in processing large datasets using existing packages61, while XGBoost employs software and hardware optimization techniques to make it possible to process tens of millions of training data. In this study, XGBoost was implemented using the Python library XGBoost (https://github.com/dmlc/xgboost). Before training, each variable was standardized by its mean and deviation. We split the data into the training group (80%) and the testing group (20%). Many hyperparameters in XGBoost affect the model performance, and a grid search was performed for the hyperparameters with 10-fold cross-validation to find the best combination of the parameters based on the Root Mean Square Error (RMSE) metric62. The optimized hyperparameters are compiled in Supplementary Table S1.

Data Records

Our long-term global SIF dataset, RTSIF, is available at https://doi.org/10.6084/m9.figshare.19336346.v263. The data record contains global RTSIF data from January 2001 to December 2020 at a 0.05°/8-day resolution. There are 46 GeoTiff files per year, one for each 8-day period. The unit is mWm−2nm−1sr−1. The file name RTSIF_ < YYYY > - < MM > - < DD > .tif provides information on the year, month, and start date of the 8-day period. Considering that deserts and glaciers have no vegetation, those pixels are flagged.

Technical Validation

Model validation

We tested the performance of the XGBoost model with the optimal hyperparameters. The model reproduces the TROPOMI SIF with a determination coefficient R2 of 0.916, a RMSE of 0.059 mWm−2nm−1sr−1 during training, and an R2 of 0.907, and an RMSE of 0.062 mWm−2nm−1sr−1 during testing (Fig. 2), suggesting that our optimized XGBoost model is not overfitting. The slope of the fit between the reproduced and observed SIF values is close to 1, indicating that there is no systematic discrepancy. We also investigated the performance of the model for each land cover type defined in the MCD12C1 dataset. For most land cover types, the reproduced and observed TROPOMI SIF values have R2 values over 0.8 (Table. S2).

Fig. 2
figure 2

Performance of the XGBoost model in reproducing TROPOMI SIF over the training and testing data. The shading color represents the density of the scatterplot. Black lines represent the regression slope, and the red dotted lines represent the 1:1 line. The regression is forced to pass the origin. All values are in the unit of mWm−2nm−1sr−1.

We compared RTSIF and TROPOMI SIF for 1° selected grid cells representative of the 12 vegetated biomes (locations shown in Fig. S1b). RTSIF can accurately capture seasonal and interannual variations in TROPOMI SIF for most biome types. The standard deviation in the RTSIF data is typically smaller than that in the originally retrieved TROPOMI SIF, indicating reduced noise in the RTSIF dataset. RTSIF also fills the gaps where no TROPOMI SIF data are available (Fig 3).

Fig. 3
figure 3

Time series of RTSIF and TROPOMI SIF for selected 1° grid cells. All the samples from the training data and the testing data were used. The red line represents TROPOMI SIF, and the blue line represents RTSIF. The error bars represent the standard deviation of the TROPOMI SIF footprint and RTSIF used to generate 1° grid. The MODIS MOD12C1 land cover dataset was used to select these example grid cells. All the values are in the unit of mWm−2nm−1sr−1.

To further illustrate the spatial variation of RTSIF, we show the global mean and maximum values of RTSIF in 2019 (Fig. 4). The average daily SIF has the highest values in the tropics, intermediate values in southern China, central Europe, and the eastern United States, and the lowest values in barren regions. The maximum daily SIF is found mainly in the North American corn belt, South Asia, central Europe, and tropical rainforests, consistent with the high productivity in these regions64. The annual average SIF and the maximum daily SIF show similar spatial patterns as those in TROPOMI SIF.

Fig. 4
figure 4

Spatial pattern of average and maximum (90th percentile) daily values for RTSIF (a and b) and TROPOMI SIF (c and d) in 2019. All the values are in units of mWm−2nm−1sr−1.

Comparison of RTSIF with tower-based SIF

Recently several studies have reported SIF measurements from ground towers65,66,67,68,69, providing a valuable opportunity to verify the temporal variation observed in RTSIF. We compared the tower-based SIF observations at the Southern Old Black Spruce65 (53.98°N, 105.12°W) and the Niwot Ridge sites69 (40.03°N, 105.55°W) with RTSIF. The ground tower SIF data were collected using a scanning spectrometer (PhotoSpec) for far-red (745–758 nm) SIF and retrieved by the singular value decomposition (SVD) method scaled to 750 nm. For comparison, we scaled the ground SIF to 740 nm using a wavelength scaling factor of 1.17 and aggregated the hourly data to the daily timescale51. Our results show good agreement between RTSIF and tower-based SIF (Fig. 5), with an R2 of 0.754 at the Southern Old Black Spruce site and an R2 of 0.84 at the Niwot Ridge site. Although mismatches were found between RTSIF and SIF measurements at the Niwot Ridge Site, which is possibly due to inconsistency between tower footprint and RTSIF pixel size and landscape heterogeneity. RTSIF captures the seasonal changes of the tower-based SIF at both sites well reprocudes, successfully locating the timing of spring onset and autumn senescence.

Fig. 5
figure 5

Comparison between RTSIF, TROPOMI SIF, and tower-based SIF measurements. The red line is plotted using the daily average SIF collected by PhotoSpec, presented as 5-day moving averages. The green dots represent the average of the TROPOMI SIF footprint aggregated to 8-day. The blue dot represents the RTSIF value of the pixel where the site is located. The comparison between TROPOMI SIF and RTSIF with the tower-based SIF was based on 8-day averages (b and d). All the values are in the unit of mWm−2nm−1sr−1.

Comparison of RTSIF with other SIF products

We further compared the RTSIF dataset with the retrievals of OCO-2 SIF and GOME-2 SIF24,25 (Fig. 6). OCO-2 SIF was retrieved at 757 nm, and a wavelength scale factor of 1.56 was required to convert the wavelength from OCO-2 (757 nm) to 740 nm27. We used OCO-2 (2015–2020) and GOME-2 (2007–2019) SIF data and aggregated all the clear-sky and good-quality measurements to 1° with an 8-day temporal resolution by using the same cloud filtering threshold (less than 0.1). All the data show similar seasonal variations in the most selected areas of typical biomes except over broad-leaf evergreen forests. The disagreement is mostly due to the low signal-to-noise ratio of GOME-2, which led the GOME-2 SIF cannot capture seasonal changes (blue lines in Fig. 6b)50. In addition, the large footprint of GOME-2 SIF makes it more sensitive to cloud contamination in subpixels leading to underestimated SIF values70. Notably, GOME-2 SIF showed large fluctuations (even negative values) during the non-growing season at some sites caused by snow contamination (Fig. 6k,i)71,72. RTSIF agrees well with OCO-2 SIF as the training TROPOMI SIF with high signal-to-noise ratios and spatial resolutions has demonstrated agreement with OCO-2 SIF27 and fills the gap where OCO-2 SIF is discontinuous both spatially and temporally.

Fig. 6
figure 6

Time series of RTSIF, OCO-2 SIF, and GOME-2 SIF for selected regions. The blue line represents the RTSIF and the gray line represents the GOME-2 SIF. The red dots represent the OCO-2 SIF measurements which are not continuous. All the values are in the unit of mWm−2nm−1sr−1.

At the global scale, RTSIF shows good agreement with OCO-2 SIF and GOME-2 SIF in most regions with an R2 > 0.7 (Fig. 7a,b). The R2 between RTSIF and OCO-2 SIF is higher than that between RTSIF and GOME-2 SIF due to the reasons mentioned in the previous paragraph. The regression slopes of RTSIF with OCO-2 SIF and GOME-2 SIF are close to 1. However, in regions with persistent cloud cover (e.g., tropical rainforests and Western Europe), the regression slope of RTSIF with GOME-2 SIF is larger than 1 (Fig. 7d), suggesting that GOME-2 SIF is underestimated due to cloud cover in these regions. Although we filter the GOME-2 SIF data with a cloud fraction of 0.1, the large footprint size in GOME-2 SIF (~40 km) makes it impossible to remove all the subpixel cloud contamination51. Because our model is trained with clear-sky data (although these areas usually have high cloud coverage, there are still a large amount of clear-sky data), RTSIF is less affected by cloud cover. In addition, there is no significant increase in noise in the TROPOMI SIF due to the South Atlantic Anomaly (SSA)73, and RTSIF should reproduce SIF values for parts of South America with higher accuracy than OCO-2 and GOME-2 SIF. Overall it can be concluded that RTSIF provides consistent and spatially continuous SIF estimates compared to the other two products.

Fig. 7
figure 7

Comparison of the RTSIF, OCO-2 SIF, and GOME-2 SIF datasets. R2 and regression slope for RTSIF versus OCO-2 SIF (a and c) and GOME-2 SIF (b and d). The regression is forced to pass the origin. The white area represents the barren region. The data between 2015–2020 (OCO-2 SIF) and 2007–2019 (GOME-2 SIF) were used for comparison.

Comparison of SIF with Tower GPP estimates

To further evaluate the RTSIF product, we explored the relationship between RTSIF and GPP using GPP estimates from the FLUXNET 2015 Tier 1 dataset74. The daily GPP estimates were calculated using the average of GPP estimates from the nighttime (GPP_NT_VUT_REF) and daytime (GPP_DT_VUT_REF) partitioning methods75,76. Only the GPP estimates with more than four consecutive days of high quality (QA = 1) measurements were used when aggregated to an 8-day resolution. Considering the inconsistency between the flux tower footprint and the RTSIF pixel size, we only selected sites where the biome type in the RTSIF grid is homogeneous and the same as that at the flux tower site. We finally collected 76 sites from 171 flux sites with more than two years of GPP data. The detailed descriptions of these flux tower sites, including site code, location, and biome type are provided in Supplementary Table S3. There is a linear relationship between RTSIF and GPP in both 8-day and annual timescale (Fig. 8), indicating that RTSIF is tightly related to GPP.

Fig. 8
figure 8

Relationship between RTSIF and FLUXNET GPP at a 8-day timescale (a) and the annual scale (b). The shading color represents the density of the scatterplot. The regression is forced to pass the origin.

To investigate whether the SIF-GPP relationship is universal for different biomes, we compared the relationship between biome-specific RTSIF and GPP (Table S4 and Fig. S2). RTSIF was in good agreement with GPP for almost all biomes at the 8-day timescale, indicating strong SIF-GPP correlations for different biomes. The agreement between RTSIF and GPP was good at the annual scale in mixed forests, woody savannas, savannas, and grasslands. GPP and RTSIF showed an overall regression slope of 15.343 (g C m−2  day−1/mWm−2  nm−1  sr−1) in the 8-day timescale and 15.467 (g C m−2 day−1/mWm−2  nm−1  sr−1) in the annual timescale, with different biomes showing significant differences. Specifically, a larger slope was found in evergreen needleleaf forests due to their distinct canopy structure, resulting in stronger reabsorption of SIF.

Temporal patterns of the long-term RTSIF

We further investigated the seasonal variation of RTSIF. Fig. 9a demonstrates the seasonal variation of RTSIF in different latitudes. The northern and southern hemispheres show clear seasonal variations with repeated high values in summer. On the other hand, the tropical regions show persistently high SIF values across seasons. Globally averaged SIF shows clear seasonality (Fig. 9b).

Fig. 9
figure 9

Seasonal and interannual variation of daily SIF. (a) Latitudinal averages of SIF for each 8-day period. (b) The global average of SIF for each 8-day period. All the values are in units of mWm−2nm−1sr−1.

Between 2001 and 2020, the annual average of SIF increased in China and India, and decreased in parts of the tropical rainforest (southern Amazonia and eastern Brazil), consistent with findings in previous studies77,78,79,80 (Fig. 10a). The global average annual RTSIF over the last 20 years has a significant positive trend (0.3% yr−1, p < 0.01), consistent with those observed in other reconstructed SIF products47,50 (Fig. S3). The interannual variability and positive trend of RTSIF are similar to those observed for MODIS EVI (enhanced vegetation index)81 and VPM GPP57, but RTSIF shows larger interannual variabilities (Fig. 10b).

Fig. 10
figure 10

(a) Spatial distribution of the trends of annual average RTSIF during 2001–2020. Sen’s slope estimator is used to calculate the trend. Dots represent the locations where the trend is significant (p < 0.05) through a Mann–Kendall test. All the values are in the unit of mWm−2nm−1sr−1yr−1. (b) Inter-annual variations and trends of normalized global average RTSIF, EVI, and VPM GPP from 2001 to 2020.