Global spatiotemporally continuous MODIS land surface temperature dataset

Land surface temperature (LST) plays a critical role in land surface processes. However, as one of the effective means for obtaining global LST observations, remote sensing observations are inherently affected by cloud cover, resulting in varying degrees of missing data in satellite-derived LST products. Here, we propose a solution. First, the data interpolating empirical orthogonal functions (DINEOF) method is used to reconstruct invalid LSTs in cloud-contaminated areas into ideal, clear-sky LSTs. Then, a cumulative distribution function (CDF) matching-based method is developed to correct the ideal, clear-sky LSTs to the real LSTs. Experimental results prove that this method can effectively reconstruct missing LST data and guarantee acceptable accuracy in most regions of the world, with RMSEs of 1–2 K and R values of 0.820–0.996 under ideal, clear-sky conditions and RMSEs of 4–7 K and R values of 0.811–0.933 under all weather conditions. Finally, a spatiotemporally continuous MODIS LST dataset at 0.05° latitude/longitude grids is produced based on the above method.

Satellite and climate reanalysis data. We use the level-3 daily global LST products (MOD11C1/ MYD11C1, Collection 6) provided by two polar-orbiting sun-synchronous satellites, Terra and Aqua (10:30 AM/ PM and 1:30 AM/PM local solar time, respectively), from the NASA Earth Observing System, which provide temperature and emissivity values at 0.05° latitude/longitude climate model grids (CMGs) 34 . It has been verified that the accuracy of the improved MODIS collection 6 (C6) is better than that of C5 35 , and the errors are less than 1 K at most sites on a uniform surface. The temperature and emissivity values in MOD11C1 and MYD11C1 are derived by reprojection and by averaging the values in the daily MODIS LST/E product (MOD11B1/MYD11B1) at 6-km equal-area grids in the sinusoidal projection. The LST values aggregated to 6-km grids from those retrieved by the generalized split-window algorithm are used to supplement the LSTs retrieved by the day/night LST algorithm at grids where there are no valid pairs of day and night observations (usually in high-latitude regions). Figure 2 shows the global annual average of the number of days with cloud-free MODIS LST data from 2002 to 2020.
In this work, we also include the climate reanalysis data from ERA5-Land's skin temperature data (hereinafter referred to as ERA5-Land LST data) based on the HTESSEL model 36 . The ERA5-Land LST is the theoretical temperature that is required to satisfy the surface energy balance, which has the same physical meaning as the MODIS LST. ERA5-Land was produced by replaying the land component of the ECMWF ERA5-Land climate reanalysis dataset 37 . The temporal and spatial resolutions of ERA5-Land make this dataset very useful for all kinds of land surface applications. ERA5-Land provides data at 0.1° regular latitude/longitude grids with an hourly output frequency, and users can freely obtain data from the Copernicus Climate Change Service (C3S)

Fig. 1
Flowchart of the two-step framework developed for LST reconstruction.

Data interpolating empirical orthogonal functions (DINEOF) reconstruction method. The
DINEOF method is a self-consistent method that was first developed by Beckers et al. 40 to reconstruct incomplete oceanographic datasets based on the empirical orthogonal functions (EOF) method. Subsequently, Alvera-Azcárate et al. used the Lanczos method 41 to improve the DINEOF approach so that it could be used to reconstruct datasets with large amounts of missing data 42 16 and proved that the DINEOF method could accurately recover missing LSTs. The DINEOF method is an automatic, parameter-free, and self-consistent gap-filling method that is different from the traditional methods used in geoscience. It does not require any a priori knowledge of the original data and has high computational efficiency. The principle of the DINEOF method when applied to LST reconstruction can be explained as follows: Assuming that LST matrix is the initial matrix, which contains the observations and some unknown values corresponding to the missing data, the dimensions of LST matrix are s × t (s is the spatial dimension and t is the temporal dimension). The DINEOF method infers missing data based on the empirical orthogonal functions (EOFs) of the data. Here, we can use the singular value decomposition (SVD) method to obtain the EOFs of LST matrix : where U is the spatial EOF of LST matrix , V represents the temporal EOF of LST matrix , and Σ contains the singular values of LST matrix . Then, the missing data LST ij in LST matrix can be accurately reconstructed via Eq. (2).
ij k n n i n T j n 1 u n and v n are the nth columns of U and V, respectively, corresponding to the singular value ρ n . The specific technical process of reconstructing the missing LST data by using the DINEOF method is as follows: a. Randomly select some valid observations from the original dataset as the validation dataset to prepare for subsequent cross validation and remove the values of the validation pixels from the original dataset. b. Fill the missing data points with zeros to obtain a first guess of the missing data. c. Apply EOF decomposition to the matrix, extract the first k EOFs to reconstruct the original matrix, and replace the missing values by the LST ij obtained with the EOFs. Repeat this process until convergence is reached. d. Then, repeat the above process by increasing the number of computed EOFs from k = 1, 2, …, k optimal until the cross-validation accuracy exceeds the present value, and then determine k optimal as the optimal number of EOFs. e. Add the cross-validation dataset to the original dataset and use the optimal number of EOFs k optimal to repeat the whole procedure for the original dataset. Then, the missing data can be reconstructed accurately.
The spatial dimension s and the temporal dimension t of LST matrix are the key factors that affect the reconstruction accuracy of the DINEOF method. The variations in the DINEOF reconstruction error with different spatial and temporal LST dimensions are shown in Fig. 3. As the spatial dimension increases, the error of an LST reconstructed with the DINEOF method also increases significantly. The reason for this is probably that with the increasing size of the spatial dimension, the number of missing points becomes larger, and it is difficult to find the optimal number of EOFs for all missing points, which may affect the final reconstruction accuracy. In contrast, as the temporal LST dimension increases, the error of a DINEOF-reconstructed LST gradually decreases and becomes stable. Considering the computational efficiency and reconstruction accuracy, we finally determine that the spatial and temporal dimensions for reconstructing the global LST with the DINEOF method are 5° × 5° and 365 days (a whole year), respectively, and we then utilize a sliding window at the global scale and use the average as the final outputs for more robust LST estimation to produce the global spatiotemporally continuous ideal, clear-sky MODIS LST dataset with a 0.05° spatial resolution from 2002 to 2020.

Cumulative distribution function (CDF) matching-based correction.
Since clouds may reduce the solar radiation that reaches the surface, causing an ideal, clear-sky LST to deviate from the LST obtained under cloudy sky conditions, it is necessary to correct the ideal, clear-sky LSTs reconstructed by the DINEOF method to the real LSTs under cloudy sky conditions.
The cumulative distribution function (CDF), also called the distribution function, is the integral of the probability density function, and it can completely describe the probability distribution of a real random variable X. The cumulative distribution function matching (CDF matching) method was first proposed by Calheiros et al. 43 and then used to calibrate radar data, remote sensing observations, and precipitation data 44,45 . The CDF matching method utilizes a certain kind of reliable data to correct and fuse remote sensing data from other sources, which can improve the spatial or temporal resolution and the accuracy of the remote sensing data. The CDF matching method does not change the original relative change pattern of remote sensing data 46 , and it can adjust the overall range of data values to be close to the true value range. ERA5-Land is a reanalysis dataset that combines model data with observations from across the world by using the laws of physics. ERA5-Land LST is an all-weather LST dataset, including clear sky and cloudy sky conditions. Taking these LSTs into account, a method based on CDF matching is proposed here to use the ERA5-Land LST dataset to correct the clear-sky LSTs reconstructed by the DINEOF method to the real LSTs under cloudy sky conditions. The basic principle is as follows.
The LST at a specific time and date can be regarded as consisting of two parts: one is the long-term mean of the temperature at that time (climatological temperature), and the other is the deviation from that climatological temperature due to the weather (anomaly temperature) 17 . First, we calculate the climatological temperatures of the ideal, clear-sky satellite and the -ERA5-Land LSTs by using the reconstructed ideal clear-sky MODIS LST and ERA5-land LST data from 2002 to 2020: clim clear sky clear sky Considering the influence of clouds, the climatological temperatures of the ideal clear-sky satellite may deviate from the real climatological temperatures under cloudy sky conditions. Here, the climatological temperature of the ideal clear-sky satellite is corrected to the real climatological temperatures under cloudy sky conditions by Eq. (5):  is the climatological temperature of the real satellite under cloudy sky conditions, -LST clim clear sky is the mean annual climatological temperature of the ideal clear-sky satellite, and T clim is the mean annual climatological temperature of the ERA5-Land LST. Then, the anomaly temperature of the reconstructed ideal clear-sky MODIS LST and the ERA5-Land LST can be calculated: is the anomaly temperature of the reconstructed ideal, clear-sky MODIS LST, is the anomaly temperature of the ERA5-Land LST, and T(i) is the ERA5-Land LST, which corresponds to the reconstructed ideal, clear-sky MODIS LST.
Since ERA5-Land LST considers the influence of clouds and other factors on LST changes, we can propose a hypothesis: the anomalous temperatures obtained from satellite estimates under cloudy sky conditions and those of the ERA5-Land LST dataset should have the same cumulative distribution function: anom cloudy anom = Therefore, the anomaly temperature of the reconstructed ideal, clear-sky MODIS LST -LST anom clear sky can be corrected to that of the real satellite under cloudy sky conditions LST anom cloudy through the CDF matching method, and then the real satellite LST under cloudy sky conditions LST cloudy after correction can be obtained: Statistical metrics. Here, six statistical metrics are used to quantify the reconstruction performance: the Bias, root mean square error (RMSE), unbiased root mean square error (ubRMSE), root mean square difference (RMSD), unbiased root mean square difference (ubRMSD), and Pearson correlation coefficient ® , which are defined as follows: where n is the total number of samples involved in the comparison, LST ri is the reconstructed LST, and LST oi represents the LST corresponding to a different validation process, which can be the original satellite-derived LST, the ground-measured LST or the ERA5-Land LST. LST ri and LST oi are the means of LST ri and LST oi , respectively.

Data Records
The  Table 1.

Technical Validation
erformance under different land cover types. We randomly selected seventeen validation regions (red rectangles in Fig. 2) around the world according to their land cover types (based on the International Geosphere-Biosphere Programme (IGBP) classification scheme). The DINEOF ideal, clear-sky LST reconstruction method using cloud-free MODIS LST pixels (MYD11C1) was evaluated in these selected regions, and the spatial range of each of these selected regions was 5° × 5°. We eliminated the known clear-sky LSTs and used the DINEOF method to fill in the missing data; then, we compared the errors between the filled data and the original MODIS data. For the above areas with different land cover types, the comparison statistics (overall performance metrics) of the DINEOF method with respect to reconstructing ideal, clear-sky LSTs under synthetic clouds are shown in Fig. 4 and Table 2. Overall, the DINEOF method showed good performance, with an average R of 0.971, a minimum Bias of −0.001 K, a maximum Bias of 0.049 K, and an RMSE between 1.436 K and 2.688 K. The minimum Bias of −0.001 K was achieved in the closed shrubland, while the water areas had the smallest RMSE of 1.436 K, and the deciduous needleleaf forest had the highest correlation of 0.996. The worst correlation coefficient of 0.820 was found for the evergreen broadleaf forest due to it possessing the smallest temperature range throughout the year. From the above statistical metrics, it is shown that the DINEOF method generally effectively reconstructs missing LST information under all land cover types.

Validation against in-situ measurements.
To further evaluate the reliability of the CDF matching-based correction method, we compared the original satellite-derived LSTs and reconstructed LSTs with the in-situ measured data at 12 ground sites (black pentagrams in Fig. 2). As shown in Fig. 2  www.nature.com/scientificdata www.nature.com/scientificdata/ www.nature.com/scientificdata www.nature.com/scientificdata/ where ULR and DLR are the measured upwelling and downwelling longwave radiation, respectively, ε b is the broadband surface emissivity (BBE) at the location of the ground site, and σ is the Stefan-Boltzmann constant (5.67 × 10 −8 Wm −2 K −4 ). The details of the twelve ground sites are shown in Table 3.  Table 4. In Figs. 5(b) and 6(b), at most sites, the satellite-derived LSTs were consistent with the ground measurements, where R ranged between 0.867 and 0.986, the minimum bias was 0.001 K, the maximum bias was −5.595 K, and the RMSE was between 3.243 K and 6.754 K. Although each site had missing satellite observations on different days in 2019 and some sites had missing data for more than 200 days, all missing data were effectively filled in through the DINEOF method and the CDF matching method, completely reconstructing the LST time series data of each site. All of the DINEOF-reconstructed LSTs exhibited errors that were comparable to the errors of the satellite observations, where R ranged from 0.798 and 0.938, the smallest Bias was −0.570 K, the largest bias was 4.159 K, and the RMSE was between 4.326 K and 8.    Table 3. Detailed information about the twelve in-situ measurement sites.

SDS Name
www.nature.com/scientificdata www.nature.com/scientificdata/ www.nature.com/scientificdata www.nature.com/scientificdata/ www.nature.com/scientificdata www.nature.com/scientificdata/ DM sites increased by 0.001-0.042. The Biases and RMSEs at the BON, PSU and HL sites increased slightly, and the R values decreased slightly. The reason for this may be that the 0.1° spatial resolution ERA5-Land LST dataset was resampled to 0.05° to match the spatial resolution of MODIS LST, but the resampling process could not provide sufficient detailed information and may have introduced uncertainty to the CDF matching-based correction process. In addition, Figs. 5(a) and 6(a) exhibits good consistency between the reconstructed spatiotemporally continuous all-weather MODIS LSTs and the ground-measured LSTs. Overall, the combination of the DINEOF reconstruction method and the CDF matching-based correction method is an effective approach for MODIS LST reconstruction.
It is worth noting that in 2019, most of the twelve ground sites had more than 100 days of missed MODIS LST data. In particular, the BON, PSU and SXF sites lacked effective satellite observations for more than 200 days throughout the year, which illustrated the importance of LST reconstruction in practical applications.
Cross validation against the ERa5-Land estimates. After cross validating the LSTs corrected by the CDF matching-based method using the ERA5-Land LSTs at a global scale, we expressed the uncertainties by means of the Biases, RMSDs, ubRMSDs and R values for different land cover types, as shown in the boxplots in Fig. 7. The statistical metrics for the comparison between the daytime LSTs and the nighttime LSTs are given separately. In Fig. 8, the global spatial distribution of the Bias, RMSD, ubRMSD and R values of the corrected LSTs is shown. Figures 7 and 8 demonstrate that for most regions and land cover types in the world, the corrected LSTs exhibited a general consistency with the ERA5-Land estimates.
As shown in Fig. 7, the Biases and ubRMSDs at night were generally lower than those during the day for all land cover types. This may be because LSTs have larger spatial and temporal variabilities during the daytime, resulting in the final reconstruction accuracy being lower than that at night. The largest differences between the corrected MODIS LSTs and the ERA5-Land LSTs were observed over the water areas, where the largest mean RMSD was 7.308 K. The smallest R values between the corrected MODIS LSTs and the ERA5-Land LSTs were observed in the evergreen broadleaf forest-covered areas. In these evergreen broadleaf forest-covered areas, the lush foliage of trees hinders the reflection and emission of radiation from ground objects into space; in contrast, the transpiration of plants results in much water vapor, which also affects the radiation reflected and emitted by the ground. These are all important reasons that affect the observations of the satellite sensor when obtaining LSTs. The lack of effective satellite observations will inevitably affect the final reconstruction accuracy, and the correlation could be very low due to the shrinkage in the variability of LSTs. From the spatial distributions of the statistical metrics, the average Bias was −0.805 K, the average RMSD was 5.636 K, the average ubRMSD was 4.810 K and the average R was 0.863. In the equatorial area, the correlation coefficient R was significantly lower than that in other regions, which may be due to the perennial cloud coverage in these areas, as this results in proportions of missing data that are too high. In follow-up studies, the algorithm needs to be improved to solve these related problems.