A daily high-resolution (1 km) human thermal index collection over the North China Plain from 2003 to 2020

Human-perceived temperature (HPT) describes the joint effects of multiple climatic factors such as temperature and humidity. Extreme HPT events may reduce labor capacity and cause thermal discomfort and even mortality. These events are becoming more frequent and more intense under global warming, posing severe threats to human and natural systems worldwide, particularly in populated areas with intensive human activities, e.g., the North China Plain (NCP). Therefore, a fine-scale HPT dataset in both spatial and temporal dimensions is urgently needed. Here we construct a daily high-resolution (~1 km) human thermal index collection over NCP from 2003 to 2020 (HiTIC-NCP). This dataset contains 12 HPT indices and has high accuracy with averaged determination coefficient, mean absolute error, and root mean squared error of 0.987, 0.970 °C, and 1.292 °C, respectively. Moreover, it exhibits high spatiotemporal consistency with ground-level observations. The dataset provides a reference for human thermal environment and could facilitate studies such as natural hazards, regional climate change, and urban planning.


Background & Summary
Under global warming, the frequency and intensity of extreme temperature events (e.g., heatwaves and cold spells) are increasing in most parts of the world [1][2][3][4][5] .They pose severe impacts on human society and the ecological systems [6][7][8][9][10] , putting more people at risk of extremely hot or cold 11,12 .For example, heatwave led to more than 70,000 deaths in Europe in 2003 and 55,000 deaths in Russia in 2010 13,14 .Extreme cold events also pose a great threat to human health and often cause a higher proportion of deaths than extreme heat 15,16 .
Human-perceived temperature (HPT) has attracted much attention in recent years due to its close connection with human health 17,18 .HPT is not only affected by air temperature but also by other meteorological variables (e.g., humidity and wind speed).For instance, HPT would become higher on hot days with high humidity while becoming lower on cold days under the additional influence of wind [19][20][21][22] .Relevant studies revealed that discomfort HPT could impact human health, such as respiratory distress and reduced skin evaporation, and significantly increase the health risks and mortality in heat-and cold-exposed populations under the joint effects of multiple meteorological factors [23][24][25] .Li et al. 26 showed that global HPT increases faster than the actual air temperature.Faster increases in HPT were also confirmed by regional studies 27,28 .However, previous studies mainly quantitated the human thermal environment and human heat exposure assessment based on meteorological station data and homogeneous climatic datasets [29][30][31][32] .These station-based or raster-based datasets have a coarse spatial resolution (e.g., 0.5° × 0.5° or 2.5° × 2.5°) and insufficient temporal resolution (e.g., monthly),  which cannot meet the requirements of long time series, full regional coverage, and fine spatial scale studies 28 .A seamless dataset of HPT with high temporal frequency and spatial resolution is still lacking and sorely needed.
A better understanding of the thermal environment helps prevent or mitigate heat/cold stress effectively 18 .Various HPT indices have been proposed to investigate human thermal stress previously.For example, the  wet-bulb temperature (WBT), effective temperature (ET), and heat index (HI) have been widely used for quantitative assessment of human thermal environment research 17,28,[33][34][35] .However, there is still a lack of unified evaluation standards or comprehensive datasets for assessing the human thermal environment and stress 36 .Therefore, it is imperative to develop a new collection with multiple indices for evaluating the thermal environment at a fine scale (e.g., at daily frequencies, and ~0.00833° spatial resolution in a geographical coordinate system, which is equal to around 1 km near the equator).
Previous studies have projected that the morbidity and mortality induced by extreme temperatures may continue to rise, especially in populated Asia [37][38][39] .In particular, the North China Plain (NCP), as one of the most populated and urbanized areas of the world, is suffering from severe thermal stress 40,41 .It is located in a typical East Asian monsoon region (113°E-121°E and 34°N-41°N) 40,42 , and spans eleven provincial units of China, including one of the regions with the most profound human activities in China, i.e., the Beijing-Tianjin-Hebei region 43 .NCP has a total population of 0.3 billion, which accounts for around one-fifth of the population of the country 44 .The population exposed to heatwaves in this area is up to 0.66 million persons/day during 1961-2015 45 .By examining the long-term changes of major urbanization agglomerations in China, Wang et al. 46 indicated that HPT in northern China grew faster than that in the south, particularly in NCP.Zhang et al. 47 have generated a 1-km-resolution dataset of monthly human thermal index collection (HiTIC-Monthly) over China during 2003-2020.However, this monthly dataset is not capable of detecting extreme weather events such as heatwaves and cold spells, which are usually defined based on daily temperatures.The lack of a daily HPT dataset limits accurate heat stress-related studies in this region, and a multi-index dataset of human thermal stress at a daily scale over NCP is highly essential.
To temporal resolution from a monthly to a daily scale.It also incorporates different predictors by including new variables such as daily precipitable water vapor, which reflect the high-frequency (day-to-day) fluctuations of atmospheric humidity that induces excessive heat stress as perceived by human body.This study is thus of great significance for regional climate change, urban planning, urban thermal environment, and public health research.

overview.
The data used in this study mainly include meteorological station observations and four gridded datasets.The Light Gradient Boosting Machine (LightGBM) algorithm proposed by Ke et al. 48is used to produce 12 human thermal indices.The workflow of this study is shown in Fig. 1.First, we derived daily HPT indices from daily values of temperature, humidity, and wind speed based on station observations, and preprocessed the daily gridded datasets.We extracted the covariate values at meteorological stations from gridded datasets to Meteorological observations.The station-observed daily data including surface air temperature, relative humidity, and wind speed from 2003 to 2020 were collected from the China Meteorological Data Service Center (CMDSC, http://data.cma.cn).All station data have been strictly quality-controlled based on the method provided by Xu et al. 49 , including homogenizing climate data and processing outliers in the data series.The observational network includes 374 stations across NCP (Fig. 2).In this study, we calculated daily human-perceived temperature from meteorological observations and extracted the covariate values from gridded datasets at the same locations.The data were stored in the form of one per station per day, then processed and divided into training (80%) and validation (20%) sets for model calibration.The calculation of the 12 human thermal indices is described in Table 1.

Gridded datasets.
Four gridded datasets with a total of seven independent variables (Table 2) are used as input in the LightGBM model.These variables include land surface temperature (LST), population density, elevation, slope, aspect, precipitable water vapor, and day of the year.
LST is closely related to near-surface air temperature 50 .In this study, we adopted an LST dataset named MODIS-like LST 51 .This dataset was developed based on the Moderate Resolution Imaging Spectroradiometer (MODIS) LST dataset and the gap-filling algorithm 51 to make it seamless at a global scale in 1 km grids from 2003 to 2020.The average root mean squared error (RMSE) of this LST dataset is 1.80-2.03°C and 1.23-1.45°C for daytime and nighttime temperatures, respectively.The daily mean LST was calculated by averaging the corresponding daytime and nighttime LSTs.In addition, population distribution and regional population density are demonstrated to affect regional temperature [52][53][54][55] , and we used a high-accuracy population density dataset provided by the WorldPop project team as a predictor 56 .
The Multi-Error-Removed Improved-Terrain (MERIT) Digital Elevation Model (DEM) 57 with a spatial resolution of 90 m was used in this study to extract slope and aspect.Compared with previous DEM data (e.g., SRTM and AW3D-30m), the errors of MERIT DEM are significantly reduced, and the accuracy in the flat area has been effectively improved [57][58][59] .In addition, precipitation has been proven to affect air temperature and humidity 60,61 .Therefore, a global daily precipitable water vapor dataset with a spatial resolution of 1 km was used.
The LightGBM machine learning algorithm.LightGBM is a gradient boosting decision tree (GBDT) based machine learning algorithm.Compared with other algorithms based on GBDT such as XGBoost and pGBRT, LightGBM has improved and optimized by the Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) technologies, which allow LightGBM to better retain samples with large gradients without significantly altering the original data shape and focus more on undertrained samples 48 .LightGBM consumes much less memory and improves training efficiency by more than 20 times while maintaining nearly the same accuracy 48 .Due to its superior degree of generalizability, extremely fast processing speed, and high accuracy, LightGBM has been widely applied in different fields (e.g., climatology, air pollution, social psychology) 47,[62][63][64][65][66][67] .As shown in Fig. 1, the daily gridded datasets were fed to the trained LightGBM model for HPT prediction.Hyperparameters tunning.The accuracy of machine learning relies heavily on parameterization 68 .
Hyperparameter optimization configures the model by searching for various model hyperparameters and selecting a subset that achieves the best performance on the given dataset.The grid search-based hyperparameters tunning method was adopted in our study 69 .
Accuracy evaluation.Three statistical metrics were used to assess the performance of the training model, including the determination coefficient (R 2 ), root mean absolute error (MAE), and RMSE.The R 2 , RMSE, and MAE values are computed as follows:

Data records
The high spatial resolution (1 km) human thermal index collection at a daily scale from 2003 to 2020 over NCP (HiTIC-NCP) shows a high consistency between observations and predictions, and the product has respectable accuracies.HiTIC-NCP contains 216.rar files, with individual files decompressed into the NetCDF format.
The dataset is available on the general-purpose repository Zenodo, at https://doi.org/10.5281/zenodo.7528001 70.6.The R 2 values range from 0.975 to 0.990, the MAE values are all smaller than 1.4 °C, and the RMSE values are smaller than 1.9 °C (Table 3).Particularly, sWBGT has the highest R 2 values of 0.990, followed by HMI, WBT, and DI, and its MAE and RMSE values are 0.698 °C and 0.926 °C, respectively.ET is with an R 2 of 0.987 and has the lowest MAE and RMSE values (i.e., 0.598 °C and 0.785 °C, respectively).WCT and NET exhibit the highest MAEs (i.e., 1.17 and 1.37 °C, respectively) and RMSEs (i.e., 1.61 and 1.84 °C, respectively).It is likely caused by the involvement of wind speed in the calculation of WCT and NET (recall Table 1).7, the high R 2 values (i.e., ranging from 0.970 to 0.995) are seen at most stations across NCP, especially in the central, southwestern, and northern parts.The R 2 values of NET at most stations have relatively lower accuracies, likely caused by the inclusion of wind speed.Figures 8, 9 show that ET has the best results with the lowest MAEs and RMSEs (i.e., < 0.911 °C and < 1.161 °C, respectively, see Figs. 8d, 9d).Compared with other 11 human thermal indices, NET has higher MAE and RMSE values (i.e., > 1.058 °C and 1.415 °C, respectively, see Figs. 8k, 9k).Spatially, MAEs and RMSEs exhibit an increasing tendency from low to high elevation, with smaller MAE and RMSE values in inland areas with lower elevation and relatively larger values in mountainous areas with higher elevation.

Spatial variations of prediction accuracy.
Prediction accuracy in different periods.10b,c), demonstrating that our product is reliable in all years.

Fig. 1
Fig. 1 Framework of generating the daily human thermal index collection over the North China Plain (HiTIC-NCP).HPT: Human-perceived Temperature; LightGBM: Light Gradient Boosting Machine.

Fig. 2
Fig. 2 Study area and the spatial distribution of meteorological stations in the North China Plain (NCP).The shaded color indicates elevation, and the black dots denote the stations.

Fig. 6
Fig. 6 R 2 , MAE, and RMSE values of 12 predicted human thermal indices over NCP from 2003 to 2020.
where N represents the number of data points, x p and x o are the i th prediction and observation values, respectively.

Fig. 10
Fig. 10 Prediction accuracies of 12 human thermal indices over NCP in individual years from 2003 to 2020.

Figures 7 -
9 depict the distributions of R 2 , MAE, and RMSE values in 12 human thermal indices at individual stations of NCP during 2003-2020.As shown in Fig.

Figure 10
illustrates the averaged R 2 , MAE, and RMSE values of 12 indices in individual years of 2003-2020.All R 2 values of 12 indices in all years exhibit well performance (i.e., > 0.955 °C, Fig. 10a).Better annual predictions appear in WBT over 2008-2013 and sWBGT over 2008-2014, with R 2 exceeding 0.990.The lowest R 2 is observed in NET in 2003 (i.e., 0.956).Annual averaged MAEs and RMSEs of all indices in nearly all years are less than 1.5 °C and 2.0 °C, respectively (Fig.

Table 1 .
Equations of the 12 thermal indices.T is air temperature (°C) at 2-meter height, RH is relative humidity (%), V is wind speed (m/s), and E a is actual vapor pressure (kPa).The asterisk (*) means that an adjustment is needed.The unit of all human thermal indices is degree Celsius (°C).

Table 2 .
Gridded datasets and covariates used in this study.

Table 3 .
Overall R 2, MAE, and RMSE values of the 12 human thermal indices over NCP from 2003 to 2020.

Technical Validation overall accuracy of 12 predicted human thermal indices
. We assessed the overall accuracy of 12 human thermal indices over NCP during 2003-2020 in terms of R 2 , MAE, and RMSE.All indices exhibit desirable performance, with the averaged R 2 , MAE, and RMSE being 0.987, 1.292 °C, and 0.970 °C, respectively.It demonstrates that the LightGBM model has good effectiveness in the prediction of human thermal indices at