A high-resolution daily gridded meteorological dataset for Serbia made by Random Forest Spatial Interpolation

We produced the first daily gridded meteorological dataset at a 1-km spatial resolution across Serbia for 2000–2019, named MeteoSerbia1km. The dataset consists of five daily variables: maximum, minimum and mean temperature, mean sea-level pressure, and total precipitation. In addition to daily summaries, we produced monthly and annual summaries, and daily, monthly, and annual long-term means. Daily gridded data were interpolated using the Random Forest Spatial Interpolation methodology, based on using the nearest observations and distances to them as spatial covariates, together with environmental covariates to make a random forest model. The accuracy of the MeteoSerbia1km daily dataset was assessed using nested 5-fold leave-location-out cross-validation. All temperature variables and sea-level pressure showed high accuracy, although accuracy was lower for total precipitation, due to the discontinuity in its spatial distribution. MeteoSerbia1km was also compared with the E-OBS dataset with a coarser resolution: both datasets showed similar coarse-scale patterns for all daily meteorological variables, except for total precipitation. As a result of its high resolution, MeteoSerbia1km is suitable for further environmental analyses.

Observational source data. OGIMET and Automated meteorological stations in Vojvodina (AMSV) are two observational datasets from which daily meteorological variables from the OGIMET data were used as dependent variables in the modelling process, while AMSV data was used for evaluation of the MeteoSerbia1km dataset.
OGIMET. OGIMET 4 is a Weather Information Service which provides data that includes historical daily summaries from surface synoptic observation (SYNOP) reports starting from the year 2000. SYNOP reports are  www.nature.com/scientificdata www.nature.com/scientificdata/ meteorological alphanumeric messages for reporting observations from more than 10,000 meteorological stations around the world. Reports are mostly available every 6 h (00, 06, 12 and 18 UTC), but for some stations every 3 or 1 h. The format of these reports is standardized and defined by the World Meteorological Organization (WMO). OGIMET daily summaries from 61 SYNOP stations, of which 28 are in Serbia, were used for the spatial interpolation of meteorological variables (Fig. 1). The remaining 33 stations in a 100-km buffer around the Serbian border were used for a more accurate spatial interpolation, especially in the areas near the Serbian border.
The outliers for OGIMET precipitation daily summaries that were four times larger than (a) the maximum of the surrounding observations, i.e., observations in a radius of 100 km and (b) the corresponding E-OBS value (see section E-OBS) were detected and removed.
A summary of the statistics for each of the meteorological parameters is given in Table 2. Gridded source data. The DEM (Fig. 1), topographic wetness index (TWI) and IMERG gridded data were used as independent (auxiliary) variables (covariates) in the modelling process for the daily meteorological variables, while the E-OBS dataset was used for evaluation of the MeteoSerbia1km dataset.

DEM and TWI.
A DEM at a spatial resolution of 1 km was created by combining SRTM 30 + and ETOPO DEM. A TWI at a spatial resolution of 1 km was derived from the SAGA GIS TWI algorithm and DEM. DEM and TWI both have a 1-km spatial resolution.
IMERG. IMERG 26 is an algorithm that combines information from multiple sources, such as satellite microwave precipitation estimates, microwave-calibrated infrared satellite estimates, precipitation gauges, and other precipitation estimators to estimate precipitation over the majority of the Earth's surface. One of the IMERG products is maps (grids) of daily precipitation estimates. The IMERG final run version V06B precipitation estimates were used for developing the PRCP model. IMERG estimates are a space-time covariate with a spatial resolution of 10 km and temporal resolution of one day. Earlier versions of the IMERG dataset, based on GPM, covered the period from 2014, but starting from version V06B, IMERG includes TRMM preprocessed data going back to June 2000. The IMERG dataset was used as a coarser scale covariate for precipitation. Therefore, the IMERG dataset was resampled to a 1-km spatial resolution using bilinear interpolation and DEM as a base layer. 27 is an ensemble dataset constructed through a conditional simulation procedure. For each of the 100 members of the ensemble, a spatially correlated random field is produced using a pre-calculated spatial correlation function. The mean across the members is calculated and is provided as the "best-guess" fields. E-OBS is a daily dataset with a spatial resolution of 10 km. Because E-OBS is based on observations from ECA&D and SYNOP meteorological stations, it was used for comparison with the daily MeteoSerbia1km dataset and the detection of precipitation outliers.

E-OBS. E-OBS
RFSI. RFSI 22 is a novel methodology for spatial interpolation based on the random forest machine learning algorithm 28 . In comparison with other random forest models for spatial interpolation, RFSI uses additional spatial covariates: (1) observations at n nearest locations and (2) distances to them, in order to include the spatial context in the random forest. RFSI model predictions can be written as:  where z(s ) 0 is the prediction at prediction location s 0 , = … x i m (s ) ( 1, , ) i 0 are environmental covariates at location s 0 , z(s i ) and d i are spatial covariates (i = 1, …, n), where z(s ) i is the i-th nearest observation from s 0 at location s i and d i = |s i - s 0 |. These spatial covariates proved to be valuable extensions for the random forest algorithm in improving its spatial accuracy. A detailed description of RFSI, including its performance and implementation procedure, is provided by Sekulić et al. 22 .
Model development and prediction. In order to prepare the data for RFSI modelling, all of the environmental covariates were overlaid with training observation locations for each day. Then, RFSI spatial covariates were www.nature.com/scientificdata www.nature.com/scientificdata/ created in the following way: for each day and for each training observation location, n nearest training observation locations were found and n pairs of covariates-observations at n nearest locations and distances to themwere calculated. Extracted overlaid values and n pairs of spatial covariates were assigned to the corresponding observations, making a dataset which was then used to fit an RFSI model.
Predictions were made in a similar manner to the development of the RFSI model. For each of the desired prediction days and locations (in this case pixels of the target grid), environmental covariates were extracted and observations at n nearest training locations and distances to them were calculated. Then, predictions were made using extracted values and n pairs of spatial covariates, and an already fitted RFSI model. The entire process of making the RFSI model and making predictions is presented in Fig. 2. It should be noted that the RFSI model can handle both regression and classification tasks.
Model tuning. In order to achieve the best possible prediction accuracy, hyperparameters for the RFSI models were tuned. The tuned hyperparameters were the number of variables to possibly split at each node (mtry), minimal node size (min.node.size) and ratio of observations-to-sample in each decision tree (sample.fraction), and the number of nearest observations (n.obs). The number of trees (ntree) hyperparameter was fixed and set to 250, according to Sekulić et al. 22 , as a larger value of ntree would not improve the RFSI model's accuracy. The splitrule (splitrule) hyperparameter was also fixed and set as variance for regression tasks, and gini index for classification tasks.
The hyperparameters were tuned using 5-fold leave-location-out cross-validation (LLOCV). Here 'leavelocation-out' means that all observations from a specific location (i.e. time series of observations from a station) were in the same fold, and 5-fold means that all of the locations were grouped into 5 groups (folds). Then, each of the folds was used once for validation. By doing so, the accuracy of the targeted spatial prediction was assessed 29 . Many different combinations of hyperparameters were tested and for each combination, 5-fold LLOCV was performed. In other words, for each of the hyperparameter combinations, the entire dataset was split into 5 folds. Each of the folds once represented a test fold, while the four remaining folds were used to fit the RFSI model with a hyperparameter combination. Finally, RMSE was adopted as a criterion for the selection of optimal hyperparameters. The RMSE was calculated for the entire dataset after the 5-fold LLOCV process, i.e., based on all observations and predictions from all 5 folds.
Modelling of daily meteorological variables. Temperature. Modelling the daily temperature variables, Tmax, Tmin, and Tmean, is a regression task. All daily temperature RFSI models are as follows: max min mean R n obs n obs , , where T max,min,mean (s 0 ) is the daily temperature (Tmax, Tmin, and Tmean) prediction at prediction location s 0 , f R denotes an RFSI regression model, GTT is the geometrical temperature trend, a function of latitude and day of the year (which was shown to be a valuable covariate for Tmax, Tmin and Tmean) 30 , DOY is a temporal covariate, i.e., the day of the year, and IDW is a local inverse distance weighting prediction based on the n.obs number of nearest observations (excluding the observed location). The tuned hyperparameters for each of the daily temperature models are given in Table 3. The IDW exponent (p) was also tuned. The n.obs hyperparameter was 10 for the Tmax model and 9 for the Tmin and Tmean models. www.nature.com/scientificdata www.nature.com/scientificdata/ Sea-level pressure. Modelling the daily SLP is also a regression task. The SLP RFSI model has fewer covariates than corresponding temperature models: where SLP(s 0 ) is the daily SLP prediction at prediction location s 0 . The tuned hyperparameters for the daily SLP model are given in Table 3. The n.obs hyperparameter was 9.
Precipitation. PRCP was modelled in two steps, i.e., with two models: (1) a classification model for the daily precipitation occurrence and (2)  where PRCP(s 0 ) is the daily PRCP prediction at prediction location s 0 , f C denotes the PRCP RFSI classification model with 0 and 1 as possible classes, Tmax, Tmin, and SLP are corresponding daily predictions from the MeteoSerbia1km dataset at location s 0 , and IMERG is the corresponding overlayed value from the IMERG dataset at location s 0 . Both precipitation models were fitted on the entire dataset with the same covariates. This means that zero precipitation observations were included in the regression model fitting. One reason for this was to include zero precipitation proximity in the regression model. As seen from Eq. 4, in PRCP prediction, the regression model was applied only in the locations where the classification model predicted the precipitation occurrence (class 1). The tuned hyperparameters for both the daily PRCP classification and regression models are given in Table 3. The n.obs hyperparameter for the classification model was 9, and for the regression model was 6.

Data Records
MeteoSerbia1km is a high-resolution daily meteorological gridded dataset for Serbia, consisting of Tmean, Tmax, Tmin, SLP and PRCP variables, for the period 2000-2019. As an example, prediction maps for July 27, 2014 are presented in Fig. 3. In addition, monthly and annual averages (totals for PRCP) were generated by aggregating daily datasets. Then, daily, monthly, and annual LTM were generated by averaging daily, monthly and annual datasets. Since the first five months of the year 2000 were missing from the IMERG dataset, the daily and monthly PRCP averages start from June, 2000. Therefore, the daily and monthly PRCP LTMs were calculated without the first five months of the year 2000, and PRCP annual averages and LTM were calculated without the year 2000. Additionally, only the data for leap years were available for generating the daily LTM for February 29.
The OpenStreetMaps country border (https://osm-boundaries.com/https://osm-boundaries.com/) of Serbia was used to ensure that the MeteoSerbia1km dataset covers the territory of Serbia. The entire dataset is at a 1-km spatial resolution, and is available in both, WGS84 and UTM34N projections. The dataset is stored in the GeoTIFF (.tif) format. Units of the dataset values are • temperature (Tmean, Tmax, and Tmin) -tenths of a degree in the Celsius scale (°C) • SLP -tenths of a mbar • PRCP -tenths of a mm Furthermore, all dataset values are stored as integers (INT32 data type) in order to reduce the size of the GeoTIFF files, i.e., temperature values should be divided by 10 to obtain degrees Celsius, and the same for SLP and PRCP to obtain millibars and millimeters.
The file naming convention adopted is provided in Table 4. It should be noted that the naming convention is different for different products with different temporal resolutions.
The dataset can be downloaded from ZENODO 31

technical Validation
Validation of daily datasets. The daily MeteoSerbia1km dataset was validated using nested 5-fold LLOCV, which combines nested k-fold 32 and leave-location-out cross-validation. For nested 5-fold LLOCV, as with the regular 5-fold LLOCV, the entire dataset was split into five folds. Each of the folds was used once for testing, while the four remaining folds were used for hyperparameter tuning with regular 5-fold LLOCV (see the Model tuning section). Four accuracy metrics, namely, the coefficient of determination (R 2 ), Lin's concordance correlation coefficient (CCC) 33 , the mean absolute error (MAE) and the root mean square error (RMSE) were calculated for  www.nature.com/scientificdata www.nature.com/scientificdata/ all daily meteorological variables for the stations in Serbia ( Table 5). Note that the coefficient of determination represents the amount of variance explained by the model: where ESS is the Error Sum of Squares, TSS the Total Sum of Squares, and z s ( ) i the mean of the observations. The SLP model had the highest accuracy, especially for stations in Serbia, followed by Tmax and Tmean. This is due to the fact that SLP and temperature are continuous variables and have strong spatial autocorrelation. Tmin showed slightly lower accuracy than Tmax and Tmean, and PRCP showed the lowest accuracy, which has also been reported in similar studies 27,34 . Furthermore, LLOCV accuracy is lower for stations outside of Serbia because of the well-known edge effect interpolation problem. Therefore, including stations outside of Serbia in LLOCV would not give an objective accuracy assessment of the MeteoSerbia1km dataset and would even reduce the accuracy.
The accuracy of both the two-step PRCP model with classification and the unique PRCP regression model was the same. The advantage of the PRCP two-step model with classification is that zero PRCP values were predicted   www.nature.com/scientificdata www.nature.com/scientificdata/ as exact zeros. Cohen's kappa coefficient 35 for the PRCP RFSI classification in Serbia was 0.779. The confusion matrix is shown in Table 6. In cases where the observed values were zero (class 0), only 4.21% of the final predicted values were larger than 1 mm, and 0.44% of them were larger than 5 mm. If the opposite case was true, in which the predicted values were zero (class 0), only 3.94% of the observed values were larger than 1 mm, and 0.86% of them were larger than 5 mm.   Table 6. Confusion Matrix for the PRCP RFSI classification model from the nested 5-fold LLOCV. Class 0 represents no precipitation, and class 1 represents precipitation occurrence. www.nature.com/scientificdata www.nature.com/scientificdata/ The average RMSE per station for the entire time period is presented in Fig. 4. Stations at the highest altitudes, Kopaonik (1,711 m) and Crni Vrh (1,037 m), had the largest average RMSE for all temperature variables. Additionally, Sjenica (1,038 m) and Zlatibor (1,029 m) had a high average RMSE for Tmin, which is the reason for the lower accuracy in comparison with Tmax and Tmean. On the one hand, microclimatic conditions at higher altitudes affect the temperature behaviour, so that overall spatial autocorrelation, and therefore the accuracy, is lower. On the other hand, the accuracy is higher at lower altitudes, especially in Vojvodina, the northern part of Serbia. This makes temperature datasets particularly suitable for agriculture. The average RMSE for SLP is low and equally distributed for the territory of Serbia, which is confirmed by the overall high accuracy ( Table 5). The average RMSE for PRCP is also equally distributed over the territory of Serbia. The time series of predictions from the nested 5-fold LLOCV and observations for the Belgrade station, for 2014, are presented in Fig. 5. The figure shows that differences between observations and predictions for Tmax, Tmean, and SLP are minor, whereas those for Tmin are somewhat larger, mostly because Tmin is slightly underestimated, as reflected in the lower accuracy in comparison with Tmax and Tmean (Table 5). For PRCP, the days without precipitation are predicted well, whereas the days with precipitation are slightly underestimated.
comparison with E-OBS. The E-OBS dataset was taken as a benchmark dataset because it was made by geostatistical simulation, i.e., spatial interpolation from ECA&D stations, which also includes SYNOP stations. The daily MeteoSerbia1km dataset was aggregated to a 10-km spatial resolution in order to match the pixels (grid) of the E-OBS gridded dataset. Then, for each of the raster pixels, a Pearson correlation coefficient (PCC) was calculated between the pixel time series of MeteoSerbia1km estimations and the pixel time series of E-OBS estimations. Maps of the PCCs for each meteorological variable are presented in Fig. 6. www.nature.com/scientificdata www.nature.com/scientificdata/ The MeteoSerbia1km dataset shows an overall high correlation with the E-OBS dataset for Tmax, Tmin, Tmean, and SLP (0.992, 0.989, 0.993, and 0.922 respectively) and similar coarse-scale spatial patterns, with slightly lower correlation around the Kopaonik and Crni Vrh stations (Fig. 6), where the LLOCV accuracy was the lowest (Fig. 4). The correlation for SLP was lower in the southwestern part of Serbia, probably because of the lack of SYNOP SLP stations in that area (Fig. 4). The MeteoSerbia1km dataset showed the lowest correlation with the E-OBS dataset for PRCP (0.551). The main reason for this is that precipitation is a complex variable, and different models can produce significantly different results. Another reason is that the E-OBS methodology does not include IMERG, which is an important predictor for the PRCP model and, consequently, predictions follow IMERG patterns. Bearing in mind that the accuracy of MeteoSerbia1km and E-OBS PRCP models does not differ much in RMSE and MAE, RFSI PRCP can be valuable for the areas where E-OBS cannot contribute or where a finer spatial resolution of 1 km is needed. Hence, the MeteoSerbia1km dataset describes the local variation of daily PRCP in Serbia better than E-OBS.

Test with stations in Vojvodina.
MeteoSerbia1km was also tested with independent AMSV stations that were not used for making RFSI models. The RMSE between AMSV stations and the corresponding MeteoSerbia1km values over Vojvodina for the period 2005-present period for Tmax, Tmin, Tmean, and PRCP was 1.6°C, 1.8°C 1.2°C, and 3.7 mm, respectively. In comparison with the results from LLOCV for the whole of Serbia (Table 5), the accuracy of MeteoSerbia1km temperature variables is slightly better, while the accuracy of MeteoSerbia1km PRCP is slightly worse. Lower RMSE for PRCP can be taken as a consequence of a denser network of AMSV stations than OGIMET stations and a large spatial variability of PRCP.

Usage Notes
MeteoSerbia1km is the first high-resolution daily gridded meteorological dataset for Serbia at a 1-km spatial resolution. The dataset can be used in a wide range of areas such as agriculture, insurance, forestry, climatology, meteorology, hydrology, ecology, soil mapping, urban planning, or any other research field that needs gridded data with a high spatial resolution.
MeteoSerbia1km is in the GeoTIFF format which makes it interoperable with any GIS software, such as SAGA GIS (http://www.saga-gis.org/), QGIS (http://www.qgis.org), ArcGIS (https://www.arcgis.com/), etc. It should be noted that MeteoSerbia1km values are multiplied by 10, so they should be divided by 10 to obtain values in basic www.nature.com/scientificdata www.nature.com/scientificdata/ units (°C, mbar and mm). Finally, the predictions for some days may show artifacts due to misrepresentation by meteorological stations.
The data are freely available under Creative Commons Licence: CC BY 4.0.
In order to make this work reproducible, a complete script in R and datasets used for the modelling, tuning, validation, and prediction of daily meteorological variables is available via the GitHub repository at https://github. com/AleksandarSekulic/MeteoSerbia1km.