PPDIST, global 0.1° daily and 3-hourly precipitation probability distribution climatologies for 1979–2018

We introduce the Precipitation Probability DISTribution (PPDIST) dataset, a collection of global high-resolution (0.1°) observation-based climatologies (1979–2018) of the occurrence and peak intensity of precipitation (P) at daily and 3-hourly time-scales. The climatologies were produced using neural networks trained with daily P observations from 93,138 gauges and hourly P observations (resampled to 3-hourly) from 11,881 gauges worldwide. Mean validation coefficient of determination (R2) values ranged from 0.76 to 0.80 for the daily P occurrence indices, and from 0.44 to 0.84 for the daily peak P intensity indices. The neural networks performed significantly better than current state-of-the-art reanalysis (ERA5) and satellite (IMERG) products for all P indices. Using a 0.1 mm 3 h−1 threshold, P was estimated to occur 12.2%, 7.4%, and 14.3% of the time, on average, over the global, land, and ocean domains, respectively. The highest P intensities were found over parts of Central America, India, and Southeast Asia, along the western equatorial coast of Africa, and in the intertropical convergence zone. The PPDIST dataset is available via www.gloh2o.org/ppdist.

(≤3-hourly) resolution, and near-global coverage (typically <60°N/S) 19,20 . However, satellite P retrieval is confounded by complex terrain [21][22][23] and surface snow and ice 22,24 , and there are major challenges associated with the detection of snowfall 25,26 . 2. Courty et al. 27 used P estimates between 1979 and 2018 from the ERA5 reanalysis 28 to derive probability distribution parameters of annual P maxima for the entire globe. Reanalyses assimilate vast amounts of in situ and satellite observations into numerical weather prediction models 29 to produce a temporally and spatially consistent continuous record of the state of the atmosphere, ocean, and land surface. However, reanalyses are affected by deficiencies in model structure and parameterization, uncertainties in observation error statistics, and historical changes in observing systems 5,30,31 . 3. Sun et al. 32 calculated P occurrence estimates directly from gauge observations, while Dietzsch et al. 33 used the gauge-based interpolated GPCC dataset 34 to estimate P characteristics for the entire land surface. Gauges provide accurate estimates of P at point scales and therefore have been widely used to evaluate satellite-and reanalysis-based P products (e.g., Hirpa et al. 35 , Zambrano-Bigiarini et al. 36 , and Beck et al. 37 ). However, only 16% of the global land surface (excluding Antarctica) has a gauge located at <25-km distance 38,39 , gauge placement is topographically biased towards low elevations 40,41 , and there has been a significant decline in the number of active gauges over recent decades 42 . Additionally, interpolation of gauge observations generally leads to systematic overestimation of low intensity events and underestimation of high intensity events 43,44 .
Here, we introduce the Precipitation Probability DISTribution (PPDIST) dataset, a collection of high-resolution (0.1°) fully global climatologies (1979-2018) of eight P occurrence indices (using different thresholds to discriminate between P and no-P) and nine peak P intensity indices (using different return periods) at daily and 3-hourly time-scales (Figs. 1b and 2b). The climatologies were produced using neural network ensembles trained to estimate the P indices using quality-controlled P observations from an unprecedented database consisting of 93,138 daily and 11,881 hourly gauges worldwide ( Fig. 1a and Supplement Fig. S2a). Ten global P-, climate-, and topography-related predictors derived from all three main P data sources -satellites, reanalyses, and gauges -were used as input to the neural networks to enhance the accuracy of the estimates (Table 1). Ten-fold cross-validation was used to obtain an ensemble of neural networks, evaluate the generalizability of the approach, and quantify the uncertainty. The performance was assessed using the coefficient of determination (R 2 ), the mean absolute error (MAE), and the bias (B; Fig. 3). The neural networks yielded mean validation R 2 values (calculated using the validation subsets of observations) of 0.78 and 0.86 for the daily and 3-hourly P occurrence indices, respectively, and 0.72 and 0.83 for the daily and 3-hourly peak P intensity indices, respectively ( Fig. 3 and Supplement Fig. S4). The neural networks outperformed current state-of-the-art reanalysis (ERA5 28 ) and satellite (IMERG 45,46 ) products for all P indices and performance metrics (Fig. 3). ERA5 consistently overestimated the P occurrence (Fig. 1c), while IMERG tended to overestimate the peak P intensity (Fig. 2d). The uncertainty in the neural network-based estimates of the P indices was quantified globally and was generally higher in sparsely gauged, mountainous, and tropical regions (Fig. 4). Overall, these results suggest that the neural networks provide a useful high-resolution, global-scale picture of the P probability distribution. It should be noted that although the climatologies are provided as gridded maps, the estimates represent the point scale rather than the grid-box scale, due to the use of gauge observations for the training 43,47 .
Since the eight PPDIST P occurrence climatologies were similar, we only present and discuss results for the >0.5 mm d −1 daily P occurrence climatology (Fig. 1b), although the others can be viewed by accessing the dataset. The P occurrence was particularly high (>0.5 mm d −1 on >50% of days) in windward coastal areas of Chile, Colombia, the British Isles, and Norway, along the Pacific coast of North America, and over parts of Southeast Asia (Fig. 1b) 16,[48][49][50][51] . Conversely, the P occurrence was particularly low (>0.5 mm d −1 on <1% of days) in the Atacama and Sahara deserts. At a daily time-scale using a 0.5 mm d −1 threshold, P was found to occur 34.4%, 24.9%, and 38.9% of days, on average, over the global, land, and ocean domains, respectively ( Fig. 1b). At a 3-hourly time-scale using a 0.1 mm 3 h −1 threshold, P was found to occur 12.2%, 7.4%, and 14.3% of the time, on average, over the global, land, and ocean domains, respectively (Supplement Fig. S2b). These estimates are in line with previous estimates derived from the satellite-based CMORPH product 16 and from the satellite-, reanalysis-, and gauge-based MSWEP product 52 . The higher average P occurrence over ocean than over land reflects the unlimited water supply 53 . The PPDIST > 0.5 mm d −1 daily P occurrence climatology (Fig. 1b) exhibits reasonable visual agreement with similar climatologies from previous studies despite the use of different data sources, spatiotemporal resolutions, and thresholds to discriminate between P and no-P (Sun et al. 32  Since the nine PPDIST peak P intensity climatologies were similar, we only present and discuss results for the 15-year return-period daily P climatology (Fig. 2b). Peak P intensities were markedly higher at low latitudes (<30°N/S; Fig. 2b), due to the higher air temperatures which lead to more intense P in accordance with the Clausius-Clapeyron relationship [55][56][57] . The highest 15-year return-period P intensities (>200 mm d −1 ) over land were found in parts of Central America, India, and Southeast Asia, along the western equatorial coast of Africa, and over oceans in the intertropical convergence zone (Fig. 2b) 17,48,58,59 . Conversely, the lowest 15-year return-period P intensities (<50 mm d −1 ) were found in the desert regions of the world, at high latitudes where frontal P dominates, and over southeastern parts of the Atlantic and Pacific oceans where semi-permanent anticyclones prevail 54,60 . The PPDIST 15-year return-period daily P climatology (Fig. 2b) exhibits reasonable visual agreement with a maximum one-day P climatology derived by Dietzsch et al. 33 (their Fig. 5d) from the gauge-based GPCC Full Data Daily dataset 34 , as well as with a 3.4-year return-period 3-hourly P climatology derived by Beck et al. 52 (their Fig. 7) from the satellite-, reanalysis-, and gauge-based MSWEP dataset 52 . www.nature.com/scientificdata www.nature.com/scientificdata/

Methods
Gauge observations and quality control. Our initial database of daily P gauge observations comprised 149,095 gauges worldwide and was compiled from (i) the Global Historical Climatology Network-Daily (GHCN-D) dataset (ftp.ncdc.noaa.gov/pub/data/ghcn/daily/) 61 ( 63 . Gauge data can have considerable measurement errors and therefore quality control is important [64][65][66] . To eliminate long series of erroneous zeros frequently present in daily GSOD records 67,68 , we calculated a central moving mean with a one-year window (assigning a value only if at least half a year of values were present), and discarded daily and 3-hourly P observations with a coincident moving mean of zero. To eliminate long series of P without zeros, indicative of biased reporting practices or quality control issues, we calculated a central moving minimum with a one-year window (assigning a value only if at least half a year of values were present), and discarded daily www.nature.com/scientificdata www.nature.com/scientificdata/ and 3-hourly P observations with a coincident moving minimum greater than zero. We only used observations between January 1st, 1979, and December 31st, 2018, and discarded gauges with (i) < 3 years of remaining data, (ii) the largest daily P > 2000 mm d −1 or 3-hourly P > 500 mm 3 h −1 (approximately the maximum recorded 24-h and 3-h P, respectively 69 ), (iii) the highest P value < 2 mm d −1 or <1 mm 3 h −1 , or (iv) the lowest P value > 1 mm d −1 or >0.5 mm 3 h −1 . The resulting database comprised 93,138 gauges with daily data (Fig. 1a) and 11,881 gauges with 3-hourly data (Supplement Fig. S2a). See Supplement Fig. S1 for record lengths of the daily and 3-hourly gauge data.
Precipitation occurrence and peak intensity indices. We considered eight P occurrence indices (related to the lower tail of the P distribution) and nine peak P intensity indices (related to the upper tail of the P distribution), all calculated for both the daily and 3-hourly time-scale. The eight P occurrence indices measure the percentage of time with P using thresholds of 0.1, 0.15, 0.2, 0.25, 0.35, 0.5, 0.75, and 1 mm d −1 or mm 3 h −1 to discriminate between P and no-P. The P occurrence was calculated using the percentileofscore function of the scipy Python module 70 . We considered multiple thresholds because (i) different researchers and meteorological agencies adopt different definitions of "wet day" and "rain day", (ii) the most suitable threshold depends on the application, and (iii) the estimation accuracy depends on the threshold 16,71 . www.nature.com/scientificdata www.nature.com/scientificdata/ The nine peak P intensity indices measure the magnitude of daily or 3-hourly P events for return periods of 3 days, 7 days, 1 month, 3 months, 1 year, 3 years, 5 years, 10 years, and 15 years. The magnitudes were calculated using the percentile function of the numpy Python module 72,73 . Percentiles were calculated according to p = 100 − 100/(365.25 · R) for the daily time-scale and p = 100 − 100/(365.25 · 8 · R) for the 3-hourly time-scale, where p is the percentile (%) and R is the return period (years). Gauges with <3 years of data were discarded per the preceding subsection. However, we only calculated the three highest return periods if the record lengths were longer than 5, 10, and 15 years, respectively. We did not consider return periods >15 years because (i) the number of observations would significantly decrease, and (ii) the IMERG dataset, which was used as predictor, spans only 20 years.
Neural network ensembles. We used feed-forward artificial neural networks based on the multilayer perceptron 74 to produce global maps of the P indices. Neural networks are models composed of interconnected neurons able to model complex non-linear relationships between inputs and outputs. Neural networks have been successfully applied in numerous studies to estimate different aspects of P (e.g., Coulibaly et al. 75 , Kim and Pachepsky 76 , and Nastos et al. 77 ). We used a neural network with one hidden layer composed of 10 nodes with the logistic sigmoid activation function. The stochastic gradient descent method was used to train the weights. To make the training more efficient, the inputs (i.e., the predictors; Table 1) were standardized by subtracting the means and dividing by the standard deviations of the global predictor maps. Additionally, the outputs (i.e., the P indices) were square-root transformed to reduce the skewness of the distributions, and standardized using means and standard deviations of the observed P indices.
To obtain an ensemble of neural networks, evaluate the generalizability of the approach, and quantify the uncertainty, we trained neural networks for each P index using ten-fold cross-validation. For each cross-validation iteration, the observations were partitioned into subsets of 90% for training and 10% for validation. The partitioning was random, although each observation was used only once for validation. Since the validation subsets are completely independent and were not used to train the neural networks, they allow quantification of the performance at ungauged locations.
We subsequently applied the trained neural networks using global 0.1° maps of the predictors, yielding an ensemble of ten global maps for each P index (one for each cross-validation iteration). The values were destandardized by multiplying by the standard deviation and adding the mean of observed P indices, and back-transformed by squaring the values, after which we used the mean of the ensemble to provide a 'best estimate' and the standard deviation to provide an indication of the uncertainty. As a last step, we sorted the eight  Table 1. Predictors used in the neural networks to estimate the P occurrence and peak P intensity indices. a Two mean annual P climatologies were used to account the uncertainty in mean annual P estimates. b The ERA5 and IMERG predictors represent estimates of the P index subject of the estimation. c Daily PPDIST estimates of the P index subject of the estimation were used as predictor for the 3-hourly estimates.
www.nature.com/scientificdata www.nature.com/scientificdata/ mean P occurrence estimates for each 0.1° grid cell to ensure that they continuously decrease as a function of the thresholds, and sorted the nine mean peak P intensity estimates to ensure that they continuously increase as a function of the percentiles. The uncertainty estimates were re-ordered according to the mean estimates. Table 1 presents the predictors used as input to the neural networks. The predictors were upscaled from their native resolution to 0.1° using average resampling. To highlight broad-scale topographic features, the elevation predictor (Elev) was smoothed using a Gaussian kernel with a 10-km radius following earlier studies 78,79 . The Elev predictor was square-root transformed to increase the importance of topographic features at low elevation 80 . The mean annual air temperature predictor (MAT) was included because higher air temperatures are generally associated with higher P intensities 56,81 .

Predictors.
Mean annual P predictors (MAP1 and MAP2) were included due to the high correlations found between mean and extreme values of P 14,82 . To account for the uncertainty in mean annual P estimates, we considered Fig. 3 Performance of the PPDIST dataset, the ERA5 reanalysis, and the IMERG satellite product in estimating the daily P occurrence and peak P intensity indices. For the PPDIST dataset, we calculated the mean of the ten validation scores (one for each cross-validation iteration). The training scores were not shown as they were nearly identical to the validation scores. All scores were calculated using square-root-transformed observed and estimated values. Supplement Fig. S4 presents an equivalent figure for the 3-hourly P indices. www.nature.com/scientificdata www.nature.com/scientificdata/ two state-of-the-art climatologies: CHPclim V1 83 and WorldClim V2 84 . Since the use of correlated predictors can increase training time 85 , we did not directly use both climatologies as separate predictors. Instead, we used square-root transformed mean annual P from WorldClim as predictor MAP1, and the difference in square-root transformed mean annual P between CHPclim and WorldClim as predictor MAP2. The square-root transformation was used to increase the importance of smaller values.
The ERA5 and IMERG predictors represent maps of the P index subject of the estimation derived from ERA5 and IMERG P data, respectively. The ERA5 and IMERG predictors were added because they provide valuable independent information about the P distribution with (near-)global coverage. All predictors cover the entire land surface, with the exception of IMERG, which was often not available at latitudes higher than approximately 60°N/S 46 . This issue was addressed by replacing missing IMERG estimates of the P indices with equivalent estimates from ERA5, resulting in a duplication of ERA5 at high latitudes. P distributional characteristics associated with different durations often exhibit similar spatial patterns 27,86 . To allow the 3-hourly estimates to take advantage of the much better availability of daily P observations (Figs. 1a and 2a), we used the daily climatology of the P index subject of the estimation as additional predictor for producing the 3-hourly climatology.
Performance metrics. We assessed the performance of the PPDIST dataset, the ERA5 reanalysis, and the IMERG satellite product in estimating the P indices using the gauge observations as reference. Prior to the assessment, the P indices of the products and the reference were square-root transformed to reduce the skewness of the distributions. For the PPDIST dataset, we calculated the average of the ten scores calculated using the ten independent validation subsets of observations (one for each cross-validation iteration). For ERA5 and IMERG, we simply calculated the scores using all available observations. The following three performance metrics were used: 1. The coefficient of determination (R 2 ), which measures the proportion of variance accounted for 87 , ranges from 0 to 1, and has its optimum at 1.
2. The mean absolute error (MAE), calculated as: where X i e and X i o represent estimated and observed values of the transformed P indices, respectively, i = 1, …, n denotes the observations, and n is the sample size. The MAE ranges from 0 to ∞ and has its optimum at 0. We used the MAE, instead of the more common root mean square error (RMSE), because the errors are unlikely to follow a normal distribution 88,89 .
3. The bias (B), defined as: B ranges from −1 to 1 and has its optimum at 0. Other widely used bias formulations, which contain the sum of only the observations in the denominator rather than the sum of both the observations and estimates 90,91 , were not used, because they yield unwieldy high B values when the observations tend to zero.

Data Records
The PPDIST dataset is available for download at figshare 92 and www.gloh2o.org/ppdist. The data are provided as a zip file containing four netCDF-4 files (www.unidata.ucar.edu/software/netcdf) with gridded observations, mean estimates, and uncertainty estimates of (i) the daily P occurrence (daily_occurrence_point_scale.nc), (ii) the daily peak P intensity (daily_intensity_point_scale.nc), (iii) the 3-hourly P occurrence (3hourly_occurrence_point_scale.nc), and (iv) the 3-hourly peak P intensity (3hourly_inten-sity_point_scale.nc). The grids have a 0.1° spatial resolution and are referenced to the World Geodetic Reference System 1984 (WGS 84) ellipsoid. The observational grids were produced by averaging the observations if multiple observations were present in a single grid cell. The netCDF-4 files can be viewed, edited, and analyzed using most Geographic Information Systems (GIS) software packages, including ArcGIS, QGIS, and GRASS. Figure 3 presents the performance of the PPDIST dataset, the ERA5 reanalysis, and the IMERG satellite product in estimating the daily P indices (Supplement Fig. S4 presents the performance in estimating the 3-hourly P indices). The PPDIST scores represent averages calculated from the ten independent validation subsets of observations (one for each cross-validation iteration). The PPDIST R 2 values ranged from 0.76 to 0.80 (mean 0.78) for the daily P occurrence indices, 0.85 to 0.88 (mean 0.86) for the 3-hourly P occurrence indices, 0.44 to 0.84 (mean 0.72) for the daily peak P intensity indices, and 0.72 to 0.88 (mean 0.83) for the 3-hourly peak P intensity indices ( Fig. 3 and Supplement Fig. S4). The 3-hourly R 2 values were thus higher than the daily R 2 values, likely because the 3-hourly gauge observations tend to be of higher quality 62 . The markedly lower R 2 for the 3-day return-period daily P intensity reflects the fact that the majority (58%) of the observations were 0 mm d −1 . For the ≥3-year return-period daily and 3-hourly P indices, the lower performance reflects the uncertainty associated with estimating the magnitude of heavy P events [93][94][95] . The PPDIST scores were superior to the ERA5 and IMERG scores for every P index and performance metric by a substantial margin. The PPDIST B scores were close to 0 for all P indices, indicating that the estimates do not exhibit systematic biases, apart from those already present in the (2020) 7:302 | https://doi.org/10.1038/s41597-020-00631-x www.nature.com/scientificdata www.nature.com/scientificdata/ gauge data due to wind-induced under-catch 64,96 and the underreporting of light P events 65 . Overall, these results suggest that the PPDIST dataset provides a reliable global-scale picture of the P probability distribution.

Technical Validation
The ERA5 and IMERG products yielded mean R 2 values of 0.59 and 0.58 for the daily P occurrence indices, 0.69 and 0.49 for the 3-hourly P occurrence indices, 0.65 and 0.56 for the daily P intensity indices, and 0.79 and 0.57 for the 3-hourly P intensity indices, respectively ( Fig. 3 and Supplement Fig. S4). The global patterns of P occurrence and peak P intensity were thus generally better estimated by ERA5 than by IMERG. However, ERA5 consistently overestimated the P occurrence and the lowest return periods, in line with evaluations of other climate models 30,37,97 . Additionally, ERA5 generally showed lower values for the highest return periods (Fig. 3), which is at least partly due to the spatial scale mismatch between point-scale observations and grid-box averages 43,47,98,99 . Conversely, ERA5 strongly overestimated the P intensity over several grid cells in eastern Africa in particular (Fig. 2c), likely due to the presence of so-called "rain bombs" caused by numerical artifacts 100 .
The overly high P occurrence of IMERG in the tropics (Fig. 1d) may be at least partly attributable to the inclusion of relatively noisy P estimates from the SAPHIR sensor on board the tropical-orbiting Megha-Tropiques satellite 46 . Similar to the TMPA 3B42 and CMORPH satellite products, IMERG overestimated the P occurrence over several large rivers (notably the Amazon, the Lena, and the Ob), likely due to misinterpretation of the emissivity of small inland water bodies as P by the passive microwave P retrieval algorithms 101 . The too low P occurrence at high latitudes reflects the challenges associated with the retrieval of snowfall 26,102 and light P 103,104 . Furthermore, IMERG tended to underestimate the P intensity over regions of complex topography (e.g., over Japan and the west coast of India; Fig. 2d), which is a known limitation of the product 23,105 . The too high P intensity in the south central US may stem from the way P estimates from the various passive microwave sensors are intercalibrated to the radar-based CORRA product in the IMERG algorithm 46 .

Usage Notes
The PPDIST dataset will be useful for numerous purposes, such as the evaluation of P from climate models (e.g., Dai 4 and Bosilovich et al 5 .), the distributional correction of gridded P datasets (e.g., Zhu and Luo 6 , Xie et al. 7 , and Karbalaee et al. 8 ), the design of infrastructure in poorly gauged regions (e.g., Lumbroso et al. 9 ), and in hydrological modelling where P intensity is required (e.g., Donohue et al. 106 and Liu et al. 107 ). However, some caveats should be kept in mind: 1. The uncertainty estimates included in the PPDIST dataset provide an indication of the reliability of the climatologies. The uncertainty is generally higher in sparsely gauged, mountainous, and tropical regions (Fig. 4). The uncertainty estimates may, however, be on the low end, because while we accounted for uncertainty in the mean annual P predictor, we did not account for uncertainty in the other predictors. 2. The P occurrence estimates using low thresholds to discriminate between P and no-P (≤0. 25 109 and possibly higher for other sensors with coarser footprints) and the underreporting of light P events 65 . 3. P observations from unshielded gauges (e.g., the US NWS 8" standard gauge) are subject to wind-induced gauge under-catch and therefore underestimate rainfall by 5-15% and snowfall by 20-80% 64,96 . This underestimation is reflected in the PPDIST estimates. The Precipitation Bias CORrection (PBCOR) dataset 110 (www.gloh2o.org/pbcor) provides a means to ameliorate this bias to some degree. 4. The PPDIST climatologies were derived from gauge observations and therefore represent the point scale.
Compared to point-scale estimates, grid-box averages tend to exhibit more frequent P of lower intensity 16,43,47,99 . Point-scale P occurrence estimates can be transformed to grid-box averages using, for example, the equations of Osborn and Hulme 111 , whereas the peak P intensity estimates can be transformed to gridbox averages using, for example, the areal reduction factors collated by Pietersen et al. 112 . 5. As mentioned before, ERA5 strongly overestimates the P intensity over several grid cells in eastern Africa in particular (Fig. 2c). Since ERA5 was used as predictor, these artifacts are also present, albeit less pronounced, in the PPDIST P intensity estimates (Fig. 2b). 6. Some of the PPDIST climatologies (e.g., the 15-year return-period daily P climatology; Fig. 2b) exhibit a longitudinal discontinuity at 60°N/S, reflecting the spatial extent of the IMERG data (Table 1). Additionally, some of the climatologies exhibit discontinuities at coastlines (e.g., along the west coast of Patagonia), due to the use of different data sources for land and ocean for the MAP1, MAP2, MAT, and SnowFrac predictors.

Code availability
The neural networks used to produce the PPDIST dataset were implemented using the MLPRegressor class of the scikit-learn Python module 113 . The P occurrence for different thresholds was calculated using the percentileofscore function of the scipy Python module 70 , whereas the P magnitudes for different return periods were calculated using the percentile function of the numpy Python module 72,73 . The other codes are available upon request from the first author. The predictor, IMERG, and ERA5 data are available via the URLs listed in Table 1. Most of the gauge observations are available via the URLs provided in the "Gauge observations and quality control" subsection. Part of the GSDR database and some of the national databases are only available upon request.