## Background & Summary

Only 3% of the total water on the Earth is considered fresh water and approximately 30% of that is accessible as groundwater, which is vital for human development, ecosystem, the energy industry and other water-dependent activities1. Since the 1950s, it has been realised that nitrate (NO3N), which is the most common groundwater pollutant worldwide2,3,4, adversely affects human health5,6. Studies have shown a positive correlation between nitrate concentration in drinking water and the colorectal cancer morbidity when the drinking water quality is far below the drinking water standard (50 mg/L of nitrate as NO3 in the European Union7, or 10 mg/L of nitrogen as the maximum contaminant level regulated by the United States Environmental Protection Agency) set by policies8,9. Nitrate has also been considered to be an environmental endocrine disruptor, as it has been shown to affect vertebrate reproduction and developmental processes in fishes10,11. Nitrate entering wetlands, rivers or lakes can lead to eutrophication, which may lead to algae overgrowth and fish loss12,13,14. Moreover, nitrate has an indirect impact on the economy. Studies from the early 1990s showed that in response to groundwater pollution, many people took avoidance actions that can result in significant economic losses15. For example, in Wisconsin, USA, the direct medical cost for all adverse health consequences attributable to nitrates is estimated at between $23 million and$80 million per year16.

The main sources of nitrate in groundwater that cause these hazards include irrigated and rainfed agriculture and intensive animal farming17. Other sources, such as septic tanks and landfills, may leach nitrate locally18. In some urbanized areas, underground sewer leakage is also a source of nitrate in groundwater19. Nitrate pollution in shallow aquifers is mainly caused by fertilisation and the subsequent nitrate leaching20,21, which is the process of nitrate migration from the upper to the lower soil with soil water. Nitrate leaves the bottom of the soil into the unsaturated zone (USZ) and finally enters the groundwater. The USZ is located below the soil and above the groundwater, which is not only an important space connecting the surface and groundwater, but also a necessary way for all kinds of pollutants to enter the groundwater. After nitrate enters the USZ from the bottom of the soils, the transformation of nitrate in the USZ mainly includes three processes, namely, adsorption22, nitrification and denitrification23. Recent literature has indicated an increasing global concern about the effects of nitrate leaching on the environment, particularly agro-ecosystems24, especially the nitrate legacy in the USZ, i.e., the nitrate time lag between the bottom of the soil layer and arrival at the water table25. Some studies have termed this issue a ‘nitrate time bomb’26 and indicate that countries should consider it when assessing groundwater nitrate pollution and developing pollution control policies27.

To understand the nitrate legacy in the groundwater system, it is necessary to understand the nitrate transport velocity (vN) in the USZs and hence the nitrate lag time in the USZs. In previous studies, vN was regarded as one of the main factors affecting the nitrate concentration and distribution in the USZs of the study areas28, and nitrate was also regarded as an environmental tracer to understand the transfer processes in the USZs29. However, the vN in the USZs involved in these researches is limited to specific local research areas. In terms of global-scale research, there are few studies28,29,30,31,32,33,34,35,36,37 on vN estimation, and most of the research is concentrated on European aquifers35,36,37, especially British aquifers38. Although vN maps for the UK38, the Loess Plateau of China39 have been generated, there is no spatial map representing the vN distribution for the whole world. Since the USZ vN is determined by many factors, such as rock types, permeability, porosity, and amount of groundwater recharge, it is highly regional or lithological specific40 thus making it difficult to generate a reliable global dataset of vN the USZs.

Based on a Nitrate Time Bomb (NTB) model38, we developed a global dataset of nitrate transport velocity in the USZs (GNV) and validated it using nitrate velocity data locally observed or derived from the literature review. This first known and open-source global-scale USZ vN dataset GNV can help scientists from other disciplines to better understand the nitrate legacy in the groundwater system at a large scale, thus contributing to developing new methods to provide sound evidence for nitrate water pollution management.

## Methods

The development of the GNV consists of three steps: (1) Constructing an NTB model by preparing and inputting global datasets of rock permeability, rock porosity, and annual groundwater recharge. (2) Calibrating the NTB model based on a 22-zone baseline dataset of USZ vN created using locally measured vN data and global lithological data. (3) Validating the GNV dataset derived using the baseline USZ vN dataset.

### The Nitrate Time Bomb (NTB) model

The NTB model has been used to simulate the nitrate transport in the groundwater system at the national and global scales41, based on the information on nitrate leaching from the bottom of the soils, the thickness of the USZs, and the rock hydrogeological characteristics. The NTB model was used in this study to derive the GNV dataset. The NTB model was originally developed in the UK, where the transport velocity in the USZs was calculated as the ratio of average groundwater recharge to porosity42:

$${V}_{USZ,i}=\frac{{R}_{ec,i}}{{P}_{rock,i}\cdot R\cdot 1000}$$

where, VUSZ,i (m/year) is the USZ vN at the cell i; Rec, i (mm/year) is the groundwater recharge in the cell i; Prock,i is the porosity of the rock at the location of i; and R is the retardation factor reflecting the influences of other factors, such as permeability, pore size, diffusion, dispersion and adsorption on the USZ vN.

### Global porosity data for constructing the NTB model

The global porosity data used in this study were derived from the GLHYMPS (Fig. 1a), which is global near-surface hydrogeology data of permeability and porosity produced by synthesising and modifying existing global databases43,44. The nine classes of porosity data, which have an average polygon size of 107 km2 (including Antarctica), are corresponding to nine hydrogeological categories, i.e. unconsolidated sediments, coarse-grain unconsolidated sediments, fine-grain unconsolidated sediments, siliciclastic sedimentary, coarse-grain siliciclastic sedimentary, fine-grain siliciclastic sedimentary, carbonate, crystalline, and volcanic.

### Global annual groundwater recharge data for constructing the NTB model

The global annual groundwater recharge data used in this study were derived from a global hydrological and water resource model called PCR-GLOBWB45,46, which has spatial resolutions of 0.5° × 0.5° and 5′ × 5′ and is available at https://github.com/UU-Hydro/PCR-GLOBWB_model47,48. Similar to other large-scale hydrologic models, PCR-GLOBWB is essentially a “leaky bucket” model applied on a cell‐by‐cell basis49 by considering rainfall, evaporation, canopy interception, snow accumulation and snowmelt. The monthly groundwater recharge derived from PCR-GLOBWB was used to calculate the annual average recharge from 1958 to 2015 (Fig. 1b).

### Regionally measured or modelled USZ vN data and global lithological data for generating the global-scale baseline USZ vN

To generate a global baseline dataset of USZ vN, the measured or modelled (but verified) vN data of regional USZs in different countries were collected and averaged from published literature (Supplementary Table 1)28,29,30,31,32,33,34,35,36,37,38,50,51. Figure 2 shows the distribution of the collected mean USZ vN data from the United States, China, the UK, Western Europe, Japan and Israel. These data were then expanded to a global-scale baseline USZ vN dataset based on the regional lithology and the global lithology data (GLiM)52. GLiM, which is available at the PANGEA Database (https://doi.org/10.1594/PANGAEA.788537)53, represents the rock types of the Earth surface with 1,235,400 polygons. The lithological classification consists of three levels: the first level contains 16 basic lithological classes, while the other two levels contain 12 and 14 subclasses respectively describing more rock details. Only 16 basic classes of the first level of GliM were used in this paper, including: Intermediate volcanic rocks, Basic volcanic rocks, Acid plutonic rocks, Metamorphics, Unconsolidated sediments, Siliciclastic sedimentary rocks, Basic plutonic rocks, Intermediate plutonic rocks, Mixed sedimentary rocks, Water Bodies, Pyroclastics, Carbonate sedimentary rocks, Acid volcanic rocks, No Data and Evaporites. According to the GLiM, the Earth is covered by 64% sediments (a third of which are carbonates), 13% metamorphics, 7% plutonics, 6% volcanics, and 10% are covered by water or ice52.

### Baseline datasets of USZ vN for calibrating and validating the NTB model

The first level of GLiM classification was used in this study as a base map to interpolate the regionally measured or modelled (but verified) USZ vN data into a global baseline dataset of USZ vN (vN_base) (Fig. 3), which is used as observed/known vN to calibrate the NTB model. Figure 4 shows the flowchart of generating the vN_base. According to the principle of the NTB model, the vN is constrained by USZ lithology conditions, so we assumed that the same USZ lithology had the same average USZ vN. The collected regional monitored or modelled USZ vN data and their corresponding USZ lithologies were collated, and the world was divided into two parts according to the existence of USZ vN, namely, the regions with and without vN data. For the regions with vN data, we divided them into regions with different lithology classifications and then reclassified the lithology of these regions based on the GLiM classification (Supplementary Table 2), to calculate the mean vN values of the reclassified lithology. Whilst, for the regions without vN data, we derived the vN values based on the lithology types that are the same as that in regions with vN data. Finally, the global baseline dataset of USZ vN was generated using the mean vN data from all the regions.

The data processing of monitored or modelled USZ vN collected was mainly divided into three parts: (A) the USZ vN in the UK; (B) the USZ vN in Chalk and Triassic sandstone of Western Europe; and (C) the USZ vN in other regions. Their data processing are described below:

1. (A)

Since the UK has a complete database of USZ vN with a detailed description of aquifers that cover almost the whole country38, this UK database was used to derive the mean USZ vN of other regions in the world based on aquifer types. Therefore, the UK aquifers were reclassified using the basic lithological classification standard of the global GLiM data (Supplementary Table 2). For example, according to the spatial distribution, Chalk, Carboniferous, Cornbrash and Great Oolite of Lincolnshire and other lithology in the UK belong to the Carbonate sedimentary rocks defined in the GLiM basic lithology. Lower Cretaceous Sands, Triassic Sandstones, Triassic and Permian and other lithology belong to the Siliciclastic sedimentary rocks of the GLiM basic lithology. The Pliocene: Corralline Crag and Quaternary Norwich and Red Crags belong to the Mixed sedimentary rocks of the GLiM basic lithology. When more than one UK USZ lithologies were classified into one GLiM lithological type after the reclassification, the mean vN value of these USZ lithologies was calculated and applied to calculate the mean USZ vN of other parts of the world.

2. (B)

Because of special lithological classifications in Western Europe54 (Belgium, the former Federal Republic of Germany, Denmark, France, Ireland, Italy, Luxembourg, the Netherlands and the United Kingdom), the vN values in USZs of the Chalk and Triassic sandstone in Western Europe were determined based on the lithological classification of Western Europe.

3. (C)

Regarding other countries that have USZ vN collected from literature, the World Administrative Region data55,56 and the lithology classification of GLiM were used to extrapolate the USZ vN values at the study areas in the literature to the lithological range within the boundary of the countries where the studies were undertaken (Supplementary Table 2).

The main factors affecting the value of the retardation factor R include permeability, pore size, diffusion, dispersion and adsorption, which are constrained by lithology42. To accurately simulate spatially distributed USZ vN values, the R values for different lithological classifications need to be calibrated using vN_base. Therefore, according to the GLiM lithology classification and the vN_base of different countries, the global USZs were divided into 22 zones (excluding water, ice and snow, and no vN_base value zones) (Fig. 5). The zoning method is as follows: for the whole country where there is a mean USZ vN and the vN_base in the area is the mean USZ vN (e.g., the UK), we divide the region with the same vN_base into one zone. For areas where there is a mean USZ vN of lithology, the vN_base in the area is the mean USZ vN and the lithology boundary across several countries (such as Chalk and Triassic sandstone in Western Europe), we divide the lithology into one zone according to the boundary. For areas where there is a mean USZ vN of lithology, the vN_base in the area is obtained by using GLiM lithology to expand the space according to the subordinate relationship between the lithology and GLiM 16 lithology, and the lithology boundary exists only in one country (such as China, the United States, Japan and Israel), we divide the GLiM lithology corresponding to this lithology in this country into one zone. For example, there is a mean USZ vN in the Loess of China, and the loess region belongs to the unconsolidated sediments of GLiM. We divide the unconsolidated sediments of China into one zone. The other areas where there is no mean USZ vN and the vN_base is obtained by interpolation are divided according to GLiM lithology. The division of 22 zones is based on the existence of the mean USZ vN data, the calculation method of vN_base and lithology. Compared with GLiM 16 lithology classifications, the 22-zone zoning method distinguishes the region where vN_base is obtained by using different methods according to the mean USZ vN in the same lithology, to better restrict the value of retardation factor of vN_base directly obtained from the existence of mean USZ vN in the region, thus increasing the accuracy of the velocity simulation results. The number and the lithology of the 22 zones are shown in Supplementary Table 3. The zone map provided regional constraints for deriving spatially distributed USZ vN values using the NTB model.

### Generating the global distributed USZ vN data (GNV)

Although some regional monitored USZ vN data can be found from published literature, the number of these data are too limited to be directly used to derive the spatially distributed USZ vN values in the rest of the regions of the world. However, these collected regional monitored USZ vN data have been used to derive the baseline datasets of USZ vN, i.e., the vN_base dataset for calibrating the NTB model, which was used in this study to generate the global distributed USZ vN data (GNV). To calibrate the NTB model using vN_base in 22 zones (described in the section above), the different values of NTB retardation factors R were used and calibrated in each zone during the Monte Carlo (MC) simulation, in which, the NTB model was run 100,000 times. In each NTB run, the absolute value of the difference between the baseline datasets and the spatially distributed mean simulated values was calculated to verify the accuracy of the modelled results. The sensitivity scatter plots of the 22 zones were produced by plotting the absolute value of the difference between the baseline datasets and the spatially distributed mean simulated values of the NTB retardation factors (Fig. 6). For example, in Fig. 6(1), the number 1 corresponded to the zone1. We used the MC method to enter a random R value as Ri, ran the NTB model once, and got a mean simulated velocity (vN_sim) of the zone1. The absolute value of the difference between the mean vN_sim and the zone1 vN_base was marked with a blue point in Fig. 6(1). The MC model had been run for 100,000 times and a total of 100,000 Ri and 100,000 scatter points had been obtained. Among these scatter points, the point with the value closest to 0 indicated that the mean vN_sim is closest to the vN_base, and we called this mean vN_sim as the best mean vN_sim. The Ri corresponding to this point was the best R of zone1, marked with a red triangle in Fig. 6(1). The values of vN_base the simulated velocity (vN_sim) and retardation factor in 22 regions are shown in Supplementary Table 3. After the R values of 22 zones were determined, the NTB model was run again, and the GNV dataset was obtained. The GNV dataset generated using the NTB model is shown in Fig. 7.

## Data Records

The GNV dataset and its quality details are made available to the public free of charge in GeoTIFF format through an unrestricted public repository (Figshare57). The data is provided in a 5′ × 5′ spatial resolution with the velocity unit of m/year. The GNV dataset represents the global distribution of USZ nitrate transport velocities, which are mainly affected by rock types, rock hydrogeological characteristics, long-term groundwater recharge, etc. Data quality information, which will be discussed in the following section, is the precision estimation of nitrate transport data based on the vN_base values in different zones. Upon the availability of new regionally measured or modelled USZ vN data, the repository will be updated with a newer version of the nitrate transport data graph.

## Technical Validation

Since the global rock porosity dataset is one of the input parameters when estimating USZ vN in the NTB model, the correlation analysis of USZ vN distribution and porosity data are performed. In order to eliminate the influence of zero groundwater recharge on this analysis, the zero USZ vN results calculated by zero average annual groundwater recharge (e.g. Sahara Desert, Arabian Desert, Iranian Desert, Turkish Desert, Taklimakan Desert, Gobi Desert, Australian Desert, Namib Desert and Karari Desert) were not considered. Figure 8a shows that the mean value of vN is inversely proportional to the rock porosity on the whole; and this is consistent with the basic formula of the NTB model42. However, when the porosity values are 0.15, 0.22 and 0.28, the value ranges of vN are smaller than that of other porosity values (Fig. 8b). To explain this phenomenon, we checked the lithology classification of GLiM and found that these three porosity values belong to the same lithology category, namely unconsolidated sediment58. We compared the spatial distribution of these three kinds of porosity with the spatial distribution59 of unconsolidated sediment subtypes in the Global unconsolidated sediment Map Database (GUM)58. Through statistical comparative analysis, it was found that under the condition of excluding undifferentiated sediments, the unit area of clay and silt (assuming that different particles in the mixture were mixed in the same volume) corresponding to these three kinds of porosity accounts for more than 30% of the corresponding porosity area (Table 1). This shows that there was a large amount of clay and silt in the unconsolidated sediments with porosity values of 0.15, 0.22 and 0.28. The reference values of porosity of clay and silt are 0.4~0.7 and 0.8 respectively60, which are much higher than the 0.15, 0.22 and 0.28 used to calculate the vN. In order to verify the accuracy of this conclusion, we obtained the example porosity values from literature (Fig. 9)61,62,63,64. Figure 9 shows that it is possible that the actual porosity values can be higher than that used in the NTB model. Based on the above analysis, the actual porosity of the unconsolidated sediments may be higher than 0.15, 0.22 and 0.28 used in the NTB model, thus leading to overestimating vN. However, the vN calculation uncertenties, which were introduced by porosity errors, can be reduced due to the existence of the retardation factor in the NTB model. When calibrating the NTB model using the baseline vN, the retardation factor can be adjusted to make the vN modelled closer to the real vN.

To verify the accuracy of the GNV dataset derived in this study, we compared the simulation results with the baseline velocity in 22 zones. Firstly, the average value of simulation results in each region was compared with that of the vN_base. Table 2 shows that the maximum error between the average value of vN and vN_base is 0.4252 in zone 21, which has the main lithology of the mixed sedimentary rocks. The scatter plot of correlation between the average vN and vN_base shows a strong positive correlation (Fig. 10, R2 = 0.9956). To further evaluate the accuracy of the GNV data, the vN_base ± the standard deviation of vN in each zone were taken as the confidence interval, and the vN values outside the confidence interval were taken as the outliers, and then the cell proportion of outliers in each region was calculated (Table 3). The results show that zone 6, which has the lithology of Chalk in Western Europe, has the largest proportion of outliers (40.92%). Besides, the outlier proportions in zone 8, 9, 10 and 13 are also relatively large, accounting for 32.14%, 25%, 24.25% and 25.93%, respectively. However, outliers in these regions only occupy a small proportion globally. Therefore, the overall percentage of outliers is 5.50%, indicating that the accuracy of GNV is 94.50%. Figure 11 shows the outlier proportion of each zone. It can be found that the outliers in Western Europe and southern Britain have relatively large proportions. This is because the total areas of these regions are comparatively small, and a single grid cell takes up a large proportion of the region, resulting in a relatively high proportion of outliers.

## Usage Notes

In this paper, the global-scale USZ vN dataset named GNV was generated using the NTB model based on the global porosity data and global groundwater average recharge datasets from 1958 to 2015. This GNV dataset was derived by constraining the NTB model and has been carefully analysed and verified using the measured values in various regions of the world from published literature.

Generally, the information on nitrate transport velocity in the USZs is valuable when better understanding the legacy of nitrate in the groundwater system and investigating and forecasting its impacts on nitrate in groundwater on the environment, human health, ecological quality, plant and animal growth. In detail, the GNV dataset can be used by different numerical models, such as groundwater and USZ pollution transport models and surface water models, in conjunction with other datasets. For example, the GNV dataset can be combined with the USZ thickness data to calculate the lag-time in the USZs (the time for nitrate to travel from the bottom of the soils to the water table). Similarly, this GNV dataset could be used to estimate the time when the peak value of nitrate leaching reaches the water table, thus informing policymakers to be prepared for the possible increase or decline of nitrate concentrations in groundwater in the future.

This global study can help funders, policymakers and practitioners of a country better understand the feasible time scale for expecting the benefits of nitrate mitigation measures, thus guiding setting regional priorities of groundwater nitrate management plans at the country scale. However, further localised work needs to be undertaken to get detailed information when handling local groundwater nitrate pollution issues.

This calibrated GNV dataset is available in GeoTIFF and ASC formats, thus making it easy to be imported into ESRI ArcMap and any other geospatial software.

The limitation of this study is that deriving the GNV dataset relies on global annual groundwater recharge and global porosity data, thereby possibly passing the uncertainties in these two datasets to this GNV dataset. Besides, 8.30% of the global area in GNV have no values of nitrate velocity in the USZs due to the lack of measured vN data for the rock types in these areas. According to the classification of 16 basic lithological types of GLiM, the lithological classes, which have no measured USZ vN, includes Intermediate volcanic rocks, Acid plutonic rocks, Basic plutonic rocks, Intermediate plutonic rocks, Pyroclastics, Acid volcanic rocks and Evaporites. However, these data can be updated once the measured USZ vN for the rocks in these areas become available.