Bottom-up estimates of reactive nitrogen loss from Chinese wheat production in 2014

Excessive use of synthetic nitrogen (N) for Chinese wheat production results in high loss of reactive N loss (Nr; all forms of N except N2) into the environment, causing serious environmental issues. Quantifying Nr loss and its spatial variations therein is vital to optimize N management and mitigate loss. However, accurate, high spatial resolution estimations of Nr from wheat production are lacking due to limitations of data generation and estimation methods. Here, we applied the random forest (RF) algorithm to bottom-up N application rate data, obtained through a survey of millions of farmers, to estimate the Nr loss from wheat production in 2014. The results showed that the average total Nr loss was 52.5 kg N ha−1 (range: 4.6-157.8 kg N ha−1), which accounts for 26.1% of the total N applied. The hotspots for high Nr loss are the same as those high applied N, including northwestern Xinjiang, central-southern Hebei, Shandong, central-northern Jiangsu, and Hubei. Our database could guide regional N management and be used in conjunction with biogeochemical models. Measurement(s) reactive N loss Technology Type(s) random forest model Sample Characteristic - Environment cropland Sample Characteristic - Location China Measurement(s) reactive N loss Technology Type(s) random forest model Sample Characteristic - Environment cropland Sample Characteristic - Location China


Background & Summary
China is the largest synthetic nitrogen (N) fertilizer producer and consumer in the world, and applied more than 28 Tg N fertilizer to cropland in 2018 1 . Furthermore, China applied 256 kg N ha −1 of fertilizer in 2016, which is 3.3 times the global average 2 , while China's nitrogen use efficiency (NUE) is only 0.25 compared to 0.68 in North America and 0.42 worldwide 3 . A high N input with a low NUE indicates that a considerable amount of N has been lost to environment, mainly in the form of reactive N (Nr; all forms of N except N 2 ) including nitric oxide (NO), nitrous oxide (N 2 O), and ammonia (NH 3 ) emissions, nitrate (NO 3 − ) leaching and Nr runoff 4 . This can cause substantial environmental problems, such as soil acidification 5 , air pollution 6 , and eutrophication 7 . the Chinese government has implemented several policies to reduce the environmental risks associated with Nr loss from cropland, such as "zero increase action plan for fertilize use", and "action plan for organic fertilizer instead of synthetic fertilizer". These measures are important to optimize N management, improve the NUE, and mitigate Nr loss in China. Understanding Chinese Nr loss at a high-resolution scale is essential to address the variation in N management among crop systems and locations.
Previous studies that aimed to estimate Chinese Nr loss were partially successful 8,9 ; however, they had certain limitations that could be addressed. The first limitation concerned the method used for obtaining information on N fertilizer inputs. Fertilizer is distributed to specific locations and crops by regional regulatory bodies based on the total fertilizer input in the entire country or an individual region 10 . Previous studies used information on N fertilizer inputs obtained from regional regularities to estimate Nr loss (top-down information). Although this method can provide rough spatial information for applied N and Nr loss, the application of N is highly location-, and farmer-specific. Consequently, to improve spatial information on Nr loss, an N application rate survey should be used to obtain information from numerous farmers and locations (bottom-up information). The second limitation of previous studies was their focus on NO 3 − leaching, N 2 O and NH 3 emissions, without consideration of other Nr loss pathways; this led to underestimation of the potential risks of Nr loss 11 . For example, they did not consider NO, one of the most important potential precursors of nitric acid, which leads to acidification and eutrophication 11 . The third limitation of previous studies was that they adopted uniform emission factors (EFs), such as IPCC Tier 1, to estimate the Nr loss of entire countries or regions, rather than considering spatial variation within a country or region 12,13 . Nr loss is location-specific and strongly influenced  15 . These studies indicated that incorporating spatial variation could reduce uncertainties in Nr loss estimations and facilitate management and mitigation decisions. The fourth limitation of previous studies was that they lacked high-resolution Nr emission inventories for specific crops. Such inventories are indispensable for optimal N management.
Wheat is one of the major crops in China, playing a vital role in food security. The regions used for wheat production range from humid regions in the southeast to arid regions in the northwest, and from warm regions in the south to cool regions in the northeast. China accounts for around 20% of the global synthetic N fertilizer consumption for wheat 16 . Considering the substantial spatial variation and excessive N consumption associated with wheat production in China, it represents an excellent target Nr loss estimation methods aiming to overcome the above-mentioned limitations of previous techniques. Our study provides a comprehensive and high-resolution Nr database based on applied synthetic N. First, we developed RF models to predict the EFs of five loss pathways (NO, N 2 O, NH 3 , NO 3 − , and Nr runoff) based on a literature review. Second, we use N application rates derived from surveys of 2.23 million farmers to calculate Nr loss. High-resolution data on wheat production distribution in China 17 are presented in 1 × 1 km grid scale. Our results could help farmers optimize N application within safe boundary and develop mitigation measures against Nr loss in specific locations, and evaluate the environmental effects of Nr loss from Chinese production. − leaching, and 64 of Nr runoff. We also extracted data on N application rates, and climate and soil variables (Fig. 1). Missing climate data were obtained from China Meteorological Data Network (https://data.cma.cn/), miss values of soil organic carbon (SOC) and total N content were obtained from the National Scientific Fertilizer Network (http:// kxsf.soilbd.com/), and missing soil silt, clay, sand content, bulk density, cation exchange capacity (CEC), and pH data were obtained from the Harmonized World Soil Database (HWSD) v. 1.2 (http://www.fao.org/soils-portal/ soil-survey/soilmaps-and-databases/harmonized-world-soildatabase-v12/en). Based on this dataset, the EFs of Nr loss pathways were calculated by the following equation: where i = 1-5, represented NO, N 2 O, NH 3 , NO 3 − leaching and Nr runoff, respectively. E treatment is the loss rate of experimental treatments with applied N fertilizer, E control is the loss rate of experimental control without applied N fertilizer, and N applied is the N application rate corresponding to E treatment . The resulting data was used to develop RF models to predict EFs of the five Nr loss pathways. RF models. RF models outperformed empirical models in previous studies 15,18,19 . We employed RF models to predict the EFs of NO, N 2 O, NH 3 , NO 3 − leaching, and Nr runoff. Environmental factors were selected via redundancy analysis 20 . Redundancy analysis, a basic ordination technique for gradients analysis, produces an ordination summarizing the variation in several response variables that can be best explained by a matrix of explanatory variables based on multiple linear regression. We conducted redundancy analysis using Canoco 5 to further analyze the effects of 10 environmental factors, including 4 soil physical factors (bulk density, silt, clay, and sand content), 4 soil chemical factors (pH, SOC, CEC and total N content), and 2 weather factors (total rainfall and mean temperature during the wheat growing period) of different EFs. Ultimately, the dataset of each pathway contained an ensemble of different environmental factors (Table 1).
When establishing the RF model, the first step was to select k features from a total of m (k < m) in the training dataset, to generate root node d and daughter nodes; the second step was to repeat the first step to generate a forest with n decision trees. Lastly, the testing dataset was used to create a final decision tree 21 . We randomly split www.nature.com/scientificdata www.nature.com/scientificdata/  www.nature.com/scientificdata www.nature.com/scientificdata/ the dataset, consisting of paired environmental factors and EFs of each Nr loss pathway, into 10 parts of equal size. Among these parts, 7/10 were used to train RF models for different pathways and 3/10 were used to test the performance of the models. We used "randomForest" R package (https://www.stat.berkeley.edu/~breiman/ RandomForests/) to develop RF models in R software (https://cran.r-project.org/). To reduce random error, we ran each model 500 times and determined the performance based on the average value (Fig. 2).
Grid database. We categorized Chinese wheat production into four agroecological regions based on climate and soil variables: North China, North China Plain, South China, and Southwest China (Fig. S1) 22 . The grid layer of wheat distribution was derived from ChinaCropArea1 km (https://doi.org/10.17632/jbs44b2hrk.2), which provided a 1-km-grid crop-harvest dataset for wheat across China 17 . We selected the grid layer from 2014 and integrated nationwide climate and soil data, and N application rates derived via surveys of farmers, into grid layer (Fig. 1). We obtained climate and soil data from the same sources used for missing data. Climate data are in the form of 10-year averages 23 . The climate and soil data were extracted into each grid and used as input variables for the RF models.
predicting EFs and calculating Nr loss. The EF of each pathway was predicted by corresponding developed RF model in each grid (Fig. 3). Nr loss was calculated by multiplying predicted EFs by N applied' using the following equation: where i = 1-5, representing NO, N 2 O, NH 3 , NO 3 − leaching and Nr runoff, respectively. And j = 1, 2, 3, … represented different grids. N applied' was obtained through a nationwide survey of farmers from 2014. For the survey, 3-10 villages were chosen from each county, and 30-120 random farmers were surveyed. In total, 2.23 million farmers from 1,050 counties were surveyed 22 . The N application rates were extracted the average rate was determined for each county, superimposed using Kriging interpolation, and plotted on a map of China. Finally, average rates were extracted into grid layer of Chinese wheat production (Fig. 4a). Total Nr loss (Fig. 4b) was summed from five Nr loss pathways as Eq. (3) (Fig. 5).
Database structure. The Nr-wheat 1.0 database of Nr loss associated with Chinese wheat production consists of three files (Fig. 1). The 'data file' provides N application rates, EFs and Nr loss of five loss pathways (NO, N 2 O, NH 3 , NO 3 − , and Nr runoff). The 'source file' contains studies from which data were extracted to develop RF models, the code of RF model, and subregions of Chinese wheat production. The 'readme file' explains the abbreviations used in the 'data file' and 'source file' , and provides the units of all variables included variables (Fig. 1).

Data Records
Data records are provided in three files, including 'source file' , 'readme file' , and 'data file' . 'Source file' could be found in Supplementary Information, which contained all references used in the database, including 138 relevant papers, the code for the RF model, and four subregions of Chinese wheat cultivation. We divided the relevant papers into 5 subsets based on loss pathways. The 'readme file' explained the abbreviations and units. The synthetic N application rates surveyed from farmers, estimated EFs, and Nr loss were integrated into a map and are provided in 'data file' . The map includes 229,366 1 × 1 km grids, which cover around 94% of wheat crop www.nature.com/scientificdata www.nature.com/scientificdata/  Table 2. Averaged values and ranges of EFs and loss for each pathway.