A global experimental dataset for assessing grain legume production

Grain legume crops are a significant component of the human diet and animal feed and have an important role in the environment, but the global diversity of agricultural legume species is currently underexploited. Experimental assessments of grain legume performances are required, to identify potential species with high yields. Here, we introduce a dataset including results of field experiments published in 173 articles. The selected experiments were carried out over five continents on 39 grain legume species. The dataset includes measurements of grain yield, aerial biomass, crop nitrogen content, residual soil nitrogen content and water use. When available, yields for cereals and oilseeds grown after grain legumes in the crop sequence are also included. The dataset is arranged into a relational database with nine structured tables and 198 standardized attributes. Tillage, fertilization, pest and irrigation management are systematically recorded for each of the 8,581 crop*field site*growing season*treatment combinations. The dataset is freely reusable and easy to update. We anticipate that it will provide valuable information for assessing grain legume production worldwide.

. Latitude and longitude coordinates of the field sites included in the database. The Köppen-Geiger climatic classification 186 was used to link each field site to a grid size with a resolution of 0.50 degrees of latitude by 0.50 degrees of longitude. Eighteen Köppen-Geiger climatic zones are considered: equatorial climates (red), arid climates (orange), warm temperate climates (green) and snow climates (blue). Within each main Köppen-Geiger climatic zone, each Köppen-Geiger climatic subzone is indicated by a color gradient.
www.nature.com/sdata/ SCIENTIFIC DATA | 3:160084 | DOI: 10.1038/sdata.2016. 84 and management practices are systematically recorded for each crop*field site*growing season*treatment combination. When available, data on non-legume species grown at the same field site during the same growing season than grain legume species, and data on non-legume species grown after grain legumes in the crop sequence are also included. Most of these non-legume species correspond to cereals and oilseeds. The data are organized into a relational database with nine structured tables and 198 standardized attributes (Tables 2 and 3 (available online only)).
The dataset can be used for two types of quantitative analysis. First, the dataset can be used to compare the crop production of a broad range of grain legume species, on the basis of experimental data with diverse criteria (e.g., grain yield, aerial biomass and crop nitrogen content). Second, the dataset can be used to assess the crop production of cereal and oilseed species following grain legume species cultivated as preceding crops in the same crop sequences, based on a consideration of field data for various criteria.  Table 1. Number (percentage) of field sites, field site*growing season and field site*growing season*treatment combinations, and data for grain yield, aerial biomass, grain nitrogen content, aerial nitrogen content, fixed aerial nitrogen content, residual soil nitrogen content and water use, by main world regions. Regions are ranked in descending order of field sites. Grain yield includes data from the 'Crop_Yield_Grain' attribute. Aerial biomass includes data from both 'Crop_Biomass_Aerial' and 'Crop_Harvest_Index' attributes. Grain nitrogen content includes data from both 'Crop_N_Quantity_Grain' and 'Crop_N_Percentage_Grain' attributes. Aerial nitrogen content includes data from the 'Crop_N_ Quantity_Aerial', 'Crop_N_Percentage_Aerial' and 'Crop_N_Harvest_Index' attributes. Fixed aerial nitrogen content includes data from both 'Crop_N_Fixed_Quantity_Aerial' and 'Crop_N_Fixed_Percentage_Aerial' attributes. Residual soil nitrogen content includes data from both 'Crop_N_Soil_Quantity_Percentage_Seeding' and 'Crop_N_Soil_Quantity_Percentage_Harvest' attributes. Water use includes data from the 'Crop_ Water_Use_Balance', 'Crop_Water_Use_Balance_Efficiency_Grain' and 'Crop_Water_Use_Balance_ Efficiency_Aerial' attributes. The total number (percentage) of available data and the total number (percentage) of missing data are calculated over all considered world regions. The dataset is freely available to facilitate such analyses. It could easily be updated in the future, by adding the results of new experiments not originally included in the dataset. It might also be interesting to expand the dataset to include legumes grown for purposes other than grain production (e.g., forage production) or legumes grown in intercropping systems. The global dataset should prove to be a useful support for experimental assessments of the agronomic and environmental performances of a large diversity of grain legumes.

Literature search
We carried out a systematic search of peer-reviewed journals for articles comparing grain legume yields.
We  (Fig. 2). Each article title and article abstract were screened for eligibility according to six criteria: (1) article title and/or article abstract reporting one or several annual grain legume species grown as sole crops, (2) article title and/or article abstract reporting at least two grain legume species grown at the same field site during the same growing season, (3) article title and/or article abstract reporting at least one experiment conducted during one or several growing seasons, from the seeding stage to the harvest stage, (4) article title and article abstract referring to an article published in a peer-reviewed journal, (5) article title or article abstract written in English and (6) full-text article available. We selected 223 eligible full-text articles that met these first six criteria (Fig. 2).
Eligible full-text articles were then examined according to three additional criteria: (7) full-text article reporting raw data not duplicated in other articles or raw data that could be obtained by contacting authors, (8) full-text article reporting individual grain yield for each species and (9) full-text article reporting one or several experiments for which field site location or soil characteristics were precisely stated. We selected 60 full-text articles that met all nine criteria. This search was supplemented by screening the references cited in these 60 full-text articles. We also screened the references included in one meta-analysis about drought effects on food legume production 187 for eligibility. When reviewing the full-text articles identified from references screening, all nine selection criteria defined above had to be met for the new article to be considered eligible. Note that, according to the criterion (2), experiments reporting data for single grain legume species were excluded. This selection criterion was used to ensure the direct comparability of different grain legume species, and avoid confounding effects between species characteristics and environmental factors. Experiments testing single species cannot be used to compare several species due to the effects of field site and growing season characteristics (e.g., climate conditions, soil types and plant diseases) on the growth and development of grain legumes.

Database structure
All data are recorded in a relational database (Data Citation 1). The Structured Query Language (SQL) system is used to query and maintain the database. We used the open-access application Sequel Pro version 1.0.2 (http://www.sequelpro.com/). The data collected are grouped into nine related tables including 198 standardized attributes of five types: class, numerical, index, binary and date ( Fig. 3 and Table 2). Within the database, the tables are organized according to a cascade path: each 'child' table is related to a 'mother' table. For instance, the 'Article' table is the 'mother' table for the 'child' 'Site' table (Fig. 3). The cascade path from each 'mother' table to each 'child' table is structured by a 'primary key' and a 'secondary key' (Fig. 3). A 'primary key' assigns an index to each row of the table, whether the table is a 'mother' table or a 'child' table. A 'secondary key' assigns the 'primary key' of a 'mother' table to each row of a 'child' table. The cardinality from each 'mother' table to each 'child' table is based on 'one-to-one' and 'one-to-many' relationships (Fig. 3).
The database is structured into nine separate but related tables, stored as CSV-formatted files (Data Citation 1). Tables are related to each other via primary and secondary keys, as explained in Fig. 3. The names, types and definitions of attributes included in the nine tables are listed in Table 3 (available online only).
The 'Literature_Search' table describes each step in the literature search at which each original article was selected (e.g., selection from the initial literature search or from references screening). The corresponding file is entitled 'Literature_Search.csv' (Data Citation 1), and includes 2 columns and 3 rows (including the row header for the names of attributes). The 'Article' table describes the references of the 173 selected articles (e.g., the name of the first author and the name of the journal). The corresponding file is entitled 'Article.csv' file (Data Citation 1), and includes 8 columns and 174 rows (including the row header for the names of attributes).
The 'Site' table describes the characteristics of each field site considered in each article (e.g., latitude and longitude coordinates, soil texture, precipitation and temperature). The corresponding file is entitled 'Site.csv' (Data Citation 1), and includes 29 columns and 361 rows (including the row header for the names of attributes).
The 'Crop_Sequence_Trt' table describes each combination of crop sequences and management practices into the treatments studied at each field site (e.g., names of the species and their order in each crop sequence). The corresponding file is entitled 'Crop_Sequence_Trt.csv' (Data Citation 1), and includes 8 columns and 4,560 rows (including the row header for the names of attributes).
The 'Crop' table provides information about each crop (e.g., names of the species, seeding and harvest dates, number of replicates, grain yield, aerial biomass, crop nitrogen content, residual soil nitrogen content, water use, error terms and error types). The main attributes included in this central table are described below in the Data Records section. The corresponding file is entitled 'Crop.csv' (Data Citation 1), and includes 106 columns and 8,582 rows (including the row header for the names of attributes).
The 'Tillage' table describes tillage management for each crop (e.g., tillage tools, incorporation of preceding crop residues, seeding density and legume inoculation). The corresponding file is entitled 'Tillage.csv' (Data Citation 1), and includes 19 columns and 8,582 rows (including the row header for the names of attributes).
The 'Fertilization' table describes nitrogen, phosphate and potassium fertilizer management for each crop (e.g., names and doses of fertilizers). Only the total fertilizer dose is reported for each type of nutrient. The corresponding file is entitled 'Fertilization.csv' (Data Citation 1), and includes 7 columns and 25,744 rows (including the row header for the names of attributes).
The 'Weed_Insect_Fungi' table describes weeds, insects, and fungi management for each crop (e.g., mechanical treatment, names and doses of pesticides). The corresponding file is entitled 'Weed_Insect_Fungi.csv' (Data Citation 1), and includes 13 columns and 45,002 rows (including the row header for the names of attributes).
The 'Irrigation' table describes irrigation management for each crop (e.g., quantity of water applied and irrigation method). The corresponding file is entitled 'Irrigation.csv' (Data Citation 1), and includes 6 columns and 8,582 rows (including the row header for the names of attributes).
In addition to the nine CSV-formatted files (tables), downloadable from Dryad Digital Repository (Data Citation 1), the entire content of the database is also stored in a SQL-formatted file. The corresponding file is entitled 'Database.sql', and is also downloadable from Dryad Digital Repository (Data Citation 1). Examples of SQL queries for extracting data for each table are stored in a TXT-formatted file. The corresponding file is entitled 'Examples_SQL_Queries.txt', and is also downloadable from Dryad Digital Repository (Data Citation 1).
The names, types, and definitions of the 198 attributes included in the nine tables are reported in Table 3  (available online only).
The values (including error terms) and dates reported in graphics were digitized manually with the open-access application WebPlotDigitizer (http://arohatgi.info/WebPlotDigitizer/). The maximum error was estimated at 5.0% for the digitization of low-resolution images, generally from articles published before 1990. 'NA' indicates that data were 'Not Available' for the cell concerned. 'NULL' indicates a logical absence of data for attributes included in the 'Crop', 'Tillage', 'Fertilization', 'Weed_Insect_Fungi', and 'Irrigation' tables. For example, for the 'Fertilization' table, if no nitrogen fertilizer was applied to the crop (i.e., '0.00' was reported in the 'Fertilization_NPK_Dose' attribute), then 'NULL' was reported for the 'Fertilization_NPK_Dose_Product_Name' attribute.

Data Records
We describe below the main attributes of the 'Crop' table because this table includes most of the experimental data extracted from the 173 selected articles. Information on other attributes (e.g., articles, field sites, combinations of crop sequences and management practices) is defined in Table 3 (available online only).
In the 'Crop' table, grain yield is by far the attribute including the highest number of data. This high reporting rate reflects the explicit requirement for presence of grain yield data during the article selection process (i.e., criterion 8). Reporting rates are lower for aerial biomass, grain nitrogen content, aerial nitrogen content, fixed aerial nitrogen content, residual soil nitrogen content and water use. Table 1 presents the total number (percentage) of available and missing data for these attributes over all crop*field site*growing season*treatment combinations.
When data were not reported for some attributes (e.g., aerial biomass or water use) in the selected articles, we systematically collected data for related attributes (e.g., harvest index or grain water use efficiency) in order to retrieve the missing data. For examples, aerial biomass can be deduced from grain yield and harvest index, and water use can be deduced from grain yield and grain water use efficiency. When data were not available for any related attributes, we contacted the authors of the selected articles, and we asked them to provide us with additional raw data when available. The name of each combination of crop sequences and management practices was based on the common names of the species, such as for both 'Crop_Sequence_Trt_Name' and 'Crop_Sequence_Trt_Species_ Order' attributes in the 'Crop_Sequence_Trt' table. For instance, the name of a legume-cereal sequence without application of nitrogen fertilizer (0N) could be 'Garden pea-Common wheat, 0N' where 'Garden pea' and 'Common wheat' are the common names listed in the United States Department of Agriculture Plants Database (http://plants.usda.gov/java/) for Pisum sativum and Triticum aestivum, respectively. Malik et al. 105 and McEwen et al. 108 described several crop sequences including grain legumes and crop sequences including barrelclover (Medicago truncatula) or common oat (Avena sativa), both preceding common wheat. For these two articles, we excluded the crop sequences including barrelclover and common oat because these crops were grown for forage production.

'Crop_Site_Growing_Season_ID' attribute
This attribute is an index identifying each species grown at a given field site during one or several growing seasons. Identical raw data were found to have been duplicated in two pairs of articles: Muchow et al. 114 and Sinclair et al. 153 on the one hand, and Heenan et al. 71 and Armstrong et al. 2 on the other. The duplicated raw data from Sinclair et al. 153 and Heenan et al. 71 were excluded because the number of crop*field site*growing season*treatment combinations was smaller in these two articles than in their duplicates.
'Crop_Species_Scientific_Name' and 'Crop_Species_Common_Name' attributes These attributes give the scientific and common names of the species. The scientific name of each species was related to the common name listed in the United States Department of Agriculture Plants Database (http://plants.usda.gov/java/), to avoid confusion due to the use of different common names for the same species. In the absence of a common name for Brassica campestris, Lupinus atlanticus and Triticum sativum, the scientific names of these species were used as common names. In the presence of fallow period, it was not possible to give a scientific name and a common name, and 'Fallow' was reported.

'Crop_Date_From_Seeding_To_Harvest_Day_Number' attribute
We calculated the number of days from seeding date to harvest date, with the open-access application Time and Date (http://www.timeanddate.com/). For data averaged across multiple growing seasons, we calculated the number of days from seeding date to harvest date for each growing season and then obtained the average by dividing by the total number of growing seasons.
Some articles approximated seeding date and harvest date by describing these events as occurring in the 'early', 'middle' or 'late' part of the month. We defined 'early' as the first 15-day period of the month  (1st-15th), 'middle' as the 15th day of the month and 'last' as the second 15-day period of the month (15th-30th or 15th-31st). In these cases, the number of days from seeding to harvest was calculated by selecting the last day of the period concerned, i.e., the 15th day of the month for 'early' and 'middle' and the 30th or 31st day of the month for 'late'. Some articles reported only the number of days from seeding to harvest, without indicating precise dates or months. In these cases, we reported only the number of days from seeding to harvest. We used the expression 'NA NA NA' (i.e., 'Day Month Year' formatted expression) for both seeding and harvest dates.

'Crop_Following_Number' attribute
This attribute is used to distinguish preceding crops from following crops in the crop sequence. It takes three values: '0' (i.e., the main crop or the preceding crop, mostly grain legumes), '1' (i.e., the following crop, mostly cereals and oilseeds) and '2' (i.e., the crop after the following crop, mostly cereals and oilseeds).

'Crop_Multiple_Following_For_Same_Preceding' attribute
Some studies reported results for many different crops and management practices following the same preceding crop. The binary 'Crop_Multiple_Following_For_Same_Preceding' attribute was used to identify data associated with the same preceding crop.

'Crop_Across_Treatment_Averaged_Value' and 'Crop_Across_Treatment_Averaged_Value_Type' attributes
For species grown at the same field site during the same growing season, some articles reported only data averaged over combinations of treatments (e.g., cultivar*seeding date*presence of irrigation). We included these data provided that each type of individual treatment was precisely defined in the article. In all cases, we systematically reported whether or not the data were averaged over combinations of  One 'primary key' and one 'secondary key' are assigned to each table (except for the 'Literature_Search' table, which is exclusively a 'mother' table). Each table includes many attributes. For the sake of readability, attributes are indexed in italic from one 'mother' table to one or many 'child' tables along the cascade path of the database (Table 3 (available online only)). Arrows indicate relationships from one 'mother' table to one or many 'child' tables. For upward and backward matching between tables, each pair of numbers in brackets indicates the cardinality of the relationships between attributes. The cardinality may involve 'one-to-one' (i.e., 1,1) relationship or 'one-to-many' (i.e., 1,n) relationship. For upward matching, for instance, the cardinality (1,n) from the 'Article' treatments. When data were averaged over combinations of treatments, the total number of replicates was calculated as the sum of the replicates for each of the treatments for which results were averaged.
For articles reporting data for several cultivars of the same species but without data averaging, the data were reported separately for each cultivar. For articles reporting data averaged over several cultivars of the same species, only the averaged data were included in the dataset. The total number of replicates was calculated by multiplying the number of replicates of each cultivar by the total number of cultivars.
'Crop_Across_Species_Same_Treatment_Value' and 'Crop_Across_Species_Same_ Treatment_Value_Type' attributes In some articles, different types of treatment were applied to species grown at the same site during the same growing season. Each different type of treatment was reported in this case.
'Crop_Replicate_Number' attribute As mentioned above, when averaged data were reported in the articles, the number of replicates was equal to the sum of the replicates used to calculate each average.

'Crop_Yield_Grain' attribute
This attribute corresponds to grain yield data, with a few exceptions. For Brassica chinensis (pak choi), Citrullus lanatus (watermelon), Gossypium hirsutum (upland cotton), Ipomoea batatas (sweet potato) and Solanum lycopersicum (garden tomato), the yields reported are the economic yields. For Arachis hypogaea (peanut), pods are included in grain yields. In all other situations, the yield data given correspond to grain yields. Mutant non-nodulating legume cultivars, shading treatment and under-sowing treatment were excluded from the database. When grain yield data of following crops were confounded between the effect of preceding species and the effect of nitrogen fertilizer dose, these data were also excluded. Data were reported in 96% of all crop*field site*growing season*treatment combinations. Grain yield varied strongly both between grain legume species and between articles for a given species (Fig. 4a). Median grain yield was lowest for Vigna subterranea (bambarra groundnut) and highest for Trigonella foenum-graecum (sicklefruit fenugreek).

'Crop_Biomass_Aerial' attribute
This attribute corresponds to aerial biomass data. Data were reported in 27% of all crop*field site*growing season*treatment combinations. Aerial biomass varied considerably both between grain legume species and between articles for a given species (Fig. 4b). Median aerial biomass was lowest for Vigna aconitifolia (moth bean) and highest for Trifolium repens (white clover).

'Crop_Yield_Grain_DM_Percentage' and 'Crop_Biomass_Aerial_DM_Percentage' attributes
These two attributes correspond to the percentage of dry matter to which grain yield and aerial biomass correspond, respectively. When only the percentage of dry matter corresponding to aerial biomass was available and grains were included in aerial biomass, we assumed that the grains accounted for the same percentage of dry matter as the aerial biomass.

'Crop_Harvest_Index' attribute
This attribute was reported in the database to calculate aerial biomass at physiological maturity from grain yield. Data were reported in 4% of all crop*field site*growing season*treatment combinations (Fig. 4c). Median harvest index was lowest for Vicia villosa (winter vetch) and highest for Vicia faba (fababean).

'Crop_N_Quantity_Grain' and 'Crop_N_Quantity_Aerial' attributes
These two attributes correspond to the quantity of nitrogen in grains and aerial components, respectively. For the 'Crop_N_Quantity_Grain' attribute, data were reported in 10% of all crop*field site*growing season*treatment combinations. For the 'Crop_N_Quantity_Aerial' attribute, data were reported in 10% of all crop*field site*growing season*treatment combinations. As previous attributes, grain and aerial nitrogen quantities varied both between grain legume species and between articles for a given species (Fig. 5a,b). Median grain nitrogen quantity was lowest for Vigna subterranea (bambarra groundnut) and highest for Lupinus albus (white lupine). Median aerial nitrogen quantity was lowest for Vicia narbonensis (purple broad vetch) and highest for Lupinus mutabilis (sweet tarwi).

'Crop_N_Fixed_Percentage_Aerial' attribute
This attribute corresponds to the percentage of aerial nitrogen fixed by legume species. 'NA' was systematically reported for non-legume species. Data were reported in 3% of all crop*field site*growing season*treatment combinations (Fig. 5c). Median fixed aerial nitrogen percentage was lowest for Cajanus cajan (pigeonpea) and highest for Trifolium repens (white clover).     These two attributes correspond to the method used to determine the percentage of aerial nitrogen fixed by legume species (e.g., the 15 N isotope dilution method or the A-value method), and the scientific name of the non-fixing reference species. Some articles used a legume reference species rather than a non-legume reference species. In all cases, the legume reference species was a mutant non-nodulating legume cultivar that did not fix atmospheric nitrogen.
'Crop_Biomass_Aerial_Stage_Detailed', 'Crop_Biomass_Aerial_Stage_Simplified', 'Crop_N_Fixed_Percentage_Aerial_Stage_Detailed' and 'Crop_N_Fixed_Percentage_ Aerial_Stage_Simplified' attributes These attributes correspond to the phenological stages at which aerial biomass and the percentage of fixed aerial nitrogen (or the quantity of fixed aerial nitrogen with the 'Crop_N_Fixed_Quantity_Aerial' attribute) were determined. The 'Crop_Biomass_Aerial_Stage_Detailed' and 'Crop_N_Fixed_Percentage_ Aerial_Stage_Detailed' attributes correspond to the detailed phenological stage originally stated in the article. The 'Crop_Biomass_Aerial_Stage_Simplified' and 'Crop_N_Fixed_Percentage_Aerial_Stage_ Simplified' attributes correspond to a simplified phenological stage divided into 'Before physiological maturity' and 'Physiological maturity'.

'Crop_Protein_Quantity_Percentage_Grain' attribute
This attribute corresponds to the percentage or the quantity of protein in grains. In the selected articles, these protein contents were often calculated by multiplying the percentage or the quantity of nitrogen in grains by a constant. However, this constant differed between articles. Note that only a few articles referred to the percentage or the quantity of protein. We reported the percentage or the quantity of protein in grains independently of the percentage or the quantity of nitrogen in grains.

'Crop_N_Balance_Simplified' attribute
This attribute corresponds to the simplified nitrogen balance originally calculated in the articles (e.g., the difference between the quantity of nitrogen in grains and the quantity of fixed aerial nitrogen). Nitrogen balance data were only reported if the attributes used to calculate them were not directly available from raw data (e.g., the quantity of nitrogen in grains and the quantity of fixed aerial nitrogen). This was the case for only three articles.
'Crop_N_Soil_Quantity_Percentage_Seeding' and 'Crop_N_Soil_Quantity_Percentage_ Harvest' attributes These two attributes correspond to the percentage or the quantity of soil nitrogen at seeding and at harvest, respectively.
'Crop_N_Soil_Quantity_Percentage_Seeding_Type', 'Crop_N_Soil_Quantity_Percentage_ Seeding_Depth', 'Crop_N_Soil_Quantity_Percentage_Seeding_Date', 'Crop_N_Soil_ Quantity_Percentage_Harvest_Type', 'Crop_N_Soil_Quantity_Percentage_Harvest_Depth' and 'Crop_N_Soil_Quantity_Percentage_Harvest_Date' attributes These attributes correspond to (i) the type of nitrogen (e.g., nitrogen or nitrate or mineral), (ii) the depth of soil used to determine the percentage or the quantity of soil nitrogen and (iii) the date at which soil measurements were made. These attributes were reported at both seeding and harvest.

'Crop_Water_Use_Balance' attribute
This attribute corresponds to the water use or the water balance, according to the equation given in the selected articles. Data were reported in 6% of all crop*field site*growing season*treatment combinations. Water use (or water balance) varied both between grain legume species and between articles for a given species (Fig. 6). Median water use (or water balance) was lowest for Vigna aconitifolia (moth bean) and highest for Lablab purpureus (hyacinthbean).
'Crop_Harvest_Index', 'Crop_N_Percentage_Grain', 'Crop_N_Percentage_Aerial', 'Crop_N_ Harvest_Index', 'Crop_N_Fixed_Quantity_Aerial', 'Crop_Water_Use_Balance_Efficiency_ Grain' and 'Crop_Water_Use_Balance_Efficiency_Aerial' attributes These seven attributes were reported in the database to calculate missing data: aerial biomass, quantity of nitrogen in grains, quantity of nitrogen in aerial components, percentage of fixed aerial nitrogen, and water use. Different aerial components were included in the aerial biomass, the percentage or the quantity of aerial nitrogen, and the efficiency of aerial water use or aerial water balance. These five attributes were used to determine the aerial components originally reported in the articles. When the 'shoot', 'straw' and 'stubble' terms were used to define the aerial components in the articles, we assumed that the grains were not included in the aerial components. This information was reported for (i) the aerial biomass in the 'Crop_Biomass_Aerial_Definition' attribute, (ii) the percentage of aerial nitrogen in the 'Crop_N_Percentage_Aerial_Definition' attribute, (iii) the quantity of aerial nitrogen in the 'Crop_N_Quantity_Aerial_Definition' attribute, (iv) the quantity of fixed aerial nitrogen in the 'Crop_N_Fixed_Quantity_Aerial_Definition' attribute, and (v) the efficiency of aerial water use or aerial water balance in the 'Crop_Water_Use_Balance_Efficiency_Aerial_Definition' attribute.

'Crop_N_Balance_Simplified_Equation' and 'Crop_Water_Use_Balance_Equation' attributes
For these two attributes, we reported the equations used to calculate simplified nitrogen balance and water use or water balance, respectively.

Attributes relating to error terms and error types
When available, we systematically reported error terms and error types associated with data about grain yield, aerial biomass, crop nitrogen content, residual soil nitrogen content and water use. For the 'Crop_Yield_Grain' attribute, the 'Crop_Yield_Grain_Error' attribute indicates the error term and the 'Crop_Yield_Grain_Error_Type' attribute indicates the error type for a given item of grain yield data for a given crop in the 'Crop' 48% of grain yields, both error terms and the numbers of replicates were reported. For 47% of grain yields, only the number of replicates was reported.

Technical Validation
Each article was read carefully at least three times by the same person, to determine the type and the quantity of data reported by the authors. Once the data had been extracted, all the data reported in the tables were checked at least three times by the same person, to identify possible mistakes. SQL subset queries were systematically performed, to check the structural validity and coherence of class, numerical, index, binary and date attributes within each table, and to check the relationships between 'mother' and 'child' tables. Once the set of data was complete, SQL queries were carried out, to compare the entire content of the database with the original data reported in the selected articles. We systematically and manually checked for outliers in order to detect possible mistakes made during data extraction. We returned to the original articles as many times as needed to check the accuracy of the data. We checked the qualitative and quantitative contents of all class, numerical, index, binary and date attributes by importing each table in turn into the R software (version 3.2, https://cran.r-project.org/), and by visualizing data distribution for each attribute in turn. When the meaning of the data reported in the articles was unclear, authors were directly contacted and asked to provide additional information about their experimental protocols. Authors were also asked to provide additional data, particularly if large numbers of treatments had been averaged in their articles. Overall, 17 authors provided us with additional information and raw data (see the Acknowledgements section).

Usage Notes
The dataset is based on a compilation of experimental data published in 173 articles over the last 50 years. To our knowledge, this dataset is unique and constitutes the most comprehensive agronomic dataset for grain legume crops worldwide. The dataset can be analyzed to assess performances for a broad diversity of grain legume species, and to provide global rankings for these species in terms of grain yield, aerial biomass, harvest index, aerial nitrogen fixation, nitrogen content in aerial components, nitrogen balance, and water use. It can also be used to assess the effect of including different grain legumes as preceding crops, before cereals and oilseeds in the same crop sequences. Global species rankings were recently estimated for energy crops 188 , but never for grain legumes. Rankings of grain legume species could be directly derived from our dataset by using standard meta-analysis methods based on random-effect models 188 . Attributes describing environmental factors (e.g., climate conditions and soil types) and management practices (e.g., tillage, fertilization, pest management and irrigation) can be used to analyze the variability of grain legume performances over field sites, growing seasons, and management practices.
Our dataset covers several contrasted geographical areas. It can be used to target suitable grain legume species for cultivation in particular pedoclimatic conditions. In the context of climate change, the database represents a useful resource to assess comparatively the production of grain legume species in drought-prone environments, or to identify innovative agricultural techniques for improving grain legume cultivation under yield-limiting abiotic and biotic stresses.
Subsets of the dataset can be used to address regional issues. Figure 7 presents six regional networks including the pairs of grain legume species frequently compared at the same field sites during the same growing seasons, and the grain legume species that were not frequently compared with each other. Such networks can be used to identify the species for which reliable comparisons are feasible, and those for which limited data are available. A quantitative analysis can then be computed to determine regional rankings of grain legume species. This approach could be used to identify highly productive species, and to compare them with major regional grain legume crops (e.g., garden pea in Europe or soybean in North America). Our dataset could thus shed new light on the potential value of as yet underused grain legumes from regional to global scales.
As geographical coordinates of the experiments were systematically reported, our dataset can be connected to large-scale climate and soil maps, and to Geographic Information Systems. An example is shown in Fig. 1 where the Köppen-Geiger climatic classification was indicated for field sites included in the database. Similar maps could be easily produced using other global classification of agroecological zones (e.g., the Global Agro-Ecological Zones Data Portal, http://gaez.fao.org/Main.html#), or soil typology (e.g., the Soils Portal of the Food and Agriculture Organization of the United Nations, http:// www.fao.org/soils-portal/soil-survey/soil-maps-and-databases/harmonized-world-soil-database-v12/en/).
The dataset is also useful for comparing productivity levels of native and non-native grain legume species used as raw materials for food and feed across diverse geographic regions. Grain yield data can be converted into crude protein or energy contents metabolizable for livestock animals (e.g., pigs and poultry) using, for example, the Feedipedia Animal Feed Resources Information System (http://www. feedipedia.org/).
In the future, the dataset could be expanded in different ways. Results of new experiments comparing grain legume species can be easily included in our database. So far, we focused on legume species produced for grains, but legume grown for forage can also be included in the database without changing the relational database structure. In many world regions such as Africa, Asia and South America, agricultural grain legumes are frequently intercropped. Data  be further included in our dataset. Note that the relational structure of the database is relatively coercive, and should be modified with great care. The addition of a new table can have consequences on the relational framework and the cardinality relationships. But new data or new attributes can be easily incremented in existing tables.
The CSV format is well adapted for analyzing data using standard statistical softwares such as the R software (https://cran.r-project.org/). However, because of the cascade path between tables and of the cardinality relationships between attributes (Fig. 3), data extraction can be easily performed using SQL queries. An example of query is presented below for extracting binary data indicating absence ('0') or presence ('1') of tillage management for grain legume species included in the article indexed '29' in our dataset: SELECT   Other examples of SQL queries are shown in the TXT-formatted file entitled 'Examples_SQL_Queries. txt', downloadable from Dryad Digital Repository (Data Citation 1).