French crop yield, area and production data for ten staple crops from 1900 to 2018 at county resolution

Agricultural performance is influenced by environmental conditions, management decisions and economic circumstances. It is important to quantify their respective contribution to allow for detecting major hazards to production, projecting future yields under climate change and deriving adaptation options. For this purpose, time series of agricultural yields with high spatial and long-term temporal resolution are a primary requisite. Here we present a data set of crop performance in France, one of Europe’s major crop producers. The data set comprises ten crops (barley, maize, oats, potatoes, rapeseed, sugarbeet, sunflower, durum wheat, soft wheat and wine) and covers the years 1900 to 2018. It contains harvested area, production and yield data for all 96 French départements (i.e. counties or NUTS3 level) with a total number of 375,264 data points. Entries until 1988 have been digitized manually from statistical yearbooks. The technical validation indicates a high consistency of the data set within itself and with external resources. The data set may contribute to an enhanced understanding of the manifold influences on agricultural performance.

French crop yield, area and production data for ten staple crops from 1900 to 2018 at county resolution Bernhard Schauberger 1,2,3,6 ✉ , Hiromi Kato 4,6 , tomomichi Kato 4,5 ✉ , Daiki Watanabe 4 & Philippe Ciais 2 Agricultural performance is influenced by environmental conditions, management decisions and economic circumstances. It is important to quantify their respective contribution to allow for detecting major hazards to production, projecting future yields under climate change and deriving adaptation options. For this purpose, time series of agricultural yields with high spatial and long-term temporal resolution are a primary requisite. Here we present a data set of crop performance in France, one of Europe's major crop producers. the data set comprises ten crops (barley, maize, oats, potatoes, rapeseed, sugarbeet, sunflower, durum wheat, soft wheat and wine) and covers the years 1900 to 2018. It contains harvested area, production and yield data for all 96 French départements (i.e. counties or NUTS3 level) with a total number of 375,264 data points. Entries until 1988 have been digitized manually from statistical yearbooks. The technical validation indicates a high consistency of the data set within itself and with external resources. The data set may contribute to an enhanced understanding of the manifold influences on agricultural performance.

Background & Summary
Future food provision may be challenged by several factors: climate change, growing global population, shift of dietary patterns, increasing soil degradation and higher pressure on land 1-3 . These strains are already perceived now and their impact on agriculture will likely grow in the future. To better understand and quantify these influences, a comprehensive data base of historical agricultural performance is of salient importance. We present such a data set for France, a major crop producer, with 5%, 2%, 8%, 14%, 4% and 8% of the global production of wheat, maize, barley, sugar beet, sunflower and rapeseed in 2014, respectively. This paper describes crop performance in France in the full 20 th and beginning 21 st centuries (1900-2018; 1900-2016 for wine). Ten crops are available on subnational administrative units (département, corresponding to counties on NUTS3 (http://ec.europa.eu/eurostat/web/nuts/overview) or GADM2 (http://gadm.org/) levels, with an average area of 5,675 km 2 ; henceforth: department). Each entry comprises cultivated area, production and yield data. The crops are barley, maize, oats, potatoes, rapeseed, sugarbeet, sunflower, durum wheat, soft wheat and wine. Four of them (barley, oats, rapeseed and soft wheat) have distinct spring and winter cultivar records, resulting in a total of 18 crop-cultivar types. This unique data set contains a total of 375,264 data points on department level that have been collected and manually digitized (until 1988) over the course of two years from regional statistical offices in France. Yields (in tonnes dry mass, t DM) were calculated from production and area data since the annotations in the statistical year books were often erroneous. All data were subjected to an outlier filtering (see Methods). After filtering, there are 120,942 entries for yields, 127,344 entries for area and www.nature.com/scientificdata www.nature.com/scientificdata/ 126,978 entries for production. We evaluate data quality internally and by comparison to other established data sources. This data set is a unique resource due to its long-time frame, its high spatial detail and the availability of area, production and yield data.
The data set presented here has been used in two previous studies. The first describes the trends in French yields and discusses possible reasons for recently observed stagnation tendencies 4 , while the second identifies major weather-related hazards for crop production in France 5 . For further discussions about the crop performance data we refer to these studies.

Methods
Crop data. Crop area (in hectare, ha, for sown areas) and production (in kg) statistics on departmental level from 1900 until 1988 were collected from books of national agricultural statistics ('Statistique agricole annuelle' or ' Annuaire de statistique agricole') compiled by the French Ministry of Agriculture; detailed references are provided in the supplementary information. Numbers were manually digitized from photocopied versions of the original paper documents. Data from 1989 to 2018 were derived from digital statistics from the Agreste database ('Statistique agricole annuelle' compiled by the Service de la Statistique et de la Prospective (SSP), Secrétariat Général du Ministère de l' Agriculture, de l' Agroalimentaire et de la Forêt (MAAF), France); details are provided in the supplementary information. Yields were calculated from total production and sown area for each department to avoid apparently often incorrect yield values printed in the old statistics books. Yields are given in kilogram per hectare (kg/ha, for sown area) for dry mass with 10-16% moisture content, depending on the crop.
Data are available for ten crops: soft wheat (spring and winter separately), durum wheat, maize, oats (spring and winter), rapeseed (spring and winter), barley (spring and winter), potatoes, sugarbeet, sunflower and wine. The split into spring and winter crops eventually results in 18 distinct crop-cultivar types. Time frames with available data and the correspondence between French and English names are provided in Table 1.
The shapes of French departments have changed over time. We use the 96 mainland (Metropolitan France) departments in their current form and subsume historical values to modern departments as follows. Corsica was one single department until 1975 but then split into Corse-du-Sud and Haute-Corse. Data for Corsica until 1975 were split equally (area, production) or copied (yield) to both new departments. Seine and Seine-et-Oise were two departments until 1967, but then subdivided into seven new departments on 1 January 1968. To account for this, we consider the values of the seven new departments (Essonne, Hauts-de-Seine, Paris, Seine-Saint-Denis, Val-de-Marne, Val-d'Oise, Yvelines) only from 1968 on and unite the two old departments into one counter-factual ("Seine_SeineOise" in the data tables) until 1967.
Multiple cropping per year within this set of crops is accounted for by separate area data, but is practically nonexistent in France 6 .
Quality filters. Some yield values had to be considered as outliers, also after checking for digitizing errors.
There were four criteria for defining an outlier. First, absolute yield values larger than a physiologically currently www.nature.com/scientificdata www.nature.com/scientificdata/ unreachable threshold were removed; threshold values were 15 t/ha for barley and durum wheat, 200 t/ha for sugarbeet and potatoes, 20 t/ha for maize, oats and wheat, 10 t/ha for rape and sunflower and 200 hl/ha for wine. These thresholds were chosen to eliminate visually obvious outliers likely due to mismatches between area and production records. The values are set slightly above current maximum attained yields, thus remaining permissive and removing only obvious errors in this first step. Additionally, all yield values for winter rape in 1944, spring rape in 1968 and spring barley in 1980 were removed due to wrongly reported values in the yearbooks. This first step removed in total 167 yield data points. Second, the top 1% of yield values across all departments per decade were removed. Third, values above or below the mean +/− four times the standard deviation of each crop-department time series (for yield, area and production separately) were removed. Fourth, and finally, a similar variance filter as in the third step was applied within each decade of a single time series, filtering values above or below decadal mean +/− two (for yield) or three (area, production) decadal standard deviations. The latter three filters removed, on average, 3.6% of the yield and 0.2% of the area or production data, respectively (Table 1). There were, as a median, 43 yield outliers per department (out of 1,260 data points on average), with a range of 4 (department Hauts de Seine) and 255 (Nord) and an interquartile range of 35-50 outliers. Outliers were masked  aggregated yield (a,b), area (c,d) and production (e,f) data. Crops are split by seasonal types for display reasons. Yields for sugarbeet, potatoes and wine (for wine also production) have been scaled with 0.1 for display reasons (indicated in the legends). Yield units are t/ha, area units are hectare (ha) and production units are tons except for wine where these are hl/ha (yields) and hl (production), respectively (both before scaling). Wine data only run from 1900 to 2016. www.nature.com/scientificdata www.nature.com/scientificdata/ as missing values to avoid introducing a bias from any correction. In the accompanying data sets we provide two version of the full data set, one without any corrections ("RAW") and one where the filters described above have been applied ("FILTERED").
Validation. Nationally aggregated area, production and yield data from our data set were validated with national data from 1961 to 2018 provided by the FAO (http://faostat3.fao.org/home/E). Area and production data for crops with separate spring and winter data were summed on department level to test agreement with area and production data digitized for the 'total' crop. www.nature.com/scientificdata www.nature.com/scientificdata/

Data records
Time series length, the number of data points and outlier numbers are provided in Table 1. All results presented afterwards refer only to the filtered data set without outliers. The most complete time series are available for soft wheat, oats, barley, potato, maize and wine. National yield (area-weighted), area and production trends as aggregates over all departments are displayed in Fig. 1. Trends for the bottom and top 5% percentiles as well as the difference between them, i.e. the 90% confidence interval for expected yields, are shown in Fig. 2.
All data described here are available via GFZ Data Services, under https://doi.org/10.5880/PIK.2021.001 and with a CC-BY 4.0 license 7 (see Usage Notes). There are two g-zipped tar balls, one with filtered data ("FILTERED") and one with unfiltered ("RAW") data (see Methods). Within each set, the data is organised in tables in plain text files, with one table per crop-cultivar where all three data types (area, production, yield) are combined. This results in 18 tables per filter type. Semicolons (";") are used as separators. Diacritic letters of French location names were standardized to the Latin alphabet. Table entries

technical Validation
Nationally aggregated yield time series were compared with FAO yield data, available from 1961 to 2018. Yields were aggregated from departments with area weighting. For crops with distinct spring and winter types only total yields were compared. Barley, maize, oats, potatoes, rapeseed, sugarbeet, sunflower and soft wheat were available in both data sets; the other crops are not listed by the FAO. All correlation coefficients (Pearson's r) for yield, area and production are at least 0.99, with only five exceptions; all are above 0.95 (Table 2). All correlations are significant with p < 1e-5. These high correlations indicate the subnational data are reasonable. It has to be considered, though, that FAO statistics are compiled from subnational data in France -thus the two data sets are not independent. The high correlations therefore mainly point to the quality of digitalization.
Summed area and production data for crops with separate spring and winter data agree well with area and production data, respectively, for the 'total' time series. Pearson's r is at least 0.98 in all cases for area and production, pointing to high consistency in the data. All disagreements are minor and biased to higher area or production values, respectively, when summed from spring and winter data. This may point to some information lacking in the 'total' time series, but not on a practically relevant level for national aggregation.
The fraction of outliers, using the criteria defined in the Methods section, was below 4.6% for all crops and below for 4% for most ( Table 1). The overall fraction of outliers, which we assume to be annotation errors in the statistical yearbooks, is 3.6% for yields. Outlier numbers for area and production are much lower (0.2%, on average), but in these time series, outlier detection is more difficult since values between departments and years may vary largely without being unreasonable.
Notably, we assume that the values from the early period before World War II are trustworthy in principle, as France has a long tradition (since Napoleon times) of centralized administration with harmonized national directives -also for statistics -in each department. Moreover, the outlier filters did not identify a higher rate of errors during the early period than during later years. Thus, we assume that the area, yield and production data are of sufficient quality to inspect trends and changes in variability also in the early decades of the 20th century.
This data set does not distinguish between rainfed and irrigated yields, which may be a drawback when analyzing, for example, weather influences on crop production. But the area equipped or used for irrigation was not recorded in the handbooks. Statistical methods in the regional statistical offices are not known to have changed over time, such that values can be compared across the complete time frame.

Usage Notes
The French yield data set described here is available to the general public without any restrictions except citation of this data descriptor paper and the data set 7 (CC-BY 4.0; Creative Commons License with attribution). The full license text is available with the data set.
In the online repository there are two versions of the data, filtered and unfiltered (see Methods for details). We recommend to use the filtered data only, but have supplied the unfiltered original data, too, to allow for custom filters where appropriate.