Harmonised LUCAS in-situ land cover and use database for field surveys from 2006 to 2018 in the European Union

Accurately characterizing land surface changes with Earth Observation requires geo-located ground truth. In the European Union (EU), a tri-annual surveyed sample of land cover and land use has been collected since 2006 under the Land Use/Cover Area frame Survey (LUCAS). A total of 1351293 observations at 651780 unique locations for 106 variables along with 5.4 million photos were collected during five LUCAS surveys. Until now, these data have never been harmonised into one database, limiting full exploitation of the information. This paper describes the LUCAS point sampling/surveying methodology, including collection of standard variables such as land cover, environmental parameters, and full resolution landscape and point photos, and then describes the harmonisation process. The resulting harmonised database is the most comprehensive in-situ dataset on land cover and use in the EU. The database is valuable for geo-spatial and statistical analysis of land use and land cover change. Furthermore, its potential to provide multi-temporal in-situ data will be enhanced by recent computational advances such as deep learning.

Schematic overview of the LUCAS and harmonisation methodologies. The left side illustrates the sampling at the basis of the production of the LUCAS primary data. The top right side shows the raw base data (micro data). The process of harmonising is contained within the multi-year harmonised aggregation block and is the subject of the following two sections. The bottom right presents the four main outputs associated with this manuscript (more in section Data Records) -a harmonised, legend-explicit, multi-year, ready-to-use, version of the LUCAS micro data (section Overview of multi-year harmonised LUCAS survey database 49 ,), a database with all cardinal-direction landscape and point photos collected during the surveys, including their respective EXIF attributes (section Overview of EXIF photos database, EXIF table 49 , photos on https://gisco-services.ec.europa. eu/lucas/photos/, the survey geometries 49 and a R package to generate the data 51 . www.nature.com/scientificdata www.nature.com/scientificdata/ to access are classified by photo-interpretation in the office, using the latest available ortho-photos or Very High Resolution (VHR) images. Although most of the points a-priori assigned for in-situ assessment can effectively be visited in the field, those that cannot be reached, because of lack of access to the point or the point location being at more than 30 minutes walking distance from the closest point reachable by car. Those points are thus photo-interpreted on ortho-photos or Very High Resolution (VHR) images in the field by the field surveyor. Furthermore, sometimes a significant difference exists between the theoretical LUCAS point and the actual GPS location reached by the surveyor. Observations are collected for the LUCAS point, while the photos are taken at the actual GPS location. Both locations and the distance between them is noted down.
previous LUCAS use cases and shortcomings. In the scientific literature, LUCAS land cover and land use survey data have been used to derive statistical estimates 2 , to describe land cover/use diversity at regional level 12 , and its sampling frame was used as a basis for various applications including assessing the availability of crowd-sourced photos potentially relevant for crop monitoring across the EU 13 . LUCAS was designed to derive statistics for area estimation (e.g 3 . and 10 ). Recently, several researchers have started to use LUCAS data in large scale land cover mapping processes, especially as a source of training and/or validation data for supervised classification approaches at regional/national scale [14][15][16][17][18][19][20] .
Several drawbacks become apparent when working with the original LUCAS datasets. While the inconsistencies could be due to the enumerators' subjectivity in interpretation of the legends and the legend itself, it is also related to the complexity of the field survey: large number of surveyors (>700), complex documentation for the enumerators (>400 pages combining all the documents), translated to 20 languages. These drawbacks hinder the further use of the LUCAS data by the scientific community as a whole and in particular by users who are active in emerging fields of big data analytics, data fusion, and computer vision. Such drawbacks include: • Inconsistencies and errors between legends and labels from one LUCAS survey to the next which is hampering temporal analysis. • Missing internal cross-references in the datasets that would facilitate computation and linking observed variables, photos, etc. • The original full resolution photos taken at each surveyed point are not available for download.
• The lack of a single-entry point or consolidated database hampering automated processing and big data analysis.
Therefore, we have gone through an extensive process of cleaning by semantic and topological harmonisation, along with connecting the originally disjoint LUCAS datasets in one consolidated database with hard-coded links to the full-resolution photos, openly accessible along with this paper.

Methods
Having contextualized the LUCAS survey, we proceed with describing the full methodological workflow to harmonise the data, as schematically shown in Fig. 1. The Sampling and Survey sub-figures provide an overview of the methodological framework of the LUCAS data collection itself (see previous section Background & Summary). The Data aggregation and Results sub-figures illustrate the work carried out in this study. The datasets collected during the five surveys (in 2006, 2009, 2012, 2015, 2018) are the main LUCAS products available (more in section Micro data collection and documentation (Protocol 1)). These datasets and their respective data documentation were used to create the multi-year harmonised database. The harmonisation process is described below and in Table 1. Associated with the summary Table 1, the Table 2 provides name changes, the Table 3 provides the  new columns added, the Online-only Table 1 provides the missing column adding and the Online-only Table 2 provides the variable re-coding. The results are consolidated in one single consistent and legend-explicit table along with hard-coded links to the full resolution photos (stored on the GISCO, https://gisco-services.ec.europa. eu/lucas/photos/). The LUCAS primary data includes alpha-numerical variables and field photographs linked to the geo-referenced points.
Micro data collection and documentation (protocol 1). The first step is to collect the data from the source for each survey year (see Table 1 47 ), which contains information on variable name, data type and description in a more consolidated fashion, making it easier to find information about the relevant variable.
The third and final step in Protocol 1 is the generation of the mapping files used for value recoding. The workflow maps the ascertained relationship between those variables that are the same but have changed in name or alpha-coding between surveys. To recode all variables coherently from one survey to the next, the original data is changed permanently. All transformations are done by recoding ordinal variables to be compliant with the encoding of variables used in the last survey (2018). These mappings serve as a blueprint for the transformation and data integration described in Protocol 2.
Micro data harmonisation (protocol 2). The harmonisation workflow, alongside the performed database consistency checks, is shown in Fig. 2 and the code is described in code section (section Code availability). The general principle of the harmonisation workflow was to convert all the field legends to fit with the latest i.e. the 2018 database layout (the next LUCAS is planned for 2022).
Some notable changes in the source tables had to be made in order to make the harmonisation and subsequent merger into one complete table possible. This was accomplished with the above-mentioned instance-mapping files (Section Micro data collection and documentation (Protocol 1)). All manipulations executed over the separate tables prior to the merger are listed in Table 1 Table 3). Performed on all tables using the Add_new_cols() function. 5. Upper case -iteratively converting all characters of selected fields to upper case. Performed on all tables using the Upper_case() function. 6. Re-code variable -iteratively re-coding selected variables according to created mapping CSV files, designed referring back to the reference documents. Performed on all tables but 2018 by using the Recode_ vars() function. 7. Order columns -iteratively ordering all columns according to the template from the 2018 survey. Performed on all tables but 2018 by using the Order_cols() function.

Merge and post-processing (protocol 3). The third part of the harmonisation process includes the merg-
ing of the harmonised tables of each survey year plus additional steps listed below before exporting the final data outputs. The workflow ends with the output exports. The table is exported as CSV and the geometries as shapefiles. The full workflow is dependent on two software prerequisites. Firstly, one must have a running PostgreSQL server, and secondly, an installation of R (more about the versions used in section Code availability). The pipeline is provided as a R package for ease of reproducibility and transparency (section Code availability).

Merge into single
Full resolution LUCAS photos. In addition to the alphanumerical and geometry information of the survey, a complete database with full-resolution point and landscape photos was set up with photos retrieved from Eurostat. This archive was organised as a table with all the exchangeable image file (EXIF) variables for each of the images, among which a unique file path, as stored on the Eurostat GISCO server for easy retrieval by other researchers. Besides the EXIF attributes, each photo is also hard-coded with the respective point ID of the LUCAS point and the year of survey. The photos' metadata were extracted with ExifTool (v 10.8) 48 resulting in a database of photos that was compared for completeness with the survey data records. The hard-coded HTTPS links to each photo in the consolidated database allow for large data volume queries and selection tasks.   www.nature.com/scientificdata www.nature.com/scientificdata/

Data Records
The first section Storage describes each data-set provided along with this manuscript including the table, photo, and geometry databases along with the R package created to compile and construct all the data. The second section Overview of multi-year harmonised LUCAS survey database provides an outline of the resulting harmonised database and the last section Overview of EXIF photos database provides an overview of the photo database.

Storage.
1. Multi-year harmonised LUCAS survey data. The harmonised database (available for download here 49 and also archived as compressed folder here 50 ) contains 106 variables and 1351293 records corresponding to a unique combination of survey year and field location. The same dataset is also available for each year with a different file for users interested only in one specific survey. The database is provided with a Record descriptor (Online-only Table 3  www.nature.com/scientificdata www.nature.com/scientificdata/ (Table 4). The total number of surveyed points has increased significantly from the 2006 pilot study (168401) to 2015 (340143) ( Table 4). This rise is mainly due to the increase in terms of thematic richness, scope, and scale of the study from what was primarily an evaluation of agricultural areas (2006) to a more holistic and exhaustive inspection of the EU territory. Further, the total number of surveyed countries increased from 11 in 2006 to 28 in 2018 (Table 4). Over the five surveys, 1 031 813 observations (76.36%) were done in-situ. Out of these in-situ observations, 94% have been surveyed within 100 m distance of the theoretical LUCAS point and 6% were more than 100 m away from the point. The proportion of points where actual in-situ data was collected has decreased from 92.18% in 2006 to 63.67% in 2018. Furthermore, 10.92% of the points (i.e. 147574) that were visited in-situ turned out not be accessible in practice and are photo-interpreted in the field. The number of points surveyed per country and per year ranged between 79 (Malta) to 48215 (France). Finally, over the five surveys, 1677 points were out of national territory, i.e. "NOT EU" corresponding to water outside national borders or countries including Russia, Turkey, Albania and Switzerland). Figure 3 provides the accumulative frequency of assigned level-3 classes (out of 77 classes in total) to the surveyed points, sorted by reference year. Land Cover/Land Use (LC/LU) classification specifications can be found in the new reference document, containing the harmonised C3 legend (see Harmonized C3 legend in 49 ).
The classification system follows rules on spatial and temporal consistency -it can be applied and compared both between locations in the EU and by survey years. Additionally, excluding 2006, it is 'as much as possible' compatible with other existing LC/LU systems (e.g. Food and Agriculture Organization (FAO), statistical classification of economic activities in the European Community (NACE) (2009-2018) and fulfills the specifications of the European Infrastructure for Spatial Information in Europe (INSPIRE) (2015-2018)). To inform about changes in two consecutive surveys, the data providers describe the adjustments to the terminology in the documentation. The 3-level legend system is arranged hierarchically, whereby the first level (letter group) corresponds to the eight main classes obtained by ortho-photo-interpretation during the second level stratification phase (Fig. 1); the second and third level, representing subcategories of these main classes are indicated by a combination of the letter group and further digits.
The number of point visits is shown in Table 5. Some LUCAS points were visited once in 15 years (n = 332605) while others were visited each time, thus totaling five visits (n = 35204). This means that 651780 locations were at least visited once. Figure 4 shows a map with the visit frequency for each point over Europe.
Overview of EXIF photos database. The available photos (N, E, S, W, P, i.e. North, East, South, West, and Point) were catalogued totaling 5440459 photos for the 5 surveys (see Table 6 for detailed distribution). The lucas_harmo_exif.csv table contains the essential and available LUCAS EXIF information (27 variables , gps_lat), the digital cameras with GPS could also capture the location where the photos were taken as well as the orientation, i.e. the azimuth angle. In the first surveys, the digital camera and the GPS were separate devices. The orientation was determined with a traditional compass. The data were used to cross-validate the geo-location reported during the survey. To assess the availability of this information, the EXIF information of the 5440459 photos was retrieved. As summarised in the two last columns of Table 6 It was decided that having this information in a separate table is more sensible in terms of storage size and accessibility, whereby cross-table checks can easily be performed by executing joins between the tables based on point ID and year of survey. By combining this information from the two tables (i.e. the multi-year harmonised LUCAS survey database and the EXIF table database) one arrives at a significantly large set of labeled examples, corresponding to images of the 77 different types of recorded land cover.The background RGB imagery for (c) and (d) is obtained from "Map data ©2019 Google".

Technical Validation
The first part of this section briefly summarises the LUCAS field surveys quality check. The section then focuses on analyses carried out specifically to support the technical quality of the multi-year harmonised LUCAS database process.
The LUCAS surveyed observations are subject to detailed quality checks (see LUCAS metadata 52 and the data quality control documents available for 2009 53 , 2012 54 , 2015 55 ). First, an automated quality check verifies the completeness and consistency after field collection. Second, all surveyed points are checked visually at the offices responsible for collection. Third, an independent quality controller interactively checks 33% of the points for accuracy and compliance against pre-defined quality requirements, including the first 20% observations for each surveyor, to prevent systematic errors during the early collection phase.
The presented data consolidation effort seeks to enhance the quality of an existing product. Ensuring data quality by harmonisation throughout the years is thus essential. Data quality was ensured by taking into account validity, accuracy, completeness, consistency, and uniformity throughout data processing (Fig. 2): www.nature.com/scientificdata www.nature.com/scientificdata/ • Validity of the harmonised database was ensured via data type (for which information can be found in the record descriptor) and a unique constraint of a composite key (consisting of the point ID and year of survey). • Accuracy of the data relies on the source data for which the quality was assessed as described in the previous paragraphs. • Completeness checking shows that since several variables have been added over the years, many missing values exist. In such cases, fields were populated with null values. Consistency across surveys has been enhanced. All surveys were harmonised towards the 2018 survey. • Consistency of the presented dataset was internally ensured through running checks at various stages of processing. • Uniformity checks revealed that the geographical coordinates in columns th_long and th_lat show different locations between some survey years. In the interest of complete uniformity, it was decided to have the values of these variables hard coded from the LUCAS grid. Because the LUCAS grid is a non-changing feature of all LUCAS surveys, the location of each point remains the same throughout the years. Thus any discrepancy between the recorded theoretical location of a LUCAS point in the micro data and the grid must be corrected. This was done for all but 64 points from 2006 which where recorded on an inaccurate location and were thus removed from the grid.  www.nature.com/scientificdata www.nature.com/scientificdata/ To further asses spatial accuracy of the data, the distance between the theoretical point from the LUCAS grid (th_long, th_lat), and the actual GPS measurement of the survey observation point (gps_lon, gps_lat) were compared. This is important for several reasons -firstly, it allows to ascertain the real distance between the point actually surveyed and the point supposed to be surveyed, which is, in a sense, a proxy for the quality of the surveyed observation itself; secondly, it is an accuracy check of the surveyed distance between the theoretical point and the survey observation point, as collected by the surveyor, "as provided by the GPS (in m)" (column obs_dist), and the distance between the same points as calculated from the data (column th_gps_dist). It must be noted that for    www.nature.com/scientificdata www.nature.com/scientificdata/ the 2006 survey the variable obs_dist was collected as a range, whereas for the other years it represents the actual value of the distance. Because of this lack of uniformity, it was decided to hard code the values for 2006 to match exactly with the calculated distance. In this way we ensure consistency in the data type of the column, yet sacrifice the nuances from changing the original data. The procedure explains that, in 2006, we see a 100% match between recorded and calculated distance (Table 7), whereby for 2009 a match of 96.3%, meaning that for only 3.7% of the cases did the value not match. In carrying out this comparison it became apparent that the percentage of matching distances has increased throughout years probably due to better precision of positioning sensors. Thus the total amount of error in 2018 is reduced to a negligible 0.31%. Furthermore, the comparison was instrumental in the flagging and removing of a number of records that have inaccurate GPS coordinates most probably due to sensor malfunction. Cross-checking with the source data, we found that the error is indeed present in the source data, rather than introduced during processing -something which would have been hard to spot otherwise. The distribution of these calculated distances, alongside an equivalent distribution of the surveyed distances, can be found in Fig. 6. The distance between 75% of the points (1-3 quantile) is between 1.1 and 21.2 meters, meaning that only a fourth of the points have a distance greater than this. For the surveyed distances the ranges are similar www.nature.com/scientificdata www.nature.com/scientificdata/ -75% of the values fall between 1.0 and 30.0 meters. From the distributions we see that there is a lot more nuance in the values of the calculated distances, which makes sense as they are represented by numbers with decimals, which have a lower frequency than the integers, representing the surveyed distances. The values shown in the red part of the histogram of surveyed distances represent the values from 2006, which are copied from the calculated distance in order to hard code a numerical in the place of the categorical value of the variable in the source data. The theoretical grid of LUCAS point location is stable over time. However, according to the survey conditions and the terrain and accuracy of the GPS positioning, the surveyor may not be able to reach the point. This results in effective variations of the position of the observer through time (Fig. 7).
In addition to the theoretical grid and survey point location, this data descriptor provides the East-facing transect geo-location data. No additional geo-located spatial information is collected in the transect module and this is probably a shortcoming in the survey design resulting from trade-offs between the cost of the survey and its objectives. The theoretical transect line (with the same geometry as the one provided with this data descriptor) is displayed on the ground document of the surveyor. The surveyor has then to walk on the line and to record the successive land cover and landscape elements as described in the survey methodology. The only geo-location accuracy information relevant for the transect module is thus the same as presented previously, i.e. distance between the theoretical point and the GPS measured surveyed point. Then the successive land covers and landscapes surveyed along the 250-m line are collected as a sequence without distance or geo-located information.

Usage Notes
To summarize, the work documented in this data descriptor consists of 49 Table 7. Percentage (%) of points for which the distances between the theoretical point from the LUCAS grid (th_long, th_lat) and the actual GPS measurement (gps_lon,gps_lat) taken during surveying and calculated post factum match or not.  www.nature.com/scientificdata www.nature.com/scientificdata/ The harmonised LUCAS product reduces the complexity and layered nature of the original LUCAS datasets. In doing so, it valorizes the effort of many surveyors, data cleaners, statisticians, and database maintainers. The database's novelty lies in the fact that for the first time, users can query the whole LUCAS archive concurrently, allowing for comparisons and combinations between all variables collected during the relevant reference years. The homogeneity of the product facilitates the unearthing of temporal and spatial relations that were otherwise jeopardized by the physical separation between survey results. Moreover, by avoiding the burden of combing through the cumbersome documentation, the user is now free to concentrate on the research, thereby facilitating scientific discovery and analysis. Naturally, the product suffers from the shortcomings inherent in the source data, such as any inadequate surveying, surveyor or technology-related errors of precision while taking coordinates or measurements, etc. The harmonisation process itself also reveals some inconsistencies in the source data. For instance, certain variables could not be harmonised between survey years. These are mostly related to measurements of percentage or extent of coverage. Where in the early stages of LUCAS surveyors were asked to fill in a multiple choice questionnaire, listing a range of values, in subsequent surveys the surveyor was asked to fill in the actual value in quantified units. This situation applies mostly, though not exclusively, to the 2006 survey, which makes it impossible for these variables to be translated into the user friendly version; therefore in these cases the variables of 2006 must remain in their original coding. Additional information can be found in the comments section of the record descriptor.
Another shortcoming is the change of hierarchy of the LUCAS classification system between the different surveys, mainly concerning LC/LU, as well as LC and LU types. A table is provided to document this shortcoming (see special remarks in the Table ("LC (LU) changes" in the file LUCAS-Variable_and_Classification _Changes.xlsx 49 ).

Code availability
To guarantee transparency and reproducibility, the harmonisation workflow was carried out with open-source tools, namely PostgreSQL (9.5.17)/PostGIS (2.1.8 r13775)) and R (3.4.3) 56 ). The code is provided as a R package containing 17 functions along with the documentation on 51 . The LUCAS package includes all the scripts and documentation (also provided in pdf). Additionally, along with the package, a script (main.R) builds the harmonised database step by step. The workflow is schematically shown in Fig. 2. All the processing is done with SQL with only column reordering and consistency checks being done in R. The code is freely available under GPL (> = 3) license.