European primary forest database v2.0

Primary forests, defined here as forests where the signs of human impacts, if any, are strongly blurred due to decades without forest management, are scarce in Europe and continue to disappear. Despite these losses, we know little about where these forests occur. Here, we present a comprehensive geodatabase and map of Europe’s known primary forests. Our geodatabase harmonizes 48 different, mostly field-based datasets of primary forests, and contains 18,411 individual patches (41.1 Mha) spread across 33 countries. When available, we provide information on each patch (name, location, naturalness, extent and dominant tree species) and the surrounding landscape (biogeographical regions, protection status, potential natural vegetation, current forest extent). Using Landsat satellite-image time series (1985–2018) we checked each patch for possible disturbance events since primary forests were identified, resulting in 94% of patches free of significant disturbances in the last 30 years. Although knowledge gaps remain, ours is the most comprehensive dataset on primary forests in Europe, and will be useful for ecological studies, and conservation planning to safeguard these unique forests.

www.nature.com/scientificdata www.nature.com/scientificdata/ efforts for harmonizing data 24,30 , only recently has the first map of primary forests been released for Europe 31,32 together with a first assessment of their conservation status 21 .
In a previous effort, we assembled a first European Primary Forest database (EPFD v1.0) that included 32 local-to-national datasets, plus data from a literature review and a survey, resulting in the mapping of a total of ~1.4 Mha of primary forest 31 . This was only about one fifth of the estimated 7.3 Mha of undisturbed forest still occurring in Europe, excluding Russia 10 . Also, most of the data collected in our v1.0 database were not open-access, and could thus not be used without the explicit consent of their respective copyright holders.
Here, we build on those efforts to progress further towards a complete map of Europe's primary forests. First, we secured permission from all data holders to release all data with open-access. Second, we aggregated and harmonized 16 additional regional-to-continental spatial datasets to now cover a total of 48 independent datasets. The EPDF v2.0 contains 18,411 non-overlapping primary forest patches (plus 299 point features) covering an area of 41.1 Mha (37.4 Mha in European Russia alone; Fig. 1) across 33 countries (Table 1) 33 . Key improvements of this new database include (a) filling major regional gaps, including European Russia, the Balkan Peninsula, the Pyrenees and the Baltic region, (2) mapping potential primary forests for Sweden and Norway (additional 16,311 polygons and 2.5 Mha - Fig. 2), two key regions where complete inventories are currently unavailable, and (3) an update of our literature review to January 2019.

Methods
Primary forest definition. Defining primary forests is controversial, and a range of different definitions have been put forward over the years 22 . In this paper, as in our previous work, we follow the FAO definition that defines primary forests 34 as "naturally regenerated forest of native tree species, where there are no clearly visible indications of human activities and the ecological processes are not significantly disturbed".
We operationalized this definition using the framework proposed by Buchwald 35 , where 'primary forest' is used as an umbrella term to include forests with different levels of naturalness, such as primeval, virgin, near-virgin, old-growth and long-untouched forests. Based on this framework, a forest qualifies as primary if the signs of former human impacts, if any, are strongly blurred due to decades (at least 60-80 years) after the end of forest management 35 . This time limit, however, depends on how modified the forest was at the starting point, and only applies in the case of traditional management, such as patch felling, partial coppicing, or selective logging. Stands regenerating naturally after a clear cut would therefore require a longer time period to be considered a primary forest (i.e., 60-80 years plus the length of a typical rotation cycle). Our definition of primary forests, therefore, does not imply that these forests were never cleared or disturbed by humans. We consider this is in line with the Convention of Biological Diversity (CBD, https://www.cbd.int/forest/definitions.shtml), acknowledging that the concept of primary forests has a different connotation in Europe than in the rest of the world.
Finally, our collection of primary forests includes mainly old-growth, late-successional forests, but also some early seral stages and young forests that originated after natural disturbances and natural regeneration, without subsequent management. In case of large primary forest tracts (>250 ha), our polygons can also locally include land not covered by trees. Data collection. To create the EPFD v2.0, we first expanded and updated the literature review on primary forests we had originally carried out for EPFD v1.0 31 , which only considered the period 2000-2017, and excluded www.nature.com/scientificdata www.nature.com/scientificdata/ European Russia. Specifically, we added all scientific studies published between January 2000 and January 2019 for Russia, and those published in 2017-2019 for the rest of Europe. We identified relevant publications in the ISI Web of Knowledge using the search terms "(primary OR virgin OR old-growth OR primeval OR intact) AND forest*" in the title field. Based on our own interpretation of commonly used forest terms, we deliberately excluded terms such as "unmanaged" (meaning: not under active management), "ancient" (never cleared for agriculture) or "natural" (stocked with naturally regenerated native trees). These terms indicate conditions that are necessary, but not sufficient for considering a forest as primary. Finally, we refined our search using geographical and subject filters. The literature search returned 129 candidate papers. After screening their content, we added 23 additional primary forest stands (10 in European Russia, 13 in the rest of Europe), from 13 studies (four from European Russia, and nine from the rest of Europe).
Building the EPFD v1.0 31 involved reaching out to 134 forest experts. For v2.0 we contacted an additional 75 experts with knowledge on forests or forestry, and invited them to add spatially-explicit data on primary forests to our database. We focussed on experts from geographical regions poorly covered in v1.0. We received 56 answers, which led to the incorporation of 16 new datasets in our map. Given the context-dependency of definitions used in regional mapping projects, new datasets were only included if we could find an explicit equivalence between country-specific forest definitions and our definition framework 35 . This was done after discussing with data contributors the criteria and categories used for constructing their datasets, which we then mapped onto our definition framework. Depending on the datasets, these criteria included: (1) forest age or structural variables 19,23,36 , (2) legal designation 25 or year since onset of protection 37 , (3) time since last anthropogenic disturbancee 38 , or (4) the lack of human impacts and infrastructures 39 .
We integrated all data into a geodatabase, which contains primary forests either as polygons (if information on the forest boundary was available) or point locations (when having only an approximate centre location). We set 0.5 hectares as minimum mapping unit, although only a few of the datasets already contained in v1.0 contained www.nature.com/scientificdata www.nature.com/scientificdata/ polygons smaller than 2 ha (i.e., the minimum mapping unit originally used). If available, we included a set of basic descriptors for each patch: name, location, naturalness level (based on 35 ), extent, dominant tree species, disturbance history and protection status. In total, our map harmonizes 48 regional-to-continental datasets of primary forests (Online-only Table 1). All data is open-access 33 , except for three datasets that we kept confidential, either for conservation or copyright reasons. These datasets are: 'Hungarian Forest Reserve monitoring' (ID 17, custodian: Ferenc Horváth); ' Ancient and Primeval Beech Forests of the Carpathians and Other Regions of Europe' 40,41 (ID 34, copyright: UNESCO), and 'Potential OGF and primary forest in Austria' (ID 48, custodian: Matthias Schickhofer). Additional non-open access polygons also exist for the dataset 'Strict Forest Reserves in Switzerland' (ID 30, custodian: Jonas Stillhard). These data are here referred to for transparency, but are neither included in the statistics and summaries reported here, nor in any of the remote-sensing analysis below.
Post-processing. To provide common descriptions for all features contained in the geodatabase, we integrated the basic descriptors detailed above with a range of attributes derived by intersecting all polygons or points of primary forests with layers of: 1) biogeographical regions, 2) protected areas, 3) forest type, and 4) forest cover.
We used the map of biogeographical regions 42 to assign each primary forest point or polygon to one of the following ten classes: 1. Alpine, 2. Arctic, 3. Atlantic, 4. Black Sea, 5. Boreal, 6. Continental, 7. Macaronesia, 8. Mediterranean, 9. Pannonian, 10. Steppic. Similarly, we derived information on protection status and time since onset of protection for each primary forest polygon or point based on the World Database on Protected Areas (WDPA -https://www.protectedplanet.net). We simplified the original IUCN classification to three classes: 1. strictly protected -(IUCN category I); 2. protected -(IUCN categories II-VI + not classified); 3. not protected. This is a conservative aggregation recognizing the fact that, in certain contexts, logging and salvage logging are allowed inside national parks, at least in the buffer zone. In case of polygons, we considered a primary forest patch as protected if > 75% of its surface was within a WDPA polygon. When better information on the protection status of a forest patch was available directly from data contributors, we gave priority to this source. We also assigned each primary forest polygon or point to one of the forest categories defined by the European Environmental Agency 43 . The spatial information was derived by simplifying the map of Potential Vegetation types for Europe 44 , after creating an expert-based cross-link table 21 , which ties together forest categories and potential vegetation types reported in  www.nature.com/scientificdata www.nature.com/scientificdata/ evergreen forest; 10. Coniferous forests of the Mediterranean, Anatolian and Macaronesian regions; 11. Mire and swamp forest; 12. Floodplain forest; 13. Non-riverine alder, birch or aspen forest. For each primary forest polygon (but not for points), we reported the two most common forest categories. Finally, we extracted for each primary forest polygon the actual share covered by forest. We did this, because larger primary forest polygons in high naturalness classes can encompass land temporarily or permanently not covered by trees. We used a tree cover density map for the year 2010 for these regions from 45 . All post-processing was performed in R (v3.6.1) 46 .
Potential primary forests of Sweden and Norway. For Sweden and Norway, where abundant geographic information was available on forest distribution, we created maps of potential (but so far unconfirmed) primary forests. For Sweden, we derived a workflow to create a map of potential primary forests as detailed in Fig. 3. This yielded 14,300 polygons covering a total area of 2.4 Mha.
For Norway, even though we were able to include two datasets of confirmed primary forests, additional primary forest is expected to exist. Therefore, we derived a map of potential primary forests, based on the "Viktige Naturtyper" dataset from the Norwegian Environment Agency 47 , which maps different habitat types of high conservation value both inside and outside forested areas. We extracted all polygons larger than 10 ha classified as "old forest types" (="gammelskog"), i.e., forests that have never been clearcut and are in age classes of 120 years or older. This yielded 2,103 polygons covering a total area of 0.1 Mha.
Importantly, these layers were neither directly integrated in our composite map, nor used to calculate country level statistics as they only represent a first approximation of the primary forest situation in these countries, so far without ground validation. Yet, we included these layers in our geodatabase with the goal of directing future ground-based mapping efforts.

Data Records
The EPFD v2.0 33 is composed of 48 individual datasets (Online-only Table 1) and the two layers of potential primary forests for Sweden and Norway. We integrated the 48 datasets into two composite feature classes, after excluding all duplicated\overlapping polygons across individual datasets.
1) EU_PrimaryForests_Polygons_OA_v20 • Composite feature class combining the forest patches classified as "primary forest" based on polygon data sources described in Online-only  33 . The file format is ESRI personal geodatabase (.mdb). Each feature class in the geodatabase follows the structure described in Online-only Table 2. A full description of each individual dataset is reported in the metadata file 'DATASET_overview_v2.0_20201030.docx' , available at the same link.

technical Validation
We benchmarked our data against country-level statistics on primary forest extent. Although we had no direct control of the raw data contained in our database, the fact that all our information on primary forest locations derives either from peer-reviewed scientific literature, or was field-checked by trained researchers and/or professionals suggests high data reliability. We made sure to have a common understanding with data contributors about forest definitions [i.e. 34,35 ,], and only included a dataset in the EPFD if we could find an explicit equivalence with the forest definitions we used. Additional information on the harmonization process is reported for individual datasets in the metadata accompanying our geodatabase.
An additional, wall-to-wall validation of our database using remotely-sensed information is currently impossible. Remote sensing data only cover the last 35 years, and even if high resolution laser ranging (LIDAR) might become available in the future, at the moment no reliable workflow exists for mapping primary forests from such multi-sensor data. The alternative is field work, which is clearly unfeasible given the huge area covered by our database, the large number of polygons, and the cost and time effort that would be required for a statistically valid ground sample of data. Still, remote sensing data can be helpful for checking whether a patch of primary forest underwent human disturbance after it was delineated and that is why we implemented a semi-automatic procedure based on Landsat satellite-image time series (1985-2018) (see below).
Benchmarking against country-level statistics. Our database contains most of the geographical information currently available on primary forests in Europe, but we do not claim this data is complete. To benchmark the completeness of our map, we calculated the ratio between the area of primary forest in our database at country level, and the estimated area of "forest undisturbed by man" from the indicator 4.3 in the Forest Europe report 10 or, for those countries where this information is not available, from FAO's Forest Resources Assessment 48 . Although the definition of "forest undisturbed by man" in Forest Europe is consistent with our definition of primary forest 10 , it must be noted that these country-level estimates stem from national inventories or other studies, and data quality varies from country to country 49 . The comparison presented here should, therefore, be taken with caution (Fig. 4).
Forest Europe reports no primary forest for some western European countries (Spain, France, Belgium, Netherlands, Germany, United Kingdom and Ireland), although for most of these countries we did find information on at least a handful of primary forest sites. The coverage of our map was also higher than expected for some Eastern European countries (e.g., Ukraine, Belarus, Lithuania), as well as Norway and Finland, known for hosting large areas of primary forests. Data completeness was lower for some central European countries. In the www.nature.com/scientificdata www.nature.com/scientificdata/ case of Czechia, Slovakia, Poland and Romania, our data only accounted for 20-100% of the country-level estimates from Forest Europe 10 . For Austria, Switzerland and Hungary, instead, additional data on primary forests exists but it is not currently open-access, and therefore not considered here. The largest data gaps were in Sweden, Italy, Bulgaria, Estonia, Denmark and Russia, where our map accounted for less than 10% of the primary forest reported in Forest Europe 10 . The low data completeness found for Denmark likely depends on the inclusion of minimum-intervention forest reserves in Forest Europe (see 49 ) that were harvested until recently and therefore do not qualify as primary forests according to our definition.
Assessing recent human disturbance with remote sensing. Since our data were collected continuously over the last two decades, we cannot exclude that some forest patches may have undergone human disturbance after data collection. This is particularly relevant for areas where primary forests are lost at high rates, such as the Carpathians, Russian Karelia, or Northern Fennoscandia [18][19][20] . To assess to what extent this might be an issue, we used the open-access Landsat archive and the LandTrendr disturbance detection algorithm 50,51 , using Google Earth Engine 52 (Fig. 5). Specifically, we 1) quantified the proportion of polygons in our map that underwent disturbance between 1985 and 2018, i.e., Landsat 5 operating time, 2) visually checked a stratified random selection of these disturbed polygons to quantify the prevalence of anthropogenic vs. natural disturbance, and 3) estimated the proportion of polygons in our map not meeting the necessary, but not sufficient, condition for being classified as primary (i.e. not being affected by anthropogenic disturbance within the last 35 years).
For each polygon contained in the map of primary forests, we extracted the whole stack of available Landsat images (~1985-today), and ran the LandTrendr 53 algorithm. LandTrendr identifies breakpoints in spectral time www.nature.com/scientificdata www.nature.com/scientificdata/ series, separates periods of disturbance or stability, and records the years in which disturbances occurred. To avoid problems due to cloud cover, changes in illumination, and atmospheric condition, we used all available images from the growing season of each year (1 May through 15 September) to derive yearly composite images 54 . As our spectral index, we used Tasseled Cap Wetness (TCW), as this index is particularly sensitive to forest structure 55 , is robust to spatial and temporal variations in canopy moisture 56 , and consistently outperforms other spectral indices, including Normalized Difference Vegetation Index 53 , for detecting forest disturbance 50,57-59 . As input parameters for the LandTrendr algorithm when detecting forest disturbances, we used a prevalue of −300 TCW units, a minimum disturbance magnitude of 500 TCW units, and a maximum duration of 4 years.
After running LandTrendr, we eliminated noise by applying a minimum disturbance threshold (2 ha). We then visually inspected a stratified random selection of primary forest polygons highlighted as 'disturbed' by LandTrendr using very-high-resolution images available in Google Earth. For each biogeographic region, we randomly selected 20% of disturbed polygons up to a maximum of 100 polygons per region. Depending on the size of the polygons, we inspected up to 5 randomly selected disturbed pixels within each disturbed polygon with a minimum distance between pixels of 1 km. Based on the spectral and physical characteristics of the disturbed patch (brightness, shape, size), and on ancillary information derived from the Google Earth imagery, we assigned disturbance agents as either anthropogenic (i.e., forest harvest, infrastructure development) or natural (e.g., windstorm, bark beetle outbreak, fire; Figs. 6, 7). We conservatively considered a polygon as anthropogenically disturbed if at least a third of the points we checked for that polygon were anthropogenically disturbed. To avoid introducing an observer bias, all polygons were checked by the same photo-interpreter (FMS).
Out of the 17,309 polygons checked with LandTrendr, 4,734 (27.3% of total) experienced major disturbances between 1985 and 2018. The proportion of disturbed area was greater than 10% in 2,904 polygons. We visually inspected a total of 712 pixels across 268 primary forest polygons, corresponding to 1.5% of the total number of polygons and 5.7% of the disturbed polygons. We attributed a total of 149 pixels, across 61 primary forest polygons, to anthropogenic disturbance, i.e., 22.7% (bootstrapped standard error = 2.5%) of the polygons we checked (Table 2, Fig. 7). We thus estimated the total number of primary forest polygons being anthropogenically disturbed by multiplying the total number of polygons with the proportion of disturbed polygons (27.3%) and the share of these disturbed polygons attributed to anthropogenic causes (22.7%). This suggests our map contains 1,077 anthropogenically disturbed polygons (95% CIs [847,1323]), which corresponds to 6.2% (95% CIs [4.9%, 7.6%]) of the total number of polygons. Disturbed polygons were concentrated in the Russian Federation (especially in Archangelsk region, Karelia and Komi republics), Southern Finland, and the Carpathians (Fig. 7; Table 2). The Boreal and Alpine biogeographical regions had the highest number of disturbed polygons (both www.nature.com/scientificdata www.nature.com/scientificdata/ in total, and when considering only those with evident anthropogenic disturbance). The regions with the highest share of anthropogenically disturbed polygons were the Continental and Boreal region. The sample size in Macaronesia was too low to provide a reliable estimation of the incidence of human disturbance.
These estimates should be considered as lower bounds, because only the disturbance events with a magnitude sufficient to be captured with LandTrendr and occurring in 1985-2018 could be identified. Not being this a formal validation, the results presented here should not be extrapolated to primary forests not included in our map. Finally, being our database built with a bottom-up approach, we are unable to exclude the existence of remaining bias or interpretation error, which might have propagated through the successive steps required to build it. As such, we warn the users against possible heterogeneity in data quality, accuracy and completeness across datasets.

Usage Notes
All data files are referenced in a geographic coordinate system (lat/long, WGS 84 -EPSG code: 4326). The provided files are in a personal geodatabase, and can be accessed and displayed using standard GIS software such as: QGIS (www.qgis.org/en).
All datasets listed in Online-only Table 1 Table 2. Recent human disturbance in primary forest polygons, summarized by biogeographical region. † The number of disturbed polygons is higher than the total number of polygons because some polygons expanding over more than one biogeographical region were split. PF -Primary Forest.
www.nature.com/scientificdata www.nature.com/scientificdata/ from the dataset 'Strict Forest Reserves in Switzerland' (ID 30, custodian: Jonas Stillhard). In the case of the dataset ' Ancient and Primeval Beech Forests of the Carpathians and Other Regions of Europe' 40,41 (ID 34, Custodian: UNESCO), this data is freely available online, but its copyright does not allow redistribution. We refer the interested reader to the website https://www.protectedplanet.net/903141 for the original data.
Comments and requests of updates for the dataset are collected and discussed in the GitHub forum: https:// github.com/fmsabatini/PrimaryForestEurope.

code availability
The code to reproduce the composite layers, for post-processing and for assessing recent human disturbance with remote sensing is available together with the database in Figshare (https://doi.org/10.6084/m9.figshare.13194095.v1) 33 . We included seven scripts: • 00_ComposeMap.R -Identifies overlapping polygons across individual datasets.
• 03_PostProcessing.R -Extracts additional information on each primary forest.
• 05_Summary_stats.R -Calculates summary statistics of primary forests • 06_DisturbanceAssessment_Step1_exportIntermediateChangeImg.txt -Runs LandTrendr in Google Earth Engine, tiles the area of interest, creates Change-Images for each tile, and exports these as intermediate .tif files containing the LandTrendr metrics. • 07_DisturbanceAssessment_Step2_extractPolygonValuesFromChangeImg.txt -Extracts LandTrendr metrics for each forest polygon from Change-Images and exports as .csv.
Python (.py) scripts were run in ESRI ArcGIS (v10.5) and are available also as ArcGIS Models inside the Geodatabase. R (.R) scripts were run using R (v 3.6.1) 46 . The remaining .txt scripts were run in Google Earth Engine.