WOODIV, a database of occurrences, functional traits, and phylogenetic data for all Euro-Mediterranean trees

Trees play a key role in the structure and function of many ecosystems worldwide. In the Mediterranean Basin, forests cover approximately 22% of the total land area hosting a large number of endemics (46 species). Despite its particularities and vulnerability, the biodiversity of Mediterranean trees is not well known at the taxonomic, spatial, functional, and genetic levels required for conservation applications. The WOODIV database fills this gap by providing reliable occurrences, four functional traits (plant height, seed mass, wood density, and specific leaf area), and sequences from three DNA-regions (rbcL, matK, and trnH-psbA), together with modelled occurrences and a phylogeny for all 210 Euro-Mediterranean tree species. We compiled, homogenized, and verified occurrence data from sparse datasets and collated them on an INSPIRE-compliant 10 × 10 km grid. We also gathered functional trait and genetic data, filling existing gaps where possible. The WOODIV database can benefit macroecological studies in the fields of conservation, biogeography, and community ecology. Measurement(s) occurrence • Trait • DNA Technology Type(s) Sampling • digital curation • DNA sequencing assay Sample Characteristic - Organism Viridiplantae • trees Sample Characteristic - Environment forest biome Sample Characteristic - Location Mediterranean Region • Europe Measurement(s) occurrence • Trait • DNA Technology Type(s) Sampling • digital curation • DNA sequencing assay Sample Characteristic - Organism Viridiplantae • trees Sample Characteristic - Environment forest biome Sample Characteristic - Location Mediterranean Region • Europe Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.13553519

www.nature.com/scientificdata www.nature.com/scientificdata/ The raw dataset obtained from gathering occurrences from all sources included a total of 1,248,701 occurrence records distributed across the participating countries.
The raw occurrence data were aggregated at a resolution of 10 × 10 km in line with an INSPIRE 14 compliant 10 × 10 km grid (SCR 4258). This gridding procedure provided a way to standardize data from different sources. We selected this spatial grain because it was the finest resolution available for some countries of the study area (e.g. Slovenia, Croatia, Greece). Sources of occurrence data with a resolution coarser than 10 × 10 km (e.g. Atlas Florae Europaeae 15 ) were not considered. The considered area includes 10,042 grid cells with at least one occurrence record (Fig. 1a). The occurrence dataset provided by the WOODIV database, i.e. aggregated records for  www.nature.com/scientificdata www.nature.com/scientificdata/ species considered as native in the given grid cell using the 10 × 10 km grid (removal of duplicate species within a grid cell) includes 140,279 occurrences.
Modelled occurrence data. The WOODIV database provides modelled occurrences of the species from the Médail et al. 1 checklist. From the 10 × 10 km gridded observed occurrence data, we modelled the distribution of each species across the Euro-Mediterranean area using Species Distribution Models (SDM). SDM statistically relate species occurrence records to environmental variables to predict the potential distribution of species 16 .
Due to the extent of the study area, we only related species occurrence to climate gradients 17 . Bioclimatic variables were extracted from the CHELSA database V1.2 18 available at a resolution of 30 arc-sec (http://chelsaclimate.org/) and then averaged to a 10 × 10 km resolution. The selection of the environmental predictors for niche modeling is a source of uncertainty in model predictions that can be reduced with sound statistical methods and ecological knowledge of the target species 19 . We also focused on proximal predictors that directly influence species distribution and selected a low number of predictive variables to reduce the issues of model overfitting and multicollinearity 20 . We selected four bioclimatic variables that previous studies had reported to be relevant predictors of the distribution of plant species, especially in environments such as those that characterize the Mediterranean Basin 21-24 : "Minimum temperature of the coldest month" (Bio06, in °C) quantifies potentially lethal frost events and more generally, stress due to low temperatures; "Total annual precipitation" (Bio12, in mm) approximates average water availability; "Precipitation of the driest month" (Bio14, in mm) describes the extremes associated with drought events and stress due to low water availability, and "Temperature seasonality" (Bio04, no dimension) describes the variability of temperature during the year. All selected predictors showed VIF (variance inflation factor 25 ) values below 5, indicating that a given predictor was not correlated with any linear combinations of the other predictors (VIF Bio04 = 1.68, VIF Bio06 = 2.06, VIF Bio12 = 1.53, and VIF Bio14 = 2.07).
We related species occurrence to these four bioclimatic variables using the Random Forest algorithm 26 . As only presence data are archived in the WOODIV database, we randomly sampled a number of pseudo-absences equal to the number of observed occurrences 27 . This random selection of pseudo-absences was repeated 10 times for each species. When comparing the floras, occurrence data in the Italian Peninsula, Sardinia and/or Sicily were highly unrepresentative of the distribution of some species (n = 84; see Supplementary Table 3). To overcome www.nature.com/scientificdata www.nature.com/scientificdata/ this potential bias in the models, we did not include these regions in the model calibration step (Supplementary Table 3). The model was projected in these areas after having tested the similarity in the variables between the projection dataset (Italy, Sicily, and Sardinia) and the fitting dataset (the rest of the study area). Indeed, when model predictions are projected into regions not analyzed in the fitting data, it is necessary to measure the similarity between the new environments and those in the training sample 28 , as models are not so reliable when predicting outside their domain 29 . Similarity analyses computed using ExDet 30 indicated that all covariables in the projected area are within the univariate range of the fitting area and that there is no change in correlation between covariables (NT1 and NT2 = 0).
Each of these 10 datasets (per species) was then randomly split into two datasets to evaluate model performance on pseudo-independent data 31 : 70% of the data was used to calibrate models and the 30% remaining data was used to evaluate model performance using the True Skill Statistic (TSS 32 ) and the Area Under the Curve (AUC) of the receiver-operating characteristic (ROC) plot 33 metrics. This split-sample step was repeated 10 times resulting in 100 models per species.
For each of the 171 modelled species, a mean model (from the 100 replicates) was then used to predict potential species distribution. Predicted probabilities of occurrence were finally converted into presence/absence using the threshold maximizing the TSS. We fitted all models under the R environment R Core team 34 and the package biomod2 35,36 .
The WOODIV database provides modelled occurrences of each of the 171 species for each 10 × 10 km grid cell (Fig. 1a). Thirty-two species with less than 10 occurrence records were not modelled (Supplementary Table 3). Among these 32 species, 21 are small-ranged species whose distribution is limited to a few grid cells (Supplementary Table 3). The observed occurrence records for these 21 species can be considered as representative of their distribution and we therefore recommend using the non-modelled records for these species for analyses. The occurrences of the remaining 11 species should be considered unrepresentative of their distribution.
Functional data. Four functional traits were considered in this project: adult plant height (Height), seed mass (SeedMass), specific leaf area (SLA), and wood density (StemSpecDens). These traits have been proposed to reflect a global spectrum of plant strategies 37,38 : height is a commonly measured proxy for individual size and reflects several aspects including resource acquisition, competitive ability, or dispersal capacity. SeedMass represents the trade-off between fecundity, seed survival, and dispersal. SLA (the ratio between leaf area and dry mass) is correlated to photosynthetic capacity and leaf life span and is an indirect measure of the return on investments in carbon gain compared to water loss. StemSpecDens is a key component of woody plant growth linked to the mechanical support of the stem and its growth rate.
We compiled the values for these traits at the species level for the trees from the Médail et al. 1 checklist, referring mostly to 2 databases: TRY 9 and BROT 2.0 39 . Supplementary values were obtained from more specific databases (Global Wood Density Database 40 , Kew Seed Information Database 41 ) or from the scientific literature and atlas [42][43][44][45][46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61] . In total, 92% of the entries were extracted from TRY, 7% from BROT 2.0 and the remaining were retrieved from the other sources. The original ID of records from the TRY and BROT databases is provided in order to make it possible to refer to the complete observation if a user needs to have some contextual information.
The WOODIV database lacks all traits data for only 6 of the 210 species from the checklist ( The database provides an R script that can be used to estimate missing trait values using the taxonomic classification if needed. Genetic data. Three different DNA regions from the plastid genome corresponding to the most commonly used DNA barcode regions [62][63][64] were considered in this project: the ribulose-bisphosphate/carboxylase Large-subunit gene (rbcL), the maturase-K gene (matK), and the psbA-trnH intergenic spacer (trnH).
In a first step, we collected all sequences from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) for the three DNA regions available for the species from the Médail et al. 1 checklist at the species level: rbcL: n = 650 sequences for 146 species, matK: n = 644 sequences for 127 species, trnH: n = 493 sequences for 129 species). To fill the gaps, we obtained DNA from fresh samples collected in the field or gathered from herbarium specimens (Supplementary Table 4). DNA extraction and sequencing were performed at INRA-URFM, Avignon (France) and the National Research Council (IBBR-CNR), Florence (Italy) (rbcL: n = 233 for 125 species, matK: n = 162 for 91 species, trnH: n = 200 for 120 species). Methods used for DNA isolation and Sanger sequencing are described by Albassatneh et al. 65 . When more than one sequence was available for a given DNA region/species, a sequence alignment was performed to check data quality and a taxon-consensus sequence was generated. Consensus sequences were built using the IUPAC-IUB ambiguity 66 code for a total of 119 (rbcL), 109 (matK), and 110 species (trnH), respectively (Fig. 2c). All newly created sequences were uploaded to GenBank.
The WOODIV database lacks the DNA-region sequences data of only 6 of the 210 species from the Médail et al. 1 (Supplementary Fig. 1).
Uneven taxon sampling focused on a single biogeographic area such as ours, can bias phylogenetic inferences 67 . Our goal here is to provide DNA sequence data that can be readily re-used to estimate, e.g. comparable phylogenetic diversity indices, not phylogenetic inferences per se. To illustrate our DNA-sequences data and to facilitate their use for future analyses (to calculate phylogenetic diversity for example), we constructed a molecular phylogeny encompassing the 204 Euro-Mediterranean tree species. Each gene was independently aligned using the MAFFT program 68 and parsed using the program Gblocks 69 to exclude the segments characterized by several variable positions or gaps from final alignments. An appropriate substitution model of sequence evolution was selected for each of the three plastid DNA regions using the Akaike Information Criterion (AIC) as implemented in the JModeltest 2 program 70 . The optimal substitution model identified was the same for all three sequences: GTR + I + G. We obtained a concatenated matrix with 1615 aligned bases. We used the Maximum Likelihood analysis 71 as implemented in the RAxML V8 program 72 . The DNA sequence matrix of 1615 sites was analyzed using three partitions with the GTRGAMMAI model (GTR + Gamma substitution model + proportion of invariant sites). We searched for the optimal tree, running at least 20 independent maximum likelihood analyses; full analyses also consisted of 100 bootstrap replicates 72 . www.nature.com/scientificdata www.nature.com/scientificdata/ For users who would like to work on the complete pool of 210 tree species, we also built a 210 species phylogram including all Euro-Mediterranean trees. The six missing species for which no DNA-region sequence was available were added to the phylogenetic tree using the Simulation with Uncertainty for Phylogenetic Investigating (SUNPLIN) method 73 , with 100 replicates. The geometric median tree was computed from the set of 100 replicates with the medTree function from the R package treespace 74 . Both the median tree and the set of 100 replicates are provided in the WOODIV database, together with the molecular tree with 204 species.

Data Records
The data are available on the figshare data repository https://doi.org/10.6084/m9.figshare.13952897.v2 75 and are comprised of twenty files and two R scripts divided into six folders (Fig. 3), all named following the pattern "WOODIV_filename.ext".
The "SPECIES" folder includes three datasets in comma-separated values (csv) format (Online-only Tables 1, 2): the "Species_code" file matches the species code used in the WOODIV database and the scientific name as defined by Médail et al. 1 ; the "Nomenclature" file includes the nomenclature data of all taxa, from the order to the species or subspecies level, and synonymous names if any; the "Status" file indicates which taxon is endemic or cultivated following the Médail et al. 1

definition (see Methods section).
The "OCCURRENCE" folder includes five datasets in csv format and one R script (Online-only Tables 1, 2). The "Occurrence_data" file includes all observed occurrences of species at the grain size of 10 × 10 km aligned with the INSPIRE LAEA grid, the associated country, and the code of the source from where the data was extracted; the "SDM" file includes the modelled occurrences of the (171) species at the 10 × 10 km grid cell-size aligned with the INSPIRE LAEA reference grid; the "Occurrence_source" file matches the source code to the full description of the source; the "Aggregation" file indicates if taxa can be merged (e.g. collapsing all subspecies level data to the species or species' group level); the "Country" file shows whether the taxon is present (native or introduced) or absent in each country; the "working_file_generation" R script combines all these datasets into a global dataset.
The "TRAITS" folder includes two datasets in csv format and one R script (Online-only Tables 1, 2): the "Trait_data" table includes the functional trait values, the code of the source from where they were extracted, and, when relevant, the source database from which the data is extracted, as well as the ID within this database; the "Trait_source" table matches the source code to the full description of the source; the "trait_table_generation" R script provides the method to average the trait values at the species level and to replace the missing values with Occurrence data and the associated information and script are in the yellow box, nomenclature information in the grey box, DNA-region sequences data in the blue box, phylogenetic data in the purple box, the functional data and associated script in the dark orange box, and spatial data files in the dark green box. Contents of provided files are described in Online-only Tables 1 and 2. www.nature.com/scientificdata www.nature.com/scientificdata/ the mean trait values of the higher taxonomic level while recording this level used in a table. The Supplementary Table 5 indicates for each species/trait pair, at which level the value of the trait has been assessed with the current data and code implemented.
The "SEQUENCES" folder includes one dataset in csv format and three text-based files for representing nucleotide sequences (FASTA). The "Sequence_source" file shows the GenBank reference number of each DNA-region sequence together with the data source (either GenBank or WOODIV); the sources of the samples sequenced by the WOODIV consortium are listed in Supplementary Table 4; the fasta files refer to the sequences (unique or consensus) for each species and DNA-region used to build the phylogenetic tree of the 204 species and named according to the DNA-region.
The "PHYLOGENY" folder includes four phylogenetic trees. The "Phylogeny_204spp_BS" file includes the phylogram of the 204 species for which at least one of the three DNA-region sequences was available with bootstrap values, in nexus format. The "Phylogeny_204spp" file includes the same phylogram without bootstrap in Newick format. The "Phylogenies_210spp_100rep" file includes the 100 replicates of the phylogeny of the 210 species from the checklist, in Newick format. The "Phylogeny_210spp_median" file includes the median tree from the 100 replicates for the 210 species, in Newick format.
The "SPATIAL" folder has two subfolders: the "Study_area_shape" subfolder includes a polygon shapefile (EPSG: 3035) delimitating the study area while the "10 × 10 km_grid" subfolder includes a polygon shapefile (EPSG: 3035) displaying the part of the INSPIRE LAEA grid that covers the study area.

technical Validation
Observed and modelled occurrence data. The first step of data validation when gathering occurrence data is to agree on a taxonomic backbone. We followed the list of accepted names and their synonyms compiled by Médail et al. 1 for all Euro-Mediterranean tree species. The WOODIV database includes a taxonomy table which provides the nomenclature from different taxonomic references: EURO + MED Plant Base (http://www. emplantbase.org/home.html), the Browicz 76 , and the World Checklist Kew (http://wcsp.science.kew.org).
Errors in georeferenced data are common, but many of them can be easily detected 77 . We systematically filtered the data to discard records (i) with missing latitude or longitude or (ii) falling outside the study area covered by the data source (e.g. outside the borders of a country for a national atlas), and standardized the projection system if needed. In other cases, coordinates for species records appear correct but could fall outside the known and validated range of a species, mostly due to uncertain or erroneous taxonomic identification. These cases are more complicated to detect, requiring validation by an expert and/or comparison with an independent dataset to distinguish a false identification from a validated location. This step is often neglected due to lack of time or because the expertise is not available. In the WOODIV project, we implemented these two time-consuming validation steps: (i) using independent data provided at the country level to discard records falling outside the known species range. This step was led by the botanists using the country checklist of trees in Mediterranean Europe published by the same authors 1 . For Spain only, botanists also compared the spatial distribution of occurrences available from the GBIF platform with the occurrences maps provided by Flora Iberica 61 and Flora-On (https:// flora-on.pt/); (ii) checking the resulting maps of occurrences to discard dubious records by botanists from each of the 13 countries and islands. This validation step, for example, resulted in the deletion of records of planted trees such as those of Abies pinsapo Boiss. planted outside the native range in southern Spain.
To assess gaps in the occurrence data within the WOODIV dataset, we compared our occurrence data with the data of the Atlas Flora Europaeae (AFE) 15 . The AFE provides the distribution data for vascular plants in Europe at a 50 × 50 km resolution. We compared the occurrence distribution only for species in both the AFE and the WOODIV data (n = 104). For each of these species, we checked whether our dataset provides occurrence in the grid cells where the AFE reported presence. AFE grid cells where occurrences are missing in our dataset (Fig. 1b) and where our dataset reports occurrence data when the AFE does not (Fig. 1c) were mapped. Overall, the comparison with AFE (on 10,585 occurrences in the 50 × 50 km grid cells in AFE, for the 104 species) showed that we brought more occurrence information (n = 5405, i.e. + 51.1%) than we missed (n = 2186, i.e. 20.7%), suggesting the strong input of our database for Euro-Mediterranean trees. The most important gaps in the data occurred in Italy and in Montenegro (inland as we collected additional data on the field in the coastal area).
All species distribution models were tested for their predictive ability on the evaluation dataset using both the AUC and the TSS metric (Supplementary Table 3). A filter was applied to modelled occurrences based on the presence or not of species in each country as indicated by Médail et al. 1 Thus, modelled but unconfirmed occurrences of species, namely "false occurrences" were converted to absences. trait data. Datasets from juvenile stages were systematically discarded. Trait measures were checked for consistency in the unit (m for Height, mg for SeedMass, g.cm −3 for StemSpecDens, and mm 2 .mg −1 for SLA). Categorical coded values (e.g. high, medium, heavy) and extreme outliers were removed. For species with shrub and tree forms, maximum or range values of Height were taken only for tree forms. When coordinates were provided in the databases, we filtered out those from outside of the Euro-Mediterranean Basin in order to keep trait variation observed within the region. Finally, redundancies between the different sources were checked and duplicated entries were removed to keep only one entry.
Genetic data and phylogeny. For each taxon, sequences were quality checked and edited using CodonCode Aligner (CondonCode Co., MA, USA) to trim and remove low-quality regions. For sequences from GenBank, long sequences were preferred. For INRA-URFM and IBBR-CNR sequences, the quality of the chromatograms was visually checked, and ambiguous nucleotides were called using the uncertainty code. All sequences were blasted and matched with the closest relatives. Sequences falling outside genus sections were removed from the data set. Multiple sequence alignments were built using the program MAFFT 42 and parsed using the program www.nature.com/scientificdata www.nature.com/scientificdata/ Gblocks 43 to exclude the segments characterized by several variable positions or gaps from final alignments. The monophyly of families and genera was checked in the inferred phylogeny. In case of non-monophyly, the sequences were blasted again to validate them. We compared the topology of orders and above in our tree with the tree published in APGIV to make sure that the topologies were mainly congruent. The slight discrepancies we observe with reference phylogenies are mostly in families that are notoriously phylogenetically complicated, with incomplete lineage sorting and frequent speciation events, as in the Rosaceae and the Fagaceae.

Usage Notes
Two summary tables can be generated from the different tables of the WOODIV database following the workflow presented in Fig. 3, using the scripts included in the database. The first table, named "working file", is generated by the "WOODIV_working_file_generation.R" script, which relates all information regarding the species and occurrences. As a first step, the observed recorded ("Occurrence_data") and the modelled ("SDM") occurrences are merged into one, with a variable indicating the type of data for each occurrence (observed or modelled). Then the classification of each species ("Nomenclature" data) is added to the table, using the species code as an index. The next step inserts the information about taxa aggregation ("Aggregation" data) at the species or species' group level as described in the "Methods" section. The status of the species in each country (native or introduced) is added for each occurrence using the "Country" data. The last information added to the table is the cultivated or endemic status of the species ("Status" data). Other variables (e.g. the scientific name of each species for each occurrence) or filters (e.g. to select only the SDM outputs) can be easily generated from this resulting table ("working file").
The second table includes a summary of the functional traits for each species. The "WOODIV_trait_table_generation.R" script can be used to compute the mean value for each trait and each species ("Mean trait by species" table) from all trait measures included in the "Trait_data". In addition, the "Nomenclature" data can be used to impute values for species with no value for a given trait based on the taxonomic classification by taking the mean values of higher rank. Genus, family, or order levels are currently implemented in the script.
The "working file" table, the "mean trait by species" table, the 210-species phylogenetic tree, and the spatial layers are organized to easily perform several analyses, as diversity maps: the "working file" table can be filtered to keep either the observed or the modelled occurrences and converted into a community matrix giving the number of occurrences of each taxa in each cell, using the cell ID as row and the aggregation level as column. Diversity metrics can be derived from this matrix (e.g. the number of occurrences or of taxa by cell), as well as phylogenetic and functional metrics using appropriate tools and functions. To derive the latter, the match and ranking between the taxa labels of the occurrences, traits, and phylogenetic data must be carefully checked (e.g. using the organize. syncsa function in the SYNCSA R package or the match.function groups in the PICANTE R package). Maps can then be generated using the cell grid layer to spatialize the metrics.
As biodiversity data are rapidly accumulating, new information will become available. The same standardized cleaning and filtering processes can be applied to upcoming occurrences and traits data, and the future updates of the database will be uploaded as new versions of the database on the same figshare data repository 75 once a year. If a user has an error to report or a suggestion to improve the database, the corresponding author can be contacted.

Code availability
Two R scripts are available with the data files in the database. The "WOODIV_working_file_generation.R" script in the "SPECIES" folder combines all the information about species occurrences and nomenclature into one table to run the analyses. The "WOODIV_trait_table_generation.R" in the "TRAITS" folder uses the species nomenclature to compute species mean traits and impute values when no data is available using nomenclature. They run under R software version 3.6 (last tests under version 3.6.2 34 ).