The global spectrum of plant form and function: enhanced species-level trait dataset

Here we provide the ‘Global Spectrum of Plant Form and Function Dataset’, containing species mean values for six vascular plant traits. Together, these traits –plant height, stem specific density, leaf area, leaf mass per area, leaf nitrogen content per dry mass, and diaspore (seed or spore) mass – define the primary axes of variation in plant form and function. The dataset is based on ca. 1 million trait records received via the TRY database (representing ca. 2,500 original publications) and additional unpublished data. It provides 92,159 species mean values for the six traits, covering 46,047 species. The data are complemented by higher-level taxonomic classification and six categorical traits (woodiness, growth form, succulence, adaptation to terrestrial or aquatic habitats, nutrition type and leaf type). Data quality management is based on a probabilistic approach combined with comprehensive validation against expert knowledge and external information. Intense data acquisition and thorough quality control produced the largest and, to our knowledge, most accurate compilation of empirically observed vascular plant species mean traits to date.


Background & Summary
Plant traits are the morphological, chemical, physiological or phenological properties of individuals 1 . They determine how plants as primary producers capture, process and store resources, how they respond to their abiotic and biotic environment and disturbances, and how they affect other trophic levels and the fluxes of water, carbon and energy through ecosystems [2][3][4][5][6][7][8] .
Despite the overwhelming diversity of plant forms and life histories on Earth, single plant organs, such as leaves, stems, or seeds, show comparatively few essential trait combinations 9 . Evidence for recurrent trait syndromes beyond the level of single organs has been rare, restricted geographically or taxonomically, and often contradictory. Díaz et al. 9 addressed this question by analyzing the worldwide variation in six major traits critical to growth, survival and reproduction, namely: plant height (H), stem specific density (SSD), leaf area (LA), leaf mass per area (LMA), leaf nitrogen content per dry mass (N mass ) and diaspore (seed or spore) mass (SM). Díaz et al. 9 found that occupancy of the six-dimensional trait space is highly constrained, and is captured in a two-dimensional global spectrum of plant form and function, indicating strong correlation and trade-offs among traits. These results provide a foundation and baseline for studies of plant evolution, comparative plant and ecosystems ecology, and predictive modelling of future vegetation based on continuous variation in essential plant functional dimensions.
Here we provide the trait dataset that served as basis for the analysis of the global spectrum of plant form and function presented in Díaz et al. 9 -the 'Global Spectrum of Plant Form and Function Dataset' (short here 'Global Spectrum Dataset'). The dataset is predominantly based on trait records compiled in the TRY database 10,11 and provides trait values corresponding -to the extent possible-to mature and healthy plants grown under natural conditions within the species distribution range. The dataset provides species mean values for the six plant traits mentioned above plus leaf dry matter content, used for the imputation of stem specific density. The dataset covers >46,000 of the approximately 391,000 vascular plant species known to science 12 . Despite the rapid development of large plant trait datasets, the Global Spectrum Dataset stands out in terms of coverage and reliability.
First, it provides quantitative information for a very high number of species, including about 5% of them with 'complete coverage' (all six traits). Second, it represents a unique combination of probabilistic outlier detection # A full list of authors and their affiliations appears at the end of the paper.

DaTa DeScRIpToR opeN
Definition of representative trait records. The six core quantitative traits certainly show intraspecific variation, amongst others caused by different ontogenetic stages and growth conditions. The dataset, focused on mean trait values for species rather than intraspecific variation, was intended to represent species mean trait values for mature and healthy (not obviously unhealthy) plants grown under natural conditions within the species distribution range. Leaf traits were intended to represent young but fully expanded and healthy leaves from the light exposed top canopy. Trait records not conforming to these requirements, i.e. records from plants grown in laboratories under experimental conditions and records measured on juvenile plants, were excluded from the dataset. This decision was made based on the respective metadata in the TRY database (see below). Categorical traits were derived from the TRY Categorical Traits Dataset (https://www.try-db.org/TryWeb/ Data.php#3), enhanced by field data and various literature sources.
The datasets contributing via TRY to the quantitative traits are described in Supplementary Table S1, which contains data from refs. 4,  Data integration and quality management. Semantic integration of terminologies from different datasets. Ecological studies are carried out for a large number of different questions at different scales and researchers often work independently and with little coordination among them. This results in idiosyncratic datasets using heterogeneous terminologies 14 . The first step was therefore a semantic integration of terminologies. The core traits were standardized according to the definitions and measurement protocols provided in the Thesaurus Of Plant Characteristics (TOP) 14 and the 'New Handbook for Standardised Measurement of Plant Functional Traits Worldwide' 6,15 . The metadata for plant and organ maturity (juvenile, mature), health (healthy, not healthy), growth conditions (natural conditions, experimental conditions), and sun-versus shade-grown leaves were harmonized across datasets.
Consolidation of taxonomy. Species names were standardized and attributed to families according to The Plant List (http://www.theplantlist.org), the commonly accepted list for vascular plants at the time of publication of Díaz et al. 9 , using TNRS 234,235 , complemented by manual standardization by experts. Attribution of families to higher-rank groups was made according to APG III (2009) (http://www.mobot.org/MOBOT/research/APweb/).

Conversion and correction of units, and exclusion of errors.
Different datasets often used different units for the same trait. After conversion to the standardized unit per trait, differences among datasets -sometimes in the order of magnitude -became obvious. These differences could often be traced back to errors in the original units and were corrected. Obvious errors (e.g. impossible trait values like LMA < 0 g/m 2 ) were excluded from the dataset.

Data imputation.
To improve the number of species with values for all six core traits, trait records for stem SSD, LMA, N mass and SM were complemented by trait values derived from records of related traits: -Imputation of SSD. Trait records for SSD are available for a very large number of woody species, but only for very few herbaceous species. To incorporate this fundamental trait in the analyses by Díaz et al. 9 , we complemented SSD of herbaceous species using an estimation based on leaf dry matter content (LDMC), a much more widely available trait, and its close correlation to stem dry matter content (StDMC, the ratio of stem dry mass to stem water-saturated fresh mass). StDMC is a good proxy of SSD in herbaceous plants with a ratio of approximately 1:1 199 , despite substantial differences in stem anatomy among botanical families 236 , including those between non-monocotyledons and monocotyledons (where sheaths were measured). We used a data set of 422 herbaceous species collected in the field across Europe and Israel, and belonging to 31 botanical families, to parameterize linear relationships of StDMC to LDMC. The slopes of the relationship were significantly higher for monocotyledons than for other angiosperms (F = 12.3; P < 0.001, from a covariance analysis); within non-monocotyledons, the slope for Fabaceae was higher than that for species from other families (F = 4.5; P < 0.05, from a covariance analysis). We thus used three different equations to predict SSD for 1963 herbaceous species for which LDMC values were available in TRY (Table 1): one for monocotyledons, one for Fabaceae, and a third one for other non-monocotyledons. Estimated data are flagged.
-Imputation of LMA. Trait records for SLA (leaf area per leaf dry mass) were converted to LMA (leaf dry mass per leaf area): LMA = 1/SLA.
-Imputation of N mass . Trait records for leaf nitrogen content per leaf area (N area ) were converted to records of leaf nitrogen content per leaf dry mass (N mass ) if records for LMA were available for the same observation (leaf): N mass = N area /LMA.
-Imputation of SM. To be able to include trait data for pteridophytes in the analyses in Díaz et al. 9 , diaspore mass values were estimated based on published data for spore radius (r). We assumed that spores would be approximately spherical, with volume = (4/3)πr 3 , and that their density would be 0.5 mg mm −3 (refs. [237][238][239][240]. Although these assumptions were imprecise, we are confident they result in spore masses within the right order of magnitude and several orders of magnitude smaller than seed mass of spermatophytes. Most data were from Page 237 , data for Sadleria pallida were from Lloyd 238 , for Pteridium aquilinum from Conway 239 , and for Diphasiastrum spp from Stoor et al. 240 . www.nature.com/scientificdata www.nature.com/scientificdata/ Probabilistic outlier detection. The hierarchical taxonomic classification of plants into families, genera and species has been shown to be highly informative with respect to the probability of trait values [241][242][243] . We therefore used it to conduct outlier detection at each of these levels. The six core traits provided in the Global Spectrum Dataset are approximately normally distributed on a logarithmic scale 10 . We therefore assume that on log-scale, traits sample from normal distributions. In the context of a normal distribution the density distribution is symmetric to the mean with 99.73% (99.99%) of data to be expected within the range of mean +/− 3 standard deviations, and 99.99% of data within +/− 4 standard deviations. Using these wide confidence intervals ensures that extreme values that correspond to truly extreme values of traits in nature are not mistakenly identified as outliers and therefore excluded from the dataset.
The z-score indicates how many standard deviations a record is away from the mean: Trait values with absolute z-scores >4 (>3) have a probability of less than 0.1% (0.3%) to be true values of the normal distribution. These trait values are most probably caused by errors not yet detected for these individual records, e.g., wrong unit, decimal error of trait value, wrong species (e.g. by mistake attributing a herb species name to a height measured on a tree), problems related to the trait definition or non-representative growth or measurement conditions. We acknowledge however that our z-score cutoff choice is an arbitrary one.
In many cases the number of trait values per taxon (e.g. a given species) was too small for a representative sample and did not provide a reliable estimate of the standard deviation (see Fig. 1). To circumvent this problem, we used the average standard deviation of trait values at the given taxonomic level, e.g., species, genus, family or all vascular plants. This average is an approximation of the standard deviation to be expected for an individual taxon, if a sufficient number of observations would be available ( Fig. 1) 10 .
This probability-based data quality assessment on the different levels of the taxonomic hierarchy is routinely conducted within the TRY database for all traits with more than 1000 records. The z-score values for each trait record are made available on the TRY website and the highest absolute value is provided with each data release.
Trait values with an absolute z-score >4 (more than 4 standard deviations from at least one taxon mean) were excluded from the dataset unless their retention could be justified from external sources. Trait records with an absolute z-score 3 to 4 (3 to 4 standard deviations from at least one taxon mean) were checked by domain experts among the authors for plausibility, and retained or excluded accordingly.
Exclusion of duplicate trait records. Duplicate trait records were identified on the basis of the following criteria: same species (after standardization of taxonomy), similar trait values (accounting for rounding errors after semantic integration, unit conversion and data complementation), and no information on different measurement locations or dates.
Calculation of species mean trait values. The resulting dataset was used to calculate species mean trait values, without further stratification along, e.g., datasets or measurement sites. As trait distributions of the six core traits have been shown to be log-normal 9 , the mean species trait values were calculated after log-transformation of the trait values (geometric mean).
Addition of categorical traits. Data for the categorical traits were added and, if in doubt, checked against expert knowledge and independent external information from specialized websites in the Internet.
Final validation of taxonomy and mean trait values. Taxonomy was finally checked once more manually against the Plant List and APGIII. The ten most extreme species mean values of each trait (smallest and largest) were checked manually for reliability against external sources. Finally, outliers of species mean traits -after categorization of species according to the categorical traits and in bi-and multivariate trait space -were validated against external sources (see Díaz et al. 9 Fig. 2, Extended Data Fig. 3, and Extended Data Fig. 4).

Data Records
The dataset is available under a CC-BY license at the TRY File Archive The dataset consists of two data files. www.nature.com/scientificdata www.nature.com/scientificdata/ The quantitative species-level trait information is based on about 1 million trait records (see Table S1), measured on >500,000 plant individuals (number of different Observations in References (see below)). One trait record reported in the datasets is often based on several replicated measurements from different representative individuals at a site. The New Handbook for Standardised Measurement of Plant Functional Traits Worldwide 6 recommends measurements on 10 to 25 individual plants or leaves, depending on the trait. Therefore in the cases that followed this or related protocols, a trait record in the original database probably represents the site-specific mean trait value for a given species. Reporting only the site-specific mean trait value was standard procedure  www.nature.com/scientificdata www.nature.com/scientificdata/ in older publications and aggregated databases, assuming a common approach to replicated measurements on different individuals. More recent datasets tend to provide all individual measurements, among other reasons because this allows better treatment of intraspecific trait variation.
The present dataset was derived from 157 datasets (Table S1). Trait records can be traced to ca. 2500 original publications (see References_original_sources.xlsx). All species are complemented with higher-level taxonomic information; 92.5% and 84.8% of species are attributed to categories according to woodiness and basic growth-form, respectively. The raw data are available via the TRY Database (https://www.try-db.org/TryWeb/Home.php).   Fig. 4 The coverage of species per trait with respect to woodiness (woody versus non-woody incl. semi-woody). The coverage in the GIFT database 247,248 a comprehensive baseline of plant growth form, is included for external comparison (see ref. 11 for more details). In parentheses: the number of species with data for the trait and the number of species for which woodiness could be determined.
www.nature.com/scientificdata www.nature.com/scientificdata/ References.xlsx. This file contains the references of all trait data, which contributed to the core traits of the Global Spectrum Dataset via the TRY database. If datasets contributed to TRY were already compiled from original publications, the table also provides the references of these original publications. The references are linked to the data in the species mean trait dataset via species unique identifiers and trait names.
The sum of replicates in the species mean trait table is about 100,000 trait records less than the sum of 979,924 trait records in References and Supplementary Table S1, because the species mean trait table contains mean trait values and information on number of trait records only for those species-trait combinations that were retained after data cleaning and imputation.

technical Validation
The dataset has a global coverage in geographic and climate space (Fig. 2, also Díaz et al. 9 Extended Data Fig. 1), however with known gaps [9][10][11] . The numbers of species characterized per trait are similar to the TRY Database version 5, published in 2019 11 . This indicates the efficiency of data collection and curation for the Global Spectrum Dataset. All species mean trait values ( Table 2) are within the ranges published in Kattge et al. 10 . Histograms of trait frequency distributions are provided in Fig. 3. The coverage of species per trait with respect to woodiness is presented in Fig. 4. The dataset has so far been used in Díaz et al. 9 , where the data show a high internal consistency in bi-and multivariate analyses: known bivariate relationships were well reproduced (Díaz et al. 9 Extended Data Figs. 3 and 4) and individual species were located in the first axes of the principal component analysis in positions expected from general knowledge about these species (Díaz et al. 9 Fig. 2).

Usage Notes
In case the dataset is used in publications, both this paper and Díaz et al. 9 should be cited.
The six quantitative traits compiled here (plus LDMC) are among the best-covered quantitative traits in the TRY database. However, as is typical for these kinds of observational data, the numbers of records per species are unevenly distributed: few species mean trait values are based on a large number of records, while a large fraction of the species mean estimates is based on only a few or a single trait record(s) (see difference between mean and median number of trait records per species and trait in Table 2, the number of trait records per species mean is also indicated in the dataset file 'Species_mean_traits.xlsx'). The representativeness of these mean values should be taken with caution, because the trait measurements have to be treated as samples from the variation of traits within species, which -for some traits -can be substantial 10 . However, as mentioned above, one trait record is often based on several trait measurements on characteristic individuals and therefore represents a species per site-specific mean value. In the context of large-scale analyses the variation within species has been shown to be considerably smaller than the variation between species 10 .

code availability
Does not apply.