AusTraits, a curated plant trait database for the Australian flora

We introduce the AusTraits database - a compilation of values of plant traits for taxa in the Australian flora (hereafter AusTraits). AusTraits synthesises data on 448 traits across 28,640 taxa from field campaigns, published literature, taxonomic monographs, and individual taxon descriptions. Traits vary in scope from physiological measures of performance (e.g. photosynthetic gas exchange, water-use efficiency) to morphological attributes (e.g. leaf area, seed mass, plant height) which link to aspects of ecological variation. AusTraits contains curated and harmonised individual- and species-level measurements coupled to, where available, contextual information on site properties and experimental conditions. This article provides information on version 3.0.2 of AusTraits which contains data for 997,808 trait-by-taxon combinations. We envision AusTraits as an ongoing collaborative initiative for easily archiving and sharing trait data, which also provides a template for other national or regional initiatives globally to fill persistent gaps in trait knowledge.

AusTraits has been developed as a standalone database, rather than as part of the existing global database TRY 12 , for three reasons. First, we sought to establish an engaged and localised community, actively collaborating to enhance coverage of plant trait data within Australia. We envisioned that a community would form more readily to fill gaps in national knowledge of traits with local ownership of the resource. While we will never have a counterfactual, a vibrant community excited to be part of this initiative has indeed been established and coverage is much higher for Australian species than has been achieved since TRY's inception. Local ownership also aligns well with funding opportunities and national research priorities, and enables database coordinators to progress at their own speed. Second, we wanted to apply an entirely open-source approach to the aggregation workflow. All the code and raw files used to create the compiled database are available, and this database is freely available via a third party data repository (Zenodo) which is itself built for long term data archiving, with an established API. Finally, we targeted primary data sources, where possible, whereas TRY accepts aggregated datasets. The hope was that this would increase data quality, by removing intermediaries and easier identification of duplicates.
While independent, the overall structure of AusTraits is similar to that of TRY, ensuring the two databases will be interoperable. Both databases are founded on similar principles and terminology 18,19 . Increasingly, researchers and biodiversity portals are seeking to connect diverse datasets 15 , which is possible if they share a common foundation.
We envision AusTraits as an on-going collaborative initiative for easily archiving and sharing trait data about the Australian flora. Open access to a comprehensive resource like this will generate significant new knowledge about the Australian flora across multiple scales of interest, as well as reduce duplication of effort in the compilation of plant trait data, particularly for research students and government agencies seeking to access information on traits. In coming years, AusTraits will continue to be expanded, with integrations into other biodiversity platforms and expansion of coverage into historically neglected plant lineages in trait science, such as pteridophytes (lycophytes and ferns). Further, through international initiatives, such as the Open Traits Network, linkages are being forged between plant datasets and a variety of other organismal databases 15 . Methods primary sources. AusTraits version 3.0.2 was assembled from 283 distinct sources, including published papers, field measurements, glasshouse and field experiments, botanical collections, and taxonomic treatments. Initially we identified a list of candidate traits of interest, then identified primary sources containing measurements for these traits, before contacting authors for access. As the compilation grew, we expanded the list of traits considered to include any measurable quantity that had been quantified for at least a moderate number of taxa (n > 20).
For a small subset of sources from herbaria, providing a text description of taxa, we used regular expressions in R to extract measurements of traits from the text. A variety of expressions were developed to extract height, leaf/seed dimensions and growth form. Error checking was completed on approximately 60% of mined measurements by visually inspecting the extracted values relative to the textual descriptions. Supplementary Table 1   . The list of sources in AusTraits was developed gradually as new datasets were incorporated, drawing from original source publications and a published thesaurus of plant characteristics 19 . We categorised traits based on the tissue where it is measured (bark, leaf, reproductive, root, stem, whole plant) and the type of measurement (allocation, life history, morphology, nutrient, physiological). Version 3.0.2 of AusTraits includes 358 numeric and 90 categorical traits.

Trait definitions. A full list of traits and their sources appears in
Database structure. The schema of AusTraits broadly follows the principles of the established Observation and Measurement Ontology 18 in that, where available, trait data are connected to contextual information about the collection (e.g. location coordinates, light levels, whether data were collected in the field or lab) and information about the methods used to derive measurements (e.g. number of replicates, equipment used). The database Fig. 1 The data curation pathway used to assemble the AusTraits database. Trait measurements are accessed from original data sources, including published floras and field campaigns. Features such as variable names, units and taxonomy are harmonised to a common standard. Versioned releases are distributed to users, allowing the dataset to be used and re-used in a reproducible way.
Harmonisation. To harmonise each source into the common AusTraits format we applied a reproducible and transparent workflow (Fig. 1), written in R 355 , using custom code, and the packages tidyverse 356 , yaml 357 , remake 358 , knitr 359 , and rmarkdown 360 . In this workflow, we performed a series of operations, including reformatting data into a standardised format, generating observation ids for each set of linked measurements, transforming variable names into common terms, transforming data into common units, standardising terms (trait values) for categorical variables, encoding suitable metadata, and flagging data that did not pass quality checks. Details from each primary source were saved with minimal modification into two plain text files. The first file, data.csv, contains the actual trait data in comma-separated values format. The second file, metadata.yml, contains relevant metadata for the study, as well as options for mapping trait names and units onto standard types, and any substitutions applied to the data in processing. These two files provide all the information needed to compile each study into    Fig. 1, to incorporate new data and correct identified errors, leading to a high-quality, harmonised dataset. After importing a study, we generated a detailed report which summarised the study's metadata and compared the study's data values to those collected by other studies for the same traits. Data for continuous and categorical variables are presented in scatter plots and tables respectively. These reports allow first the AusTraits data curator, followed by the data contributor, to rapidly scan the metadata to confirm it has been entered correctly and the trait data to ensure it has been assigned the correct units and their categorical traits values are properly aligned with AusTraits trait values.

Taxonomy.
We developed a custom workflow to clean and standardise taxonomic names using the latest and most comprehensive taxonomic resources for the Australian flora: the Australian Plant Census (APC) 13 and the Australian Plant Name Index (APNI) 361 . These resources document all known taxonomic names for Australian plants, including currently accepted names and synonyms. While several automated tools exist for updating taxonomy, such as taxize 362 , these do not currently include up to date information for Australian taxa. Updates were completed in two steps. In the first step, we used both direct and then fuzzy matching (with up key value dataset_id Primary identifier for each study contributed into AusTraits; most often these are scientific papers, books, or online resources. By default should be name of first author and year of publication, e.g. 'Falster_2005' .   methods A textual description of the methods used to collect the trait data. Whenever available, methods are taken near-verbatim from referenced source. Methods can include descriptions such as 'measured on botanical collections' , 'data from the literature' , or a detailed description of the field or lab methods used to collect the data. year_collected_start The year data collection commenced. year_collected_end The year data collection was completed.
description A 1-2 sentence description of the purpose of the study.
collection_type A field to indicate where the majority of plants on which traits were measured were collected -in the 'field' , 'lab' , 'glasshouse' , 'botanical collection' , or 'literature' . The latter should only be used when the data were sourced from the literature and the collection type is unknown.
sample_age_class A field to indicate if the study was completed on 'adult' or 'juvenile' plants.
sampling_strategy A written description of how study sites were selected and how study individuals were selected. When available, this information is copied verbatim from a published manuscript. For botanical collections, this field ideally indicates which records were 'sampled' to measure a specific trait.
source_primary_citation Citation for primary source. This detail is generated from the primary source in the metadata.

source_primary_key
Citation key for primary source in 'sources' . The key is typically of format 'Surname_year' .
source_secondary_citation Citations for secondary source. This detail is generated from the secondary source in the metadata.

source_secondary_key
Citation key for secondary source in 'sources' . The key is typically of format 'Surname_year' . www.nature.com/scientificdata www.nature.com/scientificdata/ to 2 characters difference) to search for an alignment between reported names and those in three name sets: 1) All accepted taxa in the APC, 2) All known names in the APC, 3) All names in the APNI. Names were aligned without name authorities, as we found this information was rarely reported in the raw datasets provided to us. Second, we used the aligned name to update any outdated names to their current accepted name, using the information provided in the APC. If a name was recorded as being both an accepted name and an alternative (e.g. synonym) we preferred the accepted name, but also noted the alternative records. For phrase names, when a suitable match could not be found, we manually reviewed near matches via web portals such as the Atlas of Living Australia to find a suitable match. The final resource reports both the original and the updated taxon name alongside each trait record (Table 2), as well as an additional table summarising all taxonomic name changes ( Table 6) and further information from the APC and APNI on all taxa included (Table 7). Any changes in taxonomy are exposed within the compiled dataset, enabling researchers to review these as needed.

Data records
Access. Static versions of AusTraits, including version 3.0.2 used in this descriptor, are available via Zenodo 363 .
Data is released under a CC-BY license enabling reuse with attribution -being a citation of this descriptor and, where possible, original sources. Deposition within Zenodo helps makes the dataset consistent with FAIR principles 364 . As an evolving data product, successive versions of AusTraits are being released, containing updates and corrections. Versions are labeled using semantic versioning to indicate the change between versions 365 . As key value dataset_id Primary identifier for each study contributed into AusTraits; most often these are scientific papers, books, or online resources. By default should be name of first author and year of publication, e.g. 'Falster_2005' .
original_name Name given to taxon in the original data supplied by the authors cleaned_name Name of the taxon after implementing any changes encoded for this taxon in the metadata file for the correpsonding 'dataset_id' .
taxonIDClean Where it could be identified, the 'taxonID' of the 'cleaned_name' for this taxon in the APC.
taxonomicStatusClean Taxonomic status of the taxon identified by 'taxonIDClean' in the APC. alternativeTaxonomicStatusClean The status of alternative records with the name 'cleaned_name' in the APC.
acceptedNameUsageID ID of the accepted name for taxon in the APC or APNI.
taxon_name Currently accepted name of taxon in the APC or in the APNI .   www.nature.com/scientificdata www.nature.com/scientificdata/ validation (see Technical Validation, below) and data entry are ongoing, users are recommended to pull data from release, to ensure results in their downstream analyses remain consistent as the database is updated.
The R package austraits (https://github.com/traitecoevo/austraits) provides easy access to data and examples on manipulating data (e.g. joining tables, subsetting) for those using this platform.
Data coverage. The number of accepted vascular plant taxa in the APC (as of May 2020) is around 28,981 13 . Version 3.0.2 of AusTraits includes at least one record for 26,852 taxa (~93% of known taxa). Five traits (leaf_ length, leaf_width, plant_height, life_history, plant_growth_form) have records for more than 50% of known species (Fig. 2a). Across all traits, the median number of taxa with records is 62. Supplementary Table 1 shows the number of studies, taxa, and families with data in AusTraits, as well as the number of geo-referenced records, for each trait. Looking across traits and tissue categories, coverage declined gradually, with moderate   www.nature.com/scientificdata www.nature.com/scientificdata/ coverage(>20%) for more than 50 traits (Fig. 2). Coverage for root, stem and bark traits declined much faster than trait measurements for other plant tissues (Fig. 2b).
The most common traits are non geo-referenced records from floras; these are trait values representing a continental or region mean (or spread) and hence are not linked to a location. Yet, geo-referenced records were available for several traits for more than 10% of the flora (Fig. 3a). Coverage is notably higher for geo-referenced measurements of some tissues and trait types -such as bark stems and roots -relative to non-geo-referenced measurements (Fig. 3).
Trait records are spread across the climate space of Australia (Fig. 4a), as well as geographic locations (Fig. 4b). As with most data in Australia, the density of records was somewhat concentrated around cities or roads in remote regions.  www.nature.com/scientificdata www.nature.com/scientificdata/ Overall trait coverage across an estimated phylogenetic tree of Australian plant species is relatively unbiased (Fig. 5), though there are some notable exceptions. One exception is for root traits, where taxa within Poaceae have large amounts of information available relative to other plant families. A cluster of taxa within the family Myrtaceae which are largely from Western Australia have little leaf information available.
Comparing coverage in AusTraits to the global database TRY, there were 76 traits overlapping. Of these, AusTraits tended to contain records for more taxa, but not always; multiple traits had more than 10 times the number of taxa represented in AusTraits (Fig. 6). However, there were more records in TRY for 25 traits, in particular physiological leaf traits. Many traits were not overlapping between the two databases (Fig. 6). We noted that AusTraits includes more seed and fruit nutrient data; possibly reflecting the interest in Australia in understanding how fruit and seeds are provisioned in nutrient-depauperate environments. AusTraits includes more categorical values, especially variables documenting different components of species' fire response strategies, reflecting the importance of fire in shaping Australian communities and the research to document different strategies species have evolved to succeed in fire-prone environments.

technical Validation
We implemented three strategies to maintain data quality. First, we conducted a detailed review of each source based on a bespoke report, showing all data and metadata, by both an AusTraits curator (primarily Wenk) and the original contributor (where possible). Measurements for each trait were plotted against all other values for the trait in AusTraits, allowing quick identification of outliers. Corrections suggested by contributors were combined back into AusTraits and made available with the next release. Version 3.0.2 of AusTraits, described here, is the sixth release. www.nature.com/scientificdata www.nature.com/scientificdata/ Second, we implemented automated tests for each dataset, to confirm that values for continuous traits fall within the accepted range for the trait, and that values for categorical traits are on a list of allowed values. Data that did not pass these tests were moved to a separate spreadsheet ("excluded_data") that is also made available for use and review.
Third, we provide a pathway for user feedback.
AusTraits is an open-source community resource and we encourage engagement from users on maintaining the quality and usability of the dataset. As such, we welcome reporting of possible errors, as well as additions and edits to the online documentation for AusTraits that make using the existing data, or adding new data, easier for the community. Feedback can be posted as an issue directly at the project's GitHub page (http://traitecoevo.github.io/austraits.build).

Usage Notes
Each data release is available in multiple formats: first, as a compressed folder containing text files for each of the main components, second, as a compressed R object, enabling easy loading into R for those using that platform.
Using the taxon names aligned with the APC, data can be queried against location data from the Atlas of Living Australia. To create the phylogenetic tree in Fig. 6, we pruned a master tree for all higher plants 366 using the package V.PhyloMaker 367 and visualising via ggtree 368 . To create Fig. 3a, we used the package plotbiomes 369 to create the baseline plot of biomes.