LCVP, The Leipzig catalogue of vascular plants, a new taxonomic reference list for all known vascular plants

The lack of comprehensive and standardized taxonomic reference information is an impediment for robust plant research, e.g. in systematics, biogeography or macroecology. Here we provide an updated and much improved reference list of 1,315,562 scientific names for all described vascular plant species globally. The Leipzig Catalogue of Vascular Plants (LCVP; version 1.0.3) contains 351,180 accepted species names (plus 6,160 natural hybrids), within 13,460 genera, 564 families and 84 orders. The LCVP a) contains more information on the taxonomic status of global plant names than any other similar resource, and b) significantly improves the reliability of our knowledge by e.g. resolving the taxonomic status of ~181,000 names compared to The Plant List, the up to date most commonly used plant name resource. We used ~4,500 publications, existing relevant databases and available studies on molecular phylogenetics to construct a robust reference backbone. For easy access and integration into automated data processing pipelines, we provide an ‘R’-package (lcvplants) with the LCVP.


Background & Summary
Due to substantial progress in the last decade in improving plant taxonomy with phylogenetic findings, an updated global taxonomic reference list was urgently required. To date, the most commonly used reference list of vascular plant names is The Plant List (TPL, http://www.theplantlist.org/), hosted by the Royal Botanic Gardens, Kew. TPL contains 1,166,054 vascular plant names, including 308,397 accepted names, 304,419 of them angiosperms. ~760,000 names of TPL are synonyms, including 244,017 unresolved names. The here presented Leipzig Catalogue of Vascular Plants (LCVP) updates significantly the global knowledge of plant names not only compared to TPL (see Table 1) and thus is a major improvement for global plant research. It is based on existing databases (see Online-only Table 1) and an additional 4,500 publications (see the full literature package consisting of three different files as part of the publicly available LCVP data set at https://idata.idiv.de/ddm/ Data/ShowData/1806 and Step 2 below for more details), which helped to clarify the status of plant names (i.e. accepted, synonym, taxonomic placement; see Methods). In the end, 4,059 publications provided relevant and robust additional information, e.g. changes in names and/or their status. A guiding principle during the compilation of the LCVP was to avoid polyphyletic genera, which are frequent in TPL, either by splitting genera (e.g. separating Goeppertia from Calathea) or fusing them (e.g. Stapelia and Duvalia in Ceropegia). However, we did not recombine any species name in the LCVP and in cases of unclear phylogenetic position of genera, we used the conservative (i.e. existing) name.
Taxonomists, ecologists and conservation biologists often work with many species (names) and cannot keep pace with the rapid progress in (plant) systematics, boosted by molecular phylogenetic methods 1 . These researchers often rely on taxonomic reference lists as tools to translate taxa names to accepted species names via accepted synonyms.
Comprehensive taxonomic lists, such as the LCVP 2 , are essential to standardize names in databases compiled from various sources, relying on a robust 'translation' of species names into one scheme. The TRY database of functional plant traits (TRY 3 ; www.try-db.org) is one of the most prominent examples containing trait information for about 150,000 vascular plant species. Other global databases using plant name reference lists focus on plant co-occurrence patterns, such as sPlot containing about 1,1 million vegetation surveys ( 4~5 5,000 species), or use any plant species occurrence information, such as the Global Biodiversity Information Facility (~315,000 vascular plant species; www.gbif.org), of the Botanical Information and Ecology Network (BIEN 5 : ~348,000). The Global Inventory of Floras and Traits (GIFT 6 : ~268,000; http://gift.uni-goettingen.de/home) or the inventory of the Global Naturalized Alien Flora (GloNAF 7~1 4,000; glonaf.org) focus on plant distribution information from regional floras or floristic inventories.
Generally, such databases were compiled from heterogeneous data sources varying in time of publication and place of origin. The underlying sources may be primary or secondary literature -using work of scientists with excellent to no plant taxonomic background, thus combining data with various degrees of complexity and uncertainty. The merging of these databases works via species identities and thus depends on the use of accepted species names. These databases typically tap phylogenetic information contained in taxonomic references lists via available tools supporting automated matching and error checking (i.e. taxon scrubbing). There is a variety of R packages (e.g. taxonstand 8 ; taxize 9 ; RBIEN 10 ) or online tools (e.g. Global Name Resolver http://resolver.globalnames. org/ or the Taxonomic Name Resolution Service 11 http://tnrs.iplantcollaborative.org/TNRSapp.html) supporting researchers to check their taxonomic information (see 12 for a review on some of those tools). However, most of these tools rely on TPL as a reference list, which has not been updated for almost a decade and originated in a time when phylogenetic information on many genera did not exist.
Global taxonomic name databases are useful in their own right, and jointly create synergies that have transformed ecology into a synthetic and global science, and can help identifying knowledge gaps 13 . For example, functional biogeography combines information on community composition, plant species distribution and functional traits of the component species to make inferences on determinants of global trait distribution 14 . While there is high potential for exciting research using up-to-date taxonomic information, it can be only as good as the input data and the ability of the user to understand the advantage and shortcoming of the data coming from those resources. For example, missing taxonomic background often leads to neglecting the importance of citing authors of names and inevitably leads to inconsistencies when data from different sources are matched. LCVP 2 shows that when matching plant taxonomic names without author names, results could have up to 10% mismatches (i.e. ~10% of all LCVP plant taxa names are identical but ultimately refer to different accepted plant taxa).

Methods
The creation of the LCVP involved three major steps. (1) We did a thorough search of available and relevant plant taxonomic databases (Online-only Table 1 www.nature.com/scientificdata www.nature.com/scientificdata/ relevant scientific evidence in this literature we, decided for each name, whether that name is in LCVP accepted, synonymous or unresolved (see for more details Step 2: Decision making). Additionally, we harmonized and corrected taxonomic names orthographically. (3) We implemented the LCVP in an R package (LCVP) which is accessible under a MIT license from GitHub (https://github.com/idiv-biodiversity/LCVP) and will ensure a coherent versioning of the list and future updates. Furthermore, we provide a utility function to use LCVP for taxonomic name resolution (lcvplants), which is also available under the same license from GitHub (https:// github.com/idiv-biodiversity/lcvplants).
Step 1: Producing the raw data table. TPL provided the core of the raw data table for published vascular plant names, primarily supplemented by the International Plant Names Index (IPNI, https://www.ipni.org/). IPNI provides a list of published names and their source, but does not provide any information on accepted or synonymous names. We used additional major and minor databases (see Online-only Table 1 Table 1 for a table of used databases). All additional names and potential synonyms found in those databases were incorporated in the raw data table.
Step 2: Decision making. The raw data table with more than two million entries of plant taxa names contained a high number of orthographic errors, inconsistencies and contradictory opinions concerning the status of the names. A rough guideline for the acceptance of names was a subjective assignment of quality and reliability to the source. Generally, changes were only applied when the authors of the respective publications were clearly suggesting those changes. We ascribed a higher reliability rank (e.g. for conflicting information) usually to the most recent publications. Additionally, when conflicting information appeared we usually used information from publications with a) a more thorough literature section and b) a more comprehensive synonymy history than to those without. A complete synonymy history should include and properly cite not only the latest accepted taxon, but also the depending taxonomic history of all names connected to this taxon (e.g. if it is a recombined taxon) with all homonymic (i.e. species epitheton is the same) and heteronymic (i.e. genus name is the same) synonyms. Since phylogenies based on morphological data alone are prone to homoplasy, only phylogenetic studies that made taxonomical decisions also based on molecular data were taken into account. We did not create new species name combinations. In case of conflicting evidence on the phylogenetic placement or species name, due to e.g. different methods to build phylogenetic trees, species names were marked "comb.ined. " following the basionym author.
The following examples illustrate how we treated name changes: The genus Dracaena and Sansevieria are closely related 15 , where Sansevieria seems to be clearly nested within Dracaena, but the differences between both genera are continuous. Lu et al. 15 separated the Hawaiian species of Dracaena in a new genus Chrysodracon, but did not recombine Sansevieria with Dracaena yet. The presented argumentation and data in 15 were thorough and comprehensive and thus we accepted the authors arguments, kept Sansevieria and Dracaena as distinct genera and separated the Hawaiian species of Dracaena in the new genus Chrysodracon. In another case Borchsenius et al. 16 showed that Calathea in the traditional description was polyphyletic. In order to keep Ischnosiphon and Monotagma as distinct genera, being the sister clade to a smaller Calathea clade including the type species, the larger clade of Calathea was put into the then resurrected genus Goeppertia. The argumentation and presentation in 16 was robustly based on a molecular phylogeny producing well supported clades. As a consequence, we accepted the recombination of the much larger clade as suggested in 16 .
We also applied changes to the spelling of species names. Generally, we recommend to check the species names prior to automated list treatments, following the guidelines given in 17 and the rules of the current version of the International Code of Nomenclature for algae, fungi, and plants (Shenzhen Code 18 ). We followed the Shenzen Code using standardized orthography of epitheta across genera and families, e.g. warscewiczii (neither warscewitzii nor warszewiczii). Only upper cases from ' A' to 'Z' , lower cases from 'a' to 'z' and the hyphen '-' should be used in scientific names, special characters are not valid and to be avoided (Isoëtes-> Isoetes, Köberlinia -> Koeberlinia). Authors were given in their short form as provided by IPNI. For further standardization and easier use in automated workflows, we omitted spaces within author names (C. ). This refers to the recommendation of the Shenzhen Code, Art. 46 c. We tried to include only natural hybrids (i.e. no cultivars; based on expert judgement of LCVP authors) in the LCVP. Since hybrids were not the focus of the LCVP, we only marked them with '_x' , either following the genus name or the epitheton to recognize them as such, but we did not give any parent taxa information.
In most cases, we adopted the names used by the taxonomic expert (i.e. reference author who is usually a person with a publication record within a certain taxonomic group). However, there are many taxa belonging to www.nature.com/scientificdata www.nature.com/scientificdata/ genera or species which have not been phylogenetically analyzed yet. For those, we adapted the most frequently used taxon name from the recent literature. Despite a major effort, there are still names, which we could not resolve.
As part of the LCVP data package we also provide at https://idata.idiv.de/ddm/Data/ShowData/1806 three different files related to the used literature that we used to decide upon species names to create LCVP. We provide a complete bibliography (as.bib file and as full text pdf) of all ~4,500 literature references ordered by plant families. We focused on literature published from 1994 onwards, when molecular phylogenies became widespread 19,20 . The third file is a table directly matching >104,000 individual taxa and literature, used to inform the applied name changes for the respective taxa.
Step 3: Implementation in R. Besides providing LCVP as downloadable text table 2 with this article, we also provide LCVP as R package for easy integration with analyses pipelines. Due to the large size of the data we provide a pure data package, LCVP, and a separate tool package, lcvplants, with a fuzzy matching algorithm for taxonomic name resolution. Both can be downloaded and installed via github. The LCVP data package solely contains three files: the dataset of plant names and their taxonomic status, a package of the literature references used to compile the list (consisting of three files) and a meta data description file. The lcvplants package contains one user-level function to perform a fast fuzzy matching for taxonomic name resolution using the LCVP data 2 . This taxonomic names resolution is implemented in a user-friendly way, and can be done with few lines of code (see https://idiv-biodiversity.github.io/lcvplants/articles/taxonomic_resolution_using_lcplants.html for a tutorial):`# install LCVP and lcvplants from GitHub install.packages("devtools") library(devtools) devtools::install_github("idiv-biodiversity/LCVP") devtools::install_github("idiv-biodiversity/lcvplants") # load the package library(lcvplants) # run analyses LCVP("Hibiscus vitifolius") "' Input data. For taxonomic name resolution an individual name or a vector of names can be provided. There are no limits on the number of names submitted at a time, but we recommend to submit less than 5000 names at a time to ensure a reasonable computation time. For the input data, following the International Code of Nomenclature for algae, fungi, and plants (Shenzhen Code: https://www.iapt-taxon.org/nomen/main.php), genus, epithet, infraspecies rank, infraspecies name and authorities need to be separated by spaces (e.g. Draba mollissima var. kusnezowii N.Busch). Special characters (such as ü, á, ø, etc.) are only allowed for the authority names. Infraspecific names have to be preceded by their rank (e.g. "subsp. ", "var. ", "forma", "ssp. ", "f. ", "subvar. ", "subf. "). The genus name and the epitheton need to be provieded; the infraspecific ranks and authority names are optional for better results. If the genus or the epitheton are composed of two words, they have to be separated by a hyphen (e.g. Hibiscus rosa-sinensis L.). Hybrid names use the characters '_x' at the end of the genus and epithet name (e.g. Spartocytisus_x filipes Webb & Berthel., Lycopodium habereri_x House) annotations in other formats such as 'x' or 'x_' before the names are automatically changed into the required format. The commonly used special Unicode Character ' x ' (U + 00D7) for indicating hybrids is not accepted (e.g. Crassocephalum x picridifolium).
Fuzzy matching. The lcvplants package performs a string comparison between the user-submitted names and LCVP using a fuzzy matching algorithm to solve orthographic errors. The fuzzy matching algorithm can be applied to the genus name, the epitheton, the infraspecific names and the authority (see Online-only Table 2 for a description of the options for customization), and runs in the following order: (1) Submitted name standardization. The submitted name is standardized into parts using a space as delimiter: The genus level (first word) and the epitheton (second word). If there are more than three words in the submitted name and the third word is any of: "subsp. ", "var. ", "forma", "ssp. ", "f. ", "subvar. " or "subf. " the fourth term will be recognized as the infraspecies name. Otherwise all the words after the epitheton will be recognized as authority description. (2) Genus resolution with a user-specified threshold of allowed mismatches (i.e. the number of letters that can disagree between submitted and matched name). (3) Epitheton resolution. If a match for the submitted genus name is found, a similar matching will be done to find the correct epitheton. (4) Infraspecific name and authority resolution. If genus and epitheton resolution were successful, the fuzzy matching will be applied also for infraspecific names and authority names (if supplied). (5) The results for all submitted names will be combined into the output table and the results will be returned by the function and printed to the screen.
Output data. The output is a data.frame of the submitted and matched taxon names with additional information on the taxonomic status. If the option 'save' is turned active (Save = TRUE), the output will additionally be saved in a comma-separated file (.csv) in the working directory or the path specified with the 'out_path' option. The www.nature.com/scientificdata www.nature.com/scientificdata/ following list describes the columns of the output table. If a name could not be resolved, in the LCVP the respective row in the output data.frame is empty except for the 'Submitted_Name' and the 'Score' field, which gives detail information in which parts of the name could not be matched. See Online-only Table 3 for a description of the output fields.

technical Validation
We tested whether all synonyms lead to an accepted name or another synonym. One major issue with TPL is the high amount of unresolved names. A link to another name sometimes is another synonym leading to unresolved loops. LCVP only links to accepted names, not to the taxonomic predecessor. If taxon A is synonym to taxon B and it turned out, that taxon B is synonym to taxon C, the accepted name given for taxon A is taxon C, not B. We treated invalid names as synonyms and assigned them to their appropriate accepted name.
Most of the still unresolved species names in LCVP were originally published in the 19th century. There is a high probability that the majority of them are synonyms, e.g. because of historic transfer errors from one publication to the other. An extraordinarily high amount of unresolved names can be found in Asteraceae (in Comparison to tPL. Due to the improved name resolution and increased name information in general in LCVP compared to TPL, any work flow including taxonomic harmonization of plant names, will very likely yield more robust and reliable results for e.g. species richness patterns and matches between different data sources. For an easier comparison between LCVP and TPL, LCVP includes information whether taxa name entries are identical, differ in the cross-reference to a synonym, differ only orthographically either by the name or the author, or whether a name is new in the LCVP and not present in TPL. This unique information makes it possible for the users of TPL to update their names according to the LCVP, because all differences are clearly stated in the column 'status' of the LCVP. Kew Gardens´ research effort to standardize plant names recently focuses on their new flagship program, Plants of the World Online (POWO, http://www.plantsoftheworldonline.org/), which includes a new taxonomic reference backbone (Alan Paton from Kew Gardens, pers. comm. July 2019). Given that this is becoming the successor of TPL (see http://www.plantsoftheworldonline.org/about) we also compared the available POWO list with LCVP (POWO access date: November 2018; directly provided by Kew). With ~335,000 accepted species names and ~458,000 names of vascular plants marked as synonyms in this POWO version, LCVP contains also significantly more species name information than POWO (this comparison includes only vascular plants and excludes infraspecific taxa since LCVP covers only vascular plants and this POWO version does not include taxa below species level).
TPL and the tested POWO version cover all plants, LCVP only vascular plants. With the current information we have, LCVP contains more information about vascular plant names (e.g. more resolved names, more accepted species, more synonyms) than TPL and POWO. A user is more likely to resolve a given vascular plant name with LCVP than with the given versions of TPL and POWO. Any future updated versions of LCVP and POWO will change these numbers and might strengthen different purposes of use for each reference list, and could ideally lead to a harmonized global backbone if applicable. LCVP covers also infraspecific names which are not covered in the tested POWO version. The information in LCVP to which genus a species belongs and/or thus which accepted name should be used, is based on taxonomic, but also on most recent phylogenetic (i.e. mainly genetic) information. TPL was not updated for many years, and is mainly based on taxonomic information (i.e. not molecular phylogenies). With respect to usability of LCVP, we do see advantages compared to the POWO version we tested, which to our knowledge does not offer an R package nor any other functionality of (half)automatic name checking or any fuzzy name matching functions.

Code availability
The LCVP generally consists of (1) the LCVP itself, available as R data package (version 1.0.3 as of July 2020) and as tab-delimited textfile file and (2) the R-package lcvplants. The LCVP version 1.0.3 is available in both Microsoft Excel and text formats in the iDiv data portal (https://idata.idiv.de/ddm/Data/ShowData/1806; https:// doi.org/10.25829/idiv.1806-40-3009). A developmental version of the LCVP and the lcvplants package are publicly available via GitHub (https://github.com/idiv-biodiversity/lcvplants). We will constantly update the LCVP and plan to release a new version once every second to third year. We plan to closely collaborate with plant synonymy services and tools like e.g. BIEN, GNR, R packages taxonstand and taxize, to include LCVP as reference option. Requests for integrating LCVP can be made via the projects GitHub (https://github.com/idiv-biodiversity/LCVP/issues).