Replacing Sanger with Next Generation Sequencing to improve coverage and quality of reference DNA barcodes for plants

We estimate the global BOLD Systems database holds core DNA barcodes (rbcL + matK) for about 15% of land plant species and that comprehensive species coverage is still many decades away. Interim performance of the resource is compromised by variable sequence overlap and modest information content within each barcode. Our model predicts that the proportion of species-unique barcodes reduces as the database grows and that ‘false’ species-unique barcodes remain >5% until the database is almost complete. We conclude the current rbcL + matK barcode is unfit for purpose. Genome skimming and supplementary barcodes could improve diagnostic power but would slow new barcode acquisition. We therefore present two novel Next Generation Sequencing protocols (with freeware) capable of accurate, massively parallel de novo assembly of high quality DNA barcodes of >1400 bp. We explore how these capabilities could enhance species diagnosis in the coming decades.


Supplementary Fig S3. Influence of Ecological Study Set (ESS) size on barcode diagnostic performance.
A model constructs a comprehensive reference barcode (CRB) resource from 721 whole chloroplast genomes and associated variants derived from the BOLD Systems 3 database to represent all barcodes in a hypothetical region. The model then takes subsets of barcodes from the CRB to represent incomplete coverage of barcodes in the reference database (Interim Reference Barcode, IRB) and also a random set of unknown samples from the CRB to represent Ecological Study Sets of varying size. The panels below indicate the various outcomes: perceived identifications (red); ambiguous (blue); unknowns (black); true (pink); false (orange) when identification of samples taken from ESS sets of varying size (long dashed line = 50 species; square dashed line = 100 species; fine dashed line = 150 species; solid line = 200 species) is attempted by exact matches to IRBs of varying size (x-axis) for: a) rbcL, b) matK and c) rbcL + matK. 1100) were generated using a random number generator (www.random.org). Non-linear regression of the resultant data was performed using GraphPad Prism V8.00 (La Jolla). To evaluate the effect of filtering short reads prior to alignment, we removed sequences <300bp (both loci) and <500bp (rbcL) or <600bp (matK) from the original data set and repeated the experiment.

Supplementary Methods S2
To model the importance of incomplete species coverage in the reference database, we first constructed an artificial Comprehensive Reference Barcode (CRB) database to represent the entire land plant flora of a hypothetical region. Here, we recovered full-gene sequences for rbcL and matK from 735 chloroplast genomes from the National Center for Biotechnology Information database (ncbi.nlm.nih.gov) of which 721 species contained both genes (Supplementary Table S3 online). These 721 rbcL and matK sequences were trimmed in Geneious 8.1 to match the nominal barcode regions internal to primers rbcLa_f 46 and rbcLa_rev 47 and matK xF 28 and MALPR1 40 . We simulated intraspecific variation by including all barcodes retrieved from BOLD Systems for these same species. In addition, we replaced ambiguous (N) and missing bases using the fulllength sequences retrieved from the chloroplast genomes (Supplementary Table S3 online). Thus, all barcodes in our database were full-length.
To simulate the process of building a reference barcode database from scratch we created 200 separate Incomplete Reference Barcode (IRB) sets from the CRB, with each set containing the same number of species (N s ) in the range 70-721 species. This was achieved using the Python code IRB Assembly. IRB Assembly (available at http://bit.ly/next-gen-barcode) contains a 'Regional CRB dataset' representing the CRB. This dataset is encoded as a map, where the key is represented by the species/genus name, and the value is a barcode. If the key is not unique, i.e. a species (or genus) appears more than once in the CRB the value is then an array of barcodes. To generate IRBs from the Regional CRB dataset we first generate the IRB_Map, a dataset similar in structure to Regional_CRB. For each pre-determined IRB size s, a number of s keys are randomly selected from the Regional_CRB dataset, and are added to IRB_Map. CRBs and IRBs were assembled for rbcL, matK and the concatenated barcode rbcL + matK .
We entered these CRBs and IRBs into a model (IRB size) to characterize changes to the diagnostic properties of rbcL and matK as the IRB expands. The model (described below) identified barcodes as species-unique (true) when they possessed only one species name in both the IRB and CRB. However, if the barcode sequence was attached to two or more names in the IRB (also in the CRB), it was deemed ambiguous. Cases where discordance occurred between IRB and CRB assignments (species-unique barcodes in the IRB but not the CRB) were deemed false (negatives). The IRB Size model plotted the frequency of all diagnostic outcomes as the IRBs expanded from 70 to 721 species. This process was repeated with the species names replaced by the generic epithet alone to provide plots for genus-level diagnosis. The Python code IRB Size (available at http://bit.ly/next-gen-barcode) employs the Regional_CRB and IRB_Map, for species and genus respectively.
The code goes through each key in IRB_Map and, if it has associated more than one value, that key is deemed ambiguous. If the key has associated one value in IRB_Map, but more than one value in Regional_CRB, then it is deemed false (negative). The code calculates the frequencies of these assignments as the number of keys in IRB_Map increases.

Supplementary Methods S3
Real-world identification using DNA barcodes requires comparison between unknown specimens and reference barcodes.

Supplementary Table S2. Species pairs used to retrieve full-length barcodes from chloroplast genomes and barcode databases.
List of 101 generic species pairs for which sequence data of matK and rbcL loci were downloaded from complete chloroplast genomes from NCBI, and from barcoding entries of BOLD.