A reference library for Canadian invertebrates with 1.5 million barcodes, voucher specimens, and DNA samples

The reliable taxonomic identification of organisms through DNA sequence data requires a well parameterized library of curated reference sequences. However, it is estimated that just 15% of described animal species are represented in public sequence repositories. To begin to address this deficiency, we provide DNA barcodes for 1,500,003 animal specimens collected from 23 terrestrial and aquatic ecozones at sites across Canada, a nation that comprises 7% of the planet’s land surface. In total, 14 phyla, 43 classes, 163 orders, 1123 families, 6186 genera, and 64,264 Barcode Index Numbers (BINs; a proxy for species) are represented. Species-level taxonomy was available for 38% of the specimens, but higher proportions were assigned to a genus (69.5%) and a family (99.9%). Voucher specimens and DNA extracts are archived at the Centre for Biodiversity Genomics where they are available for further research. The corresponding sequence and taxonomic data can be accessed through the Barcode of Life Data System, GenBank, the Global Biodiversity Information Facility, and the Global Genome Biodiversity Network Data Portal.

by weather conditions and, for remote locations, by accessibility of the traps for servicing (see Supplementary File 1). Malaise traps were typically serviced weekly by Centre for Biodiversity Genomics (CBG) or Parks Canada staff by replacement of the sampling bottle with a bottle containing fresh preservative.
Additional specimens in the 'National Parks' subset (N = 300,828) were collected as part of the CBG's Standardized Sampling Program. It ran from 2012-2014 in 23 National Parks, National Park Reserves, and National Urban Parks and employed a standard set of sampling methods that targeted a wider diversity of invertebrate fauna than Malaise traps 40 . In each Standardized Sampling (SS) locality, three representative sites were chosen based on a variety of biotic and abiotic factors, such as habitat type, vegetation, and elevation. The protocol was implemented over a one-week period and involved the deployment of a standard array of traps at each site: 1-2 Malaise traps, 1 flight-intercept trap, 10 pan traps, and 10-20 pitfall traps. In addition, 1-3 substrate samples were taken for Berlese funnel extraction, and a total of 60 min of sweep netting was performed over the week.
The remaining 56,832 (5.7%) specimens were obtained through opportunistic collecting in terrestrial, freshwater, and marine habitats using UV lights, dip nets, plankton nets, sieves, aspirators, mustard extraction, and freehand collecting.  48 ; and bioblitz 43  A summary of the collection method(s) used for each program is provided in Supplementary File 2. The collection method for each specimen is included in the data resources, as well as information on trap type, weather conditions, habitat, and any deviations from the normal collection protocol when this information is available. These programs covered 39 protected areas, including provincial parks, municipal parks, conservation reserves, ecological reserves, research reserves, and Nature Conservancy of Canada properties. Adding the collections in 43 national parks, a total of 1,132,347 occurrence records were derived from 82 protected areas across Canada. www.nature.com/scientificdata www.nature.com/scientificdata/ Specimen processing and DNA barcode analysis. The CBG has an efficient workflow for collecting, sorting, processing, and DNA barcoding specimens for reference library construction. As detailed protocols are outlined in other publications 49,50 , only summary details are provided here (Fig. 1).
Following collection, and prior to sorting, bulk samples and specimens were stored in −20 °C freezers, remaining in or transferred to 95% ethanol. All specimens from a trap sample or collection event were prepared for DNA barcoding, except in those cases where initial inspection suggested the presence of a very large number of specimens of a particular species. In these cases, 5 to 95 representatives of each morphospecies were prepared for sequence analysis and excess specimens were retained in ethanol at −20 °C. Larger specimens were pinned and one leg was removed for DNA extraction; smaller specimens were placed directly into 95% ethanol in either a) a sample tube rack, where a leg was later tissue-sampled for DNA extraction or b) a microplate, where the entire specimen was used for DNA extraction with an added step of recovering exoskeletal remains after non-destructive lysis and DNA extraction ('voucher recovery' 51 ).
Subsequent barcode analysis was performed following standard methods 50,52 ; the stages include tissue lysis, DNA extraction, PCR amplification of the 658 base pair (bp) fragment of the cytochrome c oxidase subunit I (COI) gene, cycle sequencing, and subsequent Sanger sequence analysis. The resultant sequences, as well as electropherograms, and primer details for all specimens were uploaded to BOLD.

Barcode Index Numbers.
For all sequences uploaded to BOLD, the records were assigned to operational taxonomic units called Barcode Index Numbers (BINs) by the Refined Single Linkage (RESL) algorithm implemented on BOLD 53 . Individual records are either assigned to an existing BIN or found a new BIN, but they only enter the RESL analysis if they meet the following criteria: greater than 300 bp coverage of the barcode region, less than 1% ambiguous bases, and no stop codon or contamination of the sequence. For inclusion into an existing BIN, sequence records must include >300 bp of the barcode region (between positions 70 and 700 of the BOLD alignment) while records that establish a new BIN must include >500 bp of the barcode region. The RESL algorithm runs monthly on all qualifying barcode sequences in BOLD -which currently contains 7.7 million animal specimen records and 0.66 million BINs (November 2019). BIN designations and assignments generated by RESL on BOLD are accessible for independent validation through the 'BIN pages' that aggregate the specimen and sequence information of its members (e.g. the eastern yellowjacket wasp, Vespula maculifrons (Buysson): https:// doi.org/10.5883/BOLD:AAD5593).
Taxonomic assignment. Prior to processing, most specimens were identified to an order level based on morphology. After processing, each record went through a taxonomic assignment and verification workflow (see Supplementary File 3). Following a record's assignment to a BIN, if that BIN contained specimens identified to a single family, genus or species, it received this identification. In cases of taxonomic discordance, the identification was applied above the level of disagreement. For example, if a BIN containing two members had one specimen assigned to genus A and the other to genus B, but both belonged to family C, the specimen would only be identified to the family level.
For specimens without a BIN assignment or where the taxonomy associated with the BIN was only to a family level, specimen sequences were compared to the complete reference library on BOLD using its Identification (BOLD-ID) Engine (available at http://v4.boldsystems.org/index.php/IDS_OpenIdEngine). A list of the top 99 sequence matches for each specimen was returned, and the taxonomy was applied where present and without discordance (as in BIN taxonomy assignment described above). Species-level identifications were assigned at ≥98% sequence similarity, genus-level identifications at ≥95% similarity, and family-level identifications at ≥90% similarity.
Specimens still lacking an identification at the family level were placed into a Neighbor-Joining tree of identified records in the same order, constructed on BOLD (see Supplementary File 4 for an example). If an unnamed specimen fell within a distinct haplogroup cluster, the lowest taxonomic level of agreement was applied to the specimen. If this approach was also unsuccessful, specimens were identified morphologically where possible, either by in-house experts or through loans to taxonomic specialists (e.g. Canadian National Collection of Insects, Arachnids, and Nematodes; Smithsonian Institution's National Museum of Natural History; see Acknowledgments for key taxonomic specialists).
Specimen, DNA, and image storage. All voucher specimens in the dataset were archived in a secure, microclimate-controlled Specimen Archive (BIOUG). All specimen provenance data, timing of processing, and storage locator information were digitized in a custom-designed institutional database (see Technical Validation below) to allow the efficient pre-laboratory processing, data submission, archival storage, and retrieval of specimens. All vouchers are available for loan for further research, and the data are accessible in various data portals (see Data Records below).
The DNA extracts produced during barcode analysis are stored within a DNA Archive, either in −80 °C freezers or dried in a trehalose or PVA-based cryoprotectant 54 and held in −20 °C freezers. Information on these DNA extracts is stored in a MS Access database. Tracking of the DNA extracts through the DNA barcoding analytical steps was also captured by a custom-built PostgreSQL-based Laboratory Information Management System (BOLD-LIMS). The data necessary for the preparation of the specimen core and GGBN extension files were exported from the DNA Archive database and BOLD (see Data Records below).
Representatives of each BIN were photographed to build a digital image library to aid taxonomic validation. Specimens were photographed at high resolution and the images were made accessible through both the specimen and BIN pages on BOLD under Creative Commons No Rights Reserved (CC0 1.0) license. (2019) 6:308 | https://doi.org/10.1038/s41597-019-0320-2 www.nature.com/scientificdata www.nature.com/scientificdata/ Data Records Records summary. Although the specimens were sourced from localities spanning ~4500 km in latitude and ~7000 km in longitude, sampling coverage was strongest in southern Canada (Fig. 2a). Sampling coverage varied between 13 provinces and territories more than 20-fold, with N = 13,225 (0.9%) for Nunavut versus N = 425,049 (28.3%) for Ontario. Most of the specimens (~98%) were from terrestrial habitats followed by freshwater (~1.5%) and marine (~0.5%) environments.
Most specimens associated with this data release are available for loan or further study in the Centre for Biodiversity Genomics Collection (BIOUG). A small percentage (2.3%) of specimens were damaged or lost during processing but, in nearly all cases, other representatives of that BIN were recovered. In total, 210,585 (14.0%) specimens were photographed and these images can be accessed on both the individual specimen and BIN pages. Most BINs (N = 58,126; 90.5%) in the data release are represented by an image of at least one voucher. When paired with Neighbor-Joining (NJ) trees, these images are critical for taxonomic validation and identification refinement (see Supplementary Files 4 and 5 for a NJ tree and associated images for one group of Canadian net-winged insects). The image library may also be useful as a training dataset for machine learning algorithms designed for specimen identification utilizing images (e.g. 55 ).
This data release is taxonomically extensive as it includes representatives for 14 phyla, 43 classes, 163 orders, 1123 families, and 6186 genera. A very high proportion of the specimens have taxonomic assignments at the family (99.9%) and genus (69.5%) levels, but fewer (N = 571,902; 38.1%) could be assigned to a species (Table 1). Of the 1,500,003 specimens included in the resource, 1,457,334 (97.2%) were either placed into an established BIN on BOLD or founded a new one, for a total of 64,264 BINs. As a proxy for species, this BIN total represents a substantial gain for the Canadian species inventory. The last thorough compilation for all invertebrates 56,57 indicated only 41,941 Canadian species and an estimated fauna of 78,821 species. Similarly, the more recent compilation of all terrestrial invertebrates by Langor 58 assembled 44,100 described species with 27,000-42,600 remaining undiscovered and/or undescribed. Flies (Diptera) dominate both specimens (N = 875,215; 58.3%) and BINs (N = 27,525; 42.8%) in the current reference library, followed by bees, wasps, ants and allies (Hymenoptera) and moths and butterflies (Lepidoptera) ( Table 1). The 'Other Localities' subset included 26.9% more families although it included half as many specimens as the Parks dataset. Taxonomic resolution (measured at the species level) also varied slightly between the subsets with the 'National Parks' subset at 35% identified to a species versus 44% for the 'Other Localities' subset. This variation in resolution is apparent between taxonomic categories as well; just 5-16% of mites and ticks (Acari) have a species assignment versus 99-100% for spiders (Araneae) and 82-87% for moths and butterflies (Lepidoptera).
A closer examination of the 'National Parks' subset reveals the recovery rate and overall complexity of the barcode-based workflow. In total, 1,148,787 specimens were processed from collecting events in these sites, but just 1,002,170 (87.2%) qualified for inclusion in the data release for four reasons. Firstly, 132,933 (11.6%) specimens were not successfully sequenced, with the order Hymenoptera comprising the largest proportion of www.nature.com/scientificdata www.nature.com/scientificdata/ failures (N = 46,103 failed specimens; recovery rate = 73.7%), followed by Diptera (N = 34,161; 94.8%), Acari (N = 17,189; 66.9%), and Hemiptera (N = 14,203; 76.9%). Secondly, ten sequence records contained stop codons, indicating that a pseudogene was likely sequenced instead of the COI barcode region; their low incidence (0.001%) indicates that nuclear mitochondrial pseudogenes (NUMTs; see 59 ) rarely complicate the recovery of COI through Sanger sequencing, likely because the copy number of NUMTs is far less. Thirdly, 4,799 were flagged as possible contaminations or misidentifications. Fourthly, 6,100 specimens were excluded because their sequence was either <300 bp, had >1% ambiguous bp in the barcode fragment, or they lacked both a BIN and a family assignment. And lastly, as part of the taxonomic assignment workflow, 2,737 specimens were permanently transferred to other institutions so their vouchers are unavailable at the CBG.
Because collecting efforts in the national parks varied in frequency and length (Fig. 2b, Supplementary File 1), there was considerable variation in the number of BINs and specimens captured per park (Online-only Table 1, Fig. 4). Values ranged from a low of 77 BINs and 715 specimens at Auyuittuq National Park to 6  www.nature.com/scientificdata www.nature.com/scientificdata/ captured 8,878 BINs and 93,378 specimens, and the combination of ATBIs and bioblitzes captured 10,721 BINs and 83,277 specimens. The sampling methods employed at each national park differed in some cases as well, further contributing to the disparity. As expected, these sampling methods each captured a differing subset of the local fauna, but in combination, they led to more comprehensive collections (Supplementary File 6).
Records access. The specimen and sequence data for all 1,500,003 records are available on BOLD in public datasets (see list in Online-only Table 1, where specimens are grouped by national park and major collection programs; Fig. 3). The record for each specimen includes its date and locality of collection, its taxonomic assignment, and voucher specimen details. The record also includes trace files, quality scores, nucleotide sequence for the COI barcodes, and corresponding GenBank accession numbers. Condensed versions of the 'National Parks' and 'Other Localities' subsets, including full lists of GenBank accessions, are available in 60 . As noted earlier, 210,585 (14.0%) of the records possess a photograph of the specimen, all with the Creative Commons No Rights Reserved (CC0 1.0) license. Each specimen record has been publicly released and is searchable in the Public Data Portal on BOLD (www.boldsystems.org/index.php/Public_BINSearch) or downloadable by utilizing BOLD's API (www. boldsystems.org/index.php/resources/api). Additionally, BOLD users can log in and search for any specimen(s) from the BOLD Workbench (http://www.boldsystems.org/index.php/Login/page). BOLD's various methods of delivering the data permit a wide range of queries and subsequent analyses (BOLD data are available as a tab separated values file in 60 . All sequences in this data release have been submitted to GenBank. A full list of GenBank Accessions for the 'National Parks' and 'Other Localities' subsets are available in 60 . From the GenBank homepage (https://www.ncbi. nlm.nih.gov/genbank/), accessions can be searched as a comma-separated list. The entire dataset can be accessed through the NCBI's BioProject PRJNA472144 (www.ncbi.nlm.nih.gov/bioproject/472144) 61 .
After final validation, specimen data were uploaded to the Global Biodiversity Information Facility (GBIF; http://www.gbif.org) as a Darwin Core Archive 62

technical Validation
Inclusion in data release. Following taxonomic and sequence curation, specimens were required to pass one of two criteria before inclusion in the release dataset. First, specimens assigned to a BIN by the RESL algorithm (see Methods) were included. Second, if a specimen did not receive a BIN assignment, it was included in the dataset as long as its sequence was at least 300 bp long with <1% ambiguous base pairs, and led at least to a family-level assignment. No specimen whose sequence record was contaminated, had a stop codon, or was flagged by a member of the BOLD community (see Taxonomic Validation below) was included in the dataset.
Sample tracking. Using a custom-built collection information management system (CIMS), the specific location and storage medium for each specimen was captured at the time of its submission to the CBG's collection archive. Unlike most natural history collections, specimens are arranged in order of processing to permit rapid submission of new specimens (up to 40,000 per week), to facilitate specimen retrieval (e.g. for photography), and to optimize the use of cabinet space. Because every specimen in the archive is databased, it is possible to query the CIMS (e.g. by a list of BINs, or a taxon for a particular geographical area) and quickly assemble all specimens required for an external loan or for examination by a visiting researcher.
Taxonomic validation. Multiple curatorial efforts were undertaken to validate taxonomic assignments.
Taxonomic conflicts within BINs were investigated and resolved where possible. This review often led to a persistent flag in BOLD stating that the record is contaminated or misidentified (which works much like a wiki -see 68 ). The list of matches provided by the BOLD-ID Engine was checked for taxonomic discordances indicative of contaminated samples or misidentified specimens and corresponding data records were flagged. Neighbor-Joining (NJ) trees of similar taxa (at the order level) were constructed on BOLD to reveal unexpected placements of taxa; this also included evaluation of an image library paired to the tree to facilitate the recognition of specimens whose phenotype was incongruent with its taxonomic assignment (see Supplementary Files 4 and 5 for an example NJ tree and associated images for Canadian net-winged insects, Neuroptera).
All species-level identifications were validated against current nomenclature. The first validation pass included comparisons against national or regional checklists (e.g. 69 for true bugs 70 ; for beetles; and 71 for moths and butterflies). Taxa that did not match with authoritative checklists were verified against online resources such as the Catalog of Life, WoRMs, ITIS, GBIF, or the World Spider Catalog. Remaining names were searched on a www.nature.com/scientificdata www.nature.com/scientificdata/ case-by-case basis in the taxonomic literature. Any synonyms or misspellings that were detected were corrected to the valid name.
Sequence validation. DNA sequences submitted to BOLD are first translated into amino acids and are then compared against a Hidden Markov Model of the COI protein. This pre-screening identifies gaps that provoke a frameshift or a stop codon, and other sequencing or editing errors. Sequences found to possess potential errors were manually re-edited or re-assembled from chromatogram trace files in CodonCode Aligner which often enabled the correction of errors made during the initial sequence editing. Sequences with confirmed gaps leading to frameshifts were excluded from the dataset. After initial submission to NCBI, staff at GenBank would report any residual errors detected with their validation tools allowing their correction before final submission.

Usage Notes
The DNA barcode reference library presented here, covering nearly 65,000 species of Canadian invertebrates, should have wide utility in supporting specimen identifications through barcoding and metabarcoding. Its primary use will undoubtedly derive from its capacity to assign unknown specimens and samples to a taxon. This step is key in producing accurate and reproducible data in metabarcoding studies 72,73 . The present DNA barcode reference library should also aid in quality control and validation for whole genome analysis by detecting misidentified samples and revealing cases of contamination (e.g. 74 ). While the library will be most useful for work in Canada, a third of the species found in the Nearctic occurs in Canada, and about 5% of the Holarctic fauna, meaning the library will have utility across the Holarctic region. In fact, given its taxonomic breadth -14 phyla, 43 classes, 163 orders, and 1123 families -it should be useful for studies worldwide, particularly for terrestrial invertebrates. It should also be valuable as a model for library construction in other countries and for other environments (e.g. soils, oceans), in Canada and elsewhere. In all applications, the accessibility of the library in various repositories 60,61,64 , paired with the ongoing curation and refinement of taxonomic assignments by the biodiversity science community, further ensures its value will increase through time.

Code availability
The Barcode of Life Data System (BOLD; www.boldsystems.org) 8 was used as the primary workbench for creating, storing, analyzing, and validating the specimen and sequence records and the associated data resources 48 . The BOLD platform has a private, password-protected workbench for the steps from specimen data entry to data validation (see details in Data Records), and a public data portal for the release of data in various formats. The latter is accessible through an API (http://www.boldsystems.org/index.php/resources/api?type=webservices) that can also be controlled through R 75 with the package 'bold' 76 .