High-throughput sequencing of multiple amplicons for barcoding and integrative taxonomy

Until now, the potential of NGS for the construction of barcode libraries or integrative taxonomy has been seldom realised. Here, we amplified (two-step PCR) and simultaneously sequenced (MiSeq) multiple markers from hundreds of fig wasp specimens. We also developed a workflow for quality control of the data. Illumina and Sanger sequences accumulated in the past years were compared. Interestingly, primers and PCR conditions used for the Sanger approach did not require optimisation to construct the MiSeq library. After quality controls, 87% of the species (76% of the specimens) had a valid MiSeq sequence for each marker. Importantly, major clusters did not always correspond to the targeted loci. Nine specimens exhibited two divergent sequences (up to 10%). In 95% of the species, MiSeq and Sanger sequences obtained from the same sampling were similar. For the remaining 5%, species were paraphyletic or the sequences clustered into divergent groups on the Sanger + MiSeq trees (>7%). These problematic cases may represent coding NUMTS or heteroplasms. Our results illustrate that Illumina approaches are not artefact-free and confirm that Sanger databases can contain non-target genes. This highlights the importance of quality controls, working with taxonomists and using multiple markers for DNA-taxonomy or species diversity assessment.


INTRODUCTION
While next-generation sequencing (NGS) is commonly used to analyse bulk environmental samples (metabarcoding) 1-3 , Sanger sequencing remains the standard approach in generating DNA barcode libraries 4 . This is unfortunate as the cost-effective acquisition of barcode sequences from hundreds of specimens identified to species by expert taxonomists could accelerate the construction of accurate reference libraries and increase their completeness 2,5 .
As it generates up to 25 million paired end reads (2*300bp), the Illumina MiSeq platform makes possible the sequencing of several hundreds of individuals on a set of informative barcodes. This allows for the increase not only in the number of species but also in the number of specimens included in reference databases, which is crucial, as a better coverage of the geographical range of the species and a better characterisation of the intraspecific variability lead to more accurate identification 6,7 . Two-step polymerase chain reactions (PCR) are easy methods that can be used to generate amplicon libraries for Illumina sequencing. In the first PCR reaction the targeted DNA region is amplified using specific primers flanked by tails (Fig. 1). These tails allow for a second PCR reaction to add Illumina adaptor sequences and indexes to multiplex samples 8 . Theoretically, two-step PCR approaches provide an opportunity to build on existing experience and tools (e.g. primers and PCR conditions), which make them very attractive.
Combining two-step PCR approaches and high-throughput sequencing may contribute to circumvent some of the main pitfalls of barcoding revealed by many studies 9 . Indeed, heteroplasms 10,11 ; NUMTS (NUclear MiTochondrial DNA segments) 12 , endosymbionts 13 , parasitoids 1 4 or contaminants may be sometimes preferentially amplified by the primer pair used and are frequently sequenced using Sanger methods. Using NGS, these non-target loci may be simultaneously amplified with the targeted COI, sequenced within the sequencing depth and better identified by post sequencing analyses.
Furthermore, combining two-step PCR and MiSeq sequencing may also help to increase the number of genes sequenced for barcoding. Indeed to circumvent the main pitfalls associated with the use of a single, mitochondrial gene, it has been acknowledged that an increase in the number of genes analysed is desired, though most studies still rely on COI only 9 . This increase is even more recommended when it comes to DNA-based species delimitation [15][16][17] or phylogeography. However, the addition of loci often comes at the expense of sampling 18 . By combining multiplexing techniques with high throughput sequencing, researchers may no longer need to choose between more samples or more characters. Finally, adding one or a few nuclear genes aside the standard mitochondrial fragment (COI) may facilitate the identification of mtDNA introgression 19 .
Recently, genome skimming, the low-coverage shotgun sequencing of total genomic DNA 20 has been proposed as a next generation barcoding tool 21 . However, a switch to databases including the complete genome sequence of each organism on Earth is still unrealistic due to unaffordable costs.
Furthermore high consumable costs, increased demands on data storage, analytical issues, as well as potential difficulties in obtaining material transfer agreements 21 , challenge the implementation of this method. In any case, identification of random scaffolds is not possible with current databases. Thus, when genome skimming was used to capture the genomic diversity of bulk arthropod samples 22 , ca 70% of the recovered scaffolds could not be identified to species with existing databases. Therefore, there should be a gradual and step-wise implementation of genome skimming.
In this light, taking advantage of current databases seems more realistic, especially to make use of the huge effort undertaken over the past 15 years in compiling millions of COI sequences for hundreds of thousands species (e.g. the International Barcode of Life project iBOL). Finally, in many groups of living organisms, COI or a couple of genetic markers provide an accurate identification, even if problems do exist in some groups 23-25 .
Here, we focused on a group of chalcid wasps (Hymenoptera, Agaonidae, Ceratosolen) for which we have accumulated Sanger sequences on two mitochondrial [COI and cytochrome b (Cytb)], and one nuclear markers [elongation factor-1a (EF1a)], over the past 20 years and on which we have a strong taxonomic expertise that is essential to detect mismatches between morphological and molecular identification. Using a two-step PCR approach (Fig. 1) and Illumina MiSeq sequencing, we amplified and sequenced the same three markers (Table 1) on 115 species of Ceratosolen (369 specimens). We process raw data using a custom workflow including quality control steps (Figs 2&3) and compared our results to the Sanger data set.
The first objective of this study was to test the feasibility of the method. Then, we wanted to determine the best strategy to analyse MiSeq raw data for reference database construction and/or DNA-based identification. Indeed, with the thousands of sequences per sample produced by the MiSeq platform, sequence correction is not a burden anymore, but other issues may appear that need to be considered.
On the one hand, thank to sequencing depth, chances of actually getting sequences of the target locus are higher compared to Sanger sequencing. On the other hand, non-target loci (i.e. pseudogenes, heteroplasmic sequences) are also sequenced and target DNA region must be sorted out from the rest of the sequences. More specifically, one may wonder whether the cluster that contains the largest proportion of reads always corresponds to the targeted loci. Two studies suggested it might be so in most cases 2,4 , but other analyses are required. Finally, at some point, Sanger and Illumina sequences will both be used in reference databases, for integrative taxonomy, or for DNA-based identification of specimen. Consequently, identifying potential issues during data reconciliation was the third objective of this study.

MiSeq library construction and sequencing
Amplification success for each gene region (bands on the gel at the expected size after the first PCR step) is summarized in Tables 2&3. The success of PCR was higher for the mitochondrial genes. A PCR amplification product was observed for 80.9% of the species for COI-long, 86.1% for COI-short, 85.2 % for Cytb, and 77.4% for EF. As might be expected, we found a negative correlation between amplification success and time elapsed since specimen collection (Fig. 4). Interestingly, the overall amplification success between Sanger and MiSeq data sets were similar, though longer primers were used in the two-step PCR approach. DNA extraction seemed to have failed for 47 specimens (no PCR amplification product visible on gel). Analyses of the per-sequence quality scores showed that the sequencing quality of 40,1% (resp. 25.9%) of the forward (resp. reverse) reads reached Q30. We observed increased error rates towards the end of the reads (especially reverse reads). As a consequence, the paired reads did not overlap for EF, though the sequenced product (563bp) fall into the range of the MiSeq Reagent Kits v3. A total of 18,688,278 Illumina paired-end reads were obtained with an average number of raw reads per sample of 38,913 (range = 673 -158,278).

Quality control of clusters of reads
The number of clusters of reads varied among samples and genes (Tables 2 & 3). More clusters were obtained when paired-end reads did not overlap, probably because of the increased error rates towards the end of the reads. After completion of our workflow (Figs 2&3; Table 4), ca 76% of the specimens have a sequence for the three-targeted genes and at least one sequence was retained for 94.8% (COI), 82.6% (Cytb) and 84.3% (EF) of the species. Sequencing what appeared as negative PCR amplifications allowed saving up to 11 species for EF, 9 for COI and 3 for Cytb. On the other hand, no sequence were obtained for about 6.6% (Cytb), 5.1 % (COI) and 3.7% (EF) of the samples for which an amplicon was visible on the gel (Table 4).
Translation to amino acids showed that 57.2% (COI-long, forward reads), 8.9% (COI-long, reverse reads), 76.3% (COI-short), 93.9% (Cytb), 89.0% (EF, forward reads), 15.5% (EF, reverse reads) of the major clusters obtained from positive PCR were coding (Tables 2 & 3). Among clusters obtained from positive PCR and that passed the translation step, 100% of the COI-short, Cytb, EF clusters as well as 89.7% (COI-long, forward reads) and 84.6% (COI-long, reverse reads) clusters blasted with Agaonidae sequences on NCBI. Non-homolog sequences mostly belong to symbionts (Wolbachia) or parasites (nematodes). Finally, among clusters obtained from positive PCR products and that passed the translation step, an average of 2.6% only had a consensus sequence identical to another species of Ceratosolen and may represent contamination or conversion of indexes. Therefore, the cluster that contained the largest proportion of reads did not necessarily represent a valid sequence.
After completion of the workflow, a few specimens (2.4%) were represented by two consensus sequences in the final MiSeq data set: one specimen for COI, for which sequences of COI-long and COI-short were different (JRAS03502_0153, Fig. S1) and eight specimens for Cytb, for which the major and the second major clusters had different consensus sequences (JRAS02196_0155, 56 ; JRAS01683_0151, 55, 56 ; JRAS02370_0151, 55, 56; Fig. S2). Phylogenetic inference revealed that one of the copies was (almost) identical to Sanger sequences while the other clustered apart with an average pairwise sequence divergence ranging from 7.3% to 10.3% (Fig. 4, S1, S2, Table 4). These cases are problematic as no objective criteria allow the removal of one of the sequences from the final data set. Lastly, when combining Sanger and MiSeq datasets, 9 species formed paraphyletic assemblages or clustered into two divergent (>7%) groups of sequences: 3 on the COI tree (Fig. S1), 4 on the Cytb tree (Figs 5, S2), 2 on both COI and Cytb trees, Table 4). Although two copies of EF have been reported in Hymenoptera 26 , no problematic case was detected on the tree obtained from the analysis of the EF data set (Fig. S3, Table 4).

DISCUSSION
In this study, we showed that a two-step PCR approach followed by Illumina sequencing may help to increase the number of sampled species and specimens in existing barcode databases. By including nuclear genes, this method could allow accurate identification of specimens within species complexes where mitochondrial markers may be misleading (e.g. introgression 6,7 ). Interestingly, primers and PCR conditions used to generate Sanger data sets did not need optimisation in order to be used for MiSeq library preparation. Moreover, this approach does not require costly investments in laboratory equipment and supplies. Provided that adapters/index and primers are compatible (e.g. no hairpin structure), researchers can keep on working with markers they have previously selected for informativeness 27 .
Increased error rates towards the end of the reads (especially reverse reads) have made the bioinformatic processing of data less convenient with a necessary switch to algorithms that allow clustering of sequences with different length. Nevertheless, processing remains feasible and fast with available programs (48 hours were required on 8-cores of a 16-cores Linux, 2.9GHz, 64GB RAM computer to process raw data). This increase in error rate could be due to accumulation of phasing and pre-phasing events throughout the sequencing process 28 . When contacted, the Illumina technical support team acknowledged the issue and informed us that they were working to fix it. Progress is being made and work associated with resolving this issue is continuing. It would thus be reasonable to think that in the next couple of years read length will increase and amplicons of any size could be sequenced.
Results from this pilot study suggest that it may be helpful to sequence PCR products with no visible bands on the gel (potentially negative PCR). Indeed, we obtained sequences that passed our quality control steps for 86 specimens for which no amplification product was detected on the gel. When using Sanger method, sequencing what seemed to be negative PCR products was discouraged because of sequencing cost. This aspect now becomes affordable with NGS methods. From a practical point of view, pooling positive and negative PCR products can lead to low concentration of the library (<2nM before denaturation, which is the minimum concentration recommended by the Illumina protocol).
Here we obtained 0.24 nM but used Tris-HCL to neutralize the increased NaOH concentration as suggested in the NextSeq protocol of Illumina. Neutralization should also work for other MiSeq libraries provided than PCR success is not too low (here the average PCR success was 70%). As for the classical PCR approach, the negative correlation observed between amplification success and time elapsed since specimen collection argues in favour of rapid DNA extraction after fieldwork instead of long-term storage of specimens in EtOH.
In this study, we amplified three genomic regions on 369 samples but other experimental design may be used to better fit with researcher needs (more markers with less samples or more samples with less markers). The ca 25M reads generated by the MiSeq platform indeed open up many possibilities. That says, sequencing depth allocated to each marker should be large enough to allow sequencing of the target region. Indeed, sequencing depth of NGS methods statistically alleviates the effect of base calling errors but also increases chances of getting non-target loci (e.g pseudogenes, non-homologous locus). Our study shows that quality control steps are required to make sure the sequences included in the data sets are accurate. Indeed, clusters that contained the largest proportion of reads can contain frame shift mutations and/or stop codons or belong to non-target organisms (e.g. symbionts or parasites). Amplification of sequences from symbionts or parasites occurred almost exclusively when we used primers derived from the universal Folmer' primers to amplify COI (LCO1490, HCO2198 29 ) (up to 25.9% of the clusters obtained from positive PCR products and that passed the translation step).
This shows that caution must be taken when using COI for a metabarcoding approach to assess species diversity.
The power of our approach coupled with its simplicity makes it attractive, but good practice designed to detect issues with Sanger sequences are still relevant. At least a translation to amino acids and a comparison to existing database (e.g. through BLAST) should be performed before sequence validation. While it has been underlined that pseudogenes, heteroplasmic sequences or sequences from symbionts or parasites may be obtained 4 , contamination during library preparation is less discussed.
Indeed, contamination is difficult to detect and requires taxonomic knowledge and sequencing of both mtDNA and nuDNA markers to be distinguished from mtDNA introgression. Our results confirm that several markers should be sequenced for species diversity assessment to avoid underestimation of the number of species.
While it may be possible to identify cross-contamination or amplification of non-coding copies (e.g nuclear mitochondrial pseudogenes, numts), sorting out paralogs in which no stop codon or frameshift mutation is detected may be difficult. In this study, we found nine specimens for which we were not able to select just one from the two copies that passed our quality control steps. These specimens were thus represented by two sequences in the final MiSeq data set. Origin of these sequences is difficult to assess. PCR or sequencing mistakes cannot be ruled out and use of replicate sequencing may reduce the noise in data processing 30 but NUMTS or heteroplasms can also explain such pattern as NUMTS On the final trees (Sanger + Miseq sequences), 4.3% (COI) and 5.2% (Cytb) of the species for which at least one consensus sequence passed quality control were paraphyletic or clustered into two divergent groups of sequences (>7%). In some cases, for Cytb, such pattern could be explained by the fact that two primer pairs were used to generate the Sanger data set, while only one pair was used for the MiSeq part. However the same pattern was observed when the same pair was used for the Sanger and the MiSeq data set. All these cases are difficult to explain (PCR/sequencing errors, NUMTS, heteroplasmic sequences ?). They showed that filtering methods to select sequence clusters may influence the results and could lead to an overestimation of species diversity. In conclusion, these results are a reminder of how important it is to take a close look at the data, work in close relation with expert taxonomists and consider more than one marker for DNA-taxonomy or species diversity assessment.
To conclude, the approach presented here may contribute to the acceleration of the global efforts and may also contribute to improving the state of completeness and accuracy of the present database. We also advocate capitalizing on the huge investments made to construct barcode databases (BOLD) and in practice be conservative and pragmatic in maintaining the genes and the methods mastered by scientists, while shifting to next generation sequencing.

Study group
We unprecedented sampling of species. We have previously sequenced these specimens on COI, Cytb and EF1α.

Sampling
One-hundred twelve species of Ceratosolen were included in the data set, of which about half (n=62) are undescribed. Three species were taken as outgroups : two in the genus Kradibia (sister group of the genus Ceratosolen 35 ) and one in the genus Tetrapus. All material were collected alive from 1996 to 2015, fixed in 75% ethanol and identified morphologically by JYR.
Sanger : DNA from two to three specimens per species was extracted over the past 20 years.
MiSeq : On average, DNA from three specimens per species was extracted. A total of 369 individual specimens were included in the library.

DNA extraction
DNA was isolated using the Qiagen DNeasy® 96 Blood and Tissue Kit (Qiagen, Germany) according to manufacturer's protocols. Individual specimens were incubated overnight at 56°C in the lysis buffer before performing the next extraction steps. In the end, DNA was recovered in a total of 100 µL of AE buffer (two elution steps of 50 µL AE buffer each).
With very few exceptions, sequences were obtained from the non-destructive extraction of a single wasp specimen (corpse kept as voucher). When destructive extraction was used, vouchers were selected among specimens sampled from the same tree and the same fig after careful identification by JYR. Destructive extraction was performed for the Miseq library. Vouchers are deposited at CBGP, Montferrier-sur-Lez, France.

Sanger data set
Two mitochondrial protein-coding genes [the 5' end of the cytochrome c oxidase subunit I (COI) « barcode fragment » and part of the cytochrome b (Cytb)] and one nuclear protein-coding gene [elongation factor-1a (EF1a)] were included in the study. Amplification and sequencing protocols followed Cruaud et al. 36 for Cytb and COI and Cruaud et al. 37 for EF1a. The two strands for each overlapping fragment were assembled using Geneious v6.1.6 38 . All sequences that we obtained for the target species were included in the data set. Sequences were aligned using MAFFT v7.222 39  Resulting trees were visualised and annotated using TreeGraph 2 44 . Following visual inspections of trees, contaminations (100% identical sequences for samples belonging to different species between which hybridization is not possible) were removed from the data set.

Illumina MiSeq library preparation
Our library preparation approach involved two PCR steps with different primer pairs, as suggested in the Illumina protocol for 16S Metagenomic Sequencing Library Preparation. The first PCR step is performed to amplify the targeted DNA region. In this step, the primer pairs used contain a standard Illumina sequencing primer, a 0 to 3 bp "heterogeneity spacer" (as suggested in Fadrosh et al., 45 ) and the gene-specific primer (Fig. 1)

Analyses of the MiSeq data set.
Step 1, from read filtering to clustering (Fig. 2 clusters containing less than 10 sequences were excluded from the data sets using VSEARCH.
Difference in read lengths due to quality trimming leaded to an overestimation of the number of clusters by SWARM for COI-long and EF1a forward and reverse reads. Therefore only the results obtained with CAP3 were subsequently analysed.

Analyses of the MiSeq data set.
Step 2, quality control of clusters of reads (Fig. 3) For each gene region, the consensus sequence of each cluster was aligned with the corresponding Sanger data set using MAFFT v7.222(default parameter). When paired-end reads did not overlap (COI-long and EF1a), clusters of reads 1 and clusters of reads 2 were analysed separately. At this step of the process, 6 data sets were assembled : COI Sanger + COI-short MiSeq, COI Sanger + COI-long MiSeq reads 1, COI Sanger + COI-long MiSeq reads 2, Cytb Sanger + Cytb MiSeq, EF Sanger + EF reads 1 MiSeq, EF Sanger + EF reads 2 MiSeq.Alignments were translated to amino acids using Geneious v6.1.6 to detect frameshift mutations and premature stop codons. Non-coding sequences were removed from the data set. NCBI-BLAST was used to identify sequences that did not belong to the target group. Phylogenetic trees were then inferred for each gene region using RAxML. Resulting trees were visualised and annotated using TreeGraph 2 44 . Visual inspection of trees was carried out to identify contaminations, which were subsequently removed from the data sets. Reads 1 and 2 for COI and EF were then merged into a single data set (gaps were inserted between the non overlapping reads). Finally,COI-long and COI-short data sets were merged and potential discrepancies were pointed out. MEGA7 57 was used to calculate average divergence (p-distance) between sequence groups for problematic cases.  O  v  e  r  l  a  p  p  i  n  g  p  a  i  r  e  d  -e  n  d  r  e  a  d  s  (  C  O  I  -s  h  o  r  t  a  n  d  C  y  t  b     1 as revealed by a visual inspection of the gel after the first PCR step 2 The cluster that contains the largest proportion of reads/sequences is called the "major cluster". 3 At this stage of the process, "major cluster" stands for the cluster that contains the largest proportion of reads/sequences AND whose consensus sequence successfully passed the translation to amino acids step. 4 as revealed by NCBI-BLAST (e.g. symbionts, parasites, predators or laboratory aerosol contamination) 5 as revealed by visual inspection of trees. In this case, sequences belong to the target group (Ceratosolen) but are identical to sequences from another species (100% BP). May be due to crosscontamination during library preparation or conversion of indexes due to mixed clusters on the flow cell (clonal clusters derived from more than one template molecule), but also to mtDNA introgression (which is undetectable without taxonomic knowledge of the group or comparison with nuDNA data sets).