Towards a global DNA barcode reference library for quarantine identifications of lepidopteran stemborers, with an emphasis on sugarcane pests

Lepidopteran stemborers are among the most damaging agricultural pests worldwide, able to reduce crop yields by up to 40%. Sugarcane is the world’s most prolific crop, and several stemborer species from the families Noctuidae, Tortricidae, Crambidae and Pyralidae attack sugarcane. Australia is currently free of the most damaging stemborers, but biosecurity efforts are hampered by the difficulty in morphologically distinguishing stemborer species. Here we assess the utility of DNA barcoding in identifying stemborer pest species. We review the current state of the COI barcode sequence library for sugarcane stemborers, assembling a dataset of 1297 sequences from 64 species. Sequences were from specimens collected and identified in this study, downloaded from BOLD or requested from other authors. We performed species delimitation analyses to assess species diversity and the effectiveness of barcoding in this group. Seven species exhibited <0.03 K2P interspecific diversity, indicating that diagnostic barcoding will work well in most of the studied taxa. We identified 24 instances of identification errors in the online database, which has hampered unambiguous stemborer identification using barcodes. Instances of very high within-species diversity indicate that nuclear markers (e.g. 18S, 28S) and additional morphological data (genitalia dissection of all lineages) are needed to confirm species boundaries.

Genetic distances and the barcoding gap. As genetic distance underpins some species delimitation analyses such as ABGD 34 and RESL 35 , we investigated diversity in our haplotypes dataset by computed mean and maximal Kimura-2-Parameter genetic distances both within and between clades. Six comparisons between species had a minimum interspecific K2P distance of less than 0.03 (Table 1). Of the 49 non-singleton species, there were 27 with maximum within-species diversity above 0.02 (Table 2).

Species delimitation.
Species delimitation was performed on the haplotypes dataset and the genus-specific datasets using the GMYC, mPTP, bPTP, RESL and ABGD methods. Varying the relative gap width (X) or prior maximal intraspecific distance (PMID) values had a marked effect on the number of taxa estimated in the ABGD delimitations, ranging from 1 taxon (X = 1.5, PMID 0.0215 or X = 1, PMID = 0.0359) to 188 taxa (X = 1, PMID = 0.0017). The ABGD estimates most in line with the other delimitation methods' estimates ranged from 94 taxa (X = 1, PMID = 0.0215) to 188 taxa (Fig. 2). The bPTP MrBayes method delimited the highest number of taxa, at 197. The GMYC single threshold method estimated 145 taxa, while the GMYC method with multiple thresholds delimited 192 species. The mPTP method delimited 107 and 122 species using the RAxML and MrBayes trees respectively, and the RESL method delimited 170 taxa. The different delimitation methods applied had varying rates of success in matching current taxonomy (Fig. 3). Species were categorised as 'matching' (all individuals in one delimited group and no individuals identified as other species included), 'merged' , (grouped with one or more other species), 'split' (two or more groupings containing the one identified species), or 'complex' , (the species is split and at least one partition is merged with at least one other species), following Kekkonen et al. 49 , and we add a further category 'single' , for taxa which are in the 'match' category but are represented by a single identified specimen in our dataset.
The multiple threshold GMYC method had the highest number of split taxa (28) of all methods (Fig 4-8). The GMYC single threshold method had the highest number of matches of the non-ABGD methods (30), while the ABGD methods exceeded this: PMID = 0.0129X = 1 (34) and X = 1.5 (33), and ABGD PMID = 0.0215X = 1 (35). The ABGD analyses were sensitive to changes in the PMID and X values, producing a range of delimitations ranging from entirely merged (ABGD PMID = 0.0359X = 1 and PMID = 0.0215X = 1.5, both delimiting one taxon across all specimens) to highly split (188 taxa).
Congruence among methods was high in many species. Based on the haplotypes dataset, in 29 out of the 64 species, at least 11 out of 12 methods agreed on whether the taxon was matching, single, split, merged or a complex ( Table 3). Three of the seven high priority species were in this category: Scirpophaga excerptalis was split in all 12 delimitations, Chilo terrenellus matched in all 12 and Sesamia inferens was split in 11 out of 12 methods. Of the remaining four high priority species, Sesamia grisescens matched in 10 out of 12 methods, Chilo sacchariphagus and Chilo auricilius lent towards split (9 split/3 match), and Chilo infuscatellus was about even (7 split/5 match). Overall, the delimitations highlighted that diversity is likely to be underestimated among these high priority species, as most species either matched with current taxonomy or were split into multiple species. In the delimitations based on the Chilo, Sesamia and Scirpophaga subtrees, results were similar (Supplementary Table 1).
To test whether sampling bias influenced the number of species delimited in each taxon, we performed regression analyses on five of the whole haplotypes tree delimitations: GMYC single threshold, mPTP MrBayes, bPTP RAxML, RESL and ABGD 0.00774 TN X = 1. These five methods encompass the narrowest possible range of total number of species delimited (121-183, Fig. 2), while still including one delimitation from each method. In each case, a regression analysis was performed between the number of matching and split taxa delimited, and the number of specimens present of that species in the dataset. Singleton taxa were excluded to prevent biasing towards 'matched' taxa (as singletons cannot be split) and merged and complexed taxa were also excluded. Analyses were conducted using the Data Analysis package in Microsoft Excel. In all cases, there was a significant correlation between the number of taxa delimited and the number of samples included in the database (Table 4). However, r 2 values in all cases were relatively low; although the value for RESL was high (0.75), this was due primarily to one outlier value (the large number of species delimited in Scirpophaga excerptalis), and without this species the r 2 value was 0.46.

Discussion
High-threat species identified by Sallam 14 were generally found to have high levels of intraspecific diversity. Sesamia inferens occurs in South, South-East and East Asia, New Guinea and the Solomon Islands, and is a pest of sugarcane and several cereals 50 . Although some studies have investigated its genetic diversity within parts of this range, particularly in China 47,51 , its overall genetic diversity is not well characterised. We found the species to be paraphyletic in all of the haplotypes dataset analyses, being split into two clades (Fig. 5). This species has the highest maximum intraspecific genetic distance in our dataset, at 11% K2P, and all delimitation methods  www.nature.com/scientificreports www.nature.com/scientificreports/ applied split the taxon into at least two species, (e.g., RESL analysis split it into 6). This strongly indicate that our S. inferens specimens are actually from two different species. We include no sequences from the type locality (Sri Lanka 13 ), but the clade from India and Pakistan is geographically closer to the type locality than the clade from China. It should be noted, however, that the India/Pakistan clade consisted only of sequences downloaded from BOLD, so we are unable to assess morphologically whether they might be a different species. More broadly, the Asian Sesamia includes 15 described species 52 , however its systematics is confused, and a revision combining morphological, ecological and molecular data is needed.
Scirpophaga excerptalis occurs throughout East, South-East and South Asia 53 . S. excerptalis formed a monophyletic clade in our haplotypes dataset analyses. S. excerptalis had a high maximum intraspecific divergence of 10.7% K2P, and was split into multiple species in all but one delimitation analysis. We identified ten of the S. excerptalis specimens in this study using genitalia dissections, including representatives from the three major clades. These results suggest either that S. excerptalis is a species complex, or that the mitochondrial gene tree does not match the species tree. Further work is required to test these possibilities.
Chilo infuscatellus occurs throughout Asia and parts of the Oceanian region 54 and is the main pest of sugarcane in China 55 . In our dataset this species exhibited high intraspecific diversity, (maximum 6.2%). Species delimitation methods either matched current taxonomy (9 analyses) or split the taxon into at least six species (15 analyses). As we did not perform any genitalia dissections on our material for this species, we cannot discount the possibility that some of these specimens are misidentified.
Chilo sacchariphagus occurs in southern and south-eastern Asia, south-eastern Africa, Mauritius, Reunion and Madagascar 56 . Species delimitation analyses favoured splitting (18 analyses) over matching current taxonomy (6 analyses). C. sacchariphagus was divided into three groups in ten delimitation analyses, which our genitalia dissections indicate correspond to the three subspecies C. sacchariphagus sacchariphagus 22 , C. sacchariphagus indicus 57 and C. sacchariphagus stramineellus 58 . These results have been confirmed by genitalia dissections for the first two subspecies. However, while C. sacchariphagus stramineellus can be differentiated by male genitalia, none of the dissected specimens of this subspecies have yielded COI sequences to date.
Chilo auricilius was recovered as monophyletic in all phylogenetic analyses. The distance to its closest non-conspecific neighbour, Chilo orichalcociliellus, is 6.93% K2P, which is sufficiently high to distinguish them when DNA barcoding. Species delimitation analysis favoured splitting (18 analyses) over matching current taxonomy (6 analyses).
Eight definitively identified S. grisescens sequences were included in the haplotypes dataset, all from Papua New Guinea. The closest distance from S. grisescens to its nearest congeneric, S. inferens, was 5.22% K2P distance. Twenty species delimitations matched current taxonomy, with one delimitation merging the species with S. inferens and three splitting it into multiple species. Two additional specimens (am12397 and am12399) clustered with S. grisescens in the tree, but as they were larval, without morphological identification and 2.94% divergent from the other specimens we considered them to be Sesamia aff. grisescens. Adult specimens would be useful in exploring whether these specimens are conspecific or not.
Four definitively identified C. terrenellus individuals occurred in our haplotypes dataset, all from Papua New Guinea. The closest distance from C. terrenellus to its nearest congeneric, C. partellus, was 7.91% K2P distance. All 24 species delimitation analyses matched the current taxonomy of C. terrenellus. Eleven sequences from Indonesian and Papua New Guinean specimens cluster very close to C. terrenellus and either represent this species or the morphologically similar species C. louisiadalis. Dissection of a larger series will be needed to confirm the identity of this clade.
Of the species of lesser biosecurity concern, 20 were also found to have maximum within-species divergences of more than 2% (Supplementary Material 1). www.nature.com/scientificreports www.nature.com/scientificreports/ Some species were found to have low levels of inter-specific diversity. Bathytricha species are not well studied, with no taxonomic publications on the genus (other than a species checklist) since the species were described in the late 19 th and early 20 th century. Although a COI-only phylogeny is not definitive, high intraspecific divergence and paraphyly within B. truncata indicates it may represent at least two different species, while the high degree of similarity between B. leonina and B. monticola suggests that further investigation of the differentiation of these species is needed.
Chilo thyrsis is known only from Tanzania, while Chilo orichalcociliellus has a much broader distribution across south-east and central Africa 59 . In its original description, C. thyrsis was described as "Externally very similar to Chilo argyrolepia" 60 , and C. argyrolepia has been subsequently synonymized with C. orichalcociliellus 59 . C. thyrsis is merged with C. orichalcociliellus in 13 delimitations and is separate in 11, which does not indicate strong support for either the separation or merging of the taxa. As we only include two specimens of C. thyrsis in our dataset, we can only draw limited conclusions here, but the acknowledged close relationship between these two species may indicate they are recently diverged.
Scirpophaga nivella has a broad distribution across South and South-East Asia, southern and eastern China, Australia and the Western Pacific 53 . Scirpophaga innotata is known from Indonesia and the Philippines 61 , and also Malaysia and northern Australia 62 . In all but three of our phylogenetic analyses, S. innotata formed a clade inserted into S. nivella, rendering the latter paraphyletic. In the haplotypes BEAST tree, they formed sister clades, and in the FastTree and BEAST Scirpophaga-only analyses, S. nivella formed a clade inserted into S. innotata, rendering it paraphyletic. Fourteen of 24 delimitations grouped S. nivella and S. innotata as one species. A minimal K2P distance of 2.19% between the two species is lower than the level of intraspecific diversity we found in other species of the genus, like S. excerptalis.
Acrapex minima and Acrapex albivena are dealt with in Le Ru et al. 63 . In that study, a phylogenetic tree reconstructed based on four mitochondrial genes and two nuclear genes strongly differentiated the two species.
Given the high levels of intraspecific diversity found in several species in this study, delimitations matching current taxonomy may not be the most successful, but rather an underestimate of the true species number. A full assessment of the taxonomic status of these species will require nuclear and morphological data, as the COI barcode is comparatively short and susceptible to the skewed inheritance patterns resulting from Wolbachia   Figure 4. RAxML tree based on the haplotypes dataset (Part 1 of 5). Bars to the right of the tree indicate species delimitation groupings according to the 12 different species delimitation methods. The group of bars on the left are delimitations based on the whole haplotypes sequence dataset, bars on the right are based on the genusspecific analyses. Numbers in brackets indicate how many additional copies of this haplotype were present in the 1297 sequence dataset. Asterisks indicate sequences that we found to be misidentified, and which have now been corrected on BOLD (new species names appear in this figure). Names (or numbers in brackets) in bold indicate specimens where we performed genitalia dissections to confirm identity.
www.nature.com/scientificreports www.nature.com/scientificreports/ infection 64,65 . Nevertheless, we can make some assessments of the relative merits of these methods as applied to this dataset. The ABGD method produced a broad range of delimitations depending on the prior maximal intraspecific distance (PMID) selected. Given that range, and our inability to independently assess which PMID www.nature.com/scientificreports www.nature.com/scientificreports/ is the most realistic, we exclude the ABGD method from the following comparisons. On this basis of matching current taxonomy, GMYC single threshold was the best method, with the highest combined number of matching and single taxa. If instead the criterion for successful delimitation is agreement with the consensus among the different methods we applied, the highest scoring method was again GMYC single threshold, with one disagreement with the consensus out of 64 taxa in the haplotypes dataset. The next best method was mPTP MrBayes, with two disagreements, then RESL with six. In the genus-level subtree analyses, the highest scoring methods are mPTP MrBayes and bPTP MrBayes, with two disagreements each out of 30 taxa, and GMYC single threshold and mPTP RAxML, with 4 disagreements each. Ultimately, congruence in delimitations across multiple methods remains the best method for assessing delimitation accuracy 40 , and we found this across all delimitations in several species (Table 3).
It is difficult to assess whether sampling was sufficient to delimit species accurately. When dealing with mitochondrial-only data, introgression and selective sweeps may make any amount of COI-only data insufficient to make an accurate assessment of species-level diversity. In the case of GMYC, Talavera et al. 66 found that the most significant factor in sampling was capturing the extremities of each species' diversity. Given the number of taxa included in our study with extreme within-species diversity above 5% (Table 2), we can be confident that at least for some species we have captured sufficient diversity. Future sampling efforts should be directed towards those species for which our sampling is poor, particularly Sesamia grisescens and Chilo terrenellus.
Our regression analyses indicate that there is a correlation between the number of individuals sampled with the number of taxa delimited in each species, although the r 2 values in three out of four cases were under 0.6, and in the last case lowers to less than 0.6 when one outlier is removed. Such a correlation is not unexpected, given that no taxon can ever have more species delimited than it has samples. The low r 2 values indicate that variables not in the model are having an effect on the relationship between number of sequences and number of taxa delimited; these other variables almost certainly include the actual absence or presence of cryptic diversity in these taxa.
Of the pairs of species with less than 3% minimum inter-specific diversity (Table 1), one species appears on the Sallam 14 list: Chilo orichalcociliellus (Low threat, similar to C. thyrsis). Including only two C. thyrsis sequences in the dataset also does not allow us to properly explore the level of diversity in that group, and whether it is generally poorly differentiated from Chilo orichalcociliellus. In addition, our dataset lacks sequences from some species identified by Sallam 14 as posing a low or medium threat to Australian sugarcane, which should be a high priority for future sequencing efforts. Apart from these caveats, the identity of all other stemborers of economic risk to Australia included in this study can be safely established through the use of the COI barcode.
When dealing with sequences downloaded from online public databases, one cannot verify the accuracy of specimen identifications (photographs, unless of genitalia dissections, are of little use in identifying stemborers). We found several instances of clear misidentification in our trees, where individuals identified as one species clustered closely with species in a different genus, or even a different family. These errors, which have now been corrected on BOLD (see Supplementary Table 2), are a cautionary tale for the uncritical use of public databases  Table 3. Summary of agreement among the twelve species delimitation methods applied to the haplotypes dataset. MATCH = delimitation agrees with current taxonomy, MERGE = taxon groups with one or more other species, SPLIT = taxon split into multiple species, COMPLEX = taxon split and at least one partition merged with another species. High congruence lists the delimitation of taxa where all twelve delimitation methods agreed, or where all but one agreed (these appear in brackets). The seven species posing the highest level of threat to Australia according to Sallam 13 are in bold and underlined; medium threat taxa are underlined but not in bold.  www.nature.com/scientificreports www.nature.com/scientificreports/ for quarantine identifications. Ideally, reference DNA barcode datasets should be established for quarantine pests and be validated through independent review processes to ensure the veracity of each specimen's species identity. This is difficult in taxa such as Chilo which lack modern integrative revisionary taxonomic studies and associated identification resources. As for our own specimens, although we aimed to perform a genitalia dissection to confirm the identity of at least one individual from each cluster, we were unable to perform this in all clusters. We were also unable to verify the identity of juvenile specimens in most cases, although this study will help future researchers develop larval morphology keys through providing an improved barcode identification tool www.nature.com/scientificreports www.nature.com/scientificreports/ for juvenile specimens. This study has resolved instances of misidentification and indicated the possible need for taxonomic revision, both operational factors that must be resolved in robust barcoding systems 67 . Global analyses coupled with morphological taxonomic study are necessary, and incremental refinements to reference DNA barcode datasets should be performed as more data accumulates 68 .
The high levels of diversity that we find in this study, and the tendency in several cases for that diversity to be correlated with geography, indicate that barcoding could be used in this group to determine the source population of a specimen. This might be particularly important in situations where different populations require different biosecurity approaches. For example, Eldana saccharina populations in Africa are host to different parasitoid species, and are differentiated geographically and by their COI barcodes 69 , and whitefly (Bemisia tabaci) biotypes are known to have different pesticide resistance profiles 70 . Further work is required to determine whether divergent COI clusters in diverse species require different biosecurity responses.   . Bars to the right of the tree indicate species delimitation groupings according to the 12 different species delimitation methods. The group of bars on the left are delimitations based on the whole haplotypes sequence dataset, bars on the right are based on the genusspecific analyses. Numbers in brackets indicate how many additional copies of this haplotype were present in the 1297 sequence dataset. Asterisks indicate sequences that we found to be misidentified, and which have now been corrected on BOLD (new species names appear in this figure). Names (or numbers in brackets) in bold indicate specimens where we performed genitalia dissections to confirm identity.
www.nature.com/scientificreports www.nature.com/scientificreports/ Similar wide-ranging studies of North American and European Lepidoptera have tended to find considerably lower intraspecific diversity than we find here. In a study by Yang et al. 71    www.nature.com/scientificreports www.nature.com/scientificreports/ were below 6%, and only three instances were found above 4%. Huemer et al. 72 examined 1004 species (4925 sequences) of butterfly in Austria and Finland, finding the highest maximum intraspecific distance was 9.6% K2P, with only 3 instances above 8%; 12.3% of included species included more than 2% maximum intraspecific divergence. Hausmann et al. 73 included 1395 sequences across 331 species of the Geometridae fauna of Bavaria, and found 9.2% maximum intraspecific divergence with seven species greater than 4%. In contrast, in our study of 1297 sequences across 64 species we find 27 species with maximum intraspecific K2P distances above 2%, 13 above 4%, and up to 11% in S. inferens. The unusually large intraspecific diversity in COI sequences observed for many species in this study needs to be resolved through the analysis of appropriate nuclear gene sequences and morphological work, to rigorously reassess species boundaries.
Overall, we find that COI DNA barcoding initiatives aimed at identifying stemborers of economic interest are likely to be successful. Four out of the seven species of greatest economic significance to Australia were found to have intraspecific distances >6%: Chilo infuscatellus, C. sacchariphagus, Scirpophaga excerptalis and Sesamia inferens. Species delimitation efforts in this large, unevenly sampled single-locus dataset were mixed, although several species exhibited congruent delimitations across methods. Non-monophyly within species was rare, encountered only three times, indicating that tree-based clustering may be a useful way to assign species identity to unknown individuals. Errors in identification found in online databases underline the importance of expertly identified voucher specimens and curation of sequence collections in establishing robust reference databases for accurate DNA barcode based identifications. This study is the first step in that direction for the lepidopteran stemborers of sugarcane and cereals.

Methods
Specimens. Specimens were collected by the authors or donated by colleagues from many countries. We attempted to sample as broadly as possible from cereal and sugarcane growing regions around the world, prioritising the "high risk" species of Sallam 14 . Two thirds of the specimens sequenced for this study were adults and one third were larvae. Adults are usually the only life stage reliably identified using current morphology techniques, but sampling adults usually precludes the collection of host plant information. In this case 68% of the adults sequenced for this study, including most of those from Africa, were collected from host plants as larvae and laboratory reared.
Species identification. Where possible, adults and larvae were identified to species level in the field based on experience, morphological appearance and/or ecological and geographic distribution, and were gifted to us with this existing identification. After DNA sequencing and preliminary phylogenetic analysis (see below), adult specimens for which the morphological (field) identifications disagreed with the DNA barcode identifications were reassessed based on external morphological appearance. Our next step was to examine at least one specimen from each putative species or each DNA barcode cluster (whichever was the less inclusive group) and reassess its morphological identification (e.g. for Scirpophaga excerptalis, 11 specimens were examined). Genitalia dissections were conducted on adult specimens and compared with available images of type specimens (for certain Chilo species), original species descriptions, taxonomic revisions, and published resources for stemborer identification. The main literature referred to: Barrion 74 24 .
For sequences downloaded from BOLD, which includes data mined from GenBank, we used the species identification provided. However, samples which we had good reason to believe had been incorrectly identified were excluded from some analyses, as described below. We contacted BOLD about such samples and their identifications on the database were changed. DNA extraction. DNA was extracted from adult moths either from a single leg or from a whole abdomen if a genitalia dissection was required. For larvae, depending on the specimen size, either a proleg or a piece of abdominal integument and associated muscle, or in some cases the entire rear half of the specimen (up to 10-20 mg) was sampled. To avoid any cross contamination, dissection instruments and forceps were wiped with laboratory tissue, dipped in ethanol and flame-sterilized between samples. DNA was extracted using commercial silica-gel membrane-based kits, either Qiagen DNeasy (Qiagen, Chadstone, Australia) or Bioline Isolate II Genomic DNA Isolation Kit (Bioline, Eveleigh, Australia) following the manufacturers' instructions, except for whole abdomen dissections we used two to three times the recommended volumes of tissue digestion buffer and proteinase-K, and stored the excess volume at −80 °C for potential later use. PCR amplification. PCR amplification used the protocols and primers described in Mitchell 88 , with some PCRs using the Folmer primers 89 . Samples were amplified using the primer pair AMbc0f1m and AMbc0r1m, and PCR products were visualised on a 1.5% agarose gel, stained with 1 drop of Biotium GelRed (Gene Target Solutions, Dural, Australia) per 50 mL of gel mix. Samples which did not show a band were re-amplified, using the primers M13F and AMbc0r2m, using 1 μL of the PCR product from the first amplification as a template. If this re-amplification failed, then no further amplification attempts were made. PCR protocols for initial amplifications and re-amplifications were the same and used the following reaction mixture per well: 2.29 μL MilliQ water, 7.5 μL of 10% Trehalose solution, 1.5 μL 10x reaction buffer (no MgCl 2 ), 0.75 μL MgCl 2 , 0.3 μL of dNTP mix, 0.3 μL forward and reverse primer at 5 μM each, 0.06 μL Platinum Taq and 2 μL template, (1 μL for reamplifications).
sequencing. Sequencing 91 , and there is a need to distinguish it from exotic species. Similarly, Acrapex Hampson, 1894 (Noctuidae) was included as the Australian species currently placed in this genus appear closely related to Asian Sesamia species, while Busseola Thurau, 1904 (Noctuidae), Rivula Guenée, 1845 (Erebidae) and Cnaphalocrocis Lederer, 1863 (Crambidae) were added as they contain significant cereal pest species for which we had obtained specimens.
This dataset, including specimen collection data, sequences and sequence trace files, is available on the Barcode of Life Data System website (BOLD) 33  We used a set of sequences from the study on Chinese sugarcane borers by Wang et al. 47 supplied to us by the senior author. When the specimens we sequenced were added to those downloaded from BOLD and the sequences from Wang et al., this produced a final dataset of 1297 individuals, including representatives from all seven of the 'high threat' species and 11 of the 15 'medium threat' species and two of the 14 'low threat' species identified by Sallam 14 .
Sequences were aligned in Geneious using Multiple Alignment using Fast Fourier Transform (MAFFT) 92 . The resulting alignments were cropped to a length of 667 bp. Only sequences longer than 486 bp, the minimum barcode standard length 93 , were used, however exceptions were made to this rule for two Chilo sacchariphagus specimens: ww06216 and ukzn0269 (at 476 bp and 413 bp respectively). The first sequence was included because our preliminary analysis showed it to occupy a long branch and be of possible taxonomic interest, and the latter was included because it was the only C. sacchariphagus sequence in our dataset from the type locality, Mauritius. Of the 1297 individuals in this final dataset, 1089 were initially identified to species level (Supplementary Material 3). GMYC 37,66,94 and mPTP 39 , are known to encounter difficulties with datasets including identical sequences. As identical sequences are also often removed when performing delimitations to speed up the analysis (e.g., in bPTP 95 ) we removed such duplicates using USEARCH 9.2.64 96 , removing the shortest of sequences with "maximum differences = 0", "maximum substitutions = 0", or "minimum match percentage identity = 100". We then checked the resulting 623 sequence dataset in a Geneious distance matrix to determine whether any 100% identity sequences remained, and a further three sequences were removed after this. We verified that no species had been eliminated from the dataset through this procedure. The resulting 620 sequence dataset, henceforth the 'haplotypes dataset' , was the main dataset used for species delimitation.
Preliminary analysis of the alignment was carried out using FastTree 2.1.5 97,98 in Geneious, using default settings, to generate approximately maximum-likelihood trees. This analysis was the one used to identify likely misidentified sequences. Nucleotide substitution models were tested using PartitionFinder2 99 , using the greedy algorithm 100 and the PhyML phylogeny estimator 101 , implemented on the CIPRES science gateway computing platform 102 . The best model was chosen based on the Bayesian Information Criterion. This was SYM + I + G for codon position 1, TRN + I + G for codon position 2 and GTR + G for codon position 3 in the full dataset, with a separate partition for each codon position.
In order to better investigate the effect of sample size and diversity on species delimitation, three further datasets were used: Chilo only, Scirpophaga only and Sesamia only subsets of the haplotypes tree. These datasets were formed by taking the smallest possible clades including all identified samples of those genera from the FastTree tree. This means that these datasets contain a mix of samples identified as being of that genus, and those that were not identified as being of that genus but grouped with them (i.e., putatively misidentified or unidentified sequences). A single BOLD sequence labelled as Sesamia submarginalis was excluded from the Sesamia analysis due to its deep divergence from other Sesamia samples. Model selection was also run on these datasets: for Chilo, this was TRN + G for codon position 1, F81 + G for codon position 2 and GTR + G for codon position 3; for Scirpophaga this was TRN + G for position 1, F81 for position 2 and TIM + I + G for position 3; for Sesamia this was TRN + I for position 1, F81 + I for position 2 and TIM for position 3.
Maximum-likelihood analyses were carried out using RAxML 8.2.10 103 on the CIPRES science gateway. In each case, the tree was estimated using 100 random stating points, and levels of bootstrap support at nodes were calculated using a bootstrapping analysis with 1000 pseudoreplicates.
Bayesian analyses were carried out using MrBayes 3.2.6 104 on the CIPRES science gateway. The analyses were terminated automatically when the standard deviation of split frequencies dropped below 0.01, so the number of generations was different in each analysis, (haplotypes dataset: 18,580,000, Chilo: 4,765,000, Scirpophaga: 2,495,000, Sesamia: 480,000). Samples were taken every 1000 steps, and the first 10% of samples were discarded as burnin. In each case two independent analyses were conducted, each consisting of one cold chain and seven heated chains.
BEAST analyses were carried out on the CIPRES portal, using the estimate for the rate of evolution in the insect COI gene from Papadopoulou et al. 105 Analyses used an MCMC chain of 10,000,000 steps, with a burnin of 1000 steps.
Trees from the maximum-likelihood analyses and Bayesian analyses were visualized and Figures generated using FigTree 1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/).
The haplotypes dataset alignment was exported to TaxonDNA/Species Identifier 1.8 106 , and all individuals with an aberrant position on the tree with species level identification, i.e., likely misidentifications, were removed (2019) 9:7039 | https://doi.org/10.1038/s41598-019-42995-0 www.nature.com/scientificreports www.nature.com/scientificreports/ prior to calculating the Kimura 2 parameter distance 107 within and between species; 18 such sequences were found in the haplotypes dataset. Specimens not identified to species were also excluded from this analysis. Mean intraspecific and mean interspecific distances were calculated in the same dataset using MEGA7 108 , using K2P distances, uniform rates among sites, and the default 500 bootstrap replicates to calculate standard error.
Species delimitation. Five different species delimitation methods were applied to each dataset to further investigate instances where barcode diversity was inconsistent with current taxonomy, and to help determine how many species there are among the unidentified specimens in the tree.
ABGD was carried out using the online version of ABGD software 34 (http://wwwabi.snv.jussieu.fr/public/ abgd/abgdweb.html). Default settings were used, following the approach of Kekkonen and Hebert 111 , however distance matrices based on TrN distance calculated in MEGA7 were used as input, as the TrN model of evolution was more applicable to our dataset than JC or K2P, based on our Partitionfinder2 results. All analyses were run twice, using two different relative gap width (X) settings, X = 1.5 (the default) and X = 1. Only the recursive results were used as they allow for different gap thresholds among taxa 34 .
bPTP delimitation was carried out using the bPTP.py module v0.51 38 in Python v2.7.14 112 , using both the MrBayes and RAxML trees in all cases. mPTP delimitation was conducted using the mPTP webserver was used for this analysis (http://mptp.h-its. org/#/tree), using the MrBayes and RAxML trees as input. Trees that had any multifurcations first randomly resolved into 0-length bifurcating branches in Mesquite v3.5 113 .
RESL delimitation 35 was carried out online at the BOLD website, using the default settings of the "cluster sequences" function.

Data Availability
DNA Sequences, raw sequence trace files and specimen collection data is available on BOLD as public project LSTEM. The 508 COI sequences produced in this study have been submitted to GenBank, accession numbers MK566231 -MK566738. The 508 sequence dataset is also available for direct download from BOLD using the https://doi.org/10.5883/DS-LSTEM. Full sequence alignment: Supplementary Material 3.