Disentangling the taxonomy of the subfamily Rasborinae (Cypriniformes, Danionidae) in Sundaland using DNA barcodes

Sundaland constitutes one of the largest and most threatened biodiversity hotspots; however, our understanding of its biodiversity is afflicted by knowledge gaps in taxonomy and distribution patterns. The subfamily Rasborinae is the most diversified group of freshwater fishes in Sundaland. Uncertainties in their taxonomy and systematics have constrained its use as a model in evolutionary studies. Here, we established a DNA barcode reference library of the Rasborinae in Sundaland to examine species boundaries and range distributions through DNA-based species delimitation methods. A checklist of the Rasborinae of Sundaland was compiled based on online catalogs and used to estimate the taxonomic coverage of the present study. We generated a total of 991 DNA barcodes from 189 sampling sites in Sundaland. Together with 106 previously published sequences, we subsequently assembled a reference library of 1097 sequences that covers 65 taxa, including 61 of the 79 known Rasborinae species of Sundaland. Our library indicates that Rasborinae species are defined by distinct molecular lineages that are captured by species delimitation methods. A large overlap between intraspecific and interspecific genetic distance is observed that can be explained by the large amounts of cryptic diversity as evidenced by the 166 Operational Taxonomic Units detected. Implications for the evolutionary dynamics of species diversification are discussed.

Sundaland constitutes one of the largest and most threatened biodiversity hotspots; however, our understanding of its biodiversity is afflicted by knowledge gaps in taxonomy and distribution patterns. The subfamily Rasborinae is the most diversified group of freshwater fishes in Sundaland. Uncertainties in their taxonomy and systematics have constrained its use as a model in evolutionary studies. Here, we established a DnA barcode reference library of the Rasborinae in Sundaland to examine species boundaries and range distributions through DNA-based species delimitation methods. A checklist of the Rasborinae of Sundaland was compiled based on online catalogs and used to estimate the taxonomic coverage of the present study. We generated a total of 991 DNA barcodes from 189 sampling sites in Sundaland. Together with 106 previously published sequences, we subsequently assembled a reference library of 1097 sequences that covers 65 taxa, including 61 of the 79 known Rasborinae species of Sundaland. Our library indicates that Rasborinae species are defined by distinct molecular lineages that are captured by species delimitation methods. A large overlap between intraspecific and interspecific genetic distance is observed that can be explained by the large amounts of cryptic diversity as evidenced by the 166 Operational Taxonomic Units detected. Implications for the evolutionary dynamics of species diversification are discussed.

Results
Sequencing of the DNA barcode marker Cytochrome Oxidase 1 (COI) yielded a total of 991 new sequences (Table S2) from 189 sampling sites distributed across Sundaland (Fig. 2). Together with 106 DNA barcodes mined from GenBank and BOLD (Table S3), we assembled a DNA barcode reference library of 1,097 sequences from 65 taxa of Rasborinae and 1 taxon of Sundadanionidae (Sundadanio retiarius). The number of specimens analyzed per species ranged from 1 to 143, with an average of 14.6 sequences per species and only six species represented by a single sequence. The sequences ranged from 459 bp to 651 bp long, with 99 percent of the sequences being above 500 bp length, and no stop codons were detected, suggesting that all the sequences correspond to functional mitochondrial COI sequences. DNA barcodes for 61 of the 79 nominal species of Rasborinae reported from Sundaland were recovered (approximately 78%) corresponding to the 7 Rasborinae genera currently recognized (Table S1). The present study achieved a complete coverage at the species level for the genera Boraras (2 species), Brevibora (3 species), Kottelatia (1 species), Trigonopoma (2 species) and Trigonostigma (3 species). In turn, two out of the three Pectenocypris species (66%) and 48 out of the 65 Rasbora species (74%) currently recognized in Sundaland were collected (Table S1). Geographically, our dataset includes 86% of the Rasborinae of Borneo (38 out of 44), all the Rasbora species of Java (4 species) and 68% of the Rasborinae species of Sumatra (26 out of 38) were collected (Table S1). Finally, four undescribed taxa are highlighted, two taxa of Rasbora in Java, one taxon of Trigonostigma in Borneo (Table S2) and an additional Rasbora taxon, previously assigned to R. paucisqualis in the literature (Table S3).
Species delimitation analyses provided varying numbers of Operational Taxonomic Units (OTUs) among methods (Fig. 3): 129 for PTP, 95 for mPTP, 178 for GMYC, 191 for mGMYC, 175 for ABGD and 146 for RESL (Table S3). Our consensus delimitation scheme yielded 166 OTUs, including 165 OTUs for the 65 Rasborinae taxa, 2.5-fold more than by using morphological characters. The number of OTUs observed within species ranged from two for 22 species to 11 for Trigonostigma pauciperforata (Table 1). Based on the results of the species delimitation analyses, a re-examination of the original species identity associated with 105 DNA barcodes mined from BOLD and GenBank revealed 13 cases of conflicts that likely originated from mis-identifications (Table S4). These concerned the genera Boraras (four records, two species), Brevibora (two records, two species) and Rasbora (seven records, six species). Along the same line, 12 uncertain identifications were revised for the genera Rasbora (10 records, five taxa) and Trigonostigma (two records, one taxa).
The examination of the maximum K2P genetic distances for species with multiple OTUs and within OTUs revealed large differences with maximum K2P distances ranging between 0.26 and 13.64 within species and between 0.00 and 2.37 within OTUs (Table 1). This trend was largely confirmed by the distribution of the genetic distances at both species and OTUs levels ( Fig. 4). At the species level, the distribution of the maximum intraspecific K2P genetic distance broadly overlap with the distribution of the K2P distances to the nearest neighbor (Fig. 4A,B, Table S5) and no barcoding gap is observed. On average, the nearest neighbor K2P genetic distances   www.nature.com/scientificreports www.nature.com/scientificreports/ are only 3.5-fold higher than the maximum intraspecific K2P distances. Plotting genetic distance for each species provides little improvement as a substantial number of species display maximum intraspecific K2P genetic distances higher than the minimum distance to the nearest neighbor ( Fig. 4C). At the OTU level, the overlap is drastically reduced peaking between 0 and 0.99 for maximum intraspecific K2P distances and ranging between 1.0 and 1.99 for the K2P distance to the nearest neighbor (Fig. 4D,E). The distribution width of the maximum intraspecific K2P distance is much more restricted for OTUs than species and fewer cases of maximum intraspecific distance higher than the minimum distance to the nearest neighbor are observed (Fig. 4F). At the OTU level, the nearest neighbor K2P genetic distances were 7.2-fold higher on average than the maximum intraspecific K2P genetic distances.
Range distributions inferred from the new records generated for this study indicate that most type localities are embedded in the observed species range (Fig. 5). The degree of overlap between species range, however, largely varies among genera with little or no overlap observed for Boraras, Pectenocypris, Trigonostigma and most Rasbora species while a substantial amount of overlap is observed for Brevibora and Trigonopoma species (Fig. 5).

Discussion
This study represents the most comprehensive molecular survey conducted for the subfamily Rasborinae 19,34 . Our DNA barcode reference library consists of 65 Rasborinae species distributed across 7 genera and covering 78% of the Rasborinae diversity reported from Sundaland. DNA barcoding delivers reliable species-level identifications when taxa possess unique COI sequence clusters characterized by multiple private mutations. This condition was met for all the Rasborinae species examined here and no cases of retention of ancestral polymorphism were detected 35 . However, this clearly contrasts with multiples discrepancies observed within the set of previously published COI sequences obtained on GenBank and BOLD. About 25 percent of these records were either misidentified or associated with uncertain identifications. Such mis-identifications were expected considering the morphological uniformity within some Rasborinae genera, particularly in the genus Rasbora where multiple cases of taxonomic conflicts have been highlighted already 16,[36][37][38][39] . Unexpectedly, most of the conflicts we detected were within the larger species of Rasbora, particularly those of the Rasbora argyrotaenia group and the R. sumatra group, and not within closely related smaller species such as members of the R. trifasciata group (Fig. 1). In facts, conflicts in species level population assignments have been previously reported for the R. argyrotaenia group in Java and Bali where R. lateristriata and R. baliensis have been confounded for decades as recently revealed by re-examination of species boundaries and distribution through DNA barcodes 16 . Other morphologically similar species of the Rasbora argyrotaenia group have been previously confused with R. lateristriata, such as R. elegans, R. spilotaenia and R. chrysotaenia. These species are difficult to separate due to overlapping meristic counts and coloration patterns 40 . Our study, however, highlights that these species have disjunct range distributions (Fig. 5) and cluster into well-differentiated mitochondrial lineages (Fig. S1, Table S3). Several of the detected misidentifications also involve species from different Rasbora species groups 24 such as Rasbora dusonensis, from the R. argyrotaenia group, that has been previously mistaken for R. sumatrana from the sumatrana group and R. myersi, from the R. sumatrana group, that has been confounded with R. dusonensis from the argyrotaenia group. Despite being distantly related (Fig. S1), these species show overlapping meristic counts and similar coloration patterns with no dark spots on the body 40 . This result further calls for a broader assessment of the monophyly of the different Rasbora groups, previously identified by Liao 24 , as they are poorly supported by our study (Fig. S1).
The observed average ratio of 3.5 between intraspecific and interspecific distances is very low compared to earlier values found for the Javanese ichtyofauna, where minimum nearest neighbor genetic distances are on average 28-fold higher than the maximum intraspecific genetic distances 27 . This value is also very low in comparison to previous large-scale fish DNA barcode surveys [41][42][43][44][45][46] . This deviation can be attributed to a substantial amount of cryptic diversity revealed by our species delimitation analyses. For 61 species, delimitated on the basis of morphological characters, and validated by a match between species range distributions and type localities, we recovered a total of 166 OTUs. When accounting for this cryptic diversity the ratio between the minimum nearest neighbor and maximum intraspecific distances rose to 7.5. Earlier large scale surveys in Sundaland already pointed to substantial levels of cryptic diversity [28][29][30][31]33 and it has also been demonstrated that small-size species are more sensitive to fragmentation, experience faster genetic drift and as such accumulate cryptic diversity at a faster rate than large-size species 45,47 . Along the same line, small-size species are more frequently confounded and lumped together, a bias that tend to inflate the proportion of hidden diversity 48 .
We found very high numbers of OTUs with deep genetic divergences (up to 13.64% in Trigonopoma gracile) in a number of species (ranging from 7 to 11) such as in Rasbora bankanensis, Rasbora einthovenii, Rasbora trilineata, Trigonopoma gracile and Trigonopoma pauciperforatum. These five species also display some of the widest range distributions in Sundaland with OTUs occurring in Borneo, Sumatra, Peninsular Malaysia and several small islands across the Java sea (R. bankanensis, Fig. 5(16); R. einthovenii, Fig. 5(19); R. trilineata, Fig. 5(8); T. gracile, Fig. 5(5); T. pauciperforatum, Fig. 5(4)). However, the scarcity of OTU range overlap for those species suggests ongoing population fragmentation across the species range distribution (Tables S2 and S3). This pattern is likely connected to the complex geological history of Sundaland which over the last 10 Million years was influenced by the subduction activity of the Asian and Australian plates and the resulting intense volcanic activity which produced multiple volcanic arches 5 . Furthermore, climatic fluctuations during the Pleistocene induced major sea levels changes leading to merging of Sundaland landmasses during glacial maxima and multiple fragmentations during glacial sea level low-stands 7,8 . In such dynamic landscapes, complex patterns of distribution and high lineage diversity are to be expected 10 . The influence of eustatic fluctuations in Sundaland is exemplified by Rasbora bankanensis, Rasbora einthovenii, Rasbora trilineata, Trigonopoma gracile and Trigonopoma pauciperforatum all of which display wide range distributions among watersheds neighboring the Java sea. Those have been repeatedly connected during glacial maxima (Fig. 5 (5), 5(8), 5(16) and 5(19)). This pattern strongly contrasts with the lower genetic diversity and restricted range distribution of the species occurring in the Eastern part of Borneo such as Rasbora vaillantii (2020)  www.nature.com/scientificreports www.nature.com/scientificreports/ ( Fig. 5(10)), R. laticlavia (Fig. 5(10)), R. trifasciata (Fig. 5(15)) and R. reophila (Fig. 5(20)) or species occurring in the Western part of Sumatra such as Rasbora vulcanus (Fig. 5(9)), R. maninjau (Fig. 5(9)), R. jacobsoni (Fig. 5(9)), R. tawarensis (Fig. 5(10)); R. chrysotaenia (Fig. 5(11)) and R. arundinata (Fig. 5(11)) and species in Java and Bali such as Rasbora sp1 (Fig. 5(10)), R. sp2 (Fig. 5(14)), R. lateristriata (Fig. 5(14)) and R. baliensis (Fig. 5(14)). These parts of Borneo, Sumatra and partially Java were disconnected from the central region of Sundaland around the Java sea during the Pleistocene. This trend highlights the sensitive status of the endemic Rasborinae species in the peripheral areas of Sundaland due to their highly restricted distribution ranges. The present study also argues against translocation programs for the most widespread species, considering the high proportion of cryptic diversity, if species and OTUs identity are not determined through DNA barcodes 16,31 . conclusions The subfamily Rasborinae is the most diverse freshwater fish group of Sundaland and therefore represents an excellent model to explore the evolutionary response of local freshwater biotas to a dynamic geological history and repeated eustatic fluctuations. Affected by taxonomic confusions for decades, the genus Rasbora has been left www.nature.com/scientificreports www.nature.com/scientificreports/ aside of recent large-scale molecular studies aimed at exploring the diversification of aquatic biotas in Sundaland. Our comprehensive DNA barcode reference library for the subfamily enables further evolutionary studies on the diversification of the group, in particular within the genus Rasbora, which allowed us to trace evolutionary dynamics at the local scale in Sundaland 16 . The contrasting patterns of molecular diversity and species range distributions between Rasborinae species inhabiting the watersheds neighboring the Java sea and the species located on the Eastern part of Borneo call for a larger assessment of their dynamics of species proliferation based on broader genomic analyses. Clearly, future studies will also have to address the systematics of the Rasborinae as no evidence supporting the monophyly of Rasbora nor the different Rasbora species groups are detected here.

Material and Methods
Sampling and collection management. Material used in the present study is the result of a collective effort to assemble a global Rasborinae DNA barcode reference library through various field sampling efforts conducted by several of the coauthors in Sundaland over the past decade. Specimens were captured using gears such as electrofishing, seine nets, cast nets and gill nets across sites that encompass the diversity of freshwater lentic and lotic habitats in Sundaland (Fig. 2). Specimens were identified following original descriptions where available, as well as monographs 40,49 . Species names were further validated using several online catalogs 50,51 . Specimens were photographed, individually labeled and voucher specimens were preserved in a 5% formalin solution. Prior to fixation a fin clip or a muscle biopsy was taken and fixed separately in a 96% ethanol solution for further genetic analyses. Both tissues and voucher specimens were deposited in the national collections at the Museum Zoologicum Bogoriense (MZB), Research Center for Biology (RCB), Indonesian Institute of Sciences (LIPI).

Assembling a checklist of the Sundaland Rasborinae.
A checklist of the Rasborinae species occurring in Sundaland was assembled from available online catalogs including Fishbase 51 and Eschmeyer's Catalog of Fishes 50 as detailed in Hubert et al. 15 . This checklist was used to estimate the taxonomic coverage of the present DNA barcoding campaign and to identify type localities for each species. The following information was included: (1) authors of the original description, (2) type locality, (3) latitude and longitude of the type locality, (4) holotype and paratypes catalog numbers, (5) distribution in Sundaland. This information is available as online Supplementary Material (Table S1).
Sequencing and international repositories. Genomic DNA was extracted using a Qiagen DNeasy 96 tissue extraction kit following manufacturer's specifications. A 651-bp segment from the 5′ region of the cytochrome oxidase I gene (COI) was amplified using primers cocktails C_FishF1t1/C_FishR1t1 including M13 tails 52 . PCR amplifications were done on a Veriti 96-well Fast (ABI-AppliedBiosystems) thermocycler with a final volume of 10.0 μl containing 5.0 μl Buffer 2× 3.3 μl ultrapure water, 1.0 μl each primer (10 μM), 0.2 μl enzyme Phire Hot Start II DNA polymerase (5U) and 0.5 μl of DNA template (~50 ng). Amplifications were conducted as followed: initial denaturation at 98 °C for 5 min followed by 30 cycles denaturation at 98 °C for 5 s, annealing at 56 °C for 20 s and extension at 72 °C for 30 s, followed by a final extension step at 72 °C for 5 min. The PCR products were purified with ExoSap-IT (USB Corporation, Cleveland, OH, USA) and sequenced in both directions. Sequencing reactions were performed using the "BigDye Terminator v3.1 Cycle Sequencing Ready Reaction" and www.nature.com/scientificreports www.nature.com/scientificreports/ sequencing was performed on the automatic sequencer ABI 3130 DNA Analyzer (Applied Biosystems). DNA barcodes obtained at the Naturhistorisches Museum Bern were generated as previously described in Conte-Grand et al. 33 .
The sequences and associated information were deposited on BOLD 53 and are available in the data set DS-BIFRA (Table S2, dx.doi.org/10.5883/DS-BIFRA). DNA sequences were submitted to GenBank (accession numbers are accessible directly at the individual records in BOLD). An additional set of 106 Rasborinae COI sequences were downloaded from GenBank (Table S3).

Figure 5.
Maps depicting species distribution ranges as established based on the present sampling sites (black margin) and type localities (white margin) following the checklist generated for this study (Table S1) www.nature.com/scientificreports www.nature.com/scientificreports/ Genetic distances and species delimitation.   54 pairwise genetic distances were calculated using the R package Ape 4.1 55 . Maximum intraspecific and nearest neighbor genetic distances were calculated from the matrice of pairwise K2P genetic distances using the R package Spider 1.5 56 . We checked for the presence of a barcoding gap, i.e. the lack of overlap between the distributions of the maximum intraspecific and the nearest neighbor genetic distances 57 , by plotting both distances and examining their relationships on an individual basis instead of comparing both distributions independently 58 . A neighbor-joining (NJ) tree was built based on K2P distances using PAUP 4.0a 59 in order to visually inspect genetic distances and DNA barcode clusters (Fig. S1). This NJ tree was rooted using Sundadanio retiarius.
Several alternative methods have been proposed for delimitating molecular lineages [60][61][62][63] . Each of these methods have pitfalls, particularly when it comes to singletons (i.e. delimitated lineages represented by a single sequence) and a combination of different approaches is increasingly used to overcome potential pitfalls arising from uneven sampling 16,43,[64][65][66] . We used four different sequence-based methods of species delimitation. For the sake of clarity, we refer to species identified based on morphological characters as species while species delimited using DNA sequences are referred to as Operational Taxonomic Unit (OTU) [67][68][69] . OTUs were delimitated using the following algorithms: (1) Refined Single Linkage (RESL) as implemented in BOLD and used to generate Barcode Index Numbers (BIN) 62 , (2) Automatic Barcode Gap Discovery (ABGD) 61 , (3) Poisson Tree Process (PTP) in its multiple rates version (mPTP) as implemented in the stand-alone software mptp_0.2.3 63,70 , (4) General Mixed Yule-Coalescent (GMYC) in its multiple rate version (mGMYC) as implemented in the R package Splits 1.0-19 71 . RESL and ABGD used DNA alignments as input files while a ML tree was used for mPTP and a Bayesian Chronogram based on a strict-clock model using a 1.2% of genetic distance per million year 72 for mGMYC. The mPTP algorithm uses a phylogenetic tree as an input file, thus, a maximum likelihood (ML) tree was first reconstructed using RAxML 73 based on a GTR + Γ substitution model. Then, an ultrametric and fully resolved tree was reconstructed using the Bayesian approach implemented in BEAST 2.4.8 74 . Two Markov chains of 50 millions each were ran independently using the Yule pure birth model tree prior, a strict-clock model and a GTR + I + Γ substitution model. Trees were sampled every 10,000 states after an initial burnin period of 10 millions. Both runs were combined using LogCombiner 2.4.8 and the maximum credibility tree was constructed using TreeAnnotator 2.4.7 74 . Identical haplotypes were pruned for further species delimitation analyses.