DNA-based taxonomy of a mangrove-associated community of fishes in Southeast Asia

The Merbok Estuary comprises one of the largest remaining mangrove forests in Peninsular Malaysia. Its value is significant as it provides important services to local and global communities. It also offers a unique opportunity to study the structure and functioning of mangrove ecosystems. However, its biodiversity is still partially inventoried, limiting its research value. A recent checklist based on morphological examination, reported 138 fish species residing, frequenting or subject to entering the Merbok Estuary. In this work, we reassessed the fish diversity of the Merbok Estuary by DNA barcoding 350 specimens assignable to 134 species initially identified based on morphology. Our results consistently revealed the presence of 139 Molecular Operational Taxonomic Units (MOTUs). 123 of them are congruent with morphology-based species delimitation (one species = one MOTU). In two cases, two morphological species share the same MOTU (two species = one MOTU), while we unveiled cryptic diversity (i.e. COI-based genetic variability > 2%) within seven other species (one species = two MOTUs), calling for further taxonomic investigations. This study provides a comprehensive core-list of fish taxa in Merbok Estuary, demonstrating the advantages of combining morphological and molecular evidence to describe diverse but still poorly studied tropical fish communities. It also delivers a large DNA reference collection for brackish fishes occurring in this region which will facilitate further biodiversity-oriented research studies and management activities.


DNA-based delimitation.
Sequence length for all 350 generated barcodes was longer than 600 bp with no indels or stop codon detected. The nucleotide composition showed a mean percentage of 18.32% (G), 27.97% (C), 24.07% (A), and 30.7% (T). More than half of the species (56%, 76 species) were represented by multiple specimens while 59 species were represented by a single specimen (Table 1). Mean number of specimens per species was 2.59. Increment in the K2P genetic divergence was directly related to the hierarchical taxonomic relationship: within species mean divergence = 0.85% (SE = 0.01), within congeners mean divergence = 16.7%, (SE = 0.01) and within families mean divergence = 18.17% (SE = 0) ( www.nature.com/scientificreports/ (4.12%) ( Table 3). Barcoding gap analysis demonstrated that almost all species represented by multiple sequences are supported by a barcode gap (Fig. 3). Notably, only one species, D. indicium, had its maximum intraspecific distance (9.05%) similar to its nearest neighbour distance (9.04%). Both Bayesian Inference (BI) (Fig. 4) and Maximum Likelihood (ML) ( Figure S1) trees were fully resolved exhibiting minimal differences in topologies. Node-supports in the BI tree were overall higher than in ML tree leading us to use the BI tree to visualize our Molecular Operational Taxonomic Unit (MOTU) delimitation results (Fig. 4). The three MOTU delimitation analyses (using RESL, ABGD and GMYC methods) yielded moderately variable numbers of MOTUs, although always higher than our initial 134 morphology-based species. The RESL analysis revealed 139 MOTUs assigned to dedicated BINs. The ABGD analysis identified the same 139 MOTUs (P = 0.0010-0.0599) within the initial partition for all substitution models (Table S2). The single-threshold GMYC analysis recognised 140 MOTUs that were taxonomically concordant with those obtained with the other two    www.nature.com/scientificreports/ analyses except for one species, Hyporhamphus quoyi, that is partitioned into two MOTUs. All incongruences between MOTUs and morphology-based species delimitation are highlighted in Fig. 4 (red bars) and detailed in Table 3. In seven (eight with GMYC) cases, two MOTUs were delimitated within one morphology-based species (see above the case of Hyporhamphus quoyi with GMYC). In two occasions, we found two of our morphologybased species sharing the same MOTU: Alepes melanoptera and Caranx sexfasciatus (BIN "BOLD:AAB5775") and Dichotomyctere nigroviridis and Dichotomyctere cf. fluviatilis (BIN "BOLD:AAF2344") ( Table 3). Within each of these two species-pairs, interspecific genetic divergence was < 2% resulting in the recognition of only one MOTU.

Discussion
Species delimitation. One of the premises of DNA barcoding is the detection of the so-called "barcode gap", which can be estimated in comparing the maximum intraspecific distance with the minimum interspecific distance (also known as the nearest neighbour genetic distance) 39 . The presence of a gap within a morphological species is good evidence for species-level cryptic diversity 40 . However, the absence of gap between two morphological species is indicative either that they are different forms within one species or of shared ancestral polymorphism and/or hybridization followed by introgression between these two species. In this case, a multi-gene (i.e. genomic) approach will help to determine the reciprocal taxonomic status of the two morphological species. Employing multiple "automatic species delimitation" methods and schemes in clustering the generated DNA barcodes provide an efficient approach in identifying putative species (= MOTUs). Even though these methods may have individual pitfalls, especially in analysing singletons, they can yield a robust outcome when combined 41 . Despites different analytical assumptions supporting each method, all three methods yielded similar results: RESL and ABGD analyses delimitated each 139 MOTUs in our dataset whereas the GMYC analysis identified 140 MOTUs. These results demonstrate a robust pattern of MOTUs in our dataset; even the GMYC method which is known to overestimate MOTUs counts compared to other methods 42 , delimitated only one additional MOTU. Because both RESL and ABGD analyses had closer correspondence to the number of species defined by morphological identification, we based our discussion on species account on these two methods.
Our results show that DNA barcoding (using COI gene) and morphology-based approach converge on the delimitation of 123 species (about 90% of the examined species) in Merbok Estuary region. DNA barcoding approach further revealed possible cryptic diversity within six species whereas it did not detect significant difference between two pairs of morphological species. Such results call for further taxonomic studies.
The mean conspecific K2P divergence (0.85%) was 20-fold lower than the mean congeneric divergence (16.7%). This increase in genetic divergence with increment in taxonomic levels is logical 35 . However, both mean genetic estimates are higher than those previously recorded in other regions. Most molecular assessment of marine fishes displayed conspecific divergence within the range of 0.25-0.39% whereas congeneric divergence were within the range of 4.56-9.93% 22-24,36,43 , but 25 found similar pattern of high average conspecific and congeneric divergence within the Indo-Pacific coral reef fishes (1.06% and 15.34%, respectively).
Taxonomic conundrum. We found that seven of our morphological species comprised two MOTUs: Eleutheronema tetradactylum (inter-MOTU COI-based genetic distance = 16.66%), Osteomugil perusii (14.24%), Planiliza subviridis (13.44%), Deveximentum indicium (9.05%), Lagocephalus lunaris (5.62%), Gerres oyena (4.29%) and Lutjanus russellii (4.12%). Such high intraspecific genetic divergence suggests either misidentification or the presence of morphologically cryptic species 25,44 . The first possibility is unlikely because the morphological examination of incriminated specimens, based on existing keys, seems consistent. Therefore, such genetic variability may more likely be the signal of hidden diversity. Large genetic differentiation has been reported in E. tetradactylum (family Polynemidae) among allopatric populations within the Indian Ocean 45 . Our results are consistent with 45 , further indicating that differentiation in this lineage is not only allopatrically but, also, sympatrically distributed. Recent molecular taxonomic studies on the family Mugilidae in which are included O. perusii and P. subviridis, evidenced a very high level of cryptic diversity in the Indo-West Pacific region 46,47 . Several mullet species (P. subviridis and O. perusii are among them) are, actually, each, a complex of several morphologically similar species for which extensive taxonomic revisions are needed. The taxonomy of D. indicium (family Leiognathidae) is still in flux with continual descriptions of new species in several genera, including Deveximentum 48 . The taxonomy of the genus Lagocephalus is difficult and the current identification key is likely incomplete making the delimitating between species challenging. Our results indicate the presence of two sympatric species under D. indicium in Merbok Estuary. Gerres oyena (family Gerreidae) and L. russellii (family Lutjanidae) exhibit intraspecific differentiation of lower magnitude than those observed for the first five species discussed above, although still well above the threshold of 2%. Lutjanus russellii natively occurs in this region 49 but it is also farmed in Merbok estuary. Aquaculture activities regularly import non-native seeds from various sources, with no or poor records of origins. The divergence observed within this species (4.12%) could be the consequence of the presence of both native and alien (escaped from aquaculture farms) individuals in Merbok estuary 15 .
Two cases of shared MOTUs between species were detected involving the pairs Alepes melanoptera and Caranx sexfasciatus (BOLD:AAB5775), and Dichotomyctere nigroviridis and Dichotomyctere cf. fluviatilis (BOLD:AAF2344). The first case is striking because A. melanoptera and C. sexfasciatus are morphologically easily distinguishable (specimens are housed in the USMFC collections and available for morphological verification) and the two COI sequences (one from each of these two species) are only slightly different, which seems to exclude the possibility of a contamination. This observation warrants future investigation based on more specimens.
The second case is interesting because the marking patterns of the specimens of D. nigroviridis and D. cf. fluviatilis are distinctly different 15 . However, the genetic distance between these two species is only 1.1%. We hypothesize that, in this case, the COI-based genetic differentiation (< 2%) between D. nigroviridis and D. cf. www.nature.com/scientificreports/ fluviatilis does not reflect their actual taxonomical status. Recent hybridisation among these two closely related species and incomplete lineage sorting of a recent, on-going speciation event could account for this observation 50 .

Scientific Reports
Guimarães-Costa et al. 51 who studied the fish diversity in the Parnaíba Delta, also suggested that the rate of molecular variation does not necessary accompany recent (sympatric) speciation event that lead to morphological differentiation.
Towards the establishment of a comprehensive DNA barcoding library of the fish community of Merbok Estuary. Precise identification of organisms is a prerequisite for assessing the biological and ecological status of an ecosystem. The current study illustrates yet another example of the complementarity of the morphological and molecular techniques to achieve this goal. DNA barcoding offers a quick and easy approach in aquatic diversity assessment and requires minimal expertise in conventional taxonomy 52,53 . Comprehensive DNA barcode reference library is crucial in any biodiversity assessment for providing selective autecological and biogeographic information for comparative analysis with previous assessment. Even though DNA databases like BOLD 54 and GenBank 55 are publicly available, a localised taxon-specific reference library is synoptically important as it is easier to curate and is a more practical reference for a focused site.
Our DNA barcodes reference library associated with voucher collections previously established 15 can be used for further biological evaluation and biomonitoring effort in Merbok Estuary and nearby regions. Future research endeavours to assess ecosystem health status in which a reference DNA barcoding library is needed, such as COI-based environmental DNA (eDNA) surveys or metabarcoding assays, can use this database. The barcode data generated in this study will contribute to the local as well as regional conservation efforts of fish diversity. Notwithstanding, to improve the resolution of the taxonomic coverage of the mangrove-associated of the fish community of Merbok Estuary, the number of DNA barcodes for the singleton specimens and also the not-yet examined species should be increased through more sampling and increased number of sites within the estuary and around.
Of the 134 species examined in this study, 61 species (~46%) were identified with high commercial value 56 . Protection planning and proper fishery management of these species are vital. Furthermore, we manage to barcode an invasive species-the Mozambique tilapia, Oreochromis mossambicus; its monitoring should be conducted either using traditional methods or eDNA methods.
We DNA barcoded a rich and diverse mangrove-associated fish community. Of the 134 species initially identified based on morphology, barcodes of 123 species support their validity. We found hidden diversity within seven species whereas the divergences between two pairs of valid species are below the interspecific threshold standard calling for further taxonomic studies. The comparison with previous species lists in and around this region 49 shows that our taxonomic coverage in Merbok Estuary is certainly not complete, although the degree of incompleteness is unknown. Further researches are needed to expand the results of this study, especially towards small, elusive, transient and non-commmercial fish species. The establishment of a local DNA barcoding reference library is an essential step for future studies of fisheries, conservation and ecological management of this important site.

Methods
Ethics statement. This project was conducted according to the relevant national and international guidelines and did not involve any endangered or protected fish species. All fish specimens were either collected from the local fishermen, caught using non-invasive fishing gear by the authors, or bought from the local market. This study was carried out following the recommendations and approval by the Universiti Sains Malaysia Animal Ethics Committee.

Sample collection.
A total of 441 specimens were sampled between December 2018 to October 2019 at multiple locations along the Merbok Estuary and its vicinity (Fig. 1). Specimens were collected either from local fishermen (who use the barrier-net method locally called 'pompang'), direct sampling by dip-net or bought from the major fish landing site (Kuala Muda Whispering Market). All specimens were caught within Merbok River and its adjacent waters. Samples collected from the fish landing site were retrieved from fishing vessels that operate within Zone A (from the shoreline up to 5 nautical miles) and Zone B (from 5 to 12 nautical miles) 57 . Information on the sampling localities (geographical coordinates) is shown in Table S1. Other collection data-dates, taxonomy and details of voucher specimens can be retrieved from the online project datasheet implemented in BOLD with project code-DBMR.
Sample processing and morphological identification. A fin clip from each fresh specimen was taken and stored in 90% ethanol. Voucher specimens were fixed in 10% formalin for at least one week and then transferred into 70% ethanol for long term storage. All specimens were catalogued and deposited at the Museum of Biodiversity, Universiti Sains Malaysia.
Morphology-based species identifications and nomenclature follow 15 15 ). We were unable to unequivocally assigned few specimens to a valid described species using available keys. In these cases, we used either "sp. " or "cf. ". www.nature.com/scientificreports/ We did not barcode five species listed in 15 : Sardinella gibbosa, Zenarchopterus buffonis, Gerres macracanthus, Drepane longimana, and Johnius belangerii, but we sequenced one specimen of Cryptocentrus sp., which was not listed in 15 . A total of 134 morphological species were considered in this study (Table 1).

Laboratory analyses.
Genomic DNA was extracted using DNeasy Blood & Tissue kit (Qiagen, Germany) following the given protocol of animal tissue DNA extraction. The purity and concentration of the isolated DNA were measured using a microvolume UV spectrophotometer (Quawell Q300, Quawell, CA) and stored at − 20 °C until further use. An approximately 650 bp fragment of the mitochondrial COI gene region was amplified using the combinations of the following primers previously designed by 22  Each sample was amplified in a final volume of 25 µL, containing 5.5 µL of 5x MyTaq™ Reaction Buffer Red (Bioline GmbH, Germany), 0.5 µL of each primer (100 ng/µL), 0.25 µL 5U Taq polymerase (iNtRON Biotechnology Inc., Korea), 2.5 µL of genomic DNA (50 ng/µL) and adequate nuclease-free water to complete the final reaction volume. Each amplification set was performed with the inclusion of a negative control (no template DNA) with thermal cycling conditions as follows: initial denaturation at 94 °C for 4 min; followed by 35 cycles of denaturation at 94 °C for 30 s, annealing at 48 °C for 50 s, and extension at 72 °C for 1 min; then a final extension at 72 °C for 10 min. The PCR products were then fractioned by 2% gel electrophoresis to check for successful amplification. All positive amplifications were then sent for purification and sequencing to Apical Scientific Sdn. Bhd. (Selangor, Malaysia) operating the ABI PRISM 3730XL automated sequencer and the ABI PRISM BigDye terminator cycle sequencing kit v3.1 (Applied Biosystems, Foster City, CA). Bidirectional sequencing was employed to decrease the probability of sequencing errors.
Data analyses. Each generated chromatogram was manually screened prior to DNA alignment in MEGA X 58 . The sequences were proofread and independently aligned and then inspected for deletions, insertions and stop codons using the same software.
A total of 350 COI sequences were determined in this study. To assess the taxon discrimination between all specimens, pairwise genetic distances were calculated within and between species, genera, and families based on the Kimura 2-parameter (K2P) distance model 59 using the analytical tools available in the BOLD system platform. To depict a graphical representation of the genetic relationships of the sequences, Bayesian Inference (BI) and Maximum Likelihood (ML) analyses were run in BEAST 2 60 and raxmlGUI 2.0 61 program, respectively. The GTR+I+G substitution model was determined as the best one in PartitionFinder 2 62 , as implemented in the CIPRES portal 63 . The BI tree was constructed with the GTR+I+G substitution model, empirical base frequencies with four gamma categories, employing a relaxed lognormal clock and the birth-death model. Two Markov Chain Monte Carlo (MCMC) chains of 40 million were run independently, sampled every 1000 generations and the first 20% were discarded as burn-in. Both run performances were then assessed for convergence (ESS > 200) using Tracer 1.7.1 and combined using LogCombiner 2.4.8 before the final tree was constructed using TreeAnnotator 2.4.7, within the BEAST 2 package 60 . The ML tree was also built based on the GTR+I+G model with 1000 nonparametric bootstrap replicates. Both constructed trees were then viewed and edited in FigTree 1.4.4 64 .
Three different sequence-based methods were used to delimit the Molecular Operational Taxonomic Units (MOTUs) from the analysed sequences-(1) Refined Single Linkage (RESL), (2) Automatic Barcode Gap Discovery (ABGD), and (3) Generalized Mixed Yule Coalescent (GMYC). The first analysis was done within the BOLD platform using the RESL algorithm 65 to assign sequences to a dedicated Barcode Index Numbers (BINs which are MOTUs). Next, the ABGD 39 analysis was run at the webserver (https:// bioin fo. mnhn. fr/ abi/ public/ abgd/ abgdw eb. html) to census divergence within the analysed dataset for species delimitation. The ABGD analysis was run with the following settings: relative gap width X=1.0, intraspecific divergence (P) values range from 0.001 to 0.0059 for all the distance metrics, while all other parameter values were kept as default. Finally, the GMYC method 66 was employed with the fully resolved, BI ultrametric tree using only unique haplotypes (see above for the reconstruction method). The haplotype dataset used in the GMYC analysis was built in collapsing all 350 individual COI sequences into 258 unique haplotype sequences using ALTER 67 . A single-threshold GMYC analysis was run in RStudio 68 with the 'splits' package 69 .

Data availability
All the COI sequences determined in this study have been uploaded in BOLD 54 under the DBMR project (dx. doi.org/10.5883/DS-SRDBMR) and deposited in GenBank 55 (Accession nos. MW498499-MW498843). www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.