Introduction

Estuaries and coastal wetlands which feature mangrove ecosystem are transition zones that link terrestrial and freshwater habitats with the sea1. Mangrove ecosystem delivers essential ecosystem services, including shoreline protection, nutrient production and fisheries resources. In consequence, mangrove ecosystem plays a vital role in supporting local communities’ socio-economic pursuits2. Unfortunately, such crucial human-nature relationship is threatened by habitat pollution, destruction, and overfishing3. It is also impacted by other factors such as species invasion, and climate change1.

The less disturbed tropical estuaries, especially their mangrove area, generally harbour rich, unique and complex faunal communities4, combining the presence of salinity-tolerant resident species along with regularly or occasionally frequenters. Frequenters include mainly marine species, which use this ecosystem either to feed, shelter, breed, or nurse their young5. Inventorying and monitoring biodiversity in these ecosystems is primordial for long term sustainability because biodiversity ensures stability and resistance towards any disturbance or potential invasion through complex species-species interactions1. However, biodiversity is still poorly documented in many mangrove ecosystems, particularly those of Southeast Asia, which hampers further research on their functioning and management.

Malaysia is part of the Sundaland biodiversity hotspot, which is recognized for its astounding levels of diversity and endemism6,7. Considering only fishes, 8 reported the presence of a total of 1418 marine and brackish species in Malaysian waters, occupying various coastal habitats, including the threatened mangrove ecosystems9. One of the largest remaining intact patches of mangrove forests is located within the Merbok Estuary, north-west Peninsular Malaysia, facing the Strait of Malacca (Fig. 1). The estuary was gazetted as a permanent forest reserve, the Sungai Merbok Mangrove Forest Reserve in 1951, and is the second largest mangrove forest in Peninsular Malaysia after the Larut Matang Forest Reserve. The Merbok Estuary and its surroundings constitute a dynamic and productive ecosystem, which supports the World’s highest mangrove species diversity per unit area within a contiguous habitat, with 39 of the estimated 70 true mangroves species described globally10. This area also represents important resource grounds for local populations11,12.

Figure 1
figure 1

Sampling localities across the study area, which covers the Merbok Estuary (Merbok River) and Muda River. Sampling sites; 1: Kuala Muda Whispering Market, 2: Pompang Sungai Merbok, 3 and 4: Pompang Batu Lintang, 5: Semeling Bridge. Inset map shows the location of the study area within Peninsular Malaysia. Maps are generated using QGIS v.3.4.11 and edited in Adobe Photoshop CC 2019.

Due to its biological, ecological, and socio-economic importance, the Merbok Estuary has been the focus of research during the last two decades, including some biodiversity inventories (trees and gastropods13; shrimps11; fishes12,14; mangrove trees10). The latest ichthyological survey has inventoried 138 fish species from 47 families in the estuary and adjacent marine environment, revealing a rich fish fauna15. However, because of taxonomical uncertainties when considering morphological characters alone, the identifications of some species were challenging, especially for some speciose families such as Mugilidae, Gobiidae or Eleotridae15. Furthermore, cryptic diversity is frequently encountered in tropical highly biodiverse regions16,17, and it is possible that some morphology-based species hide more than one species. In Merbok as elsewhere, a precise account of species diversity is a necessary requirement for further researches and numerous studies have highlighted the complementarity between morphological and molecular approaches to reveal biodiversity18,19,20. To date, there is no attempt to compare morphology-based results on fish diversity with genetics-based approach in this mangrove species-rich community.

Since its introduction in the past decades, DNA barcoding has emerged as the global molecular taxonomic method across fishes based on a standard molecular marker, a ~ 650 base pairs long fragment of the mitochondrial cytochrome oxidase I gene (COI)21. Several regional DNA barcoding studies have demonstrated its efficacy to delimitate marine fish species, for instance, in Australia22, South China Sea23, Indian Ocean24, and Indo-Pacific coral fishes25. DNA barcoding has proven to be a reliable method in detecting cryptic and potentially new fish species26,27,28,29, identifying larval fishes30,31,32,33, or tracing back food origins34.

In this study, we assemble a reference library of DNA barcodes of 350 fish individuals from Merbok Estuary and its adjacent waters for the purpose to describe the fish diversity in this region in providing a complementary look at previous morphology-based results. Comprehensive species lists built on integrative taxonomy have wide applications including ecosystem health management, biodiversity monitoring and conservation, aquaculture and fishery management35,36. All of these uses pertain to the Merbok Estuary.

Results

Fish diversity

A total of 350 specimens (out of 441 collected) were successfully sequenced for the COI gene, representing 134 morphological species, 94 genera, 47 families, 17 orders, and two classes, Chondrichthyes and Actinopterygii (taxonomic list shown Table 1). Two of these species (i.e. Cryptocentrus sp. and Johnius sp.) were only identified to the generic level using morphology whereas for three other species, Dichotomyctere cf. fluviatilis, Brachygobius cf. kabiliensis, and Cynoglossus cf. cynoglossus, we used open nomenclature.

Table 1 List of morphology-based species from Merbok Estuary region studied through DNA barcoding with the number of specimens examined (n), the BOLD IDs of their respective COI sequences, and the museum catalogue numbers of each species.

The most diverse orders were Perciformes (42 species representing 31.1% of the total number of species), followed by Carangiformes (21 species, 15.5%), Gobiiformes (18 species, 13.3%), and Clupeiformes (16 species, 11.9%) (Fig. 2a). At the family level, Gobiidae has the highest species richness with 15 species (11.1%), followed by Carangidae (12 species, 8.9%), Engraulidae (9 species, 6.7%), and Ariidae (8 species, 5.9%) (Fig. 2b). The three most diverse genera were the anchovy genus Stolephorus, the flatfish genus Cynoglossus with five species each, followed by the grouper genus Epinephelus with four species. According to the International Union for Conservation of Nature (IUCN) Red List, five species are “Near Threatened” (four Chondrichthyes: Telatrygon zugei, Brevitrygon walga, Gymnura poecilura, Chiloscyllium indicum and one actinopterygian: Arius gagora) whereas others are listed as “least concern” or “data deficient”. One recorded species is an alien invasive species (AIS), Oreochromis mossambicus (the Mozambique tilapia) from the African region37.

Figure 2
figure 2

Species count rankings according to (a) orders and (b) families recorded in this study.

DNA-based delimitation

Sequence length for all 350 generated barcodes was longer than 600 bp with no indels or stop codon detected. The nucleotide composition showed a mean percentage of 18.32% (G), 27.97% (C), 24.07% (A), and 30.7% (T). More than half of the species (56%, 76 species) were represented by multiple specimens while 59 species were represented by a single specimen (Table 1). Mean number of specimens per species was 2.59. Increment in the K2P genetic divergence was directly related to the hierarchical taxonomic relationship: within species mean divergence = 0.85% (SE = 0.01), within congeners mean divergence = 16.7%, (SE = 0.01) and within families mean divergence = 18.17% (SE = 0) (Table 2).

Table 2 K2P divergence values from 350 analysed specimens with increasing taxonomic levels.

Deep intraspecific K2P divergences, exceeded the standard threshold distance of 2%21,38, were observed in seven species: Eleutheronema tetradactylum (16.66%), Osteomugil perusii (14.24%), Planiliza subviridis (13.44%), Deveximentum indicium (9.05%), Lagocephalus lunaris (5.62%), Gerres oyena (4.29%) and Lutjanus russellii (4.12%) (Table 3). Barcoding gap analysis demonstrated that almost all species represented by multiple sequences are supported by a barcode gap (Fig. 3). Notably, only one species, D. indicium, had its maximum intraspecific distance (9.05%) similar to its nearest neighbour distance (9.04%).

Table 3 List of morphological species comprising two MOTUs (= BINs) or sharing one MOTU. The summary statistics include the BIN of each MOTU, their maximum intraspecific distance and distance to the nearest neighbour (i.e. minimum interspecific distance).
Figure 3
figure 3

Scatterplot of maximum intraspecific K2P distances vs. the nearest neighbour K2P distances.

Both Bayesian Inference (BI) (Fig. 4) and Maximum Likelihood (ML) (Figure S1) trees were fully resolved exhibiting minimal differences in topologies. Node-supports in the BI tree were overall higher than in ML tree leading us to use the BI tree to visualize our Molecular Operational Taxonomic Unit (MOTU) delimitation results (Fig. 4). The three MOTU delimitation analyses (using RESL, ABGD and GMYC methods) yielded moderately variable numbers of MOTUs, although always higher than our initial 134 morphology-based species. The RESL analysis revealed 139 MOTUs assigned to dedicated BINs. The ABGD analysis identified the same 139 MOTUs (P = 0.0010–0.0599) within the initial partition for all substitution models (Table S2). The single-threshold GMYC analysis recognised 140 MOTUs that were taxonomically concordant with those obtained with the other two analyses except for one species, Hyporhamphus quoyi, that is partitioned into two MOTUs. All incongruences between MOTUs and morphology-based species delimitation are highlighted in Fig. 4 (red bars) and detailed in Table 3. In seven (eight with GMYC) cases, two MOTUs were delimitated within one morphology-based species (see above the case of Hyporhamphus quoyi with GMYC). In two occasions, we found two of our morphology-based species sharing the same MOTU: Alepes melanoptera and Caranx sexfasciatus (BIN “BOLD:AAB5775”) and Dichotomyctere nigroviridis and Dichotomyctere cf. fluviatilis (BIN “BOLD:AAF2344”) (Table 3). Within each of these two species-pairs, interspecific genetic divergence was < 2% resulting in the recognition of only one MOTU.

Figure 4
figure 4

Bayesian Inference gene tree based on the 350 DNA barcodes with delineated MOTUs. Colour bars indicate (from left to right): morphological species (blue), MOTUs delineated by RESL (orange), ABGD (purple), and GMYC (green). Red bars indicate discrepancies among the different schemes (either morphology-genetics discrepancies or genetics-genetics discrepancies.

Discussion

Species delimitation

One of the premises of DNA barcoding is the detection of the so-called “barcode gap”, which can be estimated in comparing the maximum intraspecific distance with the minimum interspecific distance (also known as the nearest neighbour genetic distance)39. The presence of a gap within a morphological species is good evidence for species-level cryptic diversity40. However, the absence of gap between two morphological species is indicative either that they are different forms within one species or of shared ancestral polymorphism and/or hybridization followed by introgression between these two species. In this case, a multi-gene (i.e. genomic) approach will help to determine the reciprocal taxonomic status of the two morphological species.

Employing multiple “automatic species delimitation” methods and schemes in clustering the generated DNA barcodes provide an efficient approach in identifying putative species (= MOTUs). Even though these methods may have individual pitfalls, especially in analysing singletons, they can yield a robust outcome when combined41. Despites different analytical assumptions supporting each method, all three methods yielded similar results: RESL and ABGD analyses delimitated each 139 MOTUs in our dataset whereas the GMYC analysis identified 140 MOTUs. These results demonstrate a robust pattern of MOTUs in our dataset; even the GMYC method which is known to overestimate MOTUs counts compared to other methods42, delimitated only one additional MOTU. Because both RESL and ABGD analyses had closer correspondence to the number of species defined by morphological identification, we based our discussion on species account on these two methods.

Our results show that DNA barcoding (using COI gene) and morphology-based approach converge on the delimitation of 123 species (about 90% of the examined species) in Merbok Estuary region. DNA barcoding approach further revealed possible cryptic diversity within six species whereas it did not detect significant difference between two pairs of morphological species. Such results call for further taxonomic studies.

The mean conspecific K2P divergence (0.85%) was 20-fold lower than the mean congeneric divergence (16.7%). This increase in genetic divergence with increment in taxonomic levels is logical35. However, both mean genetic estimates are higher than those previously recorded in other regions. Most molecular assessment of marine fishes displayed conspecific divergence within the range of 0.25–0.39% whereas congeneric divergence were within the range of 4.56–9.93%22,23,24,36,43, but25 found similar pattern of high average conspecific and congeneric divergence within the Indo-Pacific coral reef fishes (1.06% and 15.34%, respectively).

Taxonomic conundrum

We found that seven of our morphological species comprised two MOTUs: Eleutheronema tetradactylum (inter-MOTU COI-based genetic distance = 16.66%), Osteomugil perusii (14.24%), Planiliza subviridis (13.44%), Deveximentum indicium (9.05%), Lagocephalus lunaris (5.62%), Gerres oyena (4.29%) and Lutjanus russellii (4.12%). Such high intraspecific genetic divergence suggests either misidentification or the presence of morphologically cryptic species25,44. The first possibility is unlikely because the morphological examination of incriminated specimens, based on existing keys, seems consistent. Therefore, such genetic variability may more likely be the signal of hidden diversity. Large genetic differentiation has been reported in E. tetradactylum (family Polynemidae) among allopatric populations within the Indian Ocean45. Our results are consistent with45, further indicating that differentiation in this lineage is not only allopatrically but, also, sympatrically distributed. Recent molecular taxonomic studies on the family Mugilidae in which are included O. perusii and P. subviridis, evidenced a very high level of cryptic diversity in the Indo-West Pacific region46,47. Several mullet species (P. subviridis and O. perusii are among them) are, actually, each, a complex of several morphologically similar species for which extensive taxonomic revisions are needed. The taxonomy of D. indicium (family Leiognathidae) is still in flux with continual descriptions of new species in several genera, including Deveximentum48. The taxonomy of the genus Lagocephalus is difficult and the current identification key is likely incomplete making the delimitating between species challenging. Our results indicate the presence of two sympatric species under D. indicium in Merbok Estuary. Gerres oyena (family Gerreidae) and L. russellii (family Lutjanidae) exhibit intraspecific differentiation of lower magnitude than those observed for the first five species discussed above, although still well above the threshold of 2%. Lutjanus russellii natively occurs in this region49 but it is also farmed in Merbok estuary. Aquaculture activities regularly import non-native seeds from various sources, with no or poor records of origins. The divergence observed within this species (4.12%) could be the consequence of the presence of both native and alien (escaped from aquaculture farms) individuals in Merbok estuary15.

Two cases of shared MOTUs between species were detected involving the pairs Alepes melanoptera and Caranx sexfasciatus (BOLD:AAB5775), and Dichotomyctere nigroviridis and Dichotomyctere cf. fluviatilis (BOLD:AAF2344). The first case is striking because A. melanoptera and C. sexfasciatus are morphologically easily distinguishable (specimens are housed in the USMFC collections and available for morphological verification) and the two COI sequences (one from each of these two species) are only slightly different, which seems to exclude the possibility of a contamination. This observation warrants future investigation based on more specimens.

The second case is interesting because the marking patterns of the specimens of D. nigroviridis and D. cf. fluviatilis are distinctly different15. However, the genetic distance between these two species is only 1.1%. We hypothesize that, in this case, the COI-based genetic differentiation (< 2%) between D. nigroviridis and D. cf. fluviatilis does not reflect their actual taxonomical status. Recent hybridisation among these two closely related species and incomplete lineage sorting of a recent, on-going speciation event could account for this observation50. Guimarães-Costa et al.51 who studied the fish diversity in the Parnaíba Delta, also suggested that the rate of molecular variation does not necessary accompany recent (sympatric) speciation event that lead to morphological differentiation.

Towards the establishment of a comprehensive DNA barcoding library of the fish community of Merbok Estuary

Precise identification of organisms is a prerequisite for assessing the biological and ecological status of an ecosystem. The current study illustrates yet another example of the complementarity of the morphological and molecular techniques to achieve this goal. DNA barcoding offers a quick and easy approach in aquatic diversity assessment and requires minimal expertise in conventional taxonomy52,53. Comprehensive DNA barcode reference library is crucial in any biodiversity assessment for providing selective autecological and biogeographic information for comparative analysis with previous assessment. Even though DNA databases like BOLD54 and GenBank55 are publicly available, a localised taxon-specific reference library is synoptically important as it is easier to curate and is a more practical reference for a focused site.

Our DNA barcodes reference library associated with voucher collections previously established15 can be used for further biological evaluation and biomonitoring effort in Merbok Estuary and nearby regions. Future research endeavours to assess ecosystem health status in which a reference DNA barcoding library is needed, such as COI-based environmental DNA (eDNA) surveys or metabarcoding assays, can use this database. The barcode data generated in this study will contribute to the local as well as regional conservation efforts of fish diversity. Notwithstanding, to improve the resolution of the taxonomic coverage of the mangrove-associated of the fish community of Merbok Estuary, the number of DNA barcodes for the singleton specimens and also the not-yet examined species should be increased through more sampling and increased number of sites within the estuary and around.

Of the 134 species examined in this study, 61 species (~46%) were identified with high commercial value56. Protection planning and proper fishery management of these species are vital. Furthermore, we manage to barcode an invasive species—the Mozambique tilapia, Oreochromis mossambicus; its monitoring should be conducted either using traditional methods or eDNA methods.

We DNA barcoded a rich and diverse mangrove-associated fish community. Of the 134 species initially identified based on morphology, barcodes of 123 species support their validity. We found hidden diversity within seven species whereas the divergences between two pairs of valid species are below the interspecific threshold standard calling for further taxonomic studies. The comparison with previous species lists in and around this region49 shows that our taxonomic coverage in Merbok Estuary is certainly not complete, although the degree of incompleteness is unknown. Further researches are needed to expand the results of this study, especially towards small, elusive, transient and non-commmercial fish species. The establishment of a local DNA barcoding reference library is an essential step for future studies of fisheries, conservation and ecological management of this important site.

Methods

Ethics statement

This project was conducted according to the relevant national and international guidelines and did not involve any endangered or protected fish species. All fish specimens were either collected from the local fishermen, caught using non-invasive fishing gear by the authors, or bought from the local market. This study was carried out following the recommendations and approval by the Universiti Sains Malaysia Animal Ethics Committee.

Sample collection

A total of 441 specimens were sampled between December 2018 to October 2019 at multiple locations along the Merbok Estuary and its vicinity (Fig. 1). Specimens were collected either from local fishermen (who use the barrier-net method locally called ‘pompang’), direct sampling by dip-net or bought from the major fish landing site (Kuala Muda Whispering Market). All specimens were caught within Merbok River and its adjacent waters. Samples collected from the fish landing site were retrieved from fishing vessels that operate within Zone A (from the shoreline up to 5 nautical miles) and Zone B (from 5 to 12 nautical miles)57. Information on the sampling localities (geographical coordinates) is shown in Table S1. Other collection data—dates, taxonomy and details of voucher specimens can be retrieved from the online project datasheet implemented in BOLD with project code—DBMR.

Sample processing and morphological identification

A fin clip from each fresh specimen was taken and stored in 90% ethanol. Voucher specimens were fixed in 10% formalin for at least one week and then transferred into 70% ethanol for long term storage. All specimens were catalogued and deposited at the Museum of Biodiversity, Universiti Sains Malaysia.

Morphology-based species identifications and nomenclature follow15 with few reidentifications: Pseudogobius avicennia (museum number: USMFC (34) 00022; identified as Pseudogobius olorum in15), Trypauchen vagina (USMFC (34) 00027; Trypauchen pelaeos in15), Trypauchen pelaeos (USMFC (34) 00013; Trypauchen vagina in15), Cynoglossus bilineatus (USMFC (84) 00008; Cynoglossus lingua in15), Cynoglossus monopus (USMFC (84) 00002, 00005; Cynoglossus cynoglossus in15), Cynoglossus cf. cynoglossus (USMFC (84) 00003, 00006; Cynoglossus puncticeps in15), Pseudorhombus arsius (USMFC (107) 00001; Pseudorhombus elevatus in15), Stolephorus baganensis (USMFC (82) 00038, 00049; Stolephorus dubiosus in15). We were unable to unequivocally assigned few specimens to a valid described species using available keys. In these cases, we used either “sp.” or “cf.”.

We did not barcode five species listed in15: Sardinella gibbosa, Zenarchopterus buffonis, Gerres macracanthus, Drepane longimana, and Johnius belangerii, but we sequenced one specimen of Cryptocentrus sp., which was not listed in15. A total of 134 morphological species were considered in this study (Table 1).

Laboratory analyses

Genomic DNA was extracted using DNeasy Blood & Tissue kit (Qiagen, Germany) following the given protocol of animal tissue DNA extraction. The purity and concentration of the isolated DNA were measured using a microvolume UV spectrophotometer (Quawell Q300, Quawell, CA) and stored at − 20 °C until further use. An approximately 650 bp fragment of the mitochondrial COI gene region was amplified using the combinations of the following primers previously designed by22:

  • FishF1-5’TCAACCAACCACAAAGACATTGGCAC-3’,

  • FishF2-5’-TCGACTAATCATAAAGATATCGGCAC-3’,

  • FishR1-5’-TAGACTTCTGGGTGGCCAAAGAATCA-3’ and

  • FishR2-5’-ACTTCAGGGTGACCGAAGAATCAGAA-3’.

Each sample was amplified in a final volume of 25 µL, containing 5.5 µL of 5x MyTaq™ Reaction Buffer Red (Bioline GmbH, Germany), 0.5 µL of each primer (100 ng/µL), 0.25 µL 5U Taq polymerase (iNtRON Biotechnology Inc., Korea), 2.5 µL of genomic DNA (50 ng/µL) and adequate nuclease-free water to complete the final reaction volume. Each amplification set was performed with the inclusion of a negative control (no template DNA) with thermal cycling conditions as follows: initial denaturation at 94 °C for 4 min; followed by 35 cycles of denaturation at 94 °C for 30 s, annealing at 48 °C for 50  s, and extension at 72 °C for 1 min; then a final extension at 72 °C for 10 min. The PCR products were then fractioned by 2% gel electrophoresis to check for successful amplification. All positive amplifications were then sent for purification and sequencing to Apical Scientific Sdn. Bhd. (Selangor, Malaysia) operating the ABI PRISM 3730XL automated sequencer and the ABI PRISM BigDye terminator cycle sequencing kit v3.1 (Applied Biosystems, Foster City, CA). Bidirectional sequencing was employed to decrease the probability of sequencing errors.

Data analyses

Each generated chromatogram was manually screened prior to DNA alignment in MEGA X58. The sequences were proofread and independently aligned and then inspected for deletions, insertions and stop codons using the same software.

A total of 350 COI sequences were determined in this study. To assess the taxon discrimination between all specimens, pairwise genetic distances were calculated within and between species, genera, and families based on the Kimura 2-parameter (K2P) distance model59 using the analytical tools available in the BOLD system platform. To depict a graphical representation of the genetic relationships of the sequences, Bayesian Inference (BI) and Maximum Likelihood (ML) analyses were run in BEAST 260 and raxmlGUI 2.061 program, respectively. The GTR+I+G substitution model was determined as the best one in PartitionFinder 262, as implemented in the CIPRES portal63. The BI tree was constructed with the GTR+I+G substitution model, empirical base frequencies with four gamma categories, employing a relaxed lognormal clock and the birth-death model. Two Markov Chain Monte Carlo (MCMC) chains of 40 million were run independently, sampled every 1000 generations and the first 20% were discarded as burn-in. Both run performances were then assessed for convergence (ESS > 200) using Tracer 1.7.1 and combined using LogCombiner 2.4.8 before the final tree was constructed using TreeAnnotator 2.4.7, within the BEAST 2 package60. The ML tree was also built based on the GTR+I+G model with 1000 nonparametric bootstrap replicates. Both constructed trees were then viewed and edited in FigTree 1.4.464.

Three different sequence-based methods were used to delimit the Molecular Operational Taxonomic Units (MOTUs) from the analysed sequences—(1) Refined Single Linkage (RESL), (2) Automatic Barcode Gap Discovery (ABGD), and (3) Generalized Mixed Yule Coalescent (GMYC). The first analysis was done within the BOLD platform using the RESL algorithm65 to assign sequences to a dedicated Barcode Index Numbers (BINs which are MOTUs). Next, the ABGD39 analysis was run at the webserver (https://bioinfo.mnhn.fr/abi/public/abgd/abgdweb.html) to census divergence within the analysed dataset for species delimitation. The ABGD analysis was run with the following settings: relative gap width X=1.0, intraspecific divergence (P) values range from 0.001 to 0.0059 for all the distance metrics, while all other parameter values were kept as default. Finally, the GMYC method66 was employed with the fully resolved, BI ultrametric tree using only unique haplotypes (see above for the reconstruction method). The haplotype dataset used in the GMYC analysis was built in collapsing all 350 individual COI sequences into 258 unique haplotype sequences using ALTER67. A single-threshold GMYC analysis was run in RStudio68 with the ‘splits’ package69.