One thousand DNA barcodes of piranhas and pacus reveal geographic structure and unrecognised diversity in the Amazon

Piranhas and pacus (Characiformes: Serrasalmidae) are a charismatic but understudied family of Neotropical fishes. Here, we analyse a DNA barcode dataset comprising 1,122 specimens, 69 species, 16 genera, 208 localities, and 34 major river drainages in order to make an inventory of diversity and to highlight taxa and biogeographic areas worthy of further sampling effort and conservation protection. Using four methods of species discovery—incorporating both tree and distance based techniques—we report between 76 and 99 species-like clusters, i.e. between 20% and 33% of a priori identified taxonomic species were represented by more than one mtDNA lineage. There was a high degree of congruence between clusters, with 60% supported by three or four methods. Pacus of the genus Myloplus exhibited the most intraspecific variation, with six of the 13 species sampled found to have multiple lineages. Conversely, piranhas of the Serrasalmus rhombeus group proved difficult to delimit with these methods due to genetic similarity and polyphyly. Overall, our results recognise substantially underestimated diversity in the serrasalmids, and emphasise the Guiana and Brazilian Shield rivers as biogeographically important areas with multiple cases of across-shield and within-shield diversifications. We additionally highlight the distinctiveness and complex phylogeographic history of rheophilic taxa in particular, and suggest multiple colonisations of these habitats by different serrasalmid lineages.


Methods
Sample collection. Muscle and fin-membrane tissue samples were taken in the field from the right-hand pectoral-fin base or from the right side of the flank, and were stored in 95% ethanol and deposited in the tissue collection of the Laboratόrio de Evolução e Genética Animal (LEGAL) at the Universidade Federal do Amazonas (UFAM). Voucher specimens were fixed in 10% formalin and deposited in the fish collection of the Instituto Nacional de Pesquisas da Amazônia (INPA). Vouchers were identified to species by taxonomic specialists using available comparative material, identification keys, original descriptions, and redescriptions of species 8,[50][51][52][53][54] . Individuals that could not be identified to species level were reported as "Genus sp. " (possible new/unidentified species) or as "Genus aff. species" (closely related species, possibly new). DNA barcode sequence generation. Total DNA was isolated from approximately 50 mg of tissue using standard phenol-chloroform extraction methods 55 . A fragment of 651/657 bp of mitochondrial cytochrome c oxidase subunit I (COI) was amplified using the M13-tailed primer cocktails FishF2/FishR2 and VF2/VR1d respectively 56  The forward and reverse chromatograms were assembled into contigs using Geneious 7.0.6 57 and edited manually where required. The sequences were then aligned using Mafft v7.307 58 , and checked manually for insertions, deletions or stop codons using translated amino acids in Geneious. The alignment was trimmed to 621 bp to reduce missing data and erroneous base calls at the ends of the contigs 59 .
Further sequences from the GenBank database were added to the dataset. Using the rentrez_1.0.4 interface 60 we searched GenBank in July 2017 using the terms "Serrasalmidae" and "COI", "cox1" or "CO1", requesting only sequences between 450 and 1,000 bp in length. Searches for longer sequences (i.e. mitochondrial genomes) did not reveal any species not already sampled. Any sequences generated from specimens collected outside of South America, or those that clearly appeared to be attributed to incorrect species names were removed.
Single locus discovery of species. We used four single-locus species-discovery (SLSD) methods to partition our dataset into putative species-like clusters: (1) GMYC, the general mixed Yule coalescent model 33,43,61 ; (2) bGMYC, a Bayesian implementation of the GMYC 48 ; (3) local minima (locMin), a distance threshold optimising and clustering approach from the spider_1.3-0 software package 62 ; and (4) mPTP, the poisson tree process method 44,45 . Unless otherwise stated, analyses were carried out in R 3.4.1 63 . Beast 1.8.4 64 was used to generate a posterior sample of ultrametric trees for the GMYC analyses. The dataset was first collapsed to unique haplotypes 65 . The Beast analysis was set up as follows: substitution model TN93 + Γ as selected by jModeltest2 66 ; single model partition; strict molecular clock (relaxed clock was tested for a priori); fixed arbitrary substitution rate of 0.01; and coalescent tree prior 33,65 . Three independent chains were run for 20 million generations from random starting topologies, and were sampled every 18,000 generations, resulting in 3,333 trees (of which 333 were discarded as burn-in). The 3,000 post burn-in trees were combined and then subsampled to 1,000 for all downstream analyses. Tracer 67 was used to verify the chains had reached stationarity.
GMYC, bGMYC and mPTP analyses were carried out as: (1) a point estimate based on the maximum clade credibility tree created in TreeAnnotator 1.8.4 (node heights "ca"); and (2) confidence intervals calculated from the posterior sample of 1,000 trees. We used the bGMYC_1.0.2 48 , splits_1.0-19 43 and ape_4.1 68 packages. The bGMYC posterior samples were summarised into putative species with a conservative posterior probability of conspecificity at 0.05. For mPTP (single lambda), the Beast chronograms (ultrametric trees with branch lengths scaled by time) were first transformed into phylograms (branch lengths scaled by substitutions per site) using maximum likelihood optimisation in phangorn_2.2.0 69 under the same substitution model settings as described above. The locMin analyses were again conducted as a point estimate, and also on a set of 1,000 bootstrapped datasets to generate a confidence interval for this method.
Data availability. New sequence data generated here are available from the GenBank nucleotide archive under the accessions MG751915-MG752866, and at the Barcode of Life BOLD database under the project name "PRNHA". Metadata for all sequences used in this study are presented in Supplementary Table S1 as a comma delimited flatfile following Darwin Core standard vocabulary (http://rs.tdwg.org/dwc/terms/index.htm). The datasets and scripts used in this study are available from a public GitHub repository hosted at https://github.com/ legalLab/publications.

Results
Sampling and data description. As part of this study a total of 975 serrasalmid individuals were collected from 168 unique localities in 30 major river drainages (Fig. 1). With the addition of data from GenBank this increased to 1,122 specimens from 208 unique localities in 34 major drainages. Upon morphological assessment a total of 68 species-level taxa were identified and we were able to assign taxonomic names to 60 of these (i.e. valid nominal species), with a further eight being identified to genus level only, i.e. putative new species (six pacus, two piranhas). Data for thirteen species were obtained from GenBank, but only one of these (Serrasalmus marginatus) was not already present in our dataset (bringing the total to 69 species). Overall, the sampling covered 61 (19 piranha species and 42 pacu species) of the 94 (65%) valid serrasalmid species representing all 16 genera (100%). Per species, 63 of the total 69 (91%) were represented by more than one individual, with 56 (81%) represented by five or more individuals; median number of individuals per species was ten, mean was 16.3, and maximum was 95 (Serrasalmus rhombeus). Fifty-one species (74%) were collected from more than one locality; 45 (65%) were collected from three or more localities; median number of localities per species was four, mean was 6.2, and maximum was 35 (Serrasalmus rhombeus). Forty species (58%) were collected in more than one drainage; 33 (48%) were collected from three or more drainages; median number of drainages per species was two, mean was 3.6, and maximum was 15 (Serrasalmus rhombeus). The aligned DNA barcode matrix comprised 1,122 taxa by 621 bp. The unaligned sequences varied in length 417-621 bp; 11 (1%) were less than 500 bp, 97% were greater than 530 bp, 37% were the full 621 bp; median sequence length was 609 bp (mean 595 bp). The dataset comprised a total of 444 unique haplotypes. Per species (Table 1), 59 (86%) were represented by more than one haplotype, with 28 (41%) represented by five or more haplotypes; median number of haplotypes per species was four, mean was six, and maximum was 29 (Serrasalmus rhombeus).
Single locus discovery of species. Point estimates for the SLSD varied between 76 putative species (loc-Min) and 99 species (GMYC), with 118 unique molecular delimitations over all methods (Table 2; Fig. 2); confidence intervals (95%) were largest for locMin at 67-140 species and lowest for mPTP at 75-83 species ( Table 2). Centers of the confidence interval distributions tended to be lower than the point estimates for the GMYC and bGMYC analyses, and higher for the locMin and mPTP analyses. Of the point estimate delimitations 49% were supported by congruence of all four methods, 60% were supported by three or four methods, and 14% by only one method (Fig. 2).
The locMin analysis optimised a divergence threshold of 0.0135 (p-distance) for the dataset. COI lineages delimited by this method varied between 0.023 maximum intraspecific divergence (Myloplus rhomboidalis) and 0.117 (Myloplus schomburgkii), while eight were greater than 0.05 ( Table 1). The overall mean maximum intraspecific divergence was 0.0068 with the exclusion of the 15 species showing intraspecific genetic distances above the threshold (Table 1). Of the 69 a priori identified species, 43 (62%) were monophyletic, 20 (29%) were not monophyletic, and six (9%) were singletons (Table 1). Eighteen species (26%) shared haplotypes with another species, and this most commonly occurred in Myloplus (6 spp.) and Serrasalmus (5 spp.). A neighbour-joining tree showing all 1,122 COI sequences coloured by species is presented as Supplementary Fig. S1.

Discussion
In terms of variation in the pacus, of the 13 morphologically identified species of Myloplus, six had multiple lineages with many of these in the Guiana and Brazilian shields; in M. arnoldi, a lineage from the Guiana Shield (Nhamundá River) was identified as divergent from conspecifics in the Brazilian Shield (Araguaia, Tapajόs and Xingu rivers), separated by 0.086 p-distance ( Fig. 3; Table 1); in M. rhomboidalis there were two Guiana  Shield lineages (Jari and Branco rivers) and one in the Brazilian Shield (Xingu River); in M. schomburgkii (Fig. 3) there was a Guiana Shield lineage (Branco, Negro and Nhamundá rivers) distinct from a Brazilian Shield lineage (Araguaia, Tapajόs and Xingu Rivers), a third lineage found in the Xingu, as well as an intriguing forth lineage from the upper Amazon (Nanay River, Peru) and the Branco River. The species comprising Myloplus asterias/rubripinnis (including M. aff. rubripinnis) was estimated to contain between 11 and 13 lineages, with distinct lineages found in the Araguaia (one), Tocantins (one), Tapajόs (four), Aripuanã (two), Xingu (two), and Jatapu (one) rivers. Examples of within-shield diversification were apparent with two clades of multiple lineages within the Brazilian Shield, and also one lineage showing across-shield conspecificity (Aripuanã, Trombetas, Nhamundá). Extensive non-monophyly of the nominal taxa ascribed to the Myloplus asterias/rubripinnis group indicates problems in current diagnoses of the taxa and/or application of diagnostic characters supporting these taxa, and therefore given this ambiguity and the apparent high levels of within-drainage endemism, a taxonomic revision of this group should be a priority.
Other pacus also displayed large intraspecific divergences. Mylossoma aureum, M. duriventre, and Piaractus brachypomus all revealed lineages in the Orinoco distinct from those of the Amazon basin (see also Escobar et al. 29 and Mateussi et al. 30 ). Mylesinus paraschomburgkii showed evidence of distinct lineages in the Guiana Shield rivers (Uatumã, Trombetas and Jari), while a singleton specimen of Mylesinus aff. paraschomburgkii from the Nhamundá River was nested within Myloplus zorroi from the Aripuanã River. The distribution of lineages within Myloplus lobatus and Myleus setiger also indicated a biogeographic link between the Aripuanã and the Nhamundá, Jatapu, Uatumã and Trombetas rivers, reflecting the historical proximity of the mouths of these southerly flowing Guiana Shield rivers before the capture of the north flowing Brazilian Shield Aripuanã River by the Madeira River. Sharing of species and lineages between the Aripuanã and Guiana Shield rivers is not restricted to the serrasalmids, but has also been observed in cichlids of the genus Symphysodon 70,71 and loricariid catfishes 72 . Aside from the three potential lineages of Metynnis luna, there were few new lineages of silver dollars, likely reflecting the recent taxonomic work 22 and ongoing studies being carried out on the group (Ota, studies in progress).
Patterns among piranhas were less clear than for the pacus, with greater incongruence among methods. With 16 species considered valid, there were between 12 (locMin) and 22 (GMYC) lineages in Serrasalmus and Pygocentrus. While the distinctiveness of Pygocentrus cariba, P. piraya, Serrasalmus brandtii, S. elongatus, S. gouldingi, S. manueli, S. serrulatus and S. spilopleura were well supported by at least three of the four methods, the Serrasalmus rhombeus group was the primary source of incongruence. Here, the mPTP method recognised only one species in an inclusive clade comprising eight nominal taxa, and the GMYC methods recognised six species. This indicates that this group may be at the limits of resolution for a single mitochondrial locus and the   probably the result of misidentification ( Supplementary Fig. S1). Although not supported by all methods, it is possible that given the patterns observed in other taxa, the S. rhombeus individuals from the Xingu River represent a distinct species. Therefore, due to ontogenetic complexities, genetic similarity and the subtle morphological differences among species and lineages of the S. rhombeus clade, we feel it would benefit greatly from a population genomic analyses before any taxonomic treatment of the group is embarked upon. In the genus Pygocentrus, the species P. nattereri was found to comprise up to four lineages. One of these four was represented by GenBank samples nested within Serrasalmus maculatus from the Paraná River, and we believe that these samples were misidentified. Although not supported by all delimitations, the other three lineages of P. nattereri are possibly distinct, with one from the Tocantins/Araguaia/São Bento rivers and another from the Guaporé River, and both distinct from the more widespread Amazonas clade. Serrasalmus maculatus from the upper Paraná River was reported by three of the four methods to form a distinct lineage from S. maculatus of the lower Paraná River. The upper and lower Paraná River were distinct ichthyofaunal provinces separated by the Sete Quedas rapids until the construction of the Itaipú dam, which due to its system of locks, permitted the homogenisation of these faunas 74 . Among the 69 a priori identified taxonomic species of serrasalmid analysed herein, up to 23 are represented by more than one COI lineage (Table 1). Despite recent studies and taxonomic revisions describing new species 5,21,22 , our results show that a number of potential new species may still await a formal morphological diagnosis and description. Many factors contribute to this underestimation of diversity within the family. Only few morphological studies have been published in the last 10 years, reflecting the difficulties of interpreting high levels of ontogenetic variation, allometric growth, sexual dimorphism, and spatial variation in both body shape and colour pattern 5,21,25,75,76 . The number of possibly unrecognised species observed here in serrasalmids support the conclusions of Reis et al. 2 , who estimated that 34-42% of Neotropical freshwater fishes remain undescribed, and are mostly concentrated in the Amazon basin. The main explanations for this unrecognised diversity stem largely from (1) historically poorly sampled areas above geological barriers such as rapids; (2) widespread taxa or heterogeneous taxa with insufficient or overwhelming amounts of museum material; or (3) cryptic or pseudocryptic (morphological differences apparent but overlooked) diversity in widespread species. Genetic data are an important instrument in uncovering cases of the latter 77 . Of particular importance are the rheophilic taxa inhabiting rapids. Geologically the western Amazon basin is characterised by a sedimentary basin, and in the central and eastern portion by the crystalline Guiana and Brazilian Shields separated by the Amazon River 78 . As affluents of the Amazon descend the Guiana and Brazilian Shields, they form riffles, rapids and waterfalls inhabited by a distinctive fauna and flora. Aquatic flora of the rapids habitats is characterised by the Podostemaceae 79 , while perhaps the best known faunal component of these habitats are the loricariid catfishes 80 . Serrasalmids also are a conspicuous component. Our data indicate that rheophilic pacus classified in the genera Tometes, Ossubtus, Utiaritichthys, Mylesinus and Myloplus represent multiple, apparently evolutionarily independent lineages of rheophilic fishes (Fig. 2). Conversely, we also show lineages and haplotypes to be shared between rheophilic habitats of different rivers-e.g. the Tometes camunani complexsupporting the hypothesis of interconnection of these rheophilic habitats during low-water glacial periods of the Pleistocene 81 . Serrasalmids present not only a fascinating window into adaptations to extreme environments and the complexity of diversification patterns in this environment [82][83][84] , but the strictly rheophilic species are also the most threatened of the serrasalmids since hydroelectric projects are developed at sites of rapids and waterfalls, largely destroying these unique habitats and their associated taxa 85 .
The underestimation of the variation and diversity in the Serrasalmidae-and which can be extrapolated to Amazonian aquatic fauna in general-is directly relevant to the conservation of these groups. Due to high proportions of faunal endemism and increasing anthropogenic threats, the Brazilian Shield rivers and their faunas are of particular conservation concern 80,86,87 . Reis 88 summarised the principal anthropogenic threats of the Amazon River basin as: extensive deforestation of Amazon forest; hydroelectric dam building with the associated transformation of lotic environments into lentic environments resulting in the extirpation or significant reduction of populations of rheophilic species (while concomitantly contributing to the proliferation of lentic-adapted species); alluvial gold mining causing mercury contamination; and overexploitation of most commercial species. Therefore, we reiterate the conclusions of Reis 88 , who suggested that one of the primary instruments for conservation of Amazon basin fishes is increasing expertise in fish taxonomy and systematics. Additionally, we advocate DNA barcoding and other genetic tools as powerful complementary methods for uncovering fish diversity and highlighting groups in need of taxonomic revisions. Here we demonstrate the utility of DNA barcoding in providing an independent estimate of species alpha diversity, and additionally in providing preliminary data on population subdivision, gene flow, and relative ages of divergences.
However, important caveats need to be considered when interpreting single locus mtDNA data. Our confidence intervals were generally wide, reflecting the influence of phylogenetic uncertainty on our results and its importance in species delimitation as a whole 48,49 . Furthermore, the failure to congruently discriminate closely related species, such as those in the Serrasalmus rhombeus group, is perhaps a reflection of the limitations of single threshold SLSD methods when faced with situations where species with large effective population sizes have recently diverged in rapid succession (i.e. a young radiation), conditions whereby single locus methods are known to underestimate species diversity 43,89,90 . While multiple threshold models were developed to accommodate variation in coalescent depths between groups 33,45 , we were unable to generate realistic results while experimenting with these settings (as evidenced by excessive splitting). This is largely due to multiple threshold models making possibly spurious delimitations by recognising population structuring as speciation events 91,92 . Therefore, where delimitations are implausible or incongruent we recommend secondary sources of data be (re-)examined. While rates of phenotypic evolution and speciation are correlated over macroevolutionary scale 93 , there will be situations where local adaptation and fine-scale speciation may change phenotypes at a rate significantly faster than can be identified by neutral loci 94 , emphasising the need for additional data from morphology, behaviour, distribution, and ecology [35][36][37][38] when undertaking systematic revisions.
Geographic scale is another important factor in determining the structure of DNA barcode datasets; Bergsten et al. 40 demonstrated the substantial increase in intraspecific diversity and the decrease in interspecific divergence over increasing geographical distances. Fortunately, GMYC methods have been shown to be robust to the presence of singletons and absences of intermediate haplotypes 65,89 , but where we report putatively new lineages based on low numbers of individuals, effort still needs to be made to source more specimens before more conclusive statements can be made about the distinctiveness of those taxa 95 . Overall our sampling generated a geographically broad dataset with three quarters of the species having been collected from more than one locality, and over half being collected from more than one major drainage. Despite the positive bias for samples from the eastern and central Brazilian Amazon, and the paucity of samples from the western Amazon, we are confident in having captured a significant proportion of serrasalmid diversity. Inventories from Peru, Colombia and Ecuador would be extremely valuable additions, however.
Our single locus species delimitation results support a notion that piranha and pacu taxonomic diversity is currently underestimated in the Brazilian Amazon. The four methods achieved a high level of congruence (60% of the lineages were supported by three or more methods), indicating they were recognising a common signal of diversification, with great majority of these lineages also supported as allopatric and biogeographically distinct populations. The results particularly highlight: (1) the Guiana and Brazilian Shields as regions of underestimated but high ichthyofaunal endemism and diversity; (2) the existence of both between-shield (e.g. Myloplus schomburgkii, M. arnoldi), and within-shield (Myleus setiger, Mylesinus paraschomburgkii, Myloplus rubripinnis/asterias) diversification patterns in pacus; (3) very recent biogeographic connection between the the Aripuanã (Brazilian Shield) and Guiana Shield rivers; (4) distinct lineages of species shared between the Amazon and Orinoco basins (Mylossoma aureum, M. duriventre, Piaractus brachypomus); (5) the evolutionary uniqueness, distinctness, and apparent independent evolution of rheophilic lineages; and (6) the taxonomic difficulties associated with piranhas. Thus, characterisation of these faunas by traditional taxonomic methods combined with further effort in sequencing more loci is needed to better understand the implications of these results in an explicit and testable biogeographic framework of Neotropical diversification and community assemblage.