Population genetics is essential for understanding and managing marine ecosystems, but sampling remains challenging. We demonstrate that high-throughput sequencing of seawater environmental DNA can provide useful estimates of genetic diversity in a whale shark (Rhincodon typus) aggregation. We recover similar mitochondrial haplotype frequencies in seawater compared to tissue samples, reliably placing the studied aggregation in a global genetic context and expanding the applications of environmental DNA to encompass population genetics of aquatic organisms.
Population genetic information is essential for the informed management and conservation of endangered species, but for rare oceanic species sampling remains a challenge. The whale shark is an iconic, but endangered, oceanic species, mainly due to overexploitation1. Despite its large size, much of the whale shark’s biology remains unknown2. For instance, although studies have documented coastal aggregations of whale sharks around the world2, little is known about offshore aggregations3. Population studies have primarily depended on tissue sampling and tagging, which are expensive and potentially harmful4.
Here, we investigated the use of seawater environmental DNA (eDNA) to obtain genetic information at the population level. Environmental DNA from water samples has been used to detect and quantify aquatic macroorganisms in freshwater5,6 and, more recently, in seawater7,
We compared mitochondrial (mtDNA) control region sequences obtained from PCR amplification and Illumina sequencing (metabarcoding) of eDNA samples (two polymorphic regions; DL1: 412 bp and DL2: 476–493 bp) to sequences from tissue samples collected at the same locality (61 individuals; Supplementary Table 2). Considering only known haplotypes, we found similar relative haplotype frequencies in the seawater eDNA compared to the tissue samples (Fig. 1c,d). This suggests that quantitative relationships between haplotypes present at the time of water sampling are reflected in the sequencing data. A mock sample prepared from known haplotypes indicated a positive correlation between DNA template concentration and read output (Supplementary Fig. 4), supporting a quantitative relationship between the two. We retrieved more haplotypes from eDNA (DL1: 7, DL2: 18) than from tissue samples (DL1: 4, DL2: 12) (Fig. 1c,d; Supplementary Figs 1 and 2), indicating that the tissue database did not represent the complete mitochondrial diversity of the aggregation. The four DL1 haplotypes found in the tissue samples were also found within the eDNA and included one haplotype unique to Qatar. Similarly, all twelve DL2 haplotypes found in the tissue samples were found in the eDNA, including one haplotype unique to Qatar. Globally, 19 DL1 haplotypes and 44 DL2 haplotypes were identified (Supplementary Table 3).
Using principal component analysis (PCA) of the relative haplotype frequencies inferred from the eDNA reads and tissue samples, we placed the studied aggregation in the population genetic context of the world’s whale sharks11,
As a proof of concept, we estimated the effective female population size (N
f) on the basis of eDNA from the most polymorphic target region (DL2) (Supplementary Information), using an estimated mutation rate of 0.1% per million years (95% confidence interval (CI): 0.04–0.16%) (Supplementary Fig. 10; Supplementary Information). The resulting N
f estimate was 71,600 (95% CI: 43,618–183,526; nucleotide diversity π = 0.00358), when scaling haplotype frequencies to 100 individuals. The estimated daily number of individuals in the Al Shaheen aggregation10 is approximately 124, and up to ~200 individuals were present during water sampling based on fin counts (Supplementary Table 1). Scaling haplotype frequencies to 20, 50, 100, 150, 200, 250 and 300 individuals, respectively, resulted in an average estimate of 75,543 females (95% CI: 54,714–96,372). On the basis of tissue samples, N
f was estimated at 138,400 (95% CI: 85,087–351,654; π = 0.00692). These estimates are assumed to reflect the entire Indo-Pacific N
f, as little genetic subdivision has been reported within this region11,
While the N f estimate from eDNA was approximately half of that estimated from the tissue samples, we find the overlap between CIs promising for eDNA as a proxy for estimating effective population sizes. Our regional estimates are meaningful compared with global estimates of effective population size that are based on complete control region sequences (119,000–238,000 females, no reported CI)11 and microsatellites (103,572 individuals, standard error range: 27,401–179,794)12. Importantly, these estimates are all based on estimated mutation rates, which are difficult to determine accurately14.
To account for errors generated during amplification and sequencing, which could lead to false positive haplotypes, we cleaned our data using observed (mock sample) and in silico estimated error rates before performing the above analyses (Supplementary Figs 3 and 5; Supplementary Information). Interestingly, in silico error modelling showed that some haplotypes were more likely to arise as false positives than others (Supplementary Figs 5–7). Cleaning removed one DL1 haplotype and five DL2 haplotypes (<5% of reads in both cases) (Supplementary Information).
As extensive knowledge of haplotypes is not always available, we performed a new analysis on our eDNA reads using only reference sequences from six individuals (mock sample) (Supplementary Figs 8 and 9). All of the original DL1 haplotypes and 10 of the 18 DL2 haplotypes were retained. Additionally, several unknown putative haplotypes (DL1: 9, DL2: 6) were also found. Nevertheless, results were very similar to those found using a reference database for both N f (63,400, 95% CI: 38,525–162,899, based on 100 individuals) and PCA, demonstrating that eDNA metabarcoding data can be used independently for population genetic inference with little prior knowledge of the studied population.
Owing to its high detection rates, cost-efficiency and non-invasiveness compared with traditional survey methods15, eDNA analysis is increasingly recognized as a valuable tool for ecological inference and management of aquatic biodiversity9,15. However, to realize its full potential, aquatic eDNA needs to advance from species detection to the analysis of populations. Much remains to be investigated regarding the relationship between eDNA data and abundance or biomass, including the influences of abiotic factors. The unknown number of source individuals for an eDNA sample represents another challenge; choosing a number of individuals for scaling is a major assumption and at present requires additional information. Lastly, conservative or non-conservative data cleaning criteria may lead to under- or overestimation of genetic diversity, respectively. Future advances may facilitate the retrieval of longer eDNA fragments and provide higher read coverage, but it is unclear whether the identification of individuals will be possible. Nevertheless, we have demonstrated that reliable estimates of haplotype frequencies, genetic diversity and population subdivision can be retrieved from eDNA—even in the absence of a reference database. The data derived from eDNA required fewer resources and a smaller sampling effort, compared to that derived from tissue samples (Supplementary Information).
The whale sharks in Qatar are reported to aggregate at Al Shaheen to feed on fish spawn from mackerel tuna (Euthynnus affinis)3. To investigate aquatic eDNA as a potential proxy for studying trophic interactions, we quantified eDNA from both species using quantitative PCR (qPCR). The concentration of whale shark eDNA correlated strongly with that of mackerel tuna (P < 0.001, R 2 = 0.84) (Fig. 1f). We argue that this result most probably reflects the predator–prey relationship observed between the two species. Alternatively, the tuna may follow the sharks, as reported from the Azores16, but this has never been observed in Al Shaheen.
Sea currents can potentially move genetic material over large distances, leading to detections of non-local eDNA. However, samples collected concurrently with visual observations of whale sharks contained higher concentrations of whale shark and mackerel tuna eDNA (P < 0.001 for both, Wilcoxon test) (Fig. 1f; Supplementary Information), supporting a local origin of the sampled eDNA. This is in line with research indicating the differentiation, at scales of ~60 m, of marine eDNA7 and degradation within days8. We performed an experiment on local seawater, which suggests that whale shark eDNA in the Gulf degrades on a similar timescale (Supplementary Fig. 11; Supplementary Information). Thus, while more work is needed, we find it reasonable to assume that our results reflect local population composition due to limited long-distance movement of eDNA.
Ongoing research on large oceanic species, such as the whale shark, includes tissue sampling, acoustic surveys, satellite tagging, aerial surveys and photo identification17. Most are dependent on good weather conditions and visibility, and are restricted to individuals near the surface. Aquatic eDNA sampling overcomes these challenges and offers high sensitivity8,15. To our knowledge, this study represents the first to show that aquatic eDNA can be used for population-level inferences and for identifying species co-occurrences that may indicate trophic interactions. This broadens the scope of eDNA research and facilitates the informed management of aquatic biodiversity and resources.
Seawater samples were collected in May 2013 and May–June 2014 at 15 locations near the Al Shaheen oil field (20 samples in total, Supplementary Table 1). Nineteen samples of 3 × 500 ml (1.5 l total) were collected at the surface and filtered through sterile 0.22 μm Sterivex-GP filters (Merck Millipore) using 60 ml syringes (HSW Soft-Ject). An additional sample of 6 × 30 l was collected to measure eDNA degradation. Prior to water sampling, the number of sharks in the aggregation was estimated by counting fins at the surface.
Whale shark tissue samples were taken with a biopsy spear in 2011–2014 and preserved in 96% ethanol. Sharks were photographed and later identified to the individual level (Supplementary Information).
The Qiagen DNeasy Blood and Tissue Kit was used for DNA extraction from both tissue (manufacturer’s protocol) and water (modified protocol) samples. Tissue-extracted DNA was PCR amplified using the primers WSCR1-F and WSCR1-R11, which target the complete control region. PCR products were Sanger sequenced at Macrogen Europe and sequences were quality checked in Geneious v. 7.1.7 (Biomatters Ltd) and assigned to haplotypes with DnaSP v. 5.10.118.
In eight samples, taken where whale sharks were visually observed (Supplementary Table 1), we used PCR to amplify two polymorphic regions of the whale shark mtDNA control region (DL1: 412 bp, DL2: 476–493 bp). The three eDNA extracts from each 3 × 500 ml sample were combined in pools. The DL1 and DL2 regions were PCR amplified using primers tagged with oligonucleotides eight nucleotides in length19. A unique combination of tags on the forward and reverse primers was used for each PCR replicate (six replicates per sample). PCR products from the samples that yielded positive amplification (DL1: 7 samples, DL2: 5 samples) were purified using Qiagen MinElute kit. Libraries were prepared using the NEBNext DNA Library Prep Master Mix Set for 454 (New England Biolabs Inc.) and sequenced at Macrogen Europe on the Illumina MiSeq platform (DL1: 250 bp paired-end, DL2: 300 bp paired-end). A PhiX spike-in and a mock sample prepared from tissue extracts of six individuals (relative concentrations of 1 to 1:1000) were included in the sequencing runs. Sequences were analysed in OBITools20. As read quality for DL2 dropped after ~200 bp, paired-end reads were joined end-to-end and the low-quality middle sequence was removed using a custom Python script. Only 100% matches to known whale shark haplotypes were considered.
For the PCA analysis, haplotype frequencies from eDNA were first scaled to a total of 100 individuals, corresponding approximately to the number of individuals observed (between ~20 and ~200 sharks) when the water samples included for sequencing were taken. Frequencies below 1% were rounded up, so each haplotype was represented by at least one individual. Sequences from the National Center for Biotechnology Information (NCBI) and Vignaud et al. 13 represented 32 individuals from Mozambique, 16 from Taiwan and the Philippines, 146 from Ningaloo Reef in Australia and 38 from the Gulf of California in Mexico (Indo-Pacific populations), as well as 32 individuals from Isla Holbox in Mexico (Atlantic population).
A sequencing error rate of 0.3% was estimated from the PhiX output. A putative combined PCR and sequencing error rate of 1.3% was calculated on the basis of low-frequency spurious haplotypes retrieved from the mock sample. Haplotypes appearing at a frequency below these rates were removed from the data.
When the analysis was redone without the reference database, cleaning was done on the basis of the error rate observed in the mock sample (1.3%) and assuming that the most abundant sequence from a PCR was authentic. In addition, sequences were required to be present in at least two PCRs.
The mutation rate of the DL2 region was estimated in BEAST v. 1.8.221 using a fossil-calibrated phylogeny, on the basis of the alignment of forty shark species and a relaxed clock model. Nucleotide diversity was determined in DnaSP. Effective female population size was calculated as N
f = π /2 μ, with μ being the mutation rate22,
To estimate decay rates, the 6 × 30 l water sample was divided into two 90 l buckets that were placed in sunshine and shade, respectively. A 500 ml sample was collected from each bucket every morning and evening (more often on the first three days) for eight days, giving a total of 22 samples per bucket. (This corresponds to a removal of ~12% of the starting volume by the end of the experiment.) Whale shark eDNA concentrations were estimated by qPCR as above and an exponential decay model was fitted to the data.
Detailed descriptions of all methods can be found in the Supplementary Information.
Illumina MiSeq raw sequence data are available from the Dryad Digital Repository (http://dx.doi.org/10.5061/dryad.kn206). Control region sequences for individual whale sharks generated from tissue samples have been added to Genbank (NCBI Accession numbers KX944487 to KX944547). Input files for phylogenetic analysis of the DL2 fragment in BEAST are available as Supplementary Data files.
How to cite this article: Sigsgaard, E. E. et al. Population characteristics of a large whale shark aggregation inferred from seawater environmental DNA. Nat. Ecol. Evol. 1, 0004 (2016).
We thank the Qatar Ministry of Environment for their collaboration and invaluable support. In particular, the crew on board R/V Saqt Al Khaleej is thanked for help with logistics for the water sampling. We thank the Maersk Oil Research and Technology Centre (MO-RTC) in Doha, Qatar, for being the main sponsor of the project. Special thanks to A. S. Al-Emadi (Head of MO-RTC) and J. Al-Khori (Technical Manager of MO-RTC) for supporting the project. The Danish National Research Foundation and the Natural History Museum of Denmark are thanked for additional funding. We thank T. B. Brand and the rest of the staff at the Centre for GeoGenetics, University of Copenhagen, as well as K. Magnussen and the Danish National Sequencing Centre for laboratory support. M. Krag is thanked for help with tissue samples. L. Olsen and P. Gravlund, National Aquarium Denmark (Den Blå Planet), provided a Stegostoma fasciatum tissue sample. J. V. Schmidt, University of Illinois at Chicago; R. W. Jabado, UAE University, Abu Dhabi, United Arab Emirates; and N. S. Blom, Danish Technical University, are thanked for scientific input. E. Vissing is thanked for the custom Python script.