Introduction

Obtaining sufficient sequence information is no longer a limiting factor for deciphering the demographic and selective history of a species. Many studies now use population genomic approaches to shed light on evolutionary questions that also have practical applications for species conservation and management. While next-generation sequencing (NGS) techniques have made such approaches possible, studying non-model organisms without a reference genome remains a challenge. Thus far, whole-genome re-sequencing1, transcriptome analysis2 and de novo restriction site associated DNA (RAD) sequencing3 have been used to investigate the population dynamics of non-model organisms. All of these approaches have limitations: re-sequencing whole genomes of many individuals remains expensive and superfluous when reliable results can be obtained using a subset of the genome4. Generating transcriptome data sets suitable for investigating adaptive change can be complicated by the requirement for high quality RNA and sampling equivalent life stages and tissue types for all individuals5. Short sequence lengths, high rates of allelic dropout6 and missing data7, as well as the vagaries of the bioinformatics pipeline that is used8, can influence estimates of genetic diversity obtained with RAD-seq approaches. Target gene capture approaches offer a promising alternative to these methods. They permit sequencing of pre-selected regions of the genome (see the review from ref. 9) that have been determined a priori to be informative for the research question and are generally effective across a range of sample types and qualities. This allows the generation of relatively unbiased, more complete datasets than can be obtained using RAD technologies and at an affordable cost compared with whole-genome re-sequencing.

The population structure of the study organism is another factor that may complicate our ability to make robust demographic inferences from population genomic data. Previous works have shown that ignoring metapopulation structure can sometimes mislead interpretations of the demographic and evolutionary history of a species10,11,12,13. This is a problem because populations are rarely entirely isolated in nature but instead belong to a network of demes that exchange migrants at different rates. Despite these concerns, few studies have attempted to estimate demographic parameters under complex metapopulation models (see ref. 14 and 15 for some exceptions). The predicted coalescent patterns that are associated with metapopulations sampled at different levels are of special interest. The gene genealogy of a metapopulation is characterised by the product of the effective population size N and the migration rate m16,17,18,19. The coalescent history of a sample of lineages can be divided into two phases: the scattering phase within each single deme and the collecting phase when each of the lineages comes from a different deme16,20. This separation of time scales holds true for several metapopulation models, including range expansions19. Contrasting the shape of the genealogy at different sampling levels (e.g. “single deme” vs. “scatter”, where each lineage comes from a different deme) indirectly provides information on the long term Nm18. It is therefore possible to make inferences about the evolutionary history of the metapopulation using relatively simple demographic models that investigate changes in effective population size across specific sampling schemes. A few individuals judiciously sampled from the range distribution of a species can be sufficient to uncover its demography, when enough loci are available for statistical inferences. Such an approach is now tenable with the large datasets that are being generated using NGS and has important applications for the study of endangered animals, or of any organism for which specimens are not easily obtained.

Here, we demonstrate the use of such an approach to study the demographic history of a non-model organism, the blacktip reef shark (Carcharhinus melanopterus). These animals are highly philopatric and are often found on remote coral islands and atolls21,22. Their proclivity for philopatry and strong habitat preference for coral reefs predisposes them to exhibit a disjunct meta-population structure over their widespread distribution throughout the Indo-Pacific. We devised a two-layered sampling scheme based on the above theoretical considerations associated with metapopulations20: we collected individuals from a single deme from Northern Australia and from a scattered sampling from various locations throughout the Indo-Pacific. We then used a newly developed target gene capture approach23 to generate sequence data for ~1000 pre-specified independent orthologous autosomal regions. This approach allowed us to minimise the sampling of individuals, while ensuring a comprehensive sampling of loci for a species for which no closely related reference genome exists. Demographic parameters were estimated in two ways: i) indirectly by contrasting the gene genealogies at different sampling levels in the metapopulation (“single deme” vs. “scatter”); ii) directly by applying a non-equilibrium finite island model to both sampling schemes. We developed an Approximate Bayesian Computation (ABC) framework24,25 with recombination to estimate parameters and to compare demographic models. The site frequency spectrum (SFS) computed on unphased data was used as summary statistic, thus avoiding the level of uncertainty that is introduced by phasing and haplotype reconstruction.

Carcharhinus melanopterus inhabits reef flats and sheltered lagoons21,22 in the Indian and Pacific Oceans (the “Indo-Pacific”). While their biology has been widely studied their population dynamics and dispersal patterns are largely unknown. Nevertheless, insights into their movements and population health status can be gained through assessments of their evolutionary history. The species is considered globally near threatened (NT) by the International Union for Conservation of Nature (IUCN) Red list, with a decreasing population26. Carcharhinus melanopterus appears to be locally abundant in many areas but recent fishing and anthropogenic pressures on reef environments may have affected their distribution and population dynamics26,27,28. A recent bottleneck was detected in Moorea29 suggesting that decreases in effective population size may also have occurred in some parts of its range. In light of this observation, we used an ABC framework incorporating extensive simulation to explore the population history of C. melanopterus, focusing particularly on signals in the data that might reflect recent changes in population size. In doing so, we also use this study system as a test case to explore the extent to which metapopulation structure can hinder the detection of a recent bottleneck and propose an empirical approach and sampling scheme that addresses this issue.

Results

Genetic diversity and data summary

The demographic history of the metapopulation of C. melanopterus was examined using two datasets: the first based on estimates from a “single deme” (SID) in Northern Australia, the second based on a collection of samples taken from various locations in the Indo-Pacific, which we refer to as the “scatter” sample (SCD) (see Fig. 1, Table 1 and Supplementary Table 1). We sequenced 18 individuals in total with an average coverage of 75×. After applying strict filters and removing duplicate reads (see SI), we obtained sequence data for 995 loci for SCD and 998 loci for SID, spanning 606,647 bp and 632,160 bp respectively (Table 1 and Supplementary Table S2). The average length of the regions is ~600 bp and details of the length distribution for SCD and SID are shown in Supplementary Fig. 1. Overall, 2605 and 1946 high quality single nucleotide polymorphisms (SNPs) were called for the scatter and single deme dataset, respectively. As expected, we detected a higher number of SNPs from SCD (considering samples from different locations) than SID and SNP density is higher in introns than in exons for both datasets (no significant difference was found between SNP density in intron 5′ and intron 3′; Supplementary Table S2). We also performed a Principal Component Analysis (PCA) including all 18 samples to assess the level of population structure within our dataset (Supplementary Fig. S2). At least three distinctive clusters are suggested by the first two principal components, explaining ~40% of the variance and separating Australia from the Indo-Pacific and Oman samples. Additional population sub-structure within the Australian and the Indonesian clusters is highlighted by the third principal component (Supplementary Fig. S2, panel b and d). Overall, this confirms that the signal in our dataset is incompatible with panmixia and thus a simple demographic model of a single isolated population is not appropriate to fully describe the demographic history of these samples.

Table 1 Summary of the “scatter” (SCD) and “single deme” (SID) dataset.
Figure 1
figure 1

Sampling locations with sample size.

Samples in blue were included in the scatter dataset (SCD, 9 samples). Samples in orange were considered for the single deme dataset (SID, 11 samples). Map of sampling locations was generated with the library “rworldmap” with R software (v3.0.2, https://cran.r-project.org/)54.

Demographic inferences: indirect approach

First, simple population genetics models were applied to indirectly estimate the demographic parameters of the metapopulation by contrasting the characteristics of the genealogies under the two sampling schemes, SID and SCD. A constant-size model (model COS) and a demographic-change model (model CHG1) were initially compared (Fig. 2, panel a and b). Both models were found to have equal posterior probability (0.5) in SID, suggesting a pattern compatible with either a weak signal of expansion (Nmod was slightly higher than Nanc, Supplementary Table S3) or a constant population size. This is consistent with the ancestral versus modern effective population size ratio (resize) computed under the CHG1 model (median = 0.48, 95% credible interval (CI) ranging from 0.03 to 2.56). In contrast to these results, model CHG1 showed a posterior probability of 0.89 when tested in SCD and a 95% CI of the resize between 0.03 and 0.54, suggesting a strong signal of expansion. This was confirmed when the effective population size was estimated at different time points in the past with an ABC-skyline reconstruction: a sharp signal of expansion was detected around ~90,000 generations before present, while a constant-size population was clearly observed in SID (Fig. 3). Under CHG1 we were able to detect the time of the expansion of the metapopulation but more recent changes in population size cannot be excluded a priori. Therefore, a model with two demographic changes (model CHG2; Fig. 2, panel c) was also tested in order to account for possible more recent demographic events (e.g. post-glacial expansion), as described for other marine species30,31. Posterior probabilities of model CHG1 and model CHG2 are similar for both datasets (Supplementary Table S4). When the effective population size was estimated at different time points in the past, CHG1 and CHG2 showed the same pattern (Fig. 3 and Supplementary Fig. S3). The estimation of Tc1 in CHG2 for both datasets reflects the estimate of Tc in CHG1 suggesting that CHG2 identifies only the change in effective population size corresponding to the demographic expansion that was also identified by CHG1.

Figure 2
figure 2

Demographic models tested for both sampling schemes.

(a) COS, constant-size model; (b) CHG1, demographic-change model (one demographic change); (c) CHG2, demographic-change model (two demographic changes); (d) FIM, non-equilibrium finite island model. Nmod: modern effective population size; Nanc: ancestral effective population size; Tc, Tc1 and Tc2: time of the demographic change (in generations); Ti: time of the onset of the island (in generations).

Figure 3
figure 3

Skyline reconstruction of the effective population size through time.

(a) single deme (SID, in orange); (b) scatter (SCD, in blue). Both skylines were reconstructed up to the mode of the mean TMRCA across loci. Median values are shown (darker lines) with the 95% high posterior density interval (lighter lines).

Demographic inferences: direct approach

Considering the results from the indirect approach, we therefore explored the demographic signals in the data by fitting a non-equilibrium finite island model20 (model FIM; Fig. 2, panel d) with 100 demes to both datasets. The choice of an island model is justified by the absence of panmixia in the PCA and by the near-uniform Gst matrix (Supplementary Table S5), suggesting similar genetic distances across all samples except Oman. Model FIM is defined by three parameters, namely Nm, the time of the onset of the island Ti and the effective size of the ancestral deme, Nanc. Demographic parameters of the metapopulation were estimated directly for the two different sampling schemes, SCD and SID. The FIM model showed the highest posterior probability in the model choice for both SID and SCD (Table 2). Cross-validation of the model choice confirmed that FIM is distinguished from COS and CHG1 with high probability for both datasets (see SI and Supplementary Table S6). Importantly, the two datasets produced similar estimates of both Nm and Ti (Table 3). We further confirmed this result by applying FIM to the complete dataset of 18 samples (SCD + SID without shared samples) (Table 3). With this value of Nm ≈ 40 in the metapopulation, the signature of the ancestral expansion at Ti is lost in SID but not in SCD, consistent with expectation based on results in simulation studies of range expansion (see ref. 19,32,33). Indeed, for similar Nm values the gene genealogy of lineages sampled from a single deme is a mixture of recent coalescent events occurring during the scattering phase and more ancient coalescent events occurring during the collecting phase. This generates a gene genealogy similar to that typical of a constant size population. Conversely, a scatter sample will show a signature of expansion as most of the coalescent events will occur at the time of the foundation of the metapopulation. Cross-validation of Nm suggests that the estimation is reliable and robust, with bias of the mode around −0.02 and −0.01 for SCD and SID respectively (Supplementary Table 7). We estimated Ti to have occurred ~59,000 generations ago (the average between the point estimates for the onset of the island from SCD and SID; Table 3). We observed an overestimation of the onset of the expansion in CHG1 (as Tc) compared to FIM (as Ti) in SCD. Cross-validation suggests a more robust and accurate estimate of Ti than Tc, with an associated bias of 0.48 and 1.154 respectively (Supplementary Table S7).

Table 2 Models posterior probability calculated as in ref. 58 using the closest 25,000 simulations.
Table 3 Parameter estimation under the finite island model (FIM).

Detection of a recent bottleneck

We used ABC to test whether the absence of a bottleneck in our dataset reflects the real demographic history of C. melanopterus, or a lack of power to detect it. To this end, we simulated pseudo observed data sets (pods) under a modified FIM model (FIM-BOTT) characterised by two additional parameters (Supplementary Fig. S4, panel b): Tbott (the time when an instantaneous decrease in Nm occurred) and Ibott (the ratio of the ancestral Nm to the new one). When Ibott = 1 FIM-BOTT reduces to FIM (Fig. 2 panel d). FIM-BOTT is defined by two major events: the onset of the island at Ti and the reduction in connectivity (i.e., a reduction in the effective population size of a deme, a reduction in the migration rate, or both) at Tbott. We simulated 1,000 pods under model FIM-BOTT for each parameter combination and we reanalysed them using the CHG2 model. CHG2 allows for two changes in effective population size through time and it could therefore recover both events of FIM-BOTT. We mimicked both the SID and the SCD sampling schemes and we tested twelve combinations of Ibott and Tbott, namely two intensities of Nm reduction (100× and 1000×) at six time points (10, 50, 100, 200 500, 1000 generations ago). In this way, we were able to examine their respective behaviours when a recent bottleneck (here represented as a decrease in Nm) was simulated in order to obtain insight regarding the optimal sampling strategy to detect it. We plotted for comparison the results obtained by analysing pods simulated under FIM. For Tbott ≤ 50 generations no signal of the bottleneck was detected for either sampling scheme at any Ibott (Fig. 4, Supplementary Figs S5 and S6). With progressively older Tbott, we observed a general reduction of the estimated Ne when compared to the one estimated from pods simulated under FIM (Fig. 4, Supplementary Figs S5 and S6). This reduction was detected first in SID and only later (~500 generations ago) in SCD (Supplementary Figs S5 and S6). However, a change in Ne through time (expected under a bottleneck) was never observed for any combination of Ibott and Tbott in SCD (Fig. 4 and Supplementary Figs 5 and 6). SID was more affected by the change in connectivity, but a signature of a bottleneck appeared only for Ibott = 1000× and Tbott ≥ 200 (Supplementary Fig. S6). The capacity to detect a bottleneck can depend both on the real properties of the gene genealogy and on the inferential method used to analyse the data (the ABC skyline in this case). Therefore, we tested the power of our ABC-skyline approach with pods simulated under an unstructured model with bottleneck intensities that were comparable to those simulated in the metapopulation scenario. Specifically, we performed simulations under model CHG1-BOTT (Supplementary Fig. S4, panel a), an unstructured population with demography estimated under CHG1 (in our SCD sample) and the same combinations of Ibott and Tbott as above. Here, the parameter Ibott refers to the decrease in Ne, while for FIM-BOTT it referred to the decrease in Nm. Ne estimated from pods simulated under CHG1-BOTT drops compared to the values estimated from pods simulated under CHG1 as recently as 10 generations ago (Fig. 4). Moreover, a clear signature of a bottleneck appears as soon as Tbott = 50 (Fig. 4).

Figure 4
figure 4

Average skyline reconstruction under model CHG2 of pseudo-observed data sets (pods) simulated with model FIM and FIM-BOTT in a metapopulation (for SCD and SID) and in an unstructured population (model CHG1 and CHG1-BOTT).

Data were simulated with Ibott = 1000 at: (a) Tbott = 10 generations; (b) Tbott = 50 generations; (c) Tbott = 100 generations before present. An average skyline reconstruction is shown across 1000 simulations. Solid lines: scenario with bottleneck (model FIM-BOTT and CHG1-BOTT); dashed line: scenario without bottleneck (model FIM and CHG1). Median values are shown (darker lines) with the 95% high posterior density interval (lighter lines).

Similar to the reasoning followed when studying the real data, we further analysed the pods simulated under FIM-BOTT for all combinations of Ibott, Tbott and sampling scheme with the FIM model. We plotted the median of the posterior distributions of Nm estimated from each pods (hereafter, Nmest). To quantify the reduction of Nmest due to the simulated decrease in Nm, we show as comparison Nmest obtained from pods simulated under FIM (Fig. 5, Supplementary Fig. S7). First of all, we note that both sampling schemes can correctly recover Nm when pods are simulated with the FIM model, indicating that this is an appropriate inferential procedure (Fig. 5). We then observed that the two sampling scheme respond differently when pods are simulated under FIM-BOTT. Generally, Tbott seems to have a stronger effect on Nmest than does Ibott (Fig. 5, Supplementary Fig. S7). SCD does not show any significant reduction in Nmest until 500 generations ago, while SID starts showing a decrease almost immediately, after just 10 generations. The decline of Nmest in SID is faster when Ibott = 1,000 than for Ibott = 100. In both cases, the distributions of Nmest for SID and SCD up to 500 generations ago show a significant difference. At 1000 generations, both datasets show the same pattern with severe decreases in Nmest (Fig. 5, Supplementary Fig. S7).

Figure 5
figure 5

Distribution of the median of Nm estimated under model FIM from pseudo-observed data sets (pods) generated with FIM-BOTT and FIM.

Data were simulated for both sampling schemes with Ibott = 100 and various Tbott(10, 50, 100, 200, 500 and 1000 generations ago). Dotted lines represent the value of Nm used to simulate pods under FIM model.

Discussion

In this study we present a novel approach to population genomics that is suitable for application to non-model organisms, particularly those for which it is difficult to obtain large sample sizes. The novelty stems from both the technology applied to produce NGS sequence data and the sampling scheme adopted for the population genomics inferences. First, we used a recently developed target gene capture approach paired with NGS23 that, so far, has only been used for phylogenomics inference34. Second, we exploit intrinsic differences in the shape of gene genealogies that result from the collecting and scattering phases of metapopulations16,18,32 to demonstrate that it is possible to make demographic inferences indirectly based on a few carefully chosen individuals when using a sufficiently large number of loci. We chose to test the utility of our framework using the black tip reef shark, C. melanopterus, because we felt that this species was a good example of the practical challenges that are faced by population and conservation geneticists that are working with non-model systems. Firstly, individuals of this species have a small home range and exhibit strong site fidelity, making it a good example of a non-model organism that exhibits metapopulation structure. Indeed, this species has previously been shown to exhibit extensive population structure and restricted movements in French Polynesia35,36,37, the Great Barrier Reef 38 and Western Australia39. Secondly, they mostly occur in remote regions of the world and, as is the case for the majority of free ranging marine animals, it is logistically difficult to sample them in large numbers. Finally, the International Union for the Conservation of Nature has listed C. melanopterus as “near threatened”, meaning that any information that can be gleaned regarding the demography and structure of this species can have implications for management. These attributes make C. melanopterus a well-suited system to test the efficacy of our approach for reconstructing the evolutionary dynamics, historical demography and associated conservation implications in a non-model species that has a high degree of metapopulation sub-structure, but from which widespread sampling is difficult.

Our approach derives its effectiveness from the separation of time scales (i.e., scattering vs collecting phases), first described elegantly by Wakeley (1999), for the gene genealogies in an island model and later shown to be valid for many metapopulation models, i.e., stepping stone18, spatially continuous40 and range expansion19. Our indirect approach based on the comparison of the genealogical history at the two sampling levels appears robust to mis-specification of the metapopulation model, but it cannot provide precise estimates of the demographic parameters. To further characterise the underlying demography, we analysed our data directly using a non-equilibrium finite island model. While other metapopulation models might better fit the data, they come at the risk of overparameterisation (i.e., non-symmetric island model, spatial models as a stepping stone). Herein, we demonstrate statistically that even a relatively simple model such as the FIM provides a better proxy to the data and more realistic description of a species history than models that ignore population structure (Table 2).

The relative length of the scattering vs collecting phase determines the shape of the gene genealogy and is mostly influenced by Nm16,19. This holds true for lineages sampled within a deme (our SID) and lineages scattered throughout the range of the species (our SCD). If an expansion occurred in the species through the colonisation of new habitat, high Nm values will determine a signature of population growth at all sampling levels18. For decreasing Nm, the signature of expansion is first lost when sampling lineages from a single deme and then in a scattered sample. We found a significant signature of expansion in SCD and a constant population size in SID. Ray et al.19 suggested that at Nm <50 the genealogy in a single deme would mostly resemble that of a constant size population (or even of a population reducing in size) in a range expansion model. When we fit the FIM model independently to both SCD and SID, we found a metapopulation characterised by a value of Nm ~40 (95% CI: 20–110), with an onset of formation around ~59,000 (95% CI: 6,000–165,000) generations ago. This level of connectivity is consistent with philopatry in this species, as previously described in Moorea and Tetiaroa (French Polynesia)35. Unfortunately our sampling scheme does not allow us to test males and females separately, but it would be interesting to investigate possible differences in the level of connectivity related to sex-specific migration. We further apply the same FIM model to the whole dataset and we obtained similar values of both Nm and Ti (Table 3). In summary, the indirect approach of comparing gene genealogies under different sampling schemes and directly applying the FIM model (applied to three largely independent datasets) provided consistent results, suggesting that this approach is effective for reconstructing the demography of this species. We stress that even though there is no spatial component in the FIM model, the time of formation of the metapopulation is analogous to the expansion time in a range expansion model. Therefore, we speculate that this time estimate probably represents the colonisation of the Indian Ocean by C. melanopterus, through a range expansion. Vignaud et al.29 found evidence for genetic structure and isolation by distance at a larger geographic scale (i.e., Pacific and Indian Ocean) for this species. Clearly, a FIM model would not be a good approximation in this case and a non-equilibrium stepping-stone or a range expansion model might be more appropriate. However, our individuals (with the exception of Oman) come from a restricted geographic area so that the spatial component of the demographic model can be neglected. Indeed, we simulated from the posterior distribution of the FIM the expected Fst between two demes at 14 independent microsatellites loci with the same mutation rate as in Vignaud et al.29. We found the Fst distribution to be compatible with the value they found between East and West Australia, which are more distant than the samples from our study area. This further suggests that our model is appropriate for our specific dataset, but we are aware that it might not provide an accurate description of the C. melanopterus demography over its whole range.

We made some simplifying assumptions when analysing the C. melanopterus data. We used a single step change of effective population size in our indirect approach and a constant Nm when fitting the FIM model. Obviously, the real evolutionary histories of species are likely to be more complex and it may be challenging to distinguish between local events (such as a bottleneck or expansion in a deme) and changes in connectivity/number of demes in metapopulations16,41. Climate change and anthropogenic activity are reported to have impacted several marine species, including some sharks30,42,43,44. A recent bottleneck in C. melanopterus was identified in Moorea and attributed to human activity29. In an effort to distinguish between patterns that might be caused by metapopulation structure from those caused by localised bottlenecks we subjected the data to analysis using both SID and SCD. We did not find any evidence for a more recent event than the metapopulation expansion of C. melanopterus (Supplementary Fig. S3), ruling out the possibility of a demographic change related to the post-glacial maximum (~20–15 KYA) as has been proposed for other sharks30. The distribution of C. melanopterus is confined to reef flats and sheltered lagoons and the effects of the last glacial maximum appear to have had little impact on this species. However, recent and mild bottlenecks are more challenging to detect using coalescent analyses45,46. Indeed, many marine species do not show clear signatures of a bottleneck despite being threatened by overfishing for many generations47,48. Nevertheless, meta-analysis studies have shown general reduction in diversity in overfished species49. These issues motivated us to investigate how best to detect a recent reduction in effective size in a metapopulation. To this end, we performed a simulation study focusing on a metapopulation with the same demography as that estimated in our C. melanopterus data but experiencing a recent bottleneck. Obviously, a more extensive investigation of parameter space would be interesting but is beyond the scope of this paper.

This simulation study had three goals: (i) to better understand the recent demographic history of C. melanopterus; (ii) to characterise the behaviour of metapopulations experiencing recent bottlenecks; (iii) to evaluate the relative merits of our different sampling schemes. Analyses of the simulated data paralleled those adopted for the real data, i.e., using the direct and indirect approaches. The indirect approach revealed two important features: (i) SCD fails to detect evidence of a bottleneck, even at high intensities, showing a pattern that is similar to the no-bottleneck case (Fig. 4); (ii) SID similarly does not show evidence of bottleneck, but it shows a reduction in genetic diversity compared to the no-bottleneck case at both intensities (100× and 1000×), within just 10 generations. The genetic consequence of the bottleneck on a SID sample is the reduction in the estimated effective size compared to the no-bottleneck scenario (Fig. 4, Supplementary Figs S5 and S6). However, we could not have detected such a decrease of the effective size without some baseline reference values, suggesting that it is difficult to assess whether a species is critically endangered when it is part of a structured metapopulation. This might explain why many overfished species do not show a genetic signature of bottleneck despite being threatened.

The direct approach confirmed that the two sampling schemes behave differently after a bottleneck. We computed Nm using the FIM model and we found, as expected, that a stable metapopulation shows a similar estimate of Nm under both sampling schemes (Fig. 5 and Supplementary Fig. S7). Conversely, a decrease in the Nm estimated from SID, but not from SCD, is observed under a bottleneck scenario (Fig. 5 and Supplementary Fig. S7), suggesting that the comparison of the two datasets may help detecting recent changes in effective population size. These results (both the ABC-skyline models and the Nm estimates) can be interpreted from a coalescent point of view: a recent bottleneck will produce a burst of coalescent events only at the deme level (i.e., during the scattering phase), having no impact at all on the collecting phase even for high intensity of reduction. A scattered sample, which does not have the scattering phase, will not be affected by recent bottleneck events while a local single deme sample will. These findings suggest that metapopulations are more resilient than unstructured populations to recent changes in effective size (or connectivity), implying that taking population structure into account is crucial when inferring the demography of a species. Most species are structured, including those that are endangered. In conservation studies we need to detect changes in effective population size on the order of tens to, at most, hundreds of generations. Our method can be therefore be usefully applied in a conservation setting as it is able to detect such recent changes. We expect that it can be easily applied to both critically endangered (e.g. the daggernose shark) and commercially exploited (e.g. the Atlantic cod) species to better design appropriate conservation management plans. We now envisage exploring the power of our approach under a wider spectrum of parameters in order to provide guidelines for conservation studies, but this is beyond the scope of this work. In the case of the C. melanopterus data examined in the current study, the single deme sampled from Northern Australia has not experienced a recent bottleneck. However, we cannot exclude the possibility that this may have occurred for the other demes of the metapopulation, where conditions may have been more extreme than those simulated here.

In summary, we present a recently developed target gene capture approach used here for the first time to perform inferences at the intra-specific level. By exploiting theoretical and simulation results on the coalescent history of metapopulations, we show how few carefully collected specimens can provide information about the demography of a species when many loci are available. This strategy can minimise sampling effort and associated cost of data production, making population genomic analysis of non-model organisms more feasible. We showed that this approach allows refined estimates of connectivity among demes relative to traditional population genetics approaches and that the sampling scheme adopted is particularly effective when investigating recent changes in effective population size and, potentially, migration rates. This opens new avenues for theoretical developments on the conservation management and recovery of endangered species.

Materials and Methods

Sampling, library preparation and sequencing

A total of 18 samples of C. melanopterus were included from 8 different locations (Fig. 1): Northern Territory and Queensland, Australia (N = 11), Oman (N = 1), Philippines (N = 1), Thailand (N = 1), Malaysia (N = 1), South Kalimantan, Indonesia (N = 1), West Kalimantan, Indonesian Borneo (N = 1) and East Kalimantan, Indonesia (N = 1). Samples were divided in two datasets called “scatter” (SCD) and “single deme” (SID) with 9 and 11 samples respectively (Table 1, Supplementary Table S1). The SCD dataset aims to mimic a “scatter sample”, sensu Wakeley, where all samples come from independent geographical locations. We also included one sample from Oman in order to infer the demographic history of the metapopulation at a larger scale and to cover a more extensive area of the Indian Ocean. Muscle tissue was extracted from dead specimens collected from local fish markets and then stored in 95% ethanol. Genomic DNA was extracted using the E.Z.N.A Tissue DNA Kit (Omega Bio-Tek, Inc Norcross, GA) according to the manufacturer’s instructions. Illumina sequencing libraries (500 bp) were prepared and amplified by PCR prior to two rounds of target gene capture (targeting 1077 autosomal regions) using a ‘touchdown’ DNA hybridization approach23. Resulting enriched libraries were re-amplified to incorporate a sample specific index, pooled and sequenced paired-end (250 bp reads) on an Illumina MiSeq benchtop sequencer (Illumina, Inc, San Diego, CA). Sequence reads associated with each sample were identified and sorted by their respective indices23.

De novo reference sequence assembly

Briefly, sequence read data from several individuals was used to build a haploid reference sequence for the 1077 target exons and associated introns. Adapters and low quality reads were removed from the raw sequence read data and contigs were assembled de novo. On target material was identified by determining the orthology of assembled contigs to a set of core orthologs identified across model vertebrates (see SI for details). For each exon and intron (5′ and 3′) the longest contig across all sequenced individuals was retained and used to assemble a reference sequence that is 1,293,710 bp including exons and introns (5′ and 3′).

Read mapping, variant calling and filtering

A Burrows-Wheeler Aligner (BWA50) was used to map reads to the reference sequence generated by de novo assembly51. Duplicate read marking was performed with Picard v.1.129 (http://broadinstitute.github.io/picard/), local realignment with the Genome Analysis ToolKit (GATK) v3.3-052 and genotype calling with samtools v1.153. Raw variants were filtered using custom Perl (v5.18.2) and R (3.0.2) scripts (Supplementary Table S8). Filtering was based on strand bias, minimum depth 6×, removal of triallelic sites, removal of sites with heterozygous calls in more than 80% of samples and removal of any missing calls in order to have a dataset without any missing genotypes, reducing background noise and uncertainty (see SI for details).

Descriptive analyses

Principal component analysis (PCA) was performed with the function “prcomp” in the R environment54. A Gst distance matrix55 was calculated with Arlequin 3.556 (see SI for details).

Demographic Inference

We developed an ABC24 framework to estimate parameters and compare demographic models. The folded site frequency spectrum (SFS) and total number of SNPs were used as summary statistics, to avoid phasing issues. Mutation and recombination rates were allowed to vary across loci using hyperprior distributions (see SI for details). Generation time was set to seven years36. Four demographic models were tested for both datasets (Fig. 2). Model COS represents a constant-size population model described only by the effective population size (N) (Fig. 2, panel a). Model CHG1 represents a single instantaneous demographic-change model occurring at the time Tc (Fig. 2, panel b). Nanc and Nmod are the ancestral and modern effective population sizes respectively. Values of the ancestral versus modern effective population size ratio (resize, Nanc/Nmod) greater than 1 are compatible with a reduction in effective population size while values lower than 1 suggest a population expansion. If this ratio is equal to 1, the scenario is compatible with a constant-size population. Model CHG2 represents a demographic-change model in which two events, occurring at times Tc1 and Tc2, cause changes in population size described by three effective population sizes (Nanc, Nint and Nmod for ancestral, intermediate and modern respectively) (Fig. 2, panel c). Model FIM represents a non-equilibrium finite island model with 100 demes (N1…N100) described by Nm as the product between the effective population size (N) and the migration rate (m) (Fig. 2, panel d). Nanc is the effective size of the ancestral deme from which the island originated at the time Ti. One hundred demes was chosen in order to approximate the large number of demes that are necessary to describe the coalescent history in metapopulation models in terms of the scattering and collecting phases20.

We generated 100,000 simulations for each demographic model using fastsimcoal2 v2.5.157. Each simulation includes 995 and 998 gene genealogies for SCD and SID respectively, to be consistent with our real data. Prior distributions are displayed in Table 3, Supplementary Tables S3 and S9. The posterior probabilities of each model were calculated by a weighted multinomial logistic regression58 for which we retained the best 25,000 simulations. The parameters of the best-fitting model were estimated from the 5,000 simulations closest to the observed dataset using a local linear regression according to ref. 24. Posterior distributions for model CHG1 and FIM are shown in Supplementary Figs 8 and 9.

Both CHG1 and CHG2 models were used to graphically reconstruct the variation of effective population size through time. To this end, for each combination of parameters retained by the ABC algorithm (5,000 in our case), we recorded the effective size at specific time points. The median value of the posterior distribution of the effective size at each time point was calculated together with the 95% credible interval and plotted against time to obtain an ABC-skyline reconstruction (Figs 3 and 4, Supplementary Figs S3, S5 and S6). Time points were defined by randomly extracting values from an exponential distribution with a rate calibrated with the upper bound (97.5%) of the estimated mean TMRCA across loci. In this way all generated time points were not greater than 97.5% of the estimated mean TMRCA across loci. Moreover, recent time points (0, 25, 50, 100, 200, 300, 400 and 500 generations ago) were manually added to increase the resolution towards recent events. Analyses were performed in the R environment54 with the library abc59.

We performed cross-validation for both model selection and parameter estimation by randomly generating pods from the prior distributions under each model. For each cross-validation experiment we generated 1000 pods and we applied the same inferential procedure as for the observed data (see SI, Supplementary Figs S10 and S11). In addition, a posterior predictive test60 was carried out to test whether the data can be reproduced under a specific demographic model. Bayesian p-values computed from the posterior distribution of the number of polymorphic sites showed that none of the four models in both datasets could be rejected (SI, Supplementary Fig. S12).

Recent Bottleneck

We tested for signatures of a recent population bottleneck in the metapopulation data set for both sampling schemes. We restricted our ABC analyses to a metapopulation with demographic parameters consistent with those estimated from the empirical data examined in this study. We simulated pods under a modified FIM model (FIM-BOTT) characterised by two additional parameters (Supplementary Fig. S4, panel b): Tbott (the time when an instantaneous decrease in Nm occurred) and Ibott (the ratio of the ancestral Nm to the new one). When Ibott = 1 FIM-BOTT reduces to FIM (Fig. 2, panel d). We tested twelve combinations of parameters, namely two Ibott (100× and 1000×) and six Tbott (10, 50, 100, 200, 500 and 1000 generations ago). For each combination of parameters, we simulated 1000 pods for both SCD and SID (Supplementary Fig. S4, panel b). To further understand the effect of metapopulation structure on bottlenecks, we performed additional simulations of an unstructured population under a modified CHG1 model (CHG1-BOTT, Supplementary Fig. S4, panel a). In this case, pods were simulated with demographic parameters estimated in our real SCD data under model CHG1, to which we added a recent bottleneck characterised by Ibott and Tbott. Note that here Ibott represents the ratio between the ancestral effective population size Ne and the new one (while in FIM-BOTT it represents the ratio of the ancestral Nm to the new one). We tested the same twelve parameters combinations of Ibott and Tbott that we previously used when simulating FIM-BOTT. CHG1-BOTT reduces to CHG1 when Ibott = 1. All pods simulated under both FIM-BOTT (for both sampling schemes) and CHG1-BOTT were analysed using the ABC-skyline produced by the CHG2 model (Fig. 2, panel c). We applied the same settings used for the real data and for each combination of parameters we plotted the average of the Ne through time across the 1000 pods (see SI). All bottleneck scenarios were compared to the ABC-skyline reconstructed from pods simulated under FIM or CHG1. Finally, pods simulated under FIM-BOTT were also analysed using FIM to estimate Nm. We compared these values to those estimated from pods generated under FIM.

Data accessibility

A vcf file containing all variants used in this study is being submitted to the Dryad database (http://datadryad.org/).

Additional Information

How to cite this article: Maisano Delser, P. et al. Population genomics of C. melanopterus using target gene capture data: demographic inferences and conservation perspectives. Sci. Rep. 6, 33753; doi: 10.1038/srep33753 (2016).