Genetic footprint of population fragmentation and contemporary collapse in a freshwater cetacean

Understanding demographic trends and patterns of gene flow in an endangered species is crucial for devising conservation strategies. Here, we examined the extent of population structure and recent evolution of the critically endangered Yangtze finless porpoise (Neophocaena asiaeorientalis asiaeorientalis). By analysing genetic variation at the mitochondrial and nuclear microsatellite loci for 148 individuals, we identified three populations along the Yangtze River, each one connected to a group of admixed ancestry. Each population displayed extremely low genetic diversity, consistent with extremely small effective size (≤92 individuals). Habitat degradation and distribution gaps correlated with highly asymmetric gene-flow that was inefficient in maintaining connectivity between populations. Genetic inferences of historical demography revealed that the populations in the Yangtze descended from a small number of founders colonizing the river from the sea during the last Ice Age. The colonization was followed by a rapid population split during the last millennium predating the Chinese Modern Economy Development. However, genetic diversity showed a clear footprint of population contraction over the last 50 years leaving only ~2% of the pre-collapsed size, consistent with the population collapses reported from field studies. This genetic perspective provides background information for devising mitigation strategies to prevent this species from extinction.

| Estimated probability of the data (X) given K group tested in Structure.      Table. S1. Estimated recent migration (upper and lower matrix) and non-migration rates (along the diagonal) per generation between populations identified by STRUCTURE Table S2 | Model specification and prior distributions for demographic parameters. See figure 2 for the demographic parameters of each model tested.      Table S8 | Model selection procedure and performance analysis for the ABC step (bii), figure 4b. Table S9. Model checking for the analysis for the ABC step (bii), figure 4b.
Table S11 | Composite parameter for the scenario SC2 (see figure 4bii). † These authors contributed equally and should be considered as co-first authors.

Appendix S1. Supplementary details on the ABC analyses
Model parameters. The parameters defining each scenario (i.e., effective population sizes Ne, times of population size changes and population splits T, and mutation rates µ) were considered as random variables drawn from prior distributions (figure 4, table S3, S4). For each simulation, DIYABC draw a value for each parameter from its prior distribution and performed coalescent simulations to generate a simulated pseudo-observed dataset or POD with the same number of gene copies and loci per population as in the observed dataset.  Table S4). We accounted for variation in µ mic and P among loci by drawing their individual values from a gamma distribution (Table S4). These settings allowed for large mutation rate variance across loci (i.e. range of 10 −5 to 10 −2 ). We also considered mutations inserting or deleting a single nucleotide in the microsatellite sequence.
We used jModelTest 2.1.7 [2] to identify the best substitution model describing the sequence variation of the mtDNA-CR and estimate its parameters. The best mutation model was a HKY+I+G model [3] with a proportion of constant sites of 16%, and a shape of the gamma distribution of mutations among sites equal to 99.8 (Table S4). We assumed a per-site and per-generation mutation rate ranging uniformly between 1 × 10 −7 and 1 × 10 −5 , as found in the literature [4,5].  [7], shared allele distance (DAS) [8], and (δµ) 2 Goldstein's distance [6,9]. For the mtDNA data, the descriptive statistics within populations include the number of haplotype (NAH), the number of segregating sites (NSS), the mean pairwise differences (MPD) and its variance (VPD), Tajima's D, and the number of private segregating sites (PSS). Statistics computed between groups were the mean of within sample pairwise differences (MP2), mean of between sample pairwise differences (MPB), and H ST between two samples [7,10]. A Euclidean distance was calculated between the statistics obtained for each normalized PODs and observed dataset [8,11].

Model selection procedure using the ABC-Linear
We used the standard ABC-LDA procedure to estimate the Posterior probability (PPr) of each competing scenario using a polychotomous logistic regression [12,13] on the 1% of simulated datasets closest to the observed dataset (lowest Euclidean distance δ), subject to a linear discriminant analysis on the summary statistics as a pre-processing step (to reduce the dimensionality of the data) [14] and avoid the "curse of dimensionality" [15,16]. The best-fitting scenario was selected based on the highest PPr value with a nonoverlapping 95% confidence interval (95% CI).

Model selection procedure using the ABC-Random Forest (RF). Both theoretical arguments and
simulation experiments indicate that approximate PPr estimated from standard ABC analyses for the modeled demographic scenarios can be inaccurate, even though the models being compared can still be ranked appropriately using numerical approximation [17]. To over-come this problem, we complemented the standard ABC-LDA model choice procedure with the newly developed approach based on a machine learning tool named "random forests" (ABC-RF), which selects among the complex introduction models covered by ABC algorithms [15]. The ABC-RF analysis provides a classification vote representing the number of times a scenario is selected as the best one among n trees in the constructed random forest.
The scenario with the highest number of classification vote was selected as the best scenario among a total of 500 classification trees [15]. Posterior probabilities and prior error rates (i.e. the probability of choosing a wrong model when drawing model index and parameter values into the priors of the best scenario) were computed over 10 replicate analyses, as suggested in [18]. We used the abcrf v.1.5.0 R statistical package [15] to conduct the ABC-RF analyses on a reference table that includes 1 x 10 4 PODs from which were calculated a total of 114 summary statistics (SS). They include all the SS used in the ABC-LDA analyses augmented with the Mean index of classification (relationship between two samples) [19,20] and the linear discriminant functions (LDA) as additional synthetic variables, following the recommendation of [15,18].

Performance of the model choice procedure.
We evaluated the ability of the ABC analysis to discriminate between the competing scenarios by estimating the prior error rate under the ABC-RF [15]. We also use the more standard approach by analyzing 300 simulated data sets with the same number of loci and individuals as our real data set under the ABC-LDA. Following the standard ABC-LDA procedure [13], we estimated the Type-I error rate as the proportion of instances in which the selected scenario did not show the highest posterior probability among the competing scenarios, for the 300 simulated datasets generated under the best-supported scenario. Similarly, we estimated the Type-II error rate, by simulating 300 data sets for each of the other competing scenarios and calculating the mean proportion of instances in which the best-supported model was incorrectly selected as the most probable scenario.

Estimation of marginal posterior distribution of the model parameters.
We estimated the posterior distributions of each demographic parameter under the best demographic model, by carrying out local linear regressions on the 1% closest of 10 6 simulated data sets, after a logit transformation to parameter values [11,12].

Goodness-of fit of the fitted model to the data.
Finally we conducted a model checking procedure implemented in DIYABC to evaluate the goodness-of-fit between the posterior parameter distribution and the observed data following [21]. For this analysis, we simulated 1,000 pseudo-observed datasets under each model-posterior combination, with sets of parameter values drawn with replacement from the parameter posterior distribution. This generated a posterior cumulative distribution function for each summary statistic allowing us to estimate how well each fitted model can reproduce the observed summary statistics.

S5
Figure S1 | Estimated probability of the data (X) given K group tested in Structure.   Step 1

• Coalescent simulations under each scenario (SC)
• Simulation of 10 6 pseudo-observed data sets (PODs) for each SC.
• Parameter values (N, t, µ) drawn with replacement from prior parameter distributions • Calculate summary statistics for each POD (SS) and for the observed real data set (SS*) Step 2 •Model selection: • 1-ABC-RF: Estimation of the prior error rate and posterior probability • 2-ABC-LDA: Posterior probability (Pp) of each competing scenario • Polichotomous logistic regression on 1% PODs producing the closest SS value to the observed SS*, after a linear discriminant analysis (LDA) )as a pre-processing step. • Identification of the best SC based on non-overlapping Pp 95% confidence interval Step 3

•Confidence in scenario choice
• ABC-RF: Estimation of the prior error rate • ABC-LDA: Simulate 500 test data sets with each SC.
• For each test dataset, estimate Pp as in STEP2.
• Record error rates produced by each SC to estimate Type-I and Type-II error rates and power to discriminate SC using the ABC-LDA.
Step 4 • Estimation of marginal parameter posterior distributions • Local linear regressions on the closest 1% of 10 6 simulated data sets, after the application of a logit transformation to parameter values.
Step 5 • Model checking (assessment of the goodness-of-fit of a model parameter posterior combination) • Simulating 1,000 pseudo-observed data sets under each studied model-posterior distribution combination.
• Sets of parameter values drawn with replacement from the 1,000 sets of the posterior sample.
• Estimate the p-value of the deviation of the observed value of each statistic from its simulated distribution under the best demographic model.  (Fig. 4a). The location of the observed dataset is indicated by a large black star. (B) Evolution of the ABC-RF prior error rate with respect to the number of trees in the forest. (C) Contributions of the thirty most important statistics to the RF to discriminate among the ten scenarios. The contribution of a statistic is evaluated as the mean decrease in node impurity in the trees of the RF [15]. The meaning of the variable acronyms is provided in Appendix S1.     figure 4 for the demographic parameters of each model tested.
Step a in Fig. 4      The mutation model parameters for the microsatellite loci were the mean mutation rate (µmic), the parameter determining the shape of the gamma distribution of individual loci mutation rate (P), and the Single Insertion Nucleotide rate (SNI). The mtDNA mutation model was is an HKY with two variable parameters, the per-site and generation mutation rate (µseq) and the transition/transversion ratio (K1) parameter, and two fixed parameters, the proportion of constant sites (p-inv.), and the shape of the Gamma distribution of mutations among sites (α). µ mic µ mic P P S17 Table S4. Model selection procedure and performance analysis for the ABC step (a) in figure 4a.    4a).        S10. Parameter estimation for scenario SC2 in natural units (Fig. 4b ii ).