Population genomics of finless porpoises reveal an incipient cetacean species adapted to freshwater

Cetaceans (whales, dolphins, and porpoises) are a group of mammals adapted to various aquatic habitats, from oceans to freshwater rivers. We report the sequencing, de novo assembly and analysis of a finless porpoise genome, and the re-sequencing of an additional 48 finless porpoise individuals. We use these data to reconstruct the demographic history of finless porpoises from their origin to the occupation into the Yangtze River. Analyses of selection between marine and freshwater porpoises identify genes associated with renal water homeostasis and urea cycle, such as urea transporter 2 and angiotensin I-converting enzyme 2, which are likely adaptations associated with the difference in osmotic stress between ocean and rivers. Our results strongly suggest that the critically endangered Yangtze finless porpoises are reproductively isolated from other porpoise populations and harbor unique genetic adaptations, supporting that they should be considered a unique incipient species.


Narrow/Wide-ridge
A similar dorsal surface but less tubercules compare with N. a. sunameri.
A narrow tuberculed area on the dorsal ridge.
A wide area of tubercules and more than 10 rows of denticles on the dorsal surface.

Habitat
Fresh water Sea water Sea water

Samples information
There are two forms, i.e., 'narrow-ridge' and 'wide-ridge' formed finless porpoises, which were classified according to tubercles distribution on their back ridge. The individual with width of the tuberculed area larger than 4 cm and more than 10 rows of tubercules were assigned as 'wide-ridged' form, whereas the rest samples with the Total genomic DNA from muscles or skeleton samples was extracted by using stand Phenol-chloroform method.

Genome sequencing
The whole genome shotgun strategy and the next-generation sequencing technologies on the Illumina HiSeq 2000 platform were used to sequence the genome of a finless porpoise. Multiple insert sizes (250bp, 500bp, 800bp, 2Kb, 5Kb, 10Kb, 20Kb, 40Kb) were designed to build sequencing libraries and a total of 484.88 Gb reads were generated for the de novo assembly. To reduce impact of sequencing errors, there are several correction and filter criteria for the raw data from Using these high-quality reads, genome size of finless porpoise is estimated to be 2.488 Gb using a 17-mer analysis 4 . Particularly, sequencing errors could easily bring up low frequency k-mer. To avoid this, error correction procedure was used to deleted 0.59% reads and 2.16% bases which the frequency of 17 k-mer lower than 10.

Genome assembly
After filtering and correction, SOAPdenovo-2.04 (http://soap.genomics.org.cn) 5 was employed to assembly the finless porpoise genome using qualified reads. For constructing the contig, short reads from fragmented small insert-size libraries were assembled into contig base on overlap information. The total contig size and N50 of finless porpoise were 2.28Gb and 26.7 Kb, respectively. Usable reads were realigned to the contig sequences, then paired-end relationship between pairs of contigs was used to construct scaffolds by linking contigs. We calculated and weighted with the rate of consistent and conflicting paired ends before constructing the scaffolds in a stepwise manner from the short-insert size paired ends to the long-insert size paired ends. This time, the total scaffold size and N50 were 2.30Gb and 6.33Mb, respectively. To fill the intra-scaffold gaps, we used the paired-end information to retrieve read pairs with one end uniquely aligned to a contig and the other end in the gap region. At last, a local assembly was achieved for these collected reads.

Assessment of genome assembly
The scatter graph of the distribution of GC content against sequencing depth shown that the distribution of GC content for finless porpoise was largely above 60× and relatively concentrated (Supplementary Figure 1). For region with lower average depth (30-60×) of the scatter plot, it is likely to be Y chromosome, which has half the sequencing depth of autosomes. And there was no obvious difference with an average GC content of 41.1% among the seven cetaceans that had been fully sequenced.
Then, the transcriptome data was used to measure quality of the finless porpoise genome assembly. The total RNA from blood cells of two finless porpoise were extracted with TRIZOL (Invitrogen) and then reversed transcribed into cDNA using the PrimeScript™ RT reagent Kit (Takara), respectively. After sequencing using HiSeq2000 and assembling using Trinity 6 , we generated 72, 056 transcripts to align back to the genome by using BLAT 7 with default parameters except an identity cutoff of 90%.
More than 98% of assembly could be mapped successfully by the unigenes (Supplementary Table 6). Additionally, RNA-seq reads were aligned to the finless porpoise reference genome using TopHat v2.0.7 8 with default parameters. The best quality blood sample achieved 81.71% mapping rate (Supplementary Table 5).
Finally, we further used protein coding genes of the common bottlenose dolphin and baiji and mapped them to the finless porpoise genome assembly by BLAT 7 to make sure the quality of our assembly (Supplementary Table 7).

Detections of heterozygous SNPs
The heterozygosity rate of finless porpoises was estimated (Supplementary Figure 1).
First, the high-quality reads were realigned to the assembly genomes by the help of were used in de novo prediction ( Supplementary Fig. 2). After GLEAN procedures we successfully constructed a non-redundant gene set, and the final protein coding gene set were totally 22,014 (Supplementary Table 8).
Additionally, function annotation of predicted genes were assigned according to the BLASTP with E-value cutoff of 1 × 10 −5 , this best match of the alignment to the  Table 9).
According to the sequence structure of tRNA, tRNAscan-SE 16 Table 11). The major classification of TEs was also calculated (Supplementary Fig. 3 and Supplementary Table 12).

Supplementary note 2 Gene family cluster and orthology relationship
The Treefam methodology 21 Fig. 2).
We assigned a connection (edge) between any two nodes (genes) if more than 1/3 of the region aligned to both genes. An H-score that ranged from 0 to 100 was used to weigh the similarity (edge). In particular, for two genes, G1 and G2, the H-score was defined as a score (G1G2)/max (score(G1G1), score(G2G2)) (the score here is the raw Blast score). Then, the extraction of gene families (clustering by Hcluster_sg) was used the average distance for the hierarchical clustering algorithm, with requiring the minimum edge weight (H-score) to be larger than 5, and the minimum edge density (total number of edges/theoretical number of edges) to be larger than 1/3. A Venn diagrams has been used to show the distribution of shared and unique gene families in seven sequenced cetaceans ( Supplementary Fig. 2).

Expansion/contraction of gene families
CAFÉ 22 was used to calculate the gene family gain and lose over a phylogenetic tree with divergence time based on the model of random birth and death. One important parameter λ (lambda) which describes both the gene birth (λ) and death (μ= -λ) rate across all branches in tree for all gene families was estimated using maximum likelihood method as implemented in RAxML software 23 . For each gene family, the accelerated rate of gain/loss was set to be with conditional P value less than threshold 0.05.

Positively selected genes
Using the 3,911 single-copy genes shared by the finless porpoise, 6 other cetaceans and 10 terrestrial mammals, positively selected genes (PSGs) were identified in the finless porpoise. The branch-site model 24  SNPs were also annotated by SNPEFF 32 and summarized characteristic of SNPs by a customized Perl script. This annotation for the whole SNPs set was used for subsequent population genomic analyses.

Supplementary note 4 Phylogenetic tree and Population structure
Phylogeny tree of finless porpoises were reconstructed based on neighbor-join method by TreeBeST 33 . The program FRAPPE 34 was utilized to infer population structure and ancestry information. We did not assume any prior information about their ancestry. Additional, used ADMIXTURE 35 run 10,000 iterations and pre-defined the number of cluster, K, from 2 to 5. The cross-validation test 36 were used to find the best K value. We performed a PCA following the procedure as reported. The eigenvector decomposition of the transformed genotype data was performed using the R function Eigen, and the significance of the eigenvectors was determined with a Tracey-Widom test, implemented in the program twstats provided by the EIGENSOFT software3.2 37 .

LD analysis
each population were extracted to perform the analysis. These parameters were '−n −pedfile −info −log −minMAF 0.05 −hwcutoff 0.001 −dprime −memory 2096'. After that, values for the r2 and D′ statistics were obtained.

Demographic history reconstruction
The Pairwise Sequential Markovian Coalescent (PSMC) model could be used to inference ancestral effective population size (Ne) based on information from inter-chromosomal genetic differences within a single individual 39 . We applied this model to infer the Yangtze finless porpoise, ocean narrow-ridge form and ocean wide-ridge form in our data set to investigate their respective demographic histories.
The psmcfa format input files was generated follow the authors instruction, using 100 bp bins and accounting for uncallable sites as required by the software usage specification. PSMC was run with the command 'PSMC -N25 -t15 -r5 -p 4+25*2+4+6'.
Results were scaled using an assumed mutation 1.14e-8 per bp per generation and a generation time of 8 years.
The Multiple sequential Markovian coalescent (MSMC) model 40 is an HMM along multiple phased haplotypes which were used to infer effective population size and population separation over time from now to 50000 years ago. Here only autosomes were used and the haplotype were phased based on all the sequenced samples with SHAPEIT 41 . Scaffolds in finless porpoise assembly that shown sytenic relationships to cow X chromosome (determined by Lastz) has not been included in this analysis and previous PSMC model.

Supplementary note 5 Population selection analysis
Composite likelihood ratio (CLR) 42 estimated for each SNP using SweepFinder2 43 for wide and narrow ridged finless porpoises, respectively. The top 20 genome 'peaks' with the CLR higher than 0.2% of CLR were picked out as candidate selective sweep regions, and genes in these regions (within 1Mb flanking the sweep region) are identified as putative genes under selection.
in both directions and for all SNPs. Outlying XP-EHH scores (top 0.1%) are potentially indicative of selection in a particular population. In order to reduce our false positive rate by choosing to declare a region significant only when a cluster of nearby SNPs has outlying XPEHH scores, we divide the genome into 50kb windows with a step size of 5 kb, and identify candidate regions for selection as those in which more than 0.1 fraction of SNPs within them have an XPEHH score above cutoff value of top 1%.