Modern cartilaginous fishes are divided into elasmobranchs (sharks, rays and skates) and chimaeras, and the lack of established whole-genome sequences for the former has prevented our understanding of early vertebrate evolution and the unique phenotypes of elasmobranchs. Here we present de novo whole-genome assemblies of brownbanded bamboo shark and cloudy catshark and an improved assembly of the whale shark genome. These relatively large genomes (3.8–6.7 Gbp) contain sparse distributions of coding genes and regulatory elements and exhibit reduced molecular evolutionary rates. Our thorough genome annotation revealed Hox C genes previously hypothesized to have been lost, as well as distinct gene repertories of opsins and olfactory receptors that would be associated with adaptation to unique underwater niches. We also show the early establishment of the genetic machinery governing mammalian homoeostasis and reproduction at the jawed vertebrate ancestor. This study, supported by genomic, transcriptomic and epigenomic resources, provides a foundation for the comprehensive, molecular exploration of phenotypes unique to sharks and insights into the evolutionary origins of vertebrates.
Cartilaginous fishes (Chondrichthyes) are divided into two subclasses, elasmobranchs (Elasmobranchii, including sharks, rays and skates) and chimaeras (Holocephali), and their common ancestor diverged from the rest of jawed vertebrates around 450 million years ago. More than a decade ago, the elephant fish, Callorhinchus milii, a member of the Holocephali that comprises approximately 50 species, was chosen for whole-genome sequencing because of its small genome size1. Since then, molecular comparative studies on vertebrates have largely relied on the C. milii genomic sequences as representative of cartilaginous fishes2, but the low fecundity and accessibility of live specimens have been a limitation. C. milii is often referred to as elephant ‘shark’ (or ghost ‘shark’), but true sharks belong to the subclass Elasmobranchii that comprises approximately 1,200 species. For elasmobranchs, however, no reliable genome-wide sequence resource allowing extensive molecular analyses has been established to date, in spite of some attempts3,4. Thus, there is an important need to obtain genomic information of elasmobranchs that will contribute to the elucidation of the molecular mechanisms underlying their unique traits of morphology, reproduction, sensing and longevity5, as well as thorough demographic analyses for conservation6,7. Here we report whole-genome analysis of three elasmobranch species (Fig. 1a–c), assisted by phylogenetics-oriented genome informatics. The utility of genome, transcriptome and epigenome data of prolific egg-laying (‘oviparous’) species provided by this study should expand the capacity for in-depth molecular investigation on elasmobranchs.
Sequencing the large genomes of sharks
We focused on the brownbanded bamboo shark Chiloscyllium punctatum, for which we recently tabled embryonic stages8, and the cloudy catshark Scyliorhinus torazame. Their whole genomes, measured to be approximately 4.7 and 6.7 Gbp, respectively, were sequenced de novo to obtain assemblies including megabase-long scaffolds (Supplementary Note 1.1). We also assembled the genome of the whale shark Rhincodon typus using short sequence reads previously generated3 (Supplementary Note 1.2). Using these genome assemblies, we performed genome-wide gene prediction, assisted by transcript evidence and protein-level homology to other vertebrates. The obtained genome assemblies and gene models exhibit high coverage (Supplementary Fig. 1), and of these, the bamboo shark genome assembly achieved the highest continuity (N50 scaffold length, 1.9 Mbp) and completeness (97% of reference orthologues identified at least partially). Using the novel gene models, we constructed orthologue groups encompassing a diverse array of vertebrate species (see below). Our products outperform existing resources for elasmobranchs and provide the tools for genome-wide characterization of molecular evolution at the origin of jawed vertebrates and later in the chondrichthyan lineages.
Genome-wide trends in molecular evolution
We first examined genome-wide trends of molecular evolution, utilizing one-to-one orthologues in the constructed orthologue groups (Supplementary Note 7). Our comparisons of coding sequences detected a higher similarity in nucleotide and amino acid compositions of sharks to tetrapods and coelacanth than to actinopterygian fishes (Supplementary Fig. 2a,c). We performed a phylogenomic analysis using conserved protein-coding genes, which confirmed the phylogenetic positions of elasmobranchs and the reduced rate of molecular evolution in the entire chondrichthyan lineage (Fig. 1d). The reduced evolutionary rate was further scrutinized by comparing the numbers of synonymous substitutions per site (KS) between chondrichthyan and osteichthyan lineages (Fig. 1e). The result revealed that synonymous substitution rates for the chondrichthyan lineages were significantly smaller than those for almost all the osteichthyan lineages analysed (Supplementary Note 12), suggesting a reduced intrinsic mutation rate in the chondrichthyan lineages. Our cross-species comparison revealed a remarkable increase in the intron lengths of shark genomes and its correlation with genome size (Fig. 1f and Supplementary Note 10). Our analysis on the composition of orthologue groups did not detect massive gene duplications in the chondrichthyan lineage (Fig. 1g), which was supported by the inference of age distribution of paralogues (Supplementary Note 11). Thus, the increase of genome size in sharks is not attributable to additional whole-genome duplication.
Characterizing noncoding landscape
To characterize noncoding regions, we first scanned elasmobranch genomes for repetitive sequences including those unique to the species analysed (Supplementary Note 4). This identified long interspersed nuclear elements as the most abundant class of repetitive elements, exceeding the proportions of long terminal repeats and those unclassified into any existing repetitive element class, both of which were particularly expanded in the catshark (Fig. 1h). Overall, the genomic regions identified as repetitive elements, including simple repeats, amounted to half of the individual elasmobranch genome assemblies, and their abundance contributed to the observed variation in genome size (Fig. 1h).
Next, we surveyed the elasmobranch genomes for homologues of human conserved noncoding elements (CNEs), which yielded a much larger number of matches than in teleost genomes (Fig. 2a and Supplementary Note 13). Our analysis revealed some CNEs retained by elasmobranchs but missing in teleost fish and C. milii, which included a CNE in an intron of the Tbx4 gene (Fig. 2b) previously reported as the core lung mesenchyme-specific enhancer9. Its presence in a cartilaginous fish that lacks a lung homologue prompts a reexamination of its evolutionary significance. This finding also highlights the problem of using only a single holocephalan species as a representative of chondrichthyans, whether the CNEs missing in this species are lost during evolution or masked in gaps in the genome assembly.
We also searched for elasmobranch homologues of human long noncoding RNAs (lncRNAs), which again revealed more candidate homologues than in teleost fishes (Supplementary Note 14). These were screened for transcript evidence in bamboo shark RNA-seq data and absence of homology to coding sequences. This screening resulted in the identification of 38 transcript contigs with variable degrees of spatial expression biases (Fig. 2c). These putative lncRNAs included a possible homologue of the Malat1 gene10[,11, whose presence in chondrichthyans was recently suggested only by a sequence similarity to a C. milii genomic region12. The inclusion of the putative bamboo shark Malat1 homologue in our result validates our screening procedure and more importantly, ascertains its noncoding transcription in a chondrichthyan species.
Overall, these findings indicate that despite the variable genome sizes and repetitive element compositions, elasmobranch genomes have undergone less modification in noncoding regions involved in gene regulation since the jawed vertebrate ancestor than is inferred by their evolutionary distance.
Evolution of Hox genes and clusters
Hox genes play crucial roles in embryogenesis and are organized into four clusters (Hox A–D) in osteichthyans (bony vertebrates) except for teleost fishes13. In the shark genomes, we found well-conserved Hox A, B and D clusters, which have identical gene repertories to their C. milii counterparts (Fig. 3a and Supplementary Fig. 5a) expressed in a temporally collinear manner (Fig. 3d). For a comparison of conformational regulation of Hox gene expression by CCCTC-binding factor (CTCF)14, we analysed the distribution of its binding sites with ChIP-seq (Methods and Supplementary Note 15). This comparison on the Hox A, B and D clusters revealed a high similarity between elasmobranchs and amniotes (Fig. 3b), which makes the whole gnathostomes including elasmobranchs distinct from the lamprey that has more CTCF binding sites within Hox clusters15 (Supplementary Fig. 4e). It is thus suggested that the jawed vertebrate ancestor already possessed the mechanism of CTCF-dependent conformational regulation of Hox genes documented for mammals16,17. We identified antisense transcripts in the genomic region of elasmobranchs containing Hoxa11 and -a13 as putative homologues of lncRNAs previously known only in tetrapods, Hoxa11-AS and Hottip (Fig. 3c and Supplementary Note 14). Although the acquisition of Hoxa11-AS is proposed to be linked with the fin-to-limb evolution18, our discovery of the elasmobranch counterparts indicates their early origins in the common ancestor of jawed vertebrates.
Although the entire Hox C cluster was reportedly missing from elasmobranch genomes19,20, we identified putative Hox C genes in the genome and transcript sequences of the analysed shark species (Fig. 3a,e and Supplementary Fig. 5a–c). While our phylogenetic analyses supported their affiliations to Hox C, those genes showed extremely elevated evolutionary rates. Remarkably, none of the identified, putative shark Hox C genes comprised such a compact cluster as in other jawed vertebrates, spanning within a 100-kbp-long genomic region13—for example, catshark Hoxc11 was flanked by a 50-kbp-long stretch containing no other Hox gene (Supplementary Fig. 5a). In addition, although typical jawed vertebrate Hox clusters are almost free from repetitive elements21 (at most 2.9% in length for elasmobranch Hox A, B and D clusters; Fig. 3a,j), the elasmobranch genome scaffolds containing the putative Hox C genes have accumulated repetitive elements (at least 36.8%; Fig. 3j).
Our analysis on embryonic expression patterns indicated that the identified elasmobranch Hox C genes are still under spatiotemporal transcriptional regulation, which is typically exerting on clustered Hox genes13,22―Hoxc11 (Fig. 3f–i and Supplementary Fig. 5j–l) as well as Hoxc8 (Supplementary Fig. 5g–i) are expressed in the posterior regions concomitantly with their sister paralogues in the Hox A, B and D clusters. We further surveyed the transcriptome data of other elasmobranch species23,24, which uncovered more Hox C genes of the zebra bullhead shark (Fig. 3k). These findings demonstrate that Hox C genes were not lost in a cluster-wide deletion event in the elasmobranch ancestor as proposed previously19, but have eroded intermittently during elasmobranch evolution. Together, while the Hox A, B and D clusters exhibit the canonical, conservative nature even in elongated elasmobranch genomes, their Hox C cluster underwent remarkable lineage-specific, sequence-level modifications.
Encompassing genes secondarily lost in osteichthyan lineages
The constructed orthologue groups contained 304 genes that seemingly existed in the vertebrate ancestor are retained by elasmobranchs, but have disappeared in osteichthyan evolution (Supplementary Table 8). They included a member of the Fox gene family (designated as FoxG3) whose orthologues are retained only by non-tetrapod vertebrates (Fig. 4a). One of its close paralogues, FoxG1, functions as a key regulator of forebrain development in diverse animals25,26. The third paralogue, designated as FoxG2, was identified in only non-tetrapod vertebrates and some reptiles (Fig. 4a). Molecular phylogenetic analysis showed the triplication between FoxG1, -G2 and -G3 early in vertebrate evolution and among-lineage differential gene loss (Fig. 4a and Supplementary Note 17).
While the determinant of loss or retention of gene duplicates is often imputed to their functions27, the effect of intrinsic genomic characteristics, independent of gene functions, has also been proposed as a cause28. As the less-derived shark genomes are expected to better reconstruct the differentiation process of ancient gene duplicates, we performed a multi-faceted comparison focusing on the shark FoxG paralogues. A highly conserved nature of FoxG1, retained in all the species analysed to date, was observed in not only its amino acid sequence but also in synonymous nucleotide and flanking noncoding genome sequences (Fig. 4b,c). This coding/noncoding association was detected for the divergent nature of FoxG2, while the level of sequence conservation of FoxG3 was intermediate (Fig. 4b,c). The among-paralogue variation was also observed in the GC-content of fourfold degenerate sites (GC4; Fig. 4b and Supplementary Table 15). More remarkably, the flanking sequences of the most divergent paralogue FoxG2 contain the most abundant repetitive elements and the highest GC-content (Fig. 4b). These local genomic characteristics may have facilitated the secondary loss of the coding genes embedded in the divergent genomic regions.
Taking advantage of the access to embryonic samples, we analysed the spatial distribution of catshark FoxG gene expression during development (Fig. 4d–f and Supplementary Fig. 6c–p). FoxG2 was expressed in the acoustico-facial ganglionic complex (VII+VIII) and the vagal ganglion (X) (Fig. 4e). FoxG3 was expressed in an anterodorsal part of the retina, in addition to the FoxG2-positive domains (Fig. 4f), while FoxG1 expression was observed in the forebrain in addition to the FoxG3-positive domains (Fig. 4d). Together, the more prone a FoxG paralogue is to secondary loss, the more restricted is its expression domain in shark embryos. This among-paralogue comparison, enabled by the genomic resource of an egg-laying shark, confirms the association of the fates of gene duplicates with the variable natures of genomic regions containing those duplicates.
Early invention of homoeostatic machinery for gut–brain axis
To further characterize phenotypic traits refined in jawed vertebrates, we focused on gene repertories encoding endocrine hormones and their receptors that control growth, reproduction and homoeostasis. Our phylogenetic census in shark genomes and transcriptomes revealed potential orthologues of hormone and receptor genes previously unidentified in this taxon (Fig. 5a). These included prolactin (PRL1), orexin, kisspeptin, spexin, motilin and prolactin receptor implicated in fertility, appetite, digestion and sleep in mammals29,30,31, as well as osmolarity and gastrointestinal control in teleost fishes (Supplementary Note 18). For leptin whose putative orthologue was previously identified in a genomic sequence of C. milii32, we confirmed the orthology of the C. milii and elasmobranch genes to osteichthyan leptin genes, by means of molecular phylogeny and conserved synteny (Supplementary Note 18.10). Of these hormone and receptor genes, all but leptin were suggested to have existed in the vertebrate ancestor by the presence of possible cyclostome orthologues or gene duplication that probably occurred in genome expansion before the divergence of all extant vertebrates33 (Fig. 5a). This inference marks leptin, a key metabolic and neuroendocrine regulator in mammals34, and the signalling cascade through its receptor (LepR) as an invention in the jawed vertebrate lineage (Supplementary Note 18.10). In mammals, leptin is mainly expressed in adipose tissues35, and we could not identify any tissue with intensive expression of its orthologue in sharks that generally lack overt adipose tissues36 (Fig. 5b).
Overall, we identified the orthologues of almost all hormones and their receptors involved in the hypothalamo–pituitary and gastrointestinal systems documented mainly in mammals (Fig. 5a and Supplementary Note 18). Among these, the genes encoding oxytocin homologues have undergone a unique gene duplication in the elasmobranch lineage37,38, and our genomic and phylogenetic analysis indicated its intricate evolutionary history through intermittent gene conversions (Supplementary Note 18.3). The similarity in transcript localization of the identified shark hormone and receptor genes to mammalian counterparts, such as PRL1 in the pituitary and motilin in the intestine (Fig. 5b and Supplementary Fig. 7a), suggests the establishment of genetic components of the gut–brain axis39 before the last common ancestor of extant jawed vertebrates.
Sensory and neuronal gene repertories
Visual opsin gene repertories are often altered on adaptation to new habitats with dim light40,41. Previously, two short wavelength-sensitive opsin genes, SWS1 and SWS2, were found to be missing in the C. milii genome42. Our search in the elasmobranch genome assemblies ascertained the absence of not only these two but also the green/blue-sensitive opsin gene Rh2 (Fig. 6 and Supplementary Fig. 8). Moreover, long wavelength-sensitive opsin gene (LWS) is absent from the present cloudy catshark genome assembly, and thus rhodopsin (RHO) is the only visual opsin gene identified in it (Fig. 6 and Supplementary Note 19). Previously, the retention of only RHO was reported in some animals that adapted to fossorial, nocturnal or aquatic life43,44,45. In fact, the cloudy catshark inhabits not only inshore but also the deep sea46 (~300 m) and is a close relative of typical deep-sea dwellers47. Thus, the absence of LWS might be due to an evolutionary gene loss that was permitted in the catshark ancestor by its possible exclusive deep-sea habitat (Fig. 6). Adaptation to the deep sea, into which only blue light penetrates48, was previously corroborated for ray-finned fishes by the blueshifted absorption spectra of RHO pigments49. Our spectroscopic analysis of the RHO pigments revealed blueshifted spectra for not only the cloudy catshark (λmax, 484 nm) but also the whale shark (478 nm) that occasionally migrates down to the bathypelagic zone (~2,000 m) besides daytime surface feeding habits50 (Fig. 6). This study portrays the diversity of visual opsin gene repertories among elasmobranchs and illustrates the potential of in vitro molecular experiments supported by genomic sequence analysis, in understanding underwater ecology of inaccessible species.
Our interest extended to olfactory receptor gene repertories that are often linked to adaptation to new lifestyles51. Although our present study does not include carnivorous epipelagic sharks that might have enhanced olfactory sensing, each of the shark species examined in the present study had only three olfactory receptor family genes (Supplementary Note 21), concordantly with the retention of few olfactory receptor family members by C. milii52. This finding indicates that at least the analysed shark species rely on a distinct molecular mechanism for olfaction from the conventional olfactory receptors.
Neuronal cell identities in mammalian brains are defined by the combinatorial expression of clustered protocadherin (Pcdh) genes53,54. Previously, the C. milii genome was shown to also contain a cluster of Pcdh genes55, but their expression profiles have remained unknown. Our study showed that elasmobranchs contain slightly higher numbers of Pcdh genes in markedly longer clusters than osteichthyans (Supplementary Fig. 9a,b and Supplementary Note 20). The bamboo shark transcriptome data demonstrated that most of the clustered Pcdh genes consist of both variable and constant exons and exhibit a biased expression pattern towards neural tissues, as previously shown for mammalian and teleost counterparts (Supplementary Fig. 9c). These findings, which are expected to be reinforced by single-cell analysis, suggest the early establishment of the mechanism for generating neuronal cell diversity through a Pcdh cluster in the last common ancestor of all extant jawed vertebrates.
Our study has provided an unprecedented set of genomic, transcriptomic and epigenomic data from three elasmobranch species, with the bamboo shark genome assembly achieving the highest continuity. We focused on unthreatened, oviparous species that allow captive breeding for continuous animal experimentation including embryonic operation. This is not feasible with other non-tetrapod vertebrates whose genomes are evolving relatively slowly, such as the coelacanths and the spotted gar. It would be intriguing to further explore the possible relationship of the large genome/gene sizes and low evolutionary rate of elasmobranchs with metabolic rate and/or longevity. Our results also highlighted some genomic elements retained by elasmobranchs but missing in holocephalans possibly because of the genome compaction in the latter lineage. Elasmobranchs have scarce repertories of opsin and olfactory receptor genes, possibly associated with their unique niche. Our study suggested that the jawed vertebrate ancestor was already equipped with the mechanism for generating neuronal cell diversity as well as the hormone gene repertories regulating homoeostasis and reproduction in mammals. Also, we showed that elasmobranchs have retained at least parts of the Hox C cluster, in which relict Hox C genes are under the typical Hox-like regulation in spite of relaxed genomic constraint on them. Our products will fuel diverse life science studies on sharks and evolutionary investigation about early vertebrates.
All samples of the brownbanded bamboo shark Chiloscyllium punctatum were supplied by captive breeding at the Osaka Aquarium Kaiyukan. Samples of the cloudy catshark Scyliorhinus torazame were supplied by captive breeding at the Aquarium Facility of RIKEN Center for Developmental Biology and Atmosphere and Ocean Research Institute of University of Tokyo. The developmental staging was performed according to existing literature for small-spotted catshark S. canicula56 and our original table for C. punctatum8. For the whale shark Rhincodon typus, whose genome sequence reads were publicly available2, only transcriptome sequencing and genome size estimation were performed in the present study, using blood sampled primarily for the purpose of regular health check-ups for captive animals from a male at the Okinawa Churaumi Aquarium (for transcriptome sequencing) and a female at the Osaka Aquarium Kaiyukan (for genome size estimate with flow cytometry), respectively. No wildlife was killed solely for this study. Animal handling and sample collections at the aquaria were conducted by veterinary staff without restraining the individuals57, in accordance with the Husbandry Guidelines approved by the Ethics and Welfare Committee of Japanese Association of Zoos and Aquariums. All other experiments were conducted in accordance with the Guideline of the Institutional Animal Care and Use Committee (IACUC) of RIKEN Kobe Branch (Approval ID: H16-11) or the Guideline for Care and Use of Animals at the University of Tokyo.
Genome sequencing and assembly
Genomic DNA was extracted from the liver of a 20-cm-long male juvenile brownbanded bamboo shark and a 4-cm-long whole cloudy catshark embryo of an unknown sex with phenol/chloroform as previously described58. The extracted genomic DNA was sheared with a S220 Focused-ultrasonicator (Covaris) to retrieve DNA fragments of variable length distributions (see Supplementary Table 1 for detailed amounts of starting DNA and conditions for shearing). The sheared DNA was used for paired-end library preparation with a KAPA LTP Library Preparation Kit (KAPA Biosystems). The optimal numbers of PCR cycles for individual libraries were determined with a Real-Time Library Amplification Kit (KAPA Biosystems) by preliminary qPCR-based quantification using an aliquot of adaptor-ligated DNAs. Small molecules in the prepared libraries were removed by size selection using Agencourt AMPure XP (Beckman Coulter). The numbers of PCR cycles and conditions of size selection for individual libraries are included in Supplementary Table 1. Mate-pair libraries were prepared using a Nextera Mate Pair Sample Prep Kit (Illumina), employing our customized iMate protocol59 (http://www.clst.riken.jp/phylo/imate.html). The detailed conditions of mate-pair library preparation are included in Supplementary Table 2. After size selection, the quantification of the prepared libraries was performed using a KAPA Library Quantification Kit (KAPA Biosystems). They were sequenced on a HiSeq 1500 (Illumina), operated by HiSeq Control Software v126.96.36.199 using a HiSeq SR Rapid Cluster Kit v2 (Illumina) and HiSeq Rapid SBS Kit v2 (Illumina), and MiSeq operated by MiSeq Control Software v188.8.131.52 using MiSeq Reagent Kit v3 (600 Cycles) (Illumina). Read lengths were 101, 127, 151 or 171 nt on HiSeq and 251 or 301 nt on MiSeq. Base calling was performed with RTA v184.108.40.206, and the fastq files were generated by bcl2fastq v1.8.4 (Illumina). Removal of low-quality bases from paired-end reads was processed by TrimGalore v0.3.3 with the options ‘--stringency 2 --quality 20 --length 25 --paired --retain_unpaired’. Mate-pair reads were processed by NextClip v1.160 with default parameters. De novo genome assembly and scaffolding employing the processed short reads were carried out by the program PLATANUS v1.2.161 with its default parameters. The assembly step employed paired-end reads and single reads whose pairs had been removed, and the scaffolding step employed paired-end and mate-pair reads. The gap closure step employed all of the single, paired-end and mate-pair reads. Resultant genomic scaffold sequences were screened for contaminating organismal sequences, PhiX sequences loaded as a control, mitochondrial DNA sequences, and those shorter than 500 bp, as performed previously28.
Measuring nuclear DNA contents
Nuclear DNA contents of the three species were measured as previously described28,62. We used cells prepared from the liver and blood of a 32-cm-long juvenile female bamboo shark, the blood of a 5-m-long live female whale shark reared in Osaka Aquarium Kaiyukan, and the liver and blood of a 16-cm-long juvenile male cloudy catshark (see above for the detail of sampling). Mouse embryonic fibroblast cells, used as a reference, were prepared from E14.5 embryos and cultured in DMEM media supplemented with 10% FBS, at 37 °C with 5% CO2. We also used human GM12878 cells as a reference, which were cultured in RPMI-1640 media (Thermo Fisher Scientific) supplemented with 15% FBS, 2 mM l-glutamine, and 1× antibiotic-antimycotic solution (Gibco) at 37 °C with 5% CO2. Liver tissues of bamboo shark and catshark were minced using scissors, rinsed once in shark saline solution (222.45 mM NaCl, 1.34 mM KCl, 2.38 mM NaHCO3 and 333 mM urea)63, and incubated in 0.125% trypsin-EDTA solution (1:1 mixture of 0.25% trypsin-EDTA (Thermo Fisher Scientific) and shark saline solution) for 15 min at 37 °C with gentle agitation to dissociate the cells. FBS was added to stop digestion with trypsin. The cell suspension was filtered through a 40 μm cell strainer (BD Bioscience) to remove cell clumps and debris. Blood of bamboo shark and catshark was sampled from the heart using a 1 ml syringe with a 21 G needle and immediately diluted 1:10 in shark saline solution containing 2 mM EDTA. After centrifugation at 500g for 5 min, blood cells were washed once in shark saline solution containing EDTA and counted. A total of 1 × 106 cells were collected by centrifugation at 500g for 5 min and permeabilized in shark saline solution containing 0.05% Triton X-100. DNA staining was performed by adding 1 ml of PI/RNase staining buffer (BD Bioscience). After a 15 min incubation at room temperature, cells were centrifuged at 1,500g for 5 min and resuspended in 400 μl of fresh propidium iodide (PI)/RNase staining buffer. Fluorescence intensities were measured with the excitation at 488 nm and the bandpass filter of 575/26 nm on a FACSCanto II cell sorter (BD Bioscience). Measurements were carried out with three technical replicates per sample, and the acquired values were averaged before DNA content calculations (Supplementary Table 4).
Completeness assessment of genome assemblies
Completeness of the genome assemblies was assessed with (1) CEGMA v2.564, (2) BUSCO v2.0.165 and (3) a manual curation-based census of Wnt genes (Supplementary Fig. 1). For both CEGMA and BUSCO, we employed not only the reference gene sets provided inherently with these program pipelines, but also the core vertebrate genes introduced particularly for vertebrates, especially species in isolated lineages such as elasmobranchs66. The assessments were executed on the gVolante web server67 (Supplementary Note 2) using a script released by us previously66. The genome-wide census of Wnt genes was performed with TBLASTN 2.2.31+68 searches in the elasmobranch genome assemblies using manually curated amino acid sequences of the C. milii Wnt homologues as queries, followed by a fine-scale exon search using the open reading frame sequences predicted on the individual elasmobranch genomes as queries.
To obtain species-specific repeat libraries, RepeatModeler v1.0.869 was run on the genome assemblies of the individual species with default parameters. Detection of repeat elements in the genomes was performed by RepeatMasker v4.0.570, which employs National Center for Biotechnology Information (NCBI) RMBlast v2.2.27, using the custom repeat library obtained above. For gene prediction, the parts of genome sequences detected as repeats are soft-masked with the options ‘-nolow -xsmall’.
Construction of gene models
Construction of gene models on the cloudy catshark, whale shark and bamboo shark genomes was performed in this order, following the procedure previously reported15 (Supplementary Note 5). The gene prediction program Augustus v3.1 was employed with ‘trained’ species-specific parameters and hints based on RNA-seq reads and amino acid sequences of putative homologues from other vertebrates. To build homologue hints for the cloudy catshark, we used a set of 117,246 NCBI RefSeq protein sequences downloaded on 23 November 2015, including ‘known’ human proteins (39,582 sequences), chicken (6,189 sequences) and amniote vertebrates (44,675 sequences), as well as C. milii (NCBI Genome version 6.1.3, 26,800 sequences). For constructing gene models of the whale shark, we used a sequence set combining all predicted cloudy catshark peptide sequences along with the above-mentioned sequence set. Likewise, for the gene prediction of the bamboo shark, we incorporated the predicted whale shark peptide sequences into the sequence set used in gene prediction for the catshark. RNA-seq data used for exon hint construction is indicated in Supplementary Table 6.
RNA-seq and transcriptome data processing
Total RNAs were extracted with Trizol reagent (Thermo Fisher Scientific). Quality control of DNase I-treated RNA was performed with Bioanalyzer 2100 (Agilent Technologies). Libraries were prepared with TruSeq RNA Sample Prep Kit (Illumina) or TruSeq Stranded mRNA LT Sample Prep Kit (Illumina) as previously described66. The amount of starting total RNA and numbers of PCR cycles are included in Supplementary Table 6. The obtained sequence reads were trimmed for removal of adaptor sequences and low-quality bases with TrimGalore v0.3.3 as outlined above, and de novo transcriptome assembly was performed with the program Trinity v2.2.171 with the parameters ‘--SS_lib_type RF --min_kmer_cov 3’. Alignment of the trimmed RNA-seq reads to the genome assembly employed TopHat2 v2.0.11, followed by gene expression quantification with Cuffdiff v2.1.1, while read alignment to coding sequences employed bowtie2 v2.2.8 and eXpress v1.5.1.
Comparison of conserved noncoding elements (CNE)
A set of previously identified CNEs for the human genome hg19 was downloaded from UCNEbase72 (http://ccg.vital-it.ch/UCNEbase/data/download/fasta/hg19_UCNEs.fasta.gz). This set included 4,351 genomic segments in the human genome that exhibit >95% nucleotide sequence identity with counterparts in the chicken genome and are longer than 200 bp. Ten of the retrieved CNEs that include regions annotated as protein-coding in the hg38 genome assembly were removed with bedtools v2.25.073, and the remaining 4,341 sequences were queried with BLASTN 2.5.0+ in two different modes, namely ‘megablast’ and ‘dc-megablast’, against the genome assemblies of the bamboo shark and the cloudy catshark from this study and those of other vertebrate species (coelacanth, LatCha1; spotted gar, LepOcu1; western clawed frog, Xtropicalis_v7; zebrafish, GRCz10; medaka, MEDAKA1; Arctic lamprey, LetJap1; sea lamprey, Pmarinus_7.0). The number of best hits that were longer than 100 bp was counted for each of the two search modes. In the analysis of the enhancer in the Tbx4 locus, the program, LAST v75274, was used to detect conserved noncoding elements. Visualization of sequence similarity between species employed VISTA75 using a global pairwise alignment program Shuffle-LAGAN76.
Search for long noncoding RNA (lncRNA)
Human lncRNA sequences were downloaded from GENCODE database77 (release 25; https://www.gencodegenes.org/), which included 27,692 sequences. We removed sequences of antisense RNAs that overlapped open reading frames in the lncRNA database and repetitive sequences that were masked with RepeatMasker v4.0.5, as described above. By using the program BLASTN 2.5.0+ with the dc-megablast mode, we queried the refined set of lncRNAs against the genome assemblies of other vertebrates used above in CNE detection. Following the BLASTN searches, the best hits whose bit scores exceeded 60 were counted. We first made a transcriptome assembly from the bamboo shark RNA-seq data in Supplementary Table 6, using Trinity-v2.4.071 with the options ‘--SS_lib_type RF --trimmomatic’. Next, human lncRNAs were queried against the transcriptome assembly with BLASTN. Subsequently, we selected the best hits of the BLAST search whose bit scores exceeded 50 and removed those that were aligned with the opposite strands of human lncRNAs. We also queried the lncRNA candidates against the Augustus-predicted coding genes of the bamboo shark genome and removed sequences if the best hits were aligned with the forward strands of the predicted coding genes. To analyse the tissue distribution of the validated lncRNAs using RNA-seq data, we masked repeat elements in the transcriptome assembly with RepeatMasker and the repeat library built above for the bamboo shark genome, which was followed by read mapping performed as described above.
Antibody validation for chromatin immunoprecipitation (ChIP) assays
Western blotting for catshark CTCF protein was performed as previously described15, using protein extracts from tissues of a juvenile catshark (muscle and liver) and a human GM12878 cell line with antibodies for CTCF (Cell Signaling Technology, #3418 S in 1:2,000 dilution) and histone H3 (Wako, #304-34781 in 1:2,000 dilution). Immunoprecipitation was performed as previously described15 using the protein extract from the eye of a juvenile cloudy catshark. Protein identification was performed as described previously78, with nanoliquid chromatography tandem mass spectrometry using LTQ Orbitrap Velos Pro (Thermo Fisher Scientific), followed by data analysis with the MASCOT v2.6.1 software (Matrix Science).
ChIP-seq and data processing
Bamboo shark embryos at stage 27, cloudy catshark embryos at stage 27.5 and the stomach of a juvenile cloudy catshark were dissected and snap frozen in liquid nitrogen and kept at −80 °C until use. A whole embryo or a stomach of approximately 1 × 107 cells were used for ChIP with the above-mentioned anti-CTCF antibody. ChIP assays, as well as ChIP-seq and downstream data analysis, were performed as previously described15. Trimming of the obtained sequence reads, mapping against the genome assemblies, and peak calling were performed by TrimGalore v0.3.7, Bowtie v0.12.879 and MACS2 v2.0.1080, respectively. The peaks overlapping between replicates were identified by bedtools v2.19.173 and designated as ‘consensus peaks’. A subset of ‘consensus peaks’ with a fold enrichment value of no less than 10 was assigned as ‘significant peaks’. For the catshark stomach sample without a replicate, peak calling was performed using the embryonic sample as input. Significant peaks for the catshark stomach sample were determined with a fold enrichment value of no less than 10. CTCF core and upstream motifs enriched in the top 2,000 peak regions (peak summit ± 100 bp) were identified by MEME v4.10.081. FIMO v4.10.182 was then used to identify motif locations in the entire peak set (peak summit ± 100 bp). When multiple motifs were identified within a peak region, only the motif with the lowest p value was adopted for downstream analyses.
In situ hybridization
Catshark and bamboo shark embryos were fixed with 4% PFA/PBS, dehydrated with methanol series and stored in 100% methanol at −30 °C until use. In situ hybridization using whole-mount embryos and paraffin-embedded sections of 8 μm thickness was performed as previously reported83,84. Riboprobes were synthesized using complementary DNA amplified with gene specific primers in Supplementary Table 14 as templates. The regions for cDNA amplification were selected in untranslated or non-conserved coding parts of exons to avoid cross-hybridization between paralogues.
Orthologue group construction
We employed the OMA platform85 for producing orthologue groups composed of diverse vertebrate species including the four cartilaginous fishes (Callorhinchus milii, Chiloscyllium punctatum, R. typus and S. torazame). We first retrieved all-against-all alignment results of the predicted peptides of the 19 osteichthyans included in Fig. 1d and sea lamprey from the OMA database (released in March 2017). OMA standalone v2.1.185 was run to perform additional all-against-all comparisons by incorporating the peptides of the four cartilaginous fishes and Arctic lamprey, which produced 31,498 hierarchical orthologous groups (HOGs).
Quantifying synonymous substitutions
For computation of numbers of synonymous substitutions per sites (KS; Supplementary Note 12), 1,656 one-to-one orthologues retained by the four cartilaginous fishes and ten osteichthyans were selected from the HOGs as follows (Supplementary Table 8). First, peptide sequences of the retrieved orthologues were aligned with MAFFT v7.299b86 with the option ‘-linsi’. The individual alignments were trimmed and back-translated into nucleotides with trimAl v1.4 rev1587 with the options ‘-automated1 -backtrans’ followed by removal of gapped sites using trimAl with the options ‘-nogaps’. Orthologue groups containing fewer than 50 aligned codons or a stop codon were discarded. For the selected orthologue groups, KS were computed with codeml in the PAML v4.9c88.
Phylogenetic tree inference
To reconstruct the species tree in Fig. 1d, we retrieved 935 one-to-one orthologue groups retained by all of the 25 vertebrates used in the HOGs, allowing for none or one missing orthologue for individual groups. The peptides of the individual orthologue groups were aligned with MAFFT v7.299b86 with the option ‘-linsi’. Unambiguously aligned sites were selected by trimAl v1.4 rev1587 with the option ‘-strictplus’ followed by a concatenation of these alignments into one. Phylogenetic tree inference was performed with the maximum-likelihood method using the program RAxML v8.2.889 with the options ‘-m PROTCATWAG -f a -# 100’, assuming the partition model for individual orthologue groups (the ‘-q’ option).
To infer individual gene family trees, amino acids sequences were retrieved from aLeaves90 incorporating Ensembl release 84. Multiple sequence alignment was performed with MAFFT with the option ‘-linsi’. The aligned sequence sets were processed using trimAl v1.4 rev1587 with the option ‘-automated1’. This was followed by another trimAl run with the option ‘-nogaps’ in the tree inference for Figs. 3e and 4a, and Supplementary Figs. 4a, 5b, 6a and 7s. Molecular phylogenetic trees were inferred by RAxML with the ‘-m PROTCATWAG -f a -# 1000’ options unless stated otherwise. Tree inference in the Bayesian framework was performed with the program PhyloBayes v4.1c91 with the options ‘-cat -dgam 4 -wag -nchain 2 1000 0.3 50’ unless stated otherwise. This was followed by an execution of bpcomp in the PhyloBayes v4.1c package with the option ‘-x 100’. The support values at the nodes of molecular phylogenetic trees included are, in order, bootstrap values and Bayesian posterior probabilities. The latter was shown only when the relationship at the node in the visualized tree was supported by the Bayesian inference.
Clustered Pcdh gene identification
The genomic scaffold sequences of elasmobranch sharks were first examined via TBLASTN v2.2.29+ using the amino acid sequences of individual clustered Pcdh genes of C. milii (retrieved from http://ensembl.fugu-sg.org, gene IDs: B0YN55-B0YN99, B0YNA0 and B0YNA1) and human clustered and non-clustered Pcdh genes (retrieved from the UCSC Genome Browser) to identify any prospective elasmobranch scaffolds containing clustered Pcdh genes. The regions exhibiting the homologies with the known C. milii and human clustered Pcdh proteins were utilized for gene prediction, which was accomplished by a coordination between GeneWise v2.2.3-rc792 and geneid v1.493. The predicted genes were further refined through manual inspection of exon–intron junctions and transcript evidence from RNA-seq data (Supplementary Table 6). For each species, HISAT2 v2.0.494 was run on each tissue sample with the options ‘-k 200 --known-splicesite-infile’ by inputting a list of splice sites extracted from the Pcdh gene annotation of the individual species. These alignments were passed to StringTie v1.3.095 to generate an additional set of tissue-specific gene models. These models were incorporated into the initial Pcdh gene annotation through StringTie, which was used to produce sets of read coverage tables using StringTie. The output file was utilized by Ballgown v2.6.096 to confirm transcription and splice site locations for each putative clustered Pcdh gene. The protein domain structures in the predicted clustered Pcdh genes were analysed using the HMMer v3.1b297 and SMART98.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Genome, transcriptome and ChIP sequence reads are deposited in the DNA Data Bank of Japan (DDBJ) under the accession number DRA006338. The genome assemblies of the brownbanded bamboo shark and cloudy catshark were deposited in DDBJ under the accession numbers BEZZ01000001–BEZZ01280241 and BFAA01000001–BFAA01458049, respectively. The whale shark genome assembly and the gene models of the three shark species are available at https://figshare.com/projects/sharkgenome1-phyloinfokobe/28863.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank J. Donald for critical reading of the manuscript, K. Shirato, K. Yamamoto, S. Shibuya and members of the Evolutionary Morphology Laboratory, RIKEN for rearing catsharks, M. Andrabi for assistance in sequence data analysis, K. Tanimoto, H. Kiyonari, T. Yoshikuni, T. Ito, Y. Miyagawa and S. Sodeyama for providing materials. Our gratitude extends to K. Muguruma, Y. Murakami, F. Sugahara, J. Pascual-Anaya, R. Kusakabe, S. Higuchi, Y. Yamaguchi, W. Takagi, H. Kaiya, Y. Ishihama, S. Miyake, T. Kaku, T. Tanaka, D. Sipp, M. Tan, A. D. M. Dove, T. D. Read, D. Lagman, D. Ocampo Daza, D. Larhammar, Y. Uno and S. Mazan for insightful discussion. This study was supported by RIKEN and JSPS KAKENHI Grant Numbers 26650110, 26291065 and 17H03868 to S.Kuraku and S.H. and 17K07426 to S.Kuraku.