Reconstructing the genomes of bilaterian ancestors is central to our understanding of animal evolution, where knowledge from ancient and/or slow-evolving bilaterian lineages is critical. Here we report a high-quality, chromosome-anchored reference genome for the scallop Patinopecten yessoensis, a bivalve mollusc that has a slow-evolving genome with many ancestral features. Chromosome-based macrosynteny analysis reveals a striking correspondence between the 19 scallop chromosomes and the 17 presumed ancestral bilaterian linkage groups at a level of conservation previously unseen, suggesting that the scallop may have a karyotype close to that of the bilaterian ancestor. Scallop Hox gene expression follows a new mode of subcluster temporal co-linearity that is possibly ancestral and may provide great potential in supporting diverse bilaterian body plans. Transcriptome analysis of scallop mantle eyes finds unexpected diversity in phototransduction cascades and a potentially ancient Pax2/5/8-dependent pathway for noncephalic eyes. The outstanding preservation of ancestral karyotype and developmental control makes the scallop genome a valuable resource for understanding early bilaterian evolution and biology.
The nature of Urbilateria, the last common ancestor of all bilaterians, is enigmatic due to the lack of a plausible candidate in the fossil records 1 . The earliest unambiguous fossil of a bilaterian, Kimberella, shows remarkable resemblance to a mollusc, albeit its relationship with Urbilateria remains uncertain 2,3 . In the absence of definitive fossil records, genomic reconstruction by comparing extant bilaterian genomes becomes essential to our understanding of early bilaterian ancestors and their subsequent evolution 4,5 . However, reconstructing the genome of the bilaterian ancestor is challenging due to the paucity of high-order genome assemblies from ancient and/or slow-evolving lineages. Early genome sequencing efforts have mostly focused on two of the three major bilaterian groups, that is, protostome ecdysozoans and deuterostomes. Limited sequencing in the third group of protostome lophotrochozoans, a large superclade that includes molluscs, annelids and brachiopods, has revealed that their genomes are less derived from the ancestral bilaterian state than those of many ecdysozoans 5 . Unfortunately, none of these less-derived lophotrochozoan genomes were assembled to a degree that permits chromosome-level genome comparison.
Mollusca is the most speciose phylum of Lophotrochozoa and among the first bilaterians to appear in fossil records
. Many molluscan lineages including bivalves showed little change in shell morphology and life style over several hundred million years, and yet extant molluscs are abundant and thriving in diverse marine, freshwater and terrestrial environments, providing key ecological services and significant economic benefits to humans. Molluscs are highly diverse in form, making them excellent subjects to study body plan evolution and in particular its patterning by Hox genes
. Molluscs also have the greatest diversity in eye morphology, ranging from simple cupped to chambered or compound eyes, as well as in the number and placement of their eyes
, providing good subjects to study the origin and evolution of the eye, or Darwin’s ‘organ of extreme perfection’. Despite the great evolutionary and biological significance of molluscs, our sampling of their genomes remains limited to a few species
Here we report a high-quality, chromosome-anchored reference genome of the scallop Patinopecten yessoensis (Jay, 1857), a bivalve mollusc from the large Pectinidae family that contains ~270 living species and thousands of fossil species (dating back to ~320–340 million years ago, Ma 12 ). Scallops are widely distributed in world oceans. They are mostly free-living and have multiple eyes scattering along the mantle edge. Many scallops are important fishery and aquaculture species. P. yessoensis is a large scallop living on cold and stable ocean bottoms of the northwestern Pacific. It has a conserved 19-chromosome karyotype that is common to diverse bivalves and may represent the ancient karyotype of bivalves 13 . Analysis of the scallop genome and extensive transcriptomes reveals outstanding preservation of ancestral bilaterian linkage groups, an intact Hox gene cluster under new expression control and diverse phototransduction cascades with a potentially ancient Pax2/5/8-dependent pathway for noncephalic eye formation, providing insights into the evolution of genome organization and developmental control during the emergence of bilaterians.
Results and discussion
Genome sequencing, assembly and characterization
Genomes of marine bivalves are particularly challenging to sequence and assemble with short next-generation sequencing reads due to high polymorphism and repetitive content 9,11 . To alleviate the polymorphism problem, a highly inbred individual derived from selfing of a hermaphrodite (inbreeding coefficient of 0.5; Supplementary Fig. 1; Fig. 1a) was used for whole-genome shotgun (WGS) sequencing (424.3 Gb data in total; Supplementary Table 1), and an efficient, hybrid-specific SOAPdenovo approach 14 was adopted for genome assembly (see Supplementary Text; Supplementary Figs 2–7). The final genome assembly is 988 Mb, with a contig N50 size of 38 kb and a scaffold N50 size of 804 kb (Supplementary Table 2), representing significant improvements over two published bivalve genomes 9,11 . Our assembly is 442 Mb less than the estimated genome size (~1.43 Gb; Supplementary Figs 8 and 9), probably due to the collapse of repetitive elements (Supplementary Fig. 10). The quality and integrity of the assembly is demonstrated by the mapping of 94.5% paired-end reads, 99.8% of Sanger-sequenced fosmids and 96.0–99.8% of various transcriptomic datasets generated in this and a previous study 15 (Supplementary Figs 11 and 12; Supplementary Tables 3–6). With the aid of a high-density linkage map (7,489 markers; Supplementary Table 7) constructed by using the 2b-RAD methodology 16 , 1,419 scaffolds (covering ~81% of the assembly) are assigned to the 19 haploid chromosomes (Fig. 1a; Supplementary Fig. 13), providing the first chromosome-anchored genome assembly in molluscs or less-derived lophotrochozoans.
The scallop genome encodes 26,415 protein-coding genes (Supplementary Figs 14 and 15; Supplementary Table 8), of which 91% are annotated based on known proteins in public databases (Supplementary Table 9). The repeat content accounts for 39% (389 Mb) of the assembly (Supplementary Table 10), dominated by tandem repeats (18.4%). Transposable elements, which are usually considered active modulators of genome evolution, are less abundant (8–18% reduction) and less active in the scallop genome than the Pacific oyster and pearl oyster genomes (Supplementary Table 10; Supplementary Fig. 16). Resequencing of the wild hermaphrodite parent provides a genome-wide single nucleotide polymorphism (SNP) and short insertion/deletion (indel) polymorphism level of 1.04% (Supplementary Table 11), which is lower than the 1.30% found in the Pacific oyster Crassostrea gigas but approximately sevenfold higher than that (0.14%) found in humans 17 . As expected, polymorphism in the inbred scallop is greatly reduced compared to that in its hermaphroditic parent (Supplementary Table 11; Fig. 1a), which may have contributed to our assembly success.
Genome comparison and chromosome evolution
Phylogenetic analysis with 482 highly conserved, single-copy genes show that the scallop lineage diverged around ~425 Ma from the lineage leading to Pacific oyster and pearl oyster (Supplementary Fig. 17). Based on the sister taxon relationship between Bivalvia and Gastropoda 18 , our phylogenetic analysis gives an estimation of 504 Ma for the appearance of the bivalve lineage or its divergence from the gastropod lineage (Supplementary Fig. 17). P. yessoensis shows relatively slow substitution rate in protein sequences among bilaterians (Supplementary Table 12; Supplementary Fig. 18), supporting the ‘slow-evolving’ feature of scallop coding repertoire.
Gene family analysis of scallop and two other bivalves identifies a core set of 9,365 gene families (Fig. 1b). Comparing with 24 selected animal species (Supplementary Table 13) identified 756 bivalve-specific and 567 expanded gene families with notable enrichment of ion channel- and neurotransmitter-related functions (Supplementary Tables 14 and 15) that may help sessile or less mobile bivalves to cope with environmental changes more efficiently as part of bivalve adaptation. Contrary to expectations, the number of shared gene families between scallop and each of the other two bivalves (C. gigas and Pinctada fucata) is higher than that between C. gigas and P. fucata, which are phylogenetically closer (Supplementary Fig. 17), indicating relatively slower rate of gene divergence or loss in the scallop lineage. This also coincides with the observation of higher polymorphism in the exons of Pacific oyster than those of scallop as noted above. Among lophotrochozoans, bivalves share considerably more gene families with deuterostomes, ecdysozoans and non-bilaterian animals (Fig. 1c; Supplementary Table 16), with the highest values observed for scallop, followed by the brachiopod Lingula anatina, a lophotrochozoan that is commonly considered a ‘living fossil’ 19 . Gene family analysis also identifies 830 scallop-specific and 349 expanded gene families that participate in diverse biological processes (Supplementary Tables 17 and 18) and are probably important for scallop lineage-specific adaptations.
To enable deep phylogenetic comparisons, we conducted macrosynteny analysis of conserved linkage between orthologous genes, which is independent of intra-chromosomal rearrangements 4,5 . Such analysis has been fruitful in previous studies on lophotrochozoans 5 for understanding long-range macrosynteny conservation, but limited in inferring chromosome-scale evolution, as these studies are all based on highly fragmented genome assemblies with the number of scaffolds usually ranging from thousands to tens of thousands. To understand bilaterian chromosome evolution, we generated chromosome-level assemblies not only for scallop but also for Pacific oyster (C. gigas) and pearl oyster (P. fucata) by using two recently published high-density linkage maps 20,21 , and used them for macrosynteny analysis. Strikingly, our chromosome-based macrosynteny analysis reveals a near-perfect correspondence between the 19 scallop chromosomes and the 17 presumed bilaterian ancestral linkage groups (ALGs; ALGs or proto-chromosomes reconstructed in ref. 5), a level of chromosome preservation that far exceeds other bilaterians with chromosome-level assemblies (Fig. 2; Supplementary Table 19; conservation index: 0.81 for scallop whereas 0–0.42 for other bilaterians), suggesting that scallop has a karyotype highly similar to that of the bilaterian ancestor. Such degree of karyotype preservation is less evident in the two oyster species (Fig. 2), which may attribute to their presumably derived karyotypes (10 chromosomes in C. gigas and 14 chromosomes in P. fucata) in comparison with the highly conserved 19-chromosome karyotype found in scallops and many other bivalves 13 . To allow more bilaterian genomes (11 additional representative bilaterians) to be included for comparisons, we also performed the conventional scaffold-based macrosynteny analysis 4,5 , which still shows that scallop has the highest level of macrosynteny conservation, closely followed by amphioxus Branchiostoma floridae (Supplementary Fig. 19). Only two inter-chromosome rearrangements were identified in all three bivalves, including partial translocation of ALG2 and the fusion of ALG5 and ALG16 (Supplementary Table 19) that possibly pre-dates the radiation of bivalves.
Homeobox clusters and subcluster temporal co-linearity
The homeobox genes of Antennapedia (ANTP)-class are key regulators of development in all animals, which presumably originated from a Mega-cluster that formed by tandem duplications of a Proto-ANTP gene 22 . They are more or less dispersed in modern bilaterian genomes, but mostly found in four distinct chromosomes in the amphioxus and in the annelid Platynereis, which has led to the hypothesis that the Mega-cluster, if it did exist, had already been broken up onto four chromosomes by the time of the protostome–deuterostome ancestor (PDA) 23 . Supporting this hypothesis, a similar distribution of ANTP genes on four scallop chromosomes is observed (Supplementary Fig. 20). In particular, it confirms the coexistence of the Hox genes with the NK-linked gene Dlx, providing key support for the ancient linkage of NK-linked and Hox-linked genes in the Mega-cluster hypothesis 23 .
Contrary to frequent cluster alterations in many animal lineages by gene loss, duplication or physical splits 24 (Supplementary Fig. 21), ParaHox and Hox clusters are well-preserved and remain intact in the scallop genome, which enables us to infer the possible ancestral state of these clusters in the lophotrochozoan ancestor or PDA (Fig. 3a). For example, the scallop ParaHox cluster exhibits the same gene order, orientation and relative gene spacing as those found in chordates (Supplementary Fig. 22), strongly supporting the previous speculation of the existence of a typical deuterostome-like cluster in the PDA and lophotrochozoan ancestor 25 . The scallop Hox cluster contains 11 genes (3 anterior, 6 central and 2 posterior) that largely retain the conserved residues of their homeodomains for each Hox paralogous group (Fig. 3a; Supplementary Fig. 23). Comparison of the scallop Hox cluster with those of other lophotrochozoans suggests that the lophotrochozoan ancestor might already have an 11-gene Hox cluster that resembles the intact Hox clusters of scallop and limpet, with all genes except Post1 arranged in the same orientation.
Temporal co-linear activation of homeobox genes for patterning the body plan is well documented in vertebrates and may contribute to the conservation of homeobox clusters
We re-examined published data to determine if STC is present in Hox expression of other bilaterians during development. The oyster Hox expression clearly resembles that of scallop, although the oyster has dispersed Hox subclusters 9 (Fig. 4; Supplementary Fig. 25) and STC was not previously recognized. These findings suggest that maintaining STC may depend on the integrity of subclusters but not the whole cluster. We also identified similar/partial STC patterns by analysing published Hox expression data in distantly related bilaterian taxa, including the annelids Nereis virens 29 and Platynereis dumerilii (Lophotrochozoa), the shrimp Litopenaeus vannamei 30 (Ecdysozoa) and the ascidian Ciona intestinalis 31 (Chordata) (Fig. 4; Supplementary Figs 26 and 27), suggesting that STC could be ancestral, although gene regulatory networks underlying these STC patterns may have been substantially modified to support lineage-specific body plans. As genes within each subcluster are preferentially related to each other (Fig. 3b), STC might have been established during the stepwise duplication of primordial Hox genes (represented by three co-activated Hox genes in the basal bilaterian acoels 32,33 ; Fig. 4), and a similar scenario was observed for a newly formed rodent-specific Rhox cluster 34 . It is also possible, but less likely, that a complete Hox cluster with cluster-wide temporal co-linearity already existed in the bilaterian ancestor, and STC is a derived state that independently occurred in several bilaterian lineages. Interestingly, we found that Hox expression in the annelid Capitella teleta follows an unusual mode of whole-cluster temporal co-linearity (WTC) that is subcluster-based 35 (Fig. 4; called S-WTC here), probably representing an intermediate state in evolutionary transition from STC to WTC, or vice versa. Owing to its increased flexibility in developmental patterning, STC may be central to the bilaterian body plan evolution and, if indeed ancestral, would provide the bilaterian ancestor with great potential in generating diverse body plans found in different bilaterian lineages.
Photoreceptors and the eye regulatory network
Scallops have a large number (~30–100) of noncephalic but complex eyes along the edge of their mantle, which possess double-layered retinas, with the proximal and distal retina comprising rhabdomeric and ciliary photoreceptors, respectively 36 (Fig. 5a). Ten full-length opsin genes including four r-opsins, two Go-opsins, two c-opsins and one peropsin are identified in the scallop genome and show primary expression in scallop eyes (Supplementary Figs 28 and 29). R-opsin and Go-opsin are known to mediate rhabdomeric and ciliary phototransduction in scallop eyes, respectively 37 , and as expected, key genes participating in the two phototransduction cascades show higher expression in scallop eyes than mantle (Fig. 5b). In particular, R-opsin and its associated cascade have the highest expression in scallop eyes, greatly exceeding other opsins (Fig. 5a,b), suggesting that rhabdomeric phototransduction may play a prominent role in scallop eye function. The finding of c-opsin expression in scallop eyes is intriguing (Fig. 5a), as c-opsin has not been identified in scallops before and was once considered a vertebrate-type opsin for ciliary phototransduction 38 . Further investigation of the scallop genome identified key genes participating in vertebrate canonical (Gi/t) and noncanonical (Gs) c-opsin cascades 37 , and expression profile of these genes supports the involvement of the c-opsin cascade in scallop eye function (Fig. 5b). The coexistence of r-opsin-, Go-opsin- and c-opsin-mediated phototransduction cascades in scallop eyes is unusual. Considering the differential preservation of rhabdomeric and ciliary photoreceptors for vision in extant animal groups (invertebrates and vertebrates, respectively 37,38 ), scallop eyes provide a unique model to study how multiple phototransduction cascades function and coordinate in a single visual system, which may provide insights into distinctive evolutionary routes of these cascades in invertebrates and vertebrates 37,38 .
We identified a collection of 825 genes that are significantly up-regulated in scallop eyes relative to mantle (Supplementary Table 20) and enriched for genes of the G-protein-coupled receptors (GPCRs) signalling pathway (Supplementary Table 21). Surprisingly, Pax6, a presumed master control gene for all bilaterian eyes
, is present in the genome but not expressed in the eye and mantle (Fig. 5c). Other genes of the typical invertebrate and vertebrate Pax6 pathways are either not expressed (for example, Six3/6 and Rx) or do not show upregulation in the eye relative to mantle (for example, Six1/2, Eya, Dach) (Fig. 6a). The possibility of transient expression of Pax6 regulatory pathway during early eye development, although not yet investigated, seems unlikely as scallop adult eyes exhibit continuous eye formation and growth (that is, continuous eye morphogenesis) with increasing age
. Our finding therefore suggests that the pax6-dependent pathway may not be involved in scallop eye morphogenesis and function. To understand the gene regulatory network of scallop eyes, we constructed a gene coexpression network using 26 adult transcriptome datasets, and identified M2 as the only eye-related module (Supplementary Figs 30 and 31; Supplementary Table 22). The eye-related transcription factors Pax2/5/8, Brn3, Lmx1b and Six4/5 are members of this module. In particular, Pax2/5/8, Brn3 and Lmx1b are recognized as the most important hub transcription factors in the network (Fig. 6b; Supplementary Table 23), suggesting that they are key regulators of scallop eye development and function. The involvement of Pax2/5/8, Brn3 and Six4/5 in the noncephalic light sensors has been previously reported in Platynereis midventral photoreceptor cells (PRCs)
and amphioxus Hesse organs
, both of which are also Pax6-independent and have led to the hypothesis that cephalic and noncephalic PRCs may have different evolutionary origins, with the former dependent on Pax6 and the latter on Pax2/5/8
. However, previous investigations were all based on simple light sensors, and the possibility that these noncephalic light sensors may represent evolutionary innovations cannot be excluded
. Our finding of Pax2/5/8 as a key regulator in the gene network of scallop mantle eyes provides the first complex eye-based evidence supporting the hypothesis of Pax2/5/8-dependent origin of noncephalic eyes (Fig. 6c), and together with previous studies
Reconstructing the genomes of ancient bilaterians that pre-dated the split of protostomes and deuterostomes is critical to our understanding of bilaterian evolution, where studying genomes of poorly sampled lophotrochozoans should be particularly informative. Ancient genomes may be reconstructed in both gene repertoire and genome organization through gene family studies and synteny analysis of high-order genome assemblies. In devoting such efforts to the scallop P. yessoensis, we find remarkable conservation of ancestral features in genome organization and gene repertoire that bring us closer to the bilaterian ancestral genome. These include the closest representation of the ancestral bilaterian karyotype to date, intact ParaHox and Hox gene clusters, diverse phototransduction cascades and an ancient regulatory pathway for eye development. The STC that is shared by other bilaterians may be ancestral to whole-cluster co-linearity and central to the great diversity in body plan found in molluscs and other bilaterians. The exceptional conservation of ancestral features suggests that the scallop genome is slow-evolving, probably as a consequence of life on cold and stable deep-ocean bottoms. Similar studies, particularly of chromosome-anchored genomes from basal bilaterians such as monoplacophoran molluscs, annelids and acoels, may identify other genomes more closely related to that of the bilaterian ancestor and lead to the eventual reconstruction of urbilaterian chromosomes, which may greatly improve our understanding of bilaterian evolution.
Genome sequencing and assembly
A one-year-old male P. yessoensis from a selfing family created with a hermaphroditic individual was used for WGS sequencing and assembly. High-quality genomic DNA was extracted from the adductor muscle of this inbred male using the conventional phenol/chloroform extraction method 51 . Short-insert (180 bp, 300 bp and 500 bp) paired-end libraries and large-insert (2 kb and 5 kb) mate-pair libraries were prepared using Illumina’s DNA library preparation kits following standard protocols. The 10 kb and 16 kb mate-paired libraries were prepared following the Cre–lox recombination-based protocol 52 . The libraries were subjected to the paired-end 100 bp/150 bp sequencing on the Illumina HiSeq2000 platform. A modified version of SOAPdenovo was developed for efficient genome assembly to reduce the problem of high genome heterozygosity (see Supplementary Text for methodological details).
Genome size estimation
The genome size of P. yessoensis was estimated using flow cytometry and k-mer analysis. Gills of P. yessoensis were used for flow cytometry analysis as previously described 53,54 , with Pacific oyster C. gigas (2C = 1.31 pg) 9 as an internal reference standard. Briefly, gills were dissected and dissociated into single cells using a 25-gauge syringe needle. Then the cell suspension was filtered through a 20-μm nylon mesh and stained with 10 mg ml−1 4,6-diamidino-2-phenylindole (DAPI). The stained cell suspension was analysed using a flow cytometer (Partec PAII, Germany). The DNA content was then converted to gigabases based on the formula: 1 pg = 0.978 Gb (ref. 55). For k-mer analysis, the genome size was estimated based on the 19-mer frequency distribution using the formula: genome size = (total number of 19-mer)/(position of peak depth).
Quality assessment of genome assembly
The integrity of the final assembly of P. yessoensis genome was examined using three 30–35 kb fosmid sequences, ~45× WGS sequences (from 180 bp library) and three sets of messenger RNA (mRNA) data. Fosmid sequences were aligned to the scallop genome assembly using LASTZ 56 with the parameters of ‘M = 254 K = 4,500 L = 3,000 Y = 15,000 --seed = match12 --step = 20 --identity = 85’. Burrows–Wheeler Aligner (BWA) 57 was used to align the WGS data with the final assembly with parameters of ‘-n 15 –o 1 –e 10’ by considering high polymorphism between haploids 9 . Full-length complementary DNA (cDNA) sequences, the assembled transcriptomes generated from 454 sequencing 15 and Illumina sequencing (assembled by Trinity 58 ) were mapped to the genome assembly using BLAT 59 with default parameters and an identity cutoff of 80%.
Linkage map construction and chromosome anchoring
Three full-sib families each consisting of 38–40 individuals were used for linkage mapping analysis. 2b-RAD libraries were prepared for parents and progenies using the type IIB restriction enzyme BsaXI and following the protocol developed in ref. 16. The adaptors with 5′-NNN-3′ overhangs were used to target all BsaXI fragments in the scallop genome. All libraries were subjected to single-end sequencing (1×50 bp) using the Illumina HiSeq2000 platform. The 2b-RAD reads were preprocessed to remove unreliable ones and then genotyped using the RADtyping program 60 under default parameters. The SNP markers that segregated at a 1:1 ratio in each mapping family were obtained and categorized as lm×ll or nn×np. Markers present in both parents that segregated at a 1:2:1 ratio were also retrieved and were categorized as hk×hk. SNP markers that conformed to the expected Mendelian ratios (chi-squared test, P > = 0.01) and could be genotyped in at least 80% of the offspring of each family were used for linkage analysis. Markers were grouped at a logarithm of odds threshold of at least 6.0 and ordered based on the regression mapping algorithm implemented in JoinMap4.0 software 61 . The recombination frequencies were converted into map distances in centi-Morgan (cM) through the Kosambi mapping function. The consensus map was generated by integrating the linkage maps of three families using the MergeMap software 62 , with map weight setting as 1.0 for each map.
For chromosome anchoring of scaffolds, marker sequences from the consensus genetic map were aligned back to the genome assembly using BLAST 63 with the parameters of ‘-e 1e-4 –F F –G 5 –E 2 –W 7 –r 2 –q −3 –m 8’. Only markers that were mapped to a unique location in the assembly were used for anchoring and orienting scaffolds to corresponding linkage groups (that is, chromosomes) according to the locations of markers in the genetic linkage map. For cases where scaffolds were in conflict with the genetic map (for example, markers from one scaffold assigned to different linkage groups), we manually checked these scaffolds using the 10 kb mate-paired reads and eight scaffolds were broken at points with low-coverage support by mate-paired reads. A similar approach was applied to anchor the existing genome assemblies of Pacific oyster (C. gigas) 9 and pearl oyster (P. fucata) 11 to linkage groups using two recently published high-density genetic linkage maps 20,21 .
Transcriptome sequencing and expression profiling
Embryos (two to eight cells, blastulae and gastrulae), larvae (trochophore larvae, D-stage larvae, pedi-veliger larvae and juvenile) and adults of P. yessoensis were collected from the hatchery of Zhangzidao Group Co., Ltd (Dalian, China) in 2013. To obtain embryonic and larval materials, artificial fertilization and larval culture were performed according to the procedure described in ref. 64. The fertilized eggs and larvae were reared at 13–15 °C and more than 1,000 embryos/larvae were sampled for each developmental stage (sampling time is provided in Supplementary Table 24). Nine adult tissues/organs (eye, mantle, gill, gonad, blood, digestive gland, striated muscle, smooth muscle and foot) were dissected from two to three scallop individuals. All the samples were flash frozen in liquid nitrogen and stored at −80 °C until use.
Total mRNA was extracted from each of the seven developmental samples and nine adult tissues/organs following the protocol described in ref. 65. RNA sequencing (RNA-Seq) libraries were constructed using the NEBNext mRNA Library Prep Master Mix Set for Illumina following the manufacturer’s instructions. The libraries were subjected to paired-end 100 bp sequencing on the Illumina HiSeq 2000 platform. Raw reads were first filtered by removing those containing undetermined bases (‘N’) or excessive numbers of low-quality positions (>10 positions with quality scores <10 ). Then the high-quality reads were mapped to the P. yessoensis genome using Tophat (v2.0.9) 66 with the parameters of ‘-p 10 -N 3 --read-edit-dist 3 -m 1 -r 0 --coverage-search --microexon-search’. The expression level of all genes was normalized using the trimmed mean of M-values (TMM) method (implemented in the edgeR package 67 ) and represented in the form of reads per kilobase of exon model per million mapped reads (RPKM) 68 . The RPKM expression values of all genes for all developmental stages and adult tissues/organs are provided in Supplementary Table 25.
To evaluate polymorphism reduction in the inbred progeny, ~50× genome resequencing was performed for its hermaphroditic parent. Paired-end reads from the inbred progeny (~230×) and its parent (~50×) were aligned onto the final genome assembly for SNP and indel identification using BWA 57 with the parameters of ‘-n 15 –o 1 –e 10’. The minimum and maximum read depths for variation calling were set as 0.1 and 2-fold of the average depth of sequencing, respectively. To reduce the false positives, SNPs within 5 bp around a gap were filtered out and adjacent gaps located in 10 bp window size were also removed. The statistical significance of comparison of polymorphism rates between scallop and Pacific oyster (C. gigas) was determined using the two-sided chi-squared test.
Both homology-based and de novo predictions were used to detect transposable elements in the genome. For homology-based detection, RepeatMasker and RepeatProteinMask (both available from http://www.repeatmasker.org) were used to screen the P. yessoensis genome for known transposable elements (for example, DNA transposon, long terminal repeat, long and short interspersed elements) in the RepBase library (v20140131) 69 . De novo transposable elements were identified and modelled by RepeatModeler (v1.0.4, http://www.repeatmasker.org). Tandem repeats were identified by searching for two or more contiguous, approximate copies of a pattern of nucleotides using Tandem Repeats Finder (v4.07b) 70 under default parameters.
Gene prediction and functional annotation were performed primarily following the procedure described in previous studies 71,72 . Briefly, three de novo gene prediction tools, Augustus (v2.7) 73 , GlimmerHMM (v3.02) 74 and SNAP (2006-07-28) 75 , were used to predict genes in the repeat-masked genome sequences. For homology-based gene prediction, protein sequences from C. gigas, Lottia gigantea, Helobdella robusta, Anopheles gambiae, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens and Strongylocentrotus purpuratus were aligned to the P. yessoensis genome using tblastn (v2.2.26; E-value ≤ 1e−5) 76 , then the homologous genome sequences were aligned against the matching proteins using GeneWise (v2.4.1) 77 for accurate spliced alignments. The RNA-Seq reads from different developmental stages and adult tissues/organs were aligned to the P. yessoensis genome using Tophat (2.0.11) 66 , and Cufflinks (2.1.1) 78 was used to produce assembled transcripts and predict transcript structures. Gene predictions from the de novo approach, homology-based approach and RNA-Seq-based evidence were merged to form a comprehensive consensus gene set using the software EVM 79 . To obtain gene functional annotations, the predicted protein sequences of P. yessoensis were aligned to public databases including KEGG, SwissProt and TrEMBL using BLASTP with the E-value threshold of 1e-5. InterProScan (v4.8) 80 was also used to identify motifs and domains by searching the InterPro and Gene Ontology 81 databases.
Gene family analysis
We selected the following 27 representative animal species (Supplementary Table 13) from the sub-kingdom Eumetazoa for gene family analysis: P. yessoensis, C. gigas, P. fucata, L. gigantea, Octopus bimaculoides, L. anatina, C. teleta, H. robusta, Schistosoma mansoni (lophotrochozoan group); C. elegans, D. melanogaster, Tribolium castaneum, A. gambiae, Daphnia pulex, Strigamia maritima, Apis mellifera (ecdysozoan group); H. sapiens, B. floridae, S. purpuratus, Danio rerio, Xenopus tropicalis, Gallus gallus, Mus musculus (deuterostome group); Mnemiopsis leidyi, Nematostella vectensis, Trichoplax adhaerens, Amphimedon queenslandica (non-bilaterian group). We used the OrthoMCL software (version 1.4) 82 to define gene family clusters among different species. An all-against-all BLASTP was first applied to determine the similarities between genes in all genomes at the E-value threshold of 1e-7. Then the Markov clustering (MCL) algorithm implemented in OrthoMCL was used to group orthologues and paralogues from all input species with an inflation value (-I) of 1.5. For comparisons of gene families between phylogenetic groups, a shared gene family is required to be present in at least two species within each compared group. Gene families belonging only to P. yessoensis but not to any other species (including other bilaterian and non-bilaterian species) were considered scallop-specific gene families. Within the lophotrochozoan group, the number of P. yessoensis genes from each gene family was compared to those from other lophotrochozoans to detect gene families that were expanded only in P. yessoensis. To compute the statistical significance, Fisher's exact test was applied based on two backgrounds: one is the count of all P. yessoensis genes and the other is the count of genes in other lophotrochozoans. A P value threshold of 0.05 was used to retrieve the gene families that were significantly expanded in scallop. A similar approach was also applied to identify bivalve-specific and expanded gene families.
Phylogeny, divergence time and evolutionary rate estimation
We retrieved protein sequences of all single-copy gene families (that is, only one gene copy for each species in a gene family cluster) from the gene family analysis (see previous section) to constitute a 482-gene dataset for constructing a phylogenetic tree for 14 selected species (P. yessoensis, C. gigas, P. fucata, L. gigantea, O. bimaculoides and C. teleta from the lophotrochozoan group; T. castaneum, D. pulex, S. maritima, A. mellifera and D. melanogaster from the ecdysozoan group; H. sapiens and B. floridae from the deuterostome group; and N. vectensis from the non-bilaterian group). The purpose of our phylogenetic analysis was mainly to infer the phylogenetic relationships and divergence time for the bivalve lineage, and a more comprehensive analysis of Lophotrochozoa phylogeny has been recently provided 83 . Multiple alignments were performed using MUSCLE 84 for each gene family, and gaps were trimmed using Gblocks 85 . Then the alignments were concatenated to a super alignment matrix. ProtTest 86 was used to select the best-fit model (LG+Γ4 model) for amino acid replacement and RA×ML (v8.0.19) 87 was used to reconstruct a maximum likelihood tree. Robustness of the maximum likelihood tree was assessed using the bootstrap method (100 pseudo-replicates). Divergence time between species/clade was estimated using mcmctree in PAML 88 with the parameters of ‘RootAge = <600 model = REV(GTR) alpha = 0.969 clock = 2’, and the calibration points are provided in Supplementary Table 26.
For substitution rate analysis, the above trimmed multiple protein alignments were first converted into the corresponding codon alignments for each gene family. Then synonymous substitution rate (Ks) and nonsynonymous substitution rate (Ka) were estimated by using the free-ratio model in the PAML 87 codeml program for each family and each species, and to be stringent, only Ks values less than five were considered.
Based on the phylogenetic positions of the 27 animal species (Supplementary Table 13), a hierarchical clustering method 5 was adopted to identify orthologous gene sets. First, two gene clusters from different sides of a branch would be merged when they had mutual best BLASTP hits with each other. Second, clusters of genes within a subtree would be further grouped together if these genes have better hits to each other than to any outgroup genes. Based on these two criteria, genes from different species were clustered starting at the leaves leading to the terminal point at the root. The ancestral bilaterian gene families were determined when they met at least one of the following criteria 5 : (1) the gene family was present in at least two protostome and two deuterostome species (ingroups); and (2) the gene family was present in at least two protostome or two deuterostome species and in two of the non-bilaterian (outgroup) species.
The conservation of gene macrosynteny between species with chromosome-level assemblies and the 17 presumed bilaterian ALGs was displayed in the form of ‘dot plot’. The 17 bilaterian ALGs (represented by the genes of sea anemone N. vectensis) were retrieved from a previous study 5 , where ALGs were reconstructed for early bilaterian ancestors based on the chromosome-history-graph approach. Each dot in the dot plot comparison represents a one-to-one orthologous gene pair derived from the same ancestral gene family. For species without chromosome-level assemblies, a heuristic hierarchical method 4,5 was adopted to cluster the scaffolds from these draft genomes into corresponding homologous ALGs using the cluster program 89 with the tree-cutting threshold of 0.25. For both chromosomal and scaffold-level comparisons, a macrosynteny conservation index 5 was calculated as measurement of preservation of ALGs in each species. To be conservative, the number of one-to-one orthologous gene pairs whose genes are located in scaffolds or chromosome segments that were assigned into homologous ALGs was taken as numerator, and the number of one-to-one orthologues where both genes were on a scaffold/chromosome segment that was large enough to be assigned to an ALG was taken as denominator.
Homeobox gene analysis
The homeobox genes were identified in the P. yessoensis genome using BLAST with an E-value threshold of 1e−5 against all homeodomain sequences from the HomeoDB database (http://homeodb.zoo.ox.ac.uk/) 90 , and were further confirmed by comparing to the Conserved Domains Database (http://www.ncbi.nlm.nih.gov/cdd). Genes were classified based on BLAST results, molecular phylogeny and manual inspection of conserved residues. The same approach was also used to identify homeobox genes in other bilaterian genomes. Phylogenetic analyses were performed using MEGA5 91 to construct neighbour-joining and maximum likelihood trees. For neighbour-joining analysis, evolutionary distances were computed using the p-distance method. For maximum likelihood analysis, the Poisson correction model was chosen. A discrete gamma distribution was used to model evolutionary rate differences among sites. All positions containing gaps and missing data were eliminated in both analyses, and the robustness of the resulting phylogenies was tested by a reanalysis of 1,000 bootstrap replicates. The heat map of Hox and ParaHox gene expression was drawn using custom R scripts that used the heatmap.2 function of gplots (an R package; http://cran.r-project.org/package=gplots).
Whole mount in situ hybridization
Scallop gastrulas (28 h post-fertilization at 15 °C) were fixed in 4% paraformaldehyde overnight, transferred to methanol and stored at −20 °C. Fragments of Hox genes were amplified from larval cDNA using specific primers (Supplementary Table 27) containing a 5′ T7 promoter sequence (5′-taatacgactcactataggg-3′). Purified polymerase chain reaction products were used as templates in the following in vitro transcription. Digoxigenin-labelled sense and anti-sense probes were synthesized using the DIG RNA Labeling Mix (Roche) and a T7 RNA polymerase (Fermentas). Specimens were serially rehydrated in PBST (PBS plus 0.1% tween-20). Specimens were rinsed twice with each for 5 min in TEA buffer (1% triethanolamine in PBST), transferred to freshly prepared 0.3% acetic anhydride in TEA buffer and incubated for 5 min. Additional acetic anhydride was added to yield a final concentration of 0.6% and specimens were further incubated for 5 min. After rinsing twice with each for 5 min in PBST, specimens were post-fixed in 4% paraformaldehyde for 2 h at room temperature and washed five times with each for 5 min with PBST. Specimens were pre-hybridized in hybridization buffer (50% formamide, 5 × SSC, 50 μg ml−1 heparin, 500 μg ml−1 yeast tRNA, 0.1% tween-20, pH 6.0) at 65 °C for 2 h. For hybridization, specimens were incubated in hybridization buffer containing 0.01–0.1 μg ml−1 of denatured RNA probe overnight at 65 ºC. Specimens were then washed twice in washing solution (50% formamide, 2×SSC, 0.1% tween-20; 30 min each), once in 2×SSCT (2×SSC and 0.1% tween-20; 15 min) and twice in 0.2×SSCT (0.2×SSC and 0.1% tween-20; 30 min each), all of which were conducted at 65 ºC. After washing with PBST for 5 min at room temperature, specimens were incubated in blocking buffer (PBST and 0.5% blocking reagent (Roche)) for 2 h at room temperature and then with 1/5,000 diluted alkaline phosphate-conjugated Fab fragments of a sheep anti-digoxigenin antibody (Roche) overnight at 4 ºC. After extensive washing with PBST, specimens were incubated with Nitro blue tetrazolium/5-Bromo-4-chloro-3-indolyl phosphate (NBT/BCIP) substrate solution to detect signals.
Phototransduction genes and network analysis
Key proteins involved in Homo and Drosophila phototransduction pathways 37 were downloaded from the National Center for Biotechnology Information (NCBI) protein database, and homologous proteins were searched against the P. yessoensis genome using BLASTP with the E-value threshold of 1e−5. The obtained candidate genes were further checked by their annotations. Putative opsins were also checked by the presence of common motifs for opsins and GPCRs 92 , and only those containing all seven transmembrane domains and the lysine residue (296K) were kept for further analysis. Phylogenetic analysis of opsin genes was performed using the program MrBayes (v3.2.2) 93 based on the LG+G+F amino-acid model. Differentially expressed (P < 0.05) genes were detected according to the procedure described in the edgeR package 67 . As scallop eyes are small and reside on the mantle, eye sampling might be contaminated by a minimal amount of mantle tissue. To be stringent, we considered those differentially expressed genes that were significantly up-regulated in the eye relative to mantle as candidate eye-related genes for further analysis. Gene ontology enrichment analysis of the differentially expressed genes was performed using the EnrichPipeline 94 . A signed coexpression gene network for 26 adult transcriptomic datasets was constructed using the R package WGCNA 95 , with the parameters of ‘sft = 9, minimum module size = 200 and cutting height = 0.99’. Modules with highly similar expression profiles were merged using the mergedColors function in WGCNA. The hubness of a gene in a given module was measured by its connection strength with other genes in the module, and was determined by intramodular connectivity (K within) 95 . To identify the eye-related module, over-representation analysis of the eye-related genes (that is, up-regulated differentially expressed genes in the eye relative to mantle) was performed for each module using a hypergeometric test with P values adjusted by the Benjamini–Hochberg method 96 for multiple-test correction.
The scallop genome project has been deposited at the NCBI under the BioProject number PRJNA259405. The WGS, parental resequencing and 2b-RAD data were deposited in the Sequence Read Archive (SRA) database under the accession numbers SRS788513, SRX1034910 and SRX1027271, respectively. The short-read data of various developmental and adult transcriptomes were deposited in the SRA database under the accession numbers SRX1026991, SRX2238787 to SRX2238809, SRX2250256 to SRX2250259, SRX2251047, SRX2251049, SRX2251056, SRX2251057 and SRX2279546.
How to cite this article: Wang, S. et al. Scallop genome provides insights into evolution of bilaterian karyotype and development. Nat. Ecol. Evol. 1, 0120 (2017).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank P. Holland for helpful comments on the earlier version of the manuscript. We thank X. Zhang, X. Yu, X. Wang, L. Tao, H. Ruan, H. Zhu, J. Wei and J. Lv for assistance with data analysis. We acknowledge grant support from the National Natural Science Foundation of China (31130054, 31322055, 31272656 and 31630081), the National High Technology Research and Development Program of China (863 program; 2012AA92204 and 2012AA10A405), the Taishan Scholar Project Fund of Shandong Province of China, the Natural Science Foundation for Distinguished Young Scholars of Shandong Province of China (JQ201308) and the AoShan Talents Program of Qingdao National Laboratory for Marine Science and Technology (2015ASTP-ES02). X.G. acknowledges support from Taishan Oversea Scholar Program of Shandong and USDA-NIFA/NJAES Project 1004475/NJ32920. We thank Dalian Zhangzidao Group for financial support, as well as providing scallop materials and facilities. We thank G. Jekely for providing access to the Platynereis transcriptome dataset.