Oceans cover approximately 71% of the Earth’s surface and harbour most of the phylum diversity of the animal kingdom. Understanding marine biodiversity and its evolution remains a major challenge. The Pacific oyster C. gigas (Thunberg, 1793) is a marine bivalve belonging to the phylum Mollusca, which contains the largest number of described marine animal species1. Molluscs have vital roles in the functioning of marine, freshwater and terrestrial ecosystems, and have had major effects on humans, primarily as food sources but also as sources of dyes, decorative pearls and shells, vectors of parasites, and biofouling or destructive agents. Many molluscs are important fishery and aquaculture species, as well as models for studying neurobiology, biomineralization, ocean acidification and adaptation to coastal environments under climate change2,3. As the most speciose member of the Lophotrochozoa, phylum Mollusca is central to our understanding of the biology and evolution of this superphylum of protostomes.

As sessile marine animals living in estuarine and intertidal regions, oysters must cope with harsh and dynamically changing environments. Abiotic factors such as temperature and salinity fluctuate wildly, and toxic metals and desiccation also pose serious challenges. Filter-feeding oysters face tremendous exposure to microbial pathogens. Oysters do have a notable physical line of defence against predation and desiccation in the formation of thick calcified shells, a key evolutionary innovation making molluscs a successful group. However, acidification of the world’s oceans by uptake of anthropogenic carbon dioxide poses a potentially serious threat to this ancient adaptation4. Understanding biomineralization and molluscan shell formation is, thus, a major area of interest5. Crassostrea gigas is also an interesting model for developmental biology owing to its mosaic development with typical molluscan stages, including trochophore and veliger larvae and metamorphosis.

A complete genome sequence of C. gigas would enable a more thorough understanding of oyster biology and the evolution of Lophotrochozoa. One of the main challenges, however, is the high levels of polymorphism present in oysters and many marine invertebrates6,7,8. To overcome this, an oyster derived from four generations of full-sibling mating (coefficient of inbreeding, F = 0.59) was used for genome sequencing and assembly (Supplementary Text B1) through fosmid pooling, next-generation sequencing (NGS) and hierarchical assembling. Combining these genomic data with transcriptomes from different organs, different developmental stages and adults challenged with stressors, in addition to mass spectrometric analysis of shell proteins, allowed us to explore characteristics of the oyster genome and key aspects of molluscan biology related to stress response and shell formation.

Sequencing and hierarchical assembly

NGS technology has been successfully applied for de novo genome sequencing and assembly using whole-genome shotgun strategies9,10,11,12,13. We initially generated 155-fold Illumina whole-genome shotgun reads (Supplementary Table 1), but could not adequately assemble them owing to high levels of polymorphism and abundant repetitive sequences (Supplementary Text B2 and Supplementary Fig. 1). As possible alternative sequencing strategies—such as the addition of longer Roche 454 reads12,13 or traditional bacterial artificial chromosome (BAC)-to-BAC sequencing—are expensive, we opted instead for a more cost-effective fosmid-pooling strategy. In brief, a fosmid library was constructed, and 145,170 clones (tenfold genome coverage) were evenly and randomly assigned into 1,613 pools, each of which was sequenced to 60-fold depth and assembled separately (Fig. 1 and Supplementary Table 1). Contigs from each pool were merged into supercontigs, totalling 1,002 megabases (Mb) (Supplementary Text B4.1–3), which was larger than genome-size estimates of 637 Mb from flow cytometry or 545 Mb from k-mer (k-base fragment) analysis (Supplementary Text B1, 2.3), owing to failure of some allelic variants to merge (Supplementary Figs 3 and 4). Self-to-self whole-genome alignment with LASTZ14 and sequencing depth information were used to remove redundancy in the assembly (Supplementary Text B4.4). The resulting 446 Mb of the assembly were retained for further scaffolding using paired-end data (Fig. 1). The final assembly comprised 559 Mb, with a contig N50 size (at which 50% of assembly was covered) of 19.4 kilobases (kb) and a scaffold N50 size of 401 kb (Supplementary Text B4.5 and Supplementary Table 3). Over 90% of the assembly was covered by the longest 1,670 (14%) scaffolds.

Figure 1: Fosmid-pooling strategy for oyster genome assembly.
figure 1

Genomic DNA was randomly sheared into fragments. a, b, A 40-kb-insert fosmid library was constructed (a), and 145,170 fosmid clones were randomly selected and assigned into 1,613 pools, each containing 90 clones covering 0.57% of the diploid genome (b). For each pool, three Illumina short-insert barcoded libraries (two 200 bp and one 500 bp) were constructed and 60-fold coverage of 90-bp reads (20-fold per library) were generated, and assembled using SOAPdenovo with optimizing parameters. Assemblies from each pool were further corrected and reassembled if unexpected connections were detected owing to high similarity sequences from different fosmids, and gaps were filled by local assembly. c, Fosmid scaffolds were split into contigs at unfilled regions, leaving no undetermined bases in the sequences. Each base was assigned a Phred-like quality score determined by its coverage and alignment mismatches, and these sequences were merged into supercontigs using the overlap layout consensus method. Redundancy was removed using self-to-self alignment and sequencing depth information. d, Whole-genome shotgun Illumina libraries (200-bp to 20-kb inserts) from sheared genomic DNA were constructed for mated-pair Illumina sequencing. e, The fosmid supercontigs were linked into scaffolds using (1) the whole-genome shotgun sequences; (2) inferred paired-end information extracted from assembled pool scaffolds with a span size ranging from 50 bp to 37.5 kb; and (3) 225,000 fosmid ends sequenced using Sanger technology.

PowerPoint slide

To assess the completeness of the assembly, 105-fold coverage of short-insert library reads (<2 kb) that participated in assembly (Supplementary Table 1) were aligned against the assembly. Over 99% of these reads were successfully mapped, using a combination of Burrows–Wheeler Aligners15 and the more sensitive LASTZ (Supplementary Fig. 5 and Supplementary Table 4). The integrity of the assembly is further demonstrated by the successful mapping of 99% of the BAC sequenced obtained using the Sanger sequencing technique, and 98% of 68,000 expressed sequence tags from 454 sequencing (Supplementary Text B5, Supplementary Fig. 6 and Supplementary Tables 5 and 6). Fosmid pooling has been used for re-sequencing16,17, and our results show that the combination of fosmid pooling, NGS and hierarchical assembly provides a new, cost-effective alternative for de novo sequencing and assembly of complex genomes.

Polymorphism and repetitive sequences

To understand polymorphism in the oyster genome, we analysed allelic variation in the assembled genome (inbred) and one re-sequenced wild oyster (wild) (Supplementary Text C1). The inbred genome contained 3.1 million single-nucleotide polymorphisms and 258,405 short insertion/deletion (indels, 1–40 base pairs (bp)) yielding a sequence polymorphism rate of 0.73%, whereas the wild genome had 3.8 million single-nucleotide polymorphisms and 238,182 indels, or a polymorphism rate of 1.3% (Supplementary Table 7), comparable to previous estimates18. This 44% reduction in polymorphism in the inbred genome is smaller than the 59.4% predicted from four generations of brother–sister mating, indicating that selection favouring heterozygotes had occurred19. The polymorphism combining inbred and wild (among four haplotypes) was 2.3%, higher than that in most studied animal genomes20,21 but comparable to that in known high-polymorphism species7. In inbred and wild, we found 3,094 short indels located in coding regions inferred to cause frameshift variants in 2,665 genes, providing an important source for recessive lethal mutations.

k-mer-based analysis of the oyster genome showed that 35% of 17-mers had at least two identical copies in the genome, suggesting an abundance of repetitive sequences (Supplementary Fig. 1). Similarly, homology searching and ab initio prediction found 202 Mb (36% of the genome) in repetitive sequences (Supplementary Text C2 and Supplementary Table 8). Over 62% of the detected repeats could not be assigned to known categories, reflecting the paucity of genomic information from molluscs22. Large numbers of transposase (359) and reverse-transcriptase (779) gene fragments were detected; over 96% of these had detectable transcripts (Supplementary Fig. 8). Alignment of the wild sequence against the assembly identified 20,605 deletions (>100 bp), over 80% of which overlapped with detected transposable elements, suggesting that transposable elements may have an active role in shaping genome variation. Using MITE-hunter23, we detected 157,007 copies of miniature inverted-repeat transposable elements (MITEs), accounting for a remarkable 8.82% of the genome (Supplementary Text C2.3 and Supplementary Table 9). Pair-wise comparisons show extremely low sequence divergence in some MITE families (Supplementary Fig. 9), indicating that they may still be active.

Gene annotation and developmental genomics

A total of 28,027 genes were predicted encoding 50 amino acids or more by combining de novo prediction and evidence-based searches using reference genomes, oyster expressed sequence tags and transcriptomes from multiple organs and developmental stages (Supplementary Text D1 and E1 and Supplementary Fig. 11), with 96.1% showing expression (reads per kb per million mapped reads (RPKM) > 1 in at least one transcriptome; Supplementary Text D2). Of the inferred proteins, 21,085 matched entries in the SWISS-PROT, InterPro or TrEMBL databases. These genes plus their transcriptome profile from 7 adult organs and at 38 developmental stages provide valuable resources for comparative genomics analysis (Supplementary Text E2 and 3), functional inference and studies on development and organogenesis (Supplementary Text F2 and Supplementary Fig. 15).

One notable finding of developmental interest is that the oyster Hox gene cluster is broken into four sections (Fig. 2) with flanking non-Hox genes (Supplementary Fig. 16). We did not find a clear Antennapedia gene, which is present in other bivalves such as Pecten and Yoldia24 (Supplementary Fig. 17). Disruption of the Hox cluster, as also observed in tunicates, nematodes and drosophilids, has been attributed to the loss of temporal co-linearity and modified developmental control25. Supporting this model, we find that Hox genes in the oyster are not activated in an order matching their identity or genomic position, with, for example, HOX4 and HOX1 peaking before gastrulation, LOX5 and POST2 during the trochophore stage and HOX5 during the pediveliger stage (Supplementary Fig. 18 and Supplementary Table 15).

Figure 2: Clustering of Hox genes in Pacific oyster Crassostrea gigas , polychaete annelid Capitella teleta , fruitfly Drosophila melanogaster , lancelet Branchiostoma floridae and Homo sapiens.
figure 2

Oblique lines indicate regions of Hox cluster that are non-contiguous or interrupted. Blue denotes anterior Hox genes, yellow denotes paralogy group 3 Hox genes, green and purple denote central Hox genes and red denotes posterior Hox genes.

PowerPoint slide

Adaptation to environmental stress

Comparison with seven other sequenced genomes identified 8,654 oyster-specific genes (Supplementary Text E3.1) that are probably important in the evolution and adaptation of oysters and other molluscs. With oysters being the only representative, these genes could be shared by other molluscs. Among these genes, gene ontology terms related to ‘protein binding’, ‘apoptosis’, ‘cytokine activity’ and ‘inflammatory response’ are highly enriched (P < 0.0001; Supplementary Text E2 and Supplementary Table 17), indicating over-representation of some host-defence genes against biotic and abiotic stress. Manual examination shows that several gene families related to defence pathways, including protein folding, oxidation and anti-oxidation, apoptosis and immune responses, are expanded in C. gigas (Fig. 3a and Supplementary Table 18). The oyster genome contains 88 heat shock protein 70 (HSP70) genes, which have crucial roles in protecting cells against heat and other stresses, compared with 17 in humans and 39 in sea urchins. Phylogenetic analysis finds clustering of 71 oyster HSP70 genes to themselves, suggesting that the expansion is specific to the oyster (Supplementary Fig. 19). Also expanded are cytochrome P450 (Supplementary Fig. 20) and multi-copper oxidase gene families, which are important in the biotransformation of endobiotic and xenobiotic chemicals26, and extracellular superoxide dismutases, which are important in defence against oxidative stress. The oyster genome has 48 genes coding for inhibitor of apoptosis proteins (IAPs), compared with 8 in humans and 7 in sea urchins, indicating a powerful anti-apoptosis system in oysters. Genes encoding lectin-like proteins, including C-type lectin, fibrinogen-related proteins and C1q domain-containing proteins (C1QDCs), are highly over-represented in the oyster genome (P < 0.0001; Supplementary Table 18); these genes have important roles in the innate immune response in invertebrates27,28,29. Interestingly, many immune-related genes, including genes coding for Gram-negative bacteria-binding proteins, peptidoglycan-recognition proteins, defensin, C-type-lectin-domain-containing proteins and C1QDCs, are highly expressed in the digestive gland (Supplementary Fig. 21), indicating that the digestive system of this filter feeder is an important first-line defence organ against pathogens.

Figure 3: Expansion, expression and pathway distribution of defence-related genes in Crassostrea gigas.
figure 3

a, Expansion and expression of key genes in major stress-response pathways in C. gigas. Genes include HSPs and HSF in the heat-shock response; GRP78, CRT, CNX, GRP94, PERK, IRE1 and EIF2a in the endoplasmic reticulum unfolded-protein response (UPRER); IAPs, BCL2 like, BAG, BI1, caspases, FADD and TNFR in apoptotic pathways; CYP450 and MO in oxidation; and SOD, GPX, PRX and CAT in anti-oxidation. Boxes with bold black borders indicate gene families (HSPs, IAPs and SODs) expanded in C. gigas, and the filled colours correspond to their degree of upregulation in RPKMtreatment/RPKMcontrol by stress, found in 61 transcriptomes from oysters challenged with 9 types of stressors (Supplementary Text G2 and Supplementary Table 23). b, Venn diagram of common and unique genes expressed in response to temperature, salinity, air exposure and heavy-metal stress (zinc, cadmium, copper, lead and mercury), showing overlap of responses. c, Number of genes with and without detectable paralogues differentially expressed under stress and normal conditions, showing that genes responding to stress are more likely to have paralogues (P < 1 × 10−10; χ2 test). Green sections of the pie chart represent 1,442, 809, 358, 550 and 7,938 paralogues for air exposure, metal, temperature, salinity and normal conditions, respectively.

PowerPoint slide

To investigate genome-wide responses to stress, we sequenced 61 transcriptomes from C. gigas subjected to nine stressors, including temperature, salinity, air exposure and heavy metals (Supplementary Text G1 and Supplementary Tables 19 and 20). We found that 5,844 genes were differentially expressed under at least one stressor, and genes responding to different stressors showed significant overlap (Fig. 3b and Supplementary Fig. 23a). Air exposure induced a response from the largest number of genes (4,420), indicating that air exposure is a major stressor and that oysters have evolved an extensive gene set in defence. Genes differentially expressed in response to stress are more likely to have paralogues (Fig. 3c), suggesting that expansion and selective retention of duplicated defence-related genes are probably important to oyster adaptation. Under most stressors, genes coding for HSPs, histones, IAPs and protein biogenesis were upregulated, and those for protein degradation downregulated, pointing to concerted responses to maintain cellular homeostasis30 (Supplementary Text G3 and Supplementary Table 21). Genes involved in the unfolded protein response to cellular stress in the endoplasmic reticulum (coding for calreticulin, calnexin, 78- and 94-kDa glucose-regulated proteins) were upregulated, indicating that protein quality control is critical in cellular homeostasis under stress.

Air exposure induced up to 67-fold upregulation of five highly expressed IAPs (Supplementary Fig. 24a). Other inhibitors of apoptosis were also upregulated: BCL2 up to fourfold and BAG up to 12-fold (Supplementary Fig. 24b). These apoptosis inhibitors were also highly upregulated under heat and low salinity stress. These findings, along with the expansion of IAPs, suggest that a powerful anti-apoptosis system exists and may be critical for the amazing endurance of oysters to air exposure and other stresses. The existence of an intrinsic apoptosis pathway in invertebrates has been controversial, and parts of the pathways have only recently been demonstrated for two lophotrochozoans31,32. The finding of key genes belonging to both intrinsic (BAX, BAK, BAG, BCL2, BI1 and procaspase) and extrinsic (TNFR and caspase 8) apoptosis pathways indicates that oysters have advanced apoptosis systems. Powerful inhibition of apoptosis as shown by genomic and transcriptomic analyses may be central to the ability of oysters to tolerate prolonged air exposure and other stresses.

Heat stress induced a 2,000-fold increase in expression of five highly inducible HSP70 genes or a 13.9-fold increase in average expression of all HSP70 genes, amounting to 4.2% of all transcripts (Supplementary Figs 24c and 25). The genomic expansion and massive upregulation of HSP genes help to explain why C. gigas can tolerate temperatures as high as 49 °C when exposed to summer sun at low tide33. HSP genes were also upregulated under other stressors and may be central to the oyster defence against all stresses (Supplementary Fig. 25). HSP genes may also inhibit apoptosis by binding to effector caspases34.

Genes involved in signal transduction, including genes coding for G-protein-coupled receptors and Ras GTPase, were also activated by stressors (Supplementary Fig. 24f) and over-represented in the oyster genome (Supplementary Table 11). These regulators may have a role in orchestrating stress responses, which seem to be well coordinated (Fig. 3a and Supplementary Fig. 25). The expansion of key defence genes and the strong, complex transcriptomic response to stress highlight the sophisticated genomic adaptations of the oyster to sessile life in a highly stressful environment.

Shell formation

Calcified shells provide critical protection against predation and desiccation in sessile marine animals such as oysters. Molluscan shells consist of calcium carbonate (CaCO3) crystals of either aragonite or calcite embedded in an elaborate organic matrix. Two models have been advanced for molluscan shell formation. The matrix model posits that mineralization occurs in a mantle-secreted matrix of chitin, silk fibroin and acidic proteins35,36. Chitin and silk proteins are proposed to provide matrix structure, whereas acidic proteins control the nucleation and growth of CaCO3 crystals. The cellular model suggests that biomineralization is cell-mediated; that is, crystals are formed in haemocytes and then deposited at the mineralization front37.

We searched the oyster genome for genes implicated in shell formation in previous studies and examined their expression in different tissues and at different stages (Supplementary Text H1, 2). We also sequenced peptides from shells, mapped them to the genome and identified 259 shell proteins (Supplementary Text H3 and Supplementary Table 24). Although our search found evidence for the involvement of chitin, we did not find any silk-like proteins encoded in the oyster genome (Supplementary Text H2) but found, instead, many diverse proteins that may have roles in matrix construction and modification. Notably, a gene coding for a fibronectin-like protein was highly expressed at the early developmental stage, when larval shells are formed, in unison with chitin synthase (Fig. 4a) and was mostly expressed in the adult mantle (40× other organ average; Fig. 4b); the fibronectin-like protein was among the most abundant proteins found in oyster shells. Genes coding for laminin and some collagen proteins were also highly expressed in the mantle (Supplementary Fig. 27a) and found in shells. These are typical extracellular matrix (ECM) proteins, and their presence in shells suggests that the shell matrix has similarities to the ECM of animal connective tissues and basal lamina. Unlike silk fibroins that can self assemble38, the formation of fibronectin fibrils in the ECM is cell mediated39. Oyster fibronectin-like proteins have five type-III domains for integrin binding and cell adhesion. Genes coding for integrins were highly expressed in haemocytes (4× other organ average, Supplementary Fig. 27b). Thus, haemocytes may organize fibronectin-like fibril formation in the shell matrix as they do in ECM.

Figure 4: Genes related to shell formation identified from mass spectroscopy analysis of shell proteins and transcriptome data.
figure 4

a, Relative expression (y axis) of genes coding for chitin synthase (gene CGI_10009438) and fibronectin-like (CGI_10016964) in early development corresponds to the formation of shell gland and first larval shells, as seen in scanning electron microscope photos. White arrow denotes the invagination that forms the shell gland. Developmental stages (x axis) and their timeline are defined in Supplementary Table 12. b, In adults, chitin synthase and fibronectin-like proteins (same colour as in a) are almost exclusively expressed in the mantle compared with other organs. Fibronectin-like is also one of the most abundant proteins found in the shell. c, Distribution of shell proteins in diverse Kyoto encyclopedia of genes and genomes (KEGG) pathways indicative of general cellular functions. d, Expression of 26 tyrosinase genes in the mantle edge, mantle pallial and other organs. Tyrosinases are abundant in shells and their higher expression in the non-pigmented mantle pallial indicate that their functions are not limited to melanogenesis but are related to shell formation.

PowerPoint slide

The involvement of cells in shell formation is further supported by the functional diversity of proteins detected in shells. Many housekeeping proteins, such as elongation factor 1α and ribosomal proteins, were found in the shell; indeed, most oyster shell proteins are not structural proteins but are distributed in diverse metabolic pathways (Fig. 4c and Supplementary Table 25). This functional diversity of shell proteins mirrors that of cells, which is unexpected under the matrix model. Furthermore, 84% of the 259 shell proteins identified are not classical secreted proteins (Supplementary Text H3.4 and Supplementary Table 24); they may be part of cells or deposited by exosomes40. Supporting the presence of exosomes, 61 of the 259 shell proteins matched proteins in the exosome database41. Cells and exosome-like vesicles containing calcite crystals have been observed at the mineralization front37,42, although their significance in shell formation is debated. This study provides molecular evidence for their presence inside shells and their probable participation in shell formation.

Many shell proteins are enzymes that may be involved in matrix construction or modification. A homologue of penicillin-binding protein is exclusively expressed in mantle (72× other organ average) and highly abundant in shells (Supplementary Fig. 27d). Penicillin-binding protein is a transpeptidase that crosslinks glycopeptides in bacterial cell walls43 and may have similar functions in the shell matrix. Another notable enzyme found is tyrosinase. The oyster genome has an expanded set of 26 genes coding for tyrosinase, compared with one in Caenorhabditis elegans and two in humans; most genes coding for tyrosinase are mantle specific (Fig. 4d) and highly enriched among shell proteins (P = 8 × 10−6). Although tyrosinase is a key enzyme in melanogenesis44,45, it is most highly expressed in the non-pigmented pallial mantle (Fig. 4d), indicating that it has other functions in the oyster. The mantle secretes tyrosine-rich proteins46, and oxidation of tyrosine may be essential for shell matrix maturation. Several proteinases and proteinase inhibitors are highly mantle specific and abundant in shells, and may be involved in matrix formation, modification and protection (Supplementary Table 24). Together, these results indicate that oyster shell matrix is not formed simply by self-assembling silk-like proteins but by diverse proteins through complex assembly and modification processes that may involve haemocytes and exosomes.

Concluding remarks

We sequenced and assembled the genome of the Pacific oyster using an inbred individual, short-read NGS and a new fosmid-pooling and hierarchical assembly strategy. The draft assembly provided insight into a molluscan genome characterized by high polymorphism, abundant repetitive sequences and active transposable elements. Genomic, transcriptomic and proteomic analyses show unique adaptations of oysters to sessile life in a highly stressful intertidal environment and the complexity of shell formation. The oyster genome sequence and comprehensive transcriptome data provide valuable resources for studying molluscan biology and lophotrochozoan evolution, and for genetic improvement of oysters and other important aquaculture species.

Methods Summary

The sequenced Pacific oyster is an inbred female produced by four generations of brother–sister mating. Genome sequences were produced with Illumina platform using fosmid pooling and assembled with a new hierarchical assembly strategy. Fosmid ends were sequenced by Sanger. Gene models were obtained by integrating results of de novo gene prediction and alignment-based methods based on homology and transcriptomic evidence. Transcriptomes were sequenced with Illumina platform. The proteome of the shell was obtained by mass spectrometry. All methods are described in detail in the Supplementary Information.