Coccolithophores have influenced the global climate for over 200 million years1. These marine phytoplankton can account for 20 per cent of total carbon fixation in some systems2. They form blooms that can occupy hundreds of thousands of square kilometres and are distinguished by their elegantly sculpted calcium carbonate exoskeletons (coccoliths), rendering them visible from space3. Although coccolithophores export carbon in the form of organic matter and calcite to the sea floor, they also release CO2 in the calcification process. Hence, they have a complex influence on the carbon cycle, driving either CO2 production or uptake, sequestration and export to the deep ocean4. Here we report the first haptophyte reference genome, from the coccolithophore Emiliania huxleyi strain CCMP1516, and sequences from 13 additional isolates. Our analyses reveal a pan genome (core genes plus genes distributed variably between strains) probably supported by an atypical complement of repetitive sequence in the genome. Comparisons across strains demonstrate that E. huxleyi, which has long been considered a single species, harbours extensive genome variability reflected in different metabolic repertoires. Genome variability within this species complex seems to underpin its capacity both to thrive in habitats ranging from the equator to the subarctic and to form large-scale episodic blooms under a wide variety of environmental conditions.
Fundamental uncertainties exist regarding the physiology and ecology of E. huxleyi, and the relationships between different morphotypes (Fig. 1a). To investigate its gene repertoire and physiological capacity, we sequenced the diploid genome of CCMP1516 using the Sanger shotgun approach. The haploid genome is estimated to be 141.7 megabases (Mb) and 97% complete on the basis of conserved eukaryotic single-copy genes5,6 (Supplementary Table 1, Supplementary Data 7 and Supplementary Information 1.1–1.4). It is dominated by repetitive elements, constituting >64% of the sequence, much greater than seen for sequenced diatoms (Fig. 2 and Supplementary Information 2.10). Of the 30,569 protein-coding genes predicted—93% of which have transcriptomic support (expressed sequence tag or RNA-seq) (Supplementary Information 1.5–1.7, 2.1–2.2 and Supplementary Data 1–3)—we identified expansions in gene families specific to iron/macromolecular transport, post-translational modification, cytoskeletal development and signal transduction relative to other sequenced eukaryotic algae (Supplementary Information 2.3).
The E. huxleyi genome provides a crucial reference point for evolutionary, cellular and physiological studies because haptophytes represent a distinct branch on the eukaryotic tree of life (Fig. 1b). Consistent with other published analyses7, conserved marker genes demonstrate the haptophytes branch as a sister clade to heterokonts, alveolates and rhizarians. However, as a lineage possessing secondary plastids, the evolutionary history of haptophyte genomes may be more complex8 than that suggested by a single concatenated analysis. Thus, individual gene phylogenies were constructed using clusters of orthologous proteins (1,563) identified by comparative analysis of E. huxleyi and at least 9 of 48 taxa sampled from across eukaryotes (Supplementary Information 2.4). E. huxleyi was monophyletic, with heterokonts in 28–33% of the resolved trees and the green lineage (green algae and plants) in 11–14%. Less frequent relationships were also observed, presumably reflecting a mosaic genome8 with contributions from the host lineage, the eukaryotic endosymbiont, and possibly horizontal gene transfer (Supplementary Fig. 1 and Supplementary Data 4).
Coccolithophores produce the anti-stress osmolyte dimethylsulphoniopropionate (DMSP), which can be demethylated to produce methylmercaptopropionate and/or cleaved by some organisms, such as E. huxleyi, to produce the predominant natural source of atmospheric sulphur, dimethylsulphide. Although the gene encoding the DmdA protein, which catalyses the initial demethylation of DMSP, was not detected in the genome, genes that produce sulphur and carbon intermediates and function in later stages of DMSP degradation were identified9. Also present is an intron-containing, but otherwise bacterial dddD-like, gene encoding an acetyl-coenzyme A (acetyl-CoA) transferase proposed to add CoA to DMSP before cleavage9 (Supplementary Table 2). These data will facilitate molecular approaches for probing DMSP biogeochemistry and the environmental importance of sulphur production and biotransformations.
E. huxleyi synthesizes unusual lipids that are used as nutritional/feedstock supplements, polymer precursors and petrochemical replacements. Two functionally redundant pathways for the synthesis of omega-3 polyunsaturated eicosapentaenoic and docosahexaenoic fatty acids were partially characterized10 (Supplementary Table 3). Pathway analysis indicates that E. huxleyi sphingolipids are primarily glucosylceramides, often with an unusual C9 methyl branch (Supplementary Table 3) found only in fungi and some animals11. Genes for two zinc-containing quinone reductases, involved in reduction of alkenone α,β-double bonds used in paleotemperature reconstructions and proposed biofuels, were also identified12,13.
Coccoliths have precise nanoscale architecture and unique light-scattering properties of interest to material and optoelectronic scientists. Carbonic anhydrase is associated with biomineralization in other organisms14 and accelerates bicarbonate formation. The 15 E. huxleyi carbonic anhydrase isozymes and genes involved in calcium and carbon transport, H+ efflux, cytoskeleton organization and polysaccharide modulation (Supplementary Table 4) represent targets for resolving molecular mechanisms governing coccolith formation, and will aid in predicting response patterns to anthropogenic CO2 increases and ocean acidification.
The global distribution of E. huxleyi (for example, Fig. 3a, c) and its capacity for bloom formation under different physiochemical parameters are puzzling. To investigate the potential influence of genome variation in this ecological dynamic, three E. huxleyi isolates (92A, EH2 and Van556) from different oceanic regions were deeply sequenced (265–352-fold coverage) (Fig. 3a, c, Supplementary Tables 5–7 and Supplementary Information 2.6). Two approaches were used to compare genomes. First, sequence reads were assembled and contigs aligned to the CCMP1516 reference genome using Standard Nucleotide BLAST (BLASTn; Supplementary Information 2.6.1). Although these isolates show >98% 18S ribosomal RNA (rRNA) identity, only 54–77% of their contigs showed similarity to CCMP1516. 71 Mb of the remaining contigs were shared between at least two deeply sequenced strains. 8–40 Mb appeared to be isolate specific, as did 27 Mb of CCMP1516. Flow cytometric genome-size estimates also showed heterogeneity across isolates, with haploid genome sizes ranging from 99 to 133 Mb (Supplementary Information 2.5, 2.6.1 and Supplementary Table 5). These findings indicated considerable intraspecific variation.
To examine potential variations in gene content further, sequence reads were directly mapped to the CCMP1516 genome. Of the 30,569 predicted genes in CCMP1516, between 1,373 and 2,012 different genes were not found in 92A, Van556 and EH2 (cumulatively 5,218, or 17% of CCMP1516 genes), and 364 appeared to be missing from all three. These findings cannot be explained by poor coverage or sequencing bias alone. Of 458 highly conserved eukaryotic genes from the CEGMA set5, 95–97% were identified in the isolates, indicating nearly complete genome sequences (Supplementary Data 7). Together, de novo assemblies and direct mapping to CCMP1516 indicate that the pan genome of E. huxleyi represents a rapidly changing repository of genetic information with genomic fluidity estimated to be ≥10%15 (on the basis of CCMP1516 gene content).
E. huxleyi isolate differences were assessed further by Illumina sequencing of ten additional strains. Although sequenced at lower coverage, these strains were estimated to be 91–95% complete (Supplementary Tables 6, 7 and Supplementary Data 7). Direct mapping of reads from the 13 strains to CCMP1516 revealed a ‘core genome’ containing about two-thirds of the genes predicted in the reference genome (Supplementary Information 2.6.2 and Supplementary Data 5), a core independently confirmed by comparative DNA microarrays (Supplementary Information 2.7, Supplementary Data 6 and Supplementary Fig. 2). Nearly 25% of CCMP1516 genes were not found in at least three other strains, indicating that E. huxleyi represents a species complex with a genetic repertoire much greater than that of any one strain (Supplementary Figs 3, 4). Although the most extensive gene-sequence divergence was observed between CCMP1516 and deeply sequenced isolates Van556, 92A and EH2, concatenated phylogenies define three well-supported clades that are not necessarily reflective of geographic distributions (Fig. 3b, c and Supplementary Information 2.61, 2.8).
We searched the CCMP1516 genome for evidence of molecular mechanisms contributing to genome plasticity. There was limited evidence for horizontal gene transfers (Supplementary Information 2.9 and Supplementary Table 8), and although diverse, the complement of transposable elements was also small (Fig. 2 and Supplementary Information 2.10.2). However, E. huxleyi has a high density of unclassified repeats (∼31%) and tandem repeats/low-complexity regions (∼34%) with tandem-repeat/low-complexity density highest in introns (Fig. 2, Supplementary Information 2.10.1 and Supplementary Table 9). Most protein-coding genes contain multiple introns, often with noncanonical GC donor sites (Supplementary Fig. 5). The preference for 10–11-base-pair repeats in introns and their strong strandedness (meaning that on the sense and antisense strand either the motif or its reverse complement is highly favoured) raises the possibility that intronic tandem repeats have a functional role in exon swapping (Supplementary Information 2.10.3–2.10.5 and Supplementary Table 9).
E. huxleyi blooms under many different oceanographic regimes. We explored how the core genome and variable components in different ecotypes might influence success (Supplementary Information 2.11 and Supplementary Fig. 6). The remarkable capacity of E. huxleyi to withstand photoinhibition16 lies in the core genome, which encodes a variety of photoreceptors; proteins that function in the assembly and repair of photosystem II, such as D1-specific proteases and FtsH enzymes; and proteins that have a role in non-photochemical quenching (NPQ) or synthesis of NPQ compounds (Supplementary Table 10). Genes encoding reactive oxygen species (ROS) scavenging antioxidants, enzymes for synthesis of vitamin B6 constituents used during photo-oxidative stress in plants17 (Supplementary Tables 10, 15) and many light-harvesting complex (LHC) proteins are also in the core. Of the 68 LHCs, 17 belong to LI818 or LHCZ classes with photoprotective capabilities18 (Supplementary Table 11 and Supplementary Information 3.1). The complex repertoire of photoprotectors facilitates tolerance to high light by minimizing ROS accumulation and preventing oxidative damage.
Phosphorus and nitrogen are key determinants of oceanic primary production. A suite of core genes allows E. huxleyi to thrive in low phosphorus conditions. This includes six inorganic phosphate transporters (Fig. 4), a high-efficiency alkaline phosphatase (Fig. 4)19, purple acid phosphatases and other enzymes used to hydrolyse and acquire organic phosphorus compounds20. Genes for the synthesis of betaine and sulpholipids used as replacements for cellular phospholipids21 are also present (Supplementary Table 12). Numbers of phosphate transporters and alkaline phosphatases, (Fig. 4) however, vary considerably from strain to strain, supporting previous observations of differences in phosphorus uptake and hydrolysis kinetics22.
Genes for inorganic nitrogen uptake and assimilation (nitrate, nitrite and ammonium) and for acquisition and degradation of nitrogen-rich compounds (for example, urea) (Fig. 4 and Supplementary Table 13) are present in the core genome and may explain the broad range of nitrogen concentrations in which E. huxleyi blooms23. Although present in multiple copies, the number of genes encoding nitrite (4), nitrate (8) and urea (3) transporters was relatively small compared to ammonium transporters (20). This enrichment, and the varied distribution across strains (Fig. 4), may be indicative of strain-specific ammonium preference, or the need for tightly regulated transporters to mediate high-affinity ammonium/ammonia uptake while offering ammonium-toxicity protection. Surprisingly, core iron-containing (nirK) versus clade-restricted copper-containing (nirS) nitrite reductases were identified (Fig. 3), although iron is often more limiting than copper in oceanic environments.
E. huxleyi grows well in surface waters where iron levels are generally low (0.02–1 nM)24. The core genome indicates that iron is acquired using the natural resistance-associated macrophage protein (NRAMP) class of metal transporters, multicopper oxidases, surface-bound ferric reductases, and possibly, membrane-bound siderophores (Supplementary Data 8). Genes involved in mechanisms limiting iron requirements are also in the core, including manganese and copper/zinc superoxide dismutases, both zinc and iron alcohol dehydrogenases and rubredoxins, and copper- and haem- plastocyanins (PetE) and ascorbate oxidases. Selective recruitment of these enzymes as well as flavodoxin, a functional analogue of ferredoxin, may reduce iron demands25. E. huxleyi encodes many iron-binding proteins, 80 in the core and 30 linked to the variable genome (Fig. 4). Iron limitation is linked to reduced calcification and photosynthesis26, and our analysis suggests cellular demands and mechanisms to alleviate iron deprivation differ between strains and are probably important factors shaping E. huxleyi ecological dynamics.
The E. huxleyi pan genome encodes nearly 700 proteins whose structure and function is dependent upon metal binding (Supplementary Data 8). Selenium is essential for growth27 and potentially incorporated into at least 49 proteins (20 gene families) present in nearly all strains (Supplementary Table 14). Zinc affects growth and nitrogen usage26, and is a cofactor of more than 400 proteins, many present in the variable genome (Fig. 4). Heterogeneity in zinc-binding proteins across strains may explain variations in zinc quotas between cultured isolates26,28.
In addition to metals, E. huxleyi relies on a range of vitamins. Genes for de novo synthesis of antioxidants such as pro-vitamin A, vitamins C, E, B6 and B9 and the ultraviolet-light-absorbing vitamin D are uniformly present across strains. E. huxleyi, however, is ostensibly unable to inhabit ocean regions where vitamins B1 and B12 are inaccessible. ThiC, a key B1 biosynthesis enzyme, was not found in the genome, and despite relying exclusively on a vitamin-B12-dependent methionine synthase, genes for a B12 transporter and several enzymes required for B12 synthesis are also absent (Supplementary Table 15).
E. huxleyi is the dominant bloom-forming coccolithophore and can be abundant in oligotrophic oceans, directly influencing global carbon cycling. Distributions in modern oceans and those dating back to the Pleistocene era demonstrate its tremendous capacity for adaptation. Until now, the underlying mechanisms for the physiological and morphological variations between isolates have been elusive. Evidence presented here indicates that this capacity can be explained, in part, by its pan genome, the first of its kind reported for what was thought to be a single microbial eukaryotic algal species. Variations in gene complements (Fig. 4) within this species complex may drive phenotypic variation, ecological dynamics and the physiological heterogeneity observed in past studies. The high level of diversity indicates that a single strain is unlikely to be typical—or representative—of all strains. Future sequencing of phytoplankton isolates will reveal whether this discovery is a unique or more common feature in microalgae. Together, the physiological capacity and genomic plasticity of E. huxleyi make it a powerful model for the study of speciation and adaptations to global climate change.
The diploid genome of CCMP1516 (isolated from the Equatorial Pacific (02.6667S 82.7167W)) was Sanger sequenced and assembled using the Arachne assembler. Gene models were predicted and validated using computational tools, experimental data (including transcriptomics; Sanger and Illumina sequenced) and NimbleGen tiling array experiments. Thirteen additional strains were sequenced using Illumina and mapped to the reference genome. A detailed description of materials and methods is in Supplementary Information.
Sequence Read Archive
This paper is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence, and the online version of this paper is freely available to all readers. Assembly and annotation data for E. huxleyi strain 1516 are available through JGI Genome Portal at http://jgi.doe.gov/Ehux and at DDBJ/EMBL/GenBank under accession number AHAL00000000. The version described in this paper is the first version, AHAL01000000. Sequence information for other strains can be found at the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) under the accession number SRA048733.2.
Joint Genome Institute (JGI) contributions were supported by the Office of Science of the US Department of Energy (DOE) under contract no. 7DE-AC02-05CH11231. We thank A. Gough for assistance with figures, C. Gentemann for Fig. 3 ocean colour analysis and P. Keeling for discussions.
This file contains refined gene models validated by Sanger ESTs.
This file contains refined gene models validated by tiling arrays.
This file contains refined gene models validated by RNAseq.
This file contains a phylogenomic gene list.
This file contains core and variable genes identified by direct mapping of Illumina reads based on 50% gene coverage.
This file contains core genes identified by comparative genomic hybridization.
This file contains conserved eukaryotic genes mapping approach (CEGMA) list.
This file contains genes encoding putative metal binding proteins.