Fundamental uncertainties exist regarding the physiology and ecology of E. huxleyi, and the relationships between different morphotypes (Fig. 1a). To investigate its gene repertoire and physiological capacity, we sequenced the diploid genome of CCMP1516 using the Sanger shotgun approach. The haploid genome is estimated to be 141.7 megabases (Mb) and 97% complete on the basis of conserved eukaryotic single-copy genes5,6 (Supplementary Table 1, Supplementary Data 7 and Supplementary Information 1.1–1.4). It is dominated by repetitive elements, constituting >64% of the sequence, much greater than seen for sequenced diatoms (Fig. 2 and Supplementary Information 2.10). Of the 30,569 protein-coding genes predicted—93% of which have transcriptomic support (expressed sequence tag or RNA-seq) (Supplementary Information 1.5–1.7, 2.1–2.2 and Supplementary Data 1–3)—we identified expansions in gene families specific to iron/macromolecular transport, post-translational modification, cytoskeletal development and signal transduction relative to other sequenced eukaryotic algae (Supplementary Information 2.3).

Figure 1: Emiliania huxleyi and its position in the eukaryotic tree of life.
figure 1

a, E. huxleyi has five well-characterized calcification morphotypes and an overcalcified state1. b, Cladogram showing the distinct branch occupied by the haptophyte lineage on the basis of RAxML analysis of concatenated, nuclear-encoded proteins after addition of homologues from CCMP1516 and a pico-prymnesiophyte-targeted metagenome8. Lineages with algal taxa are indicated (symbol). Filled circles represent nodes with ≥70% bootstrap support. The tree is rooted for display purposes only.

PowerPoint slide

Figure 2: Relative composition of the E. huxleyi genome.
figure 2

Structural composition of genomes from CCMP1516 and the diatom P. tricornutum. Grey-shaded regions of each class depict proportions of tandem repeats and low-complexity regions. The grey vertical box contains only tandem repeats and low-complexity sequence. Pie charts indicate the proportion of non-repeated (white) and repeated or low-complexity (black) sequences in each haploid genome.

PowerPoint slide

The E. huxleyi genome provides a crucial reference point for evolutionary, cellular and physiological studies because haptophytes represent a distinct branch on the eukaryotic tree of life (Fig. 1b). Consistent with other published analyses7, conserved marker genes demonstrate the haptophytes branch as a sister clade to heterokonts, alveolates and rhizarians. However, as a lineage possessing secondary plastids, the evolutionary history of haptophyte genomes may be more complex8 than that suggested by a single concatenated analysis. Thus, individual gene phylogenies were constructed using clusters of orthologous proteins (1,563) identified by comparative analysis of E. huxleyi and at least 9 of 48 taxa sampled from across eukaryotes (Supplementary Information 2.4). E. huxleyi was monophyletic, with heterokonts in 28–33% of the resolved trees and the green lineage (green algae and plants) in 11–14%. Less frequent relationships were also observed, presumably reflecting a mosaic genome8 with contributions from the host lineage, the eukaryotic endosymbiont, and possibly horizontal gene transfer (Supplementary Fig. 1 and Supplementary Data 4).

Coccolithophores produce the anti-stress osmolyte dimethylsulphoniopropionate (DMSP), which can be demethylated to produce methylmercaptopropionate and/or cleaved by some organisms, such as E. huxleyi, to produce the predominant natural source of atmospheric sulphur, dimethylsulphide. Although the gene encoding the DmdA protein, which catalyses the initial demethylation of DMSP, was not detected in the genome, genes that produce sulphur and carbon intermediates and function in later stages of DMSP degradation were identified9. Also present is an intron-containing, but otherwise bacterial dddD-like, gene encoding an acetyl-coenzyme A (acetyl-CoA) transferase proposed to add CoA to DMSP before cleavage9 (Supplementary Table 2). These data will facilitate molecular approaches for probing DMSP biogeochemistry and the environmental importance of sulphur production and biotransformations.

E. huxleyi synthesizes unusual lipids that are used as nutritional/feedstock supplements, polymer precursors and petrochemical replacements. Two functionally redundant pathways for the synthesis of omega-3 polyunsaturated eicosapentaenoic and docosahexaenoic fatty acids were partially characterized10 (Supplementary Table 3). Pathway analysis indicates that E. huxleyi sphingolipids are primarily glucosylceramides, often with an unusual C9 methyl branch (Supplementary Table 3) found only in fungi and some animals11. Genes for two zinc-containing quinone reductases, involved in reduction of alkenone α,β-double bonds used in paleotemperature reconstructions and proposed biofuels, were also identified12,13.

Coccoliths have precise nanoscale architecture and unique light-scattering properties of interest to material and optoelectronic scientists. Carbonic anhydrase is associated with biomineralization in other organisms14 and accelerates bicarbonate formation. The 15 E. huxleyi carbonic anhydrase isozymes and genes involved in calcium and carbon transport, H+ efflux, cytoskeleton organization and polysaccharide modulation (Supplementary Table 4) represent targets for resolving molecular mechanisms governing coccolith formation, and will aid in predicting response patterns to anthropogenic CO2 increases and ocean acidification.

The global distribution of E. huxleyi (for example, Fig. 3a, c) and its capacity for bloom formation under different physiochemical parameters are puzzling. To investigate the potential influence of genome variation in this ecological dynamic, three E. huxleyi isolates (92A, EH2 and Van556) from different oceanic regions were deeply sequenced (265–352-fold coverage) (Fig. 3a, c, Supplementary Tables 5–7 and Supplementary Information 2.6). Two approaches were used to compare genomes. First, sequence reads were assembled and contigs aligned to the CCMP1516 reference genome using Standard Nucleotide BLAST (BLASTn; Supplementary Information 2.6.1). Although these isolates show >98% 18S ribosomal RNA (rRNA) identity, only 54–77% of their contigs showed similarity to CCMP1516. 71 Mb of the remaining contigs were shared between at least two deeply sequenced strains. 8–40 Mb appeared to be isolate specific, as did 27 Mb of CCMP1516. Flow cytometric genome-size estimates also showed heterogeneity across isolates, with haploid genome sizes ranging from 99 to 133 Mb (Supplementary Information 2.5, 2.6.1 and Supplementary Table 5). These findings indicated considerable intraspecific variation.

Figure 3: Predicted proteome comparisons and concatenated phylogeny of E. huxleyi strains.
figure 3

a, Isolation locations shown over the averaged Reynolds monthly sea-surface temperature (SST) climatology (1985–2007). b, tBLASTn homology search results using predicted CCMP1516 proteins against assemblies from other strains. Bars are coloured according to the number of gene products and nucleotide per cent identity. c, Best Bayesian topology, where node values indicate posterior probability/maximum-likelihood bootstrap support. Haploid genome sizes (in Mb) are provided in brackets (with ND indicating not determined), and shaded boxes denote robust clades of geographically dispersed strains. The variable distribution of nitrite reductase (NirS) and plastocyanin (PetE) is shown.

PowerPoint slide

To examine potential variations in gene content further, sequence reads were directly mapped to the CCMP1516 genome. Of the 30,569 predicted genes in CCMP1516, between 1,373 and 2,012 different genes were not found in 92A, Van556 and EH2 (cumulatively 5,218, or 17% of CCMP1516 genes), and 364 appeared to be missing from all three. These findings cannot be explained by poor coverage or sequencing bias alone. Of 458 highly conserved eukaryotic genes from the CEGMA set5, 95–97% were identified in the isolates, indicating nearly complete genome sequences (Supplementary Data 7). Together, de novo assemblies and direct mapping to CCMP1516 indicate that the pan genome of E. huxleyi represents a rapidly changing repository of genetic information with genomic fluidity estimated to be ≥10%15 (on the basis of CCMP1516 gene content).

E. huxleyi isolate differences were assessed further by Illumina sequencing of ten additional strains. Although sequenced at lower coverage, these strains were estimated to be 91–95% complete (Supplementary Tables 6, 7 and Supplementary Data 7). Direct mapping of reads from the 13 strains to CCMP1516 revealed a ‘core genome’ containing about two-thirds of the genes predicted in the reference genome (Supplementary Information 2.6.2 and Supplementary Data 5), a core independently confirmed by comparative DNA microarrays (Supplementary Information 2.7, Supplementary Data 6 and Supplementary Fig. 2). Nearly 25% of CCMP1516 genes were not found in at least three other strains, indicating that E. huxleyi represents a species complex with a genetic repertoire much greater than that of any one strain (Supplementary Figs 3, 4). Although the most extensive gene-sequence divergence was observed between CCMP1516 and deeply sequenced isolates Van556, 92A and EH2, concatenated phylogenies define three well-supported clades that are not necessarily reflective of geographic distributions (Fig. 3b, c and Supplementary Information 2.61, 2.8).

We searched the CCMP1516 genome for evidence of molecular mechanisms contributing to genome plasticity. There was limited evidence for horizontal gene transfers (Supplementary Information 2.9 and Supplementary Table 8), and although diverse, the complement of transposable elements was also small (Fig. 2 and Supplementary Information 2.10.2). However, E. huxleyi has a high density of unclassified repeats (31%) and tandem repeats/low-complexity regions (34%) with tandem-repeat/low-complexity density highest in introns (Fig. 2, Supplementary Information 2.10.1 and Supplementary Table 9). Most protein-coding genes contain multiple introns, often with noncanonical GC donor sites (Supplementary Fig. 5). The preference for 10–11-base-pair repeats in introns and their strong strandedness (meaning that on the sense and antisense strand either the motif or its reverse complement is highly favoured) raises the possibility that intronic tandem repeats have a functional role in exon swapping (Supplementary Information 2.10.3–2.10.5 and Supplementary Table 9).

E. huxleyi blooms under many different oceanographic regimes. We explored how the core genome and variable components in different ecotypes might influence success (Supplementary Information 2.11 and Supplementary Fig. 6). The remarkable capacity of E. huxleyi to withstand photoinhibition16 lies in the core genome, which encodes a variety of photoreceptors; proteins that function in the assembly and repair of photosystem II, such as D1-specific proteases and FtsH enzymes; and proteins that have a role in non-photochemical quenching (NPQ) or synthesis of NPQ compounds (Supplementary Table 10). Genes encoding reactive oxygen species (ROS) scavenging antioxidants, enzymes for synthesis of vitamin B6 constituents used during photo-oxidative stress in plants17 (Supplementary Tables 10, 15) and many light-harvesting complex (LHC) proteins are also in the core. Of the 68 LHCs, 17 belong to LI818 or LHCZ classes with photoprotective capabilities18 (Supplementary Table 11 and Supplementary Information 3.1). The complex repertoire of photoprotectors facilitates tolerance to high light by minimizing ROS accumulation and preventing oxidative damage.

Phosphorus and nitrogen are key determinants of oceanic primary production. A suite of core genes allows E. huxleyi to thrive in low phosphorus conditions. This includes six inorganic phosphate transporters (Fig. 4), a high-efficiency alkaline phosphatase (Fig. 4)19, purple acid phosphatases and other enzymes used to hydrolyse and acquire organic phosphorus compounds20. Genes for the synthesis of betaine and sulpholipids used as replacements for cellular phospholipids21 are also present (Supplementary Table 12). Numbers of phosphate transporters and alkaline phosphatases, (Fig. 4) however, vary considerably from strain to strain, supporting previous observations of differences in phosphorus uptake and hydrolysis kinetics22.

Figure 4: Distribution of genes in the variable genome reflecting niche specificity.
figure 4

a, Key genes (gene numbers on axes) involved in nutrient acquisition and metabolism, including ammonium transporters (AMT), urea transporters (UT), nitrilase (NIT), phosphate transporters (PTA), alkaline phosphatase (PHOA), ferredoxin (FDX), flavodoxin (FldA) and nitrate reductase (NAR) (Supplementary Information 3.2). b, Genes encoding calcium EF hand (CaEF) proteins and others that bind metals such as copper, zinc and iron (Supplementary Information 3.2).

PowerPoint slide

Genes for inorganic nitrogen uptake and assimilation (nitrate, nitrite and ammonium) and for acquisition and degradation of nitrogen-rich compounds (for example, urea) (Fig. 4 and Supplementary Table 13) are present in the core genome and may explain the broad range of nitrogen concentrations in which E. huxleyi blooms23. Although present in multiple copies, the number of genes encoding nitrite (4), nitrate (8) and urea (3) transporters was relatively small compared to ammonium transporters (20). This enrichment, and the varied distribution across strains (Fig. 4), may be indicative of strain-specific ammonium preference, or the need for tightly regulated transporters to mediate high-affinity ammonium/ammonia uptake while offering ammonium-toxicity protection. Surprisingly, core iron-containing (nirK) versus clade-restricted copper-containing (nirS) nitrite reductases were identified (Fig. 3), although iron is often more limiting than copper in oceanic environments.

E. huxleyi grows well in surface waters where iron levels are generally low (0.02–1 nM)24. The core genome indicates that iron is acquired using the natural resistance-associated macrophage protein (NRAMP) class of metal transporters, multicopper oxidases, surface-bound ferric reductases, and possibly, membrane-bound siderophores (Supplementary Data 8). Genes involved in mechanisms limiting iron requirements are also in the core, including manganese and copper/zinc superoxide dismutases, both zinc and iron alcohol dehydrogenases and rubredoxins, and copper- and haem- plastocyanins (PetE) and ascorbate oxidases. Selective recruitment of these enzymes as well as flavodoxin, a functional analogue of ferredoxin, may reduce iron demands25. E. huxleyi encodes many iron-binding proteins, 80 in the core and 30 linked to the variable genome (Fig. 4). Iron limitation is linked to reduced calcification and photosynthesis26, and our analysis suggests cellular demands and mechanisms to alleviate iron deprivation differ between strains and are probably important factors shaping E. huxleyi ecological dynamics.

The E. huxleyi pan genome encodes nearly 700 proteins whose structure and function is dependent upon metal binding (Supplementary Data 8). Selenium is essential for growth27 and potentially incorporated into at least 49 proteins (20 gene families) present in nearly all strains (Supplementary Table 14). Zinc affects growth and nitrogen usage26, and is a cofactor of more than 400 proteins, many present in the variable genome (Fig. 4). Heterogeneity in zinc-binding proteins across strains may explain variations in zinc quotas between cultured isolates26,28.

In addition to metals, E. huxleyi relies on a range of vitamins. Genes for de novo synthesis of antioxidants such as pro-vitamin A, vitamins C, E, B6 and B9 and the ultraviolet-light-absorbing vitamin D are uniformly present across strains. E. huxleyi, however, is ostensibly unable to inhabit ocean regions where vitamins B1 and B12 are inaccessible. ThiC, a key B1 biosynthesis enzyme, was not found in the genome, and despite relying exclusively on a vitamin-B12-dependent methionine synthase, genes for a B12 transporter and several enzymes required for B12 synthesis are also absent (Supplementary Table 15).

E. huxleyi is the dominant bloom-forming coccolithophore and can be abundant in oligotrophic oceans, directly influencing global carbon cycling. Distributions in modern oceans and those dating back to the Pleistocene era demonstrate its tremendous capacity for adaptation. Until now, the underlying mechanisms for the physiological and morphological variations between isolates have been elusive. Evidence presented here indicates that this capacity can be explained, in part, by its pan genome, the first of its kind reported for what was thought to be a single microbial eukaryotic algal species. Variations in gene complements (Fig. 4) within this species complex may drive phenotypic variation, ecological dynamics and the physiological heterogeneity observed in past studies. The high level of diversity indicates that a single strain is unlikely to be typical—or representative—of all strains. Future sequencing of phytoplankton isolates will reveal whether this discovery is a unique or more common feature in microalgae. Together, the physiological capacity and genomic plasticity of E. huxleyi make it a powerful model for the study of speciation and adaptations to global climate change.

Methods Summary

The diploid genome of CCMP1516 (isolated from the Equatorial Pacific (02.6667S 82.7167W)) was Sanger sequenced and assembled using the Arachne assembler. Gene models were predicted and validated using computational tools, experimental data (including transcriptomics; Sanger and Illumina sequenced) and NimbleGen tiling array experiments. Thirteen additional strains were sequenced using Illumina and mapped to the reference genome. A detailed description of materials and methods is in Supplementary Information.