Phytophthora infestans is the most destructive pathogen of potato and a model organism for the oomycetes, a distinct lineage of fungus-like eukaryotes that are related to organisms such as brown algae and diatoms. As the agent of the Irish potato famine in the mid-nineteenth century, P. infestans has had a tremendous effect on human history, resulting in famine and population displacement1. To this day, it affects world agriculture by causing the most destructive disease of potato, the fourth largest food crop and a critical alternative to the major cereal crops for feeding the world’s population1. Current annual worldwide potato crop losses due to late blight are conservatively estimated at $6.7 billion2. Management of this devastating pathogen is challenged by its remarkable speed of adaptation to control strategies such as genetically resistant cultivars3,4. Here we report the sequence of the P. infestans genome, which at ∼240 megabases (Mb) is by far the largest and most complex genome sequenced so far in the chromalveolates. Its expansion results from a proliferation of repetitive DNA accounting for ∼74% of the genome. Comparison with two other Phytophthora genomes showed rapid turnover and extensive expansion of specific families of secreted disease effector proteins, including many genes that are induced during infection or are predicted to have activities that alter host physiology. These fast-evolving effector genes are localized to highly dynamic and expanded regions of the P. infestans genome. This probably plays a crucial part in the rapid adaptability of the pathogen to host plants and underpins its evolutionary potential.
The size of the P. infestans genome is estimated by optical map and other methods at 240 Mb (Supplementary Information). It is several-fold larger than those of the related Phytophthora species P. sojae (95 Mb) and P. ramorum (65 Mb), which cause soybean root rot and sudden oak death, respectively5,6. We sequenced the genome of P. infestans strain T30-4 using a whole-genome shotgun approach, and generated a ninefold coverage assembly spanning 229 Mb (Table 1 and Supplementary Information). The unassembled fraction of the genome consists of high copy repeat sequences (Supplementary Information). The assembled genome sequence provides near complete coverage of genes, with 98.2% of P. infestans T30-4 complementary DNAs aligning (Supplementary Information). We identified 17,797 protein-coding genes by ab initio gene prediction, protein and expressed sequence tag (EST) homology, and direct genome-to-genome comparative gene modelling with P. sojae and P. ramorum (Supplementary Information). Changes in gene content, number or length do not explain the marked difference in genome size (Table 1 and Supplementary Table 1). No evidence of whole-genome duplication or large-scale dispersed segmental duplication was detected. However, specific disease effector gene families are expanded in P. infestans (see later).
P. infestans, P. sojae and P. ramorum represent three major phylogenetic clades of Phytophthora6. Among the three genomes, we identified a core set of 8,492 orthologue clusters (including 9,583 P. infestans orthologues and close paralogues), of which 7,113 genes show 1:1:1 orthology relationships (Table 1, Supplementary Fig. 1 and Supplementary Table 2). The core proteome is enriched in genes involved in cellular processes including DNA replication, transcription and protein translation, whereas genes with functions involved in cellular defence mechanisms are underrepresented (Supplementary Fig. 2). Differences in gene family expansion, in particular dynamic repertoires of effector genes (see later), are probably responsible for different traits among Phytophthora species, such as altered host specificity.
Comparison of the three Phytophthora genomes reveals an unusual genome organization, comprised of blocks of conserved gene order in which gene density is relatively high and repeat content is relatively low, separated by regions in which gene order is not conserved, gene density is low and repeat content is high (Table 1 and Fig. 1). The conserved blocks represent ∼90% of core orthologous groups in all three genomes, including ∼70% (12,440) of all P. infestans protein-coding genes and ∼78% of genes in both P. sojae (13,225) and P. ramorum (11,246). Within conserved blocks, genes are typically tightly spaced in all three genomes (Table 1 and Fig. 1), with median intergenic distances of 633 base pairs (bp) for P. ramorum, 804 bp for P. sojae, and 603 bp for P. infestans. In regions between conserved blocks, intergenic distances are greater and increase with increasing genome size (median 1.5 kb for P. ramorum, 2.2 kb for P. sojae, and 3.7 kb for P. infestans). The differences in spacing between genes among the three genomes, within and outside regions of conserved gene order, are evident in Fig. 2a–f. The expansion of regions between conserved blocks results from increased density of repetitive elements (Supplementary Fig. 3), and overall differences in genome size among the three species are largely explained by proliferation of repeats in regions in which gene order is not conserved. This difference between conserved blocks and non-conserved regions is particularly apparent in the greatly expanded P. infestans genome (Fig. 2d, f). Further, it is evident that rapidly evolving secreted effector genes (see later) lie predominantly in the gene-sparse regions (Fig. 2g, h). This dual pattern of intergenic spacing and repeat content has been suggested for large, unsequenced genomes in the Poaceae such as maize7,8,9, but it is not seen in the genomes of other sequenced eukaryotes (Supplementary Fig. 4).
Recent proliferation of Gypsy elements in P. infestans underlies the genome expansion. Approximately one-third of the genome assembly corresponds to families of Gypsy elements (Supplementary Fig. 5). The two families with the highest relative expansion in P. infestans are Gypsy Pi-1 and a new Gypsy long terminal repeat (LTR) element we named ‘Albatross’, which together account for at least 29% of the genome (Supplementary Table 3). Albatross elements cover ∼32 Mb and are enriched (>2-fold) in the regions in which gene order is not conserved (Supplementary Table 4 and Supplementary Fig. 6), contributing appreciably to relative expansion of gene-sparse regions (Supplementary Fig. 3). Gypsy Pi-1 elements cover ∼22 Mb and, in contrast to Albatross elements, are relatively evenly distributed across the genome.
Overall, the P. infestans genome contains a strikingly rich and diverse population of transposons (Supplementary Table 3). We identified 273 full-length elements belonging to two large classes of autonomous rolling-circle type helitron DNA transposons (7.3-kb and 6.4-kb elements), in much larger numbers than described in any other genome (Supplementary Tables 3 and 5). Most helitron open reading frames (ORFs) are degenerate pseudogenes, but 13 are intact and presumed functional. Some apparently non-autonomous helitrons have intact termini so their transposition may be driven by gene products from the functional classes. In contrast, the P. sojae and P. ramorum genomes contain no intact helitron elements. The P. infestans genome carries increased numbers of mobile elements across diverse families as compared to P. sojae and P. ramorum, with ∼5 times as many LTR retrotransposons and ∼10 times as many helitrons (Supplementary Fig. 7).
Consistent with a model of repeat-driven expansion of the P. infestans genome, the vast majority of repeat elements in the genome are highly similar to their consensus sequences, indicating a high rate of recent transposon activity (Supplementary Fig. 8). In addition, we have observed and experimentally confirmed examples of recently active elements (Supplementary Figs 9–11).
Phytophthora species, like many pathogens, secrete effector proteins that alter host physiology and facilitate colonization. The genome of P. infestans revealed large complex families of effector genes encoding secreted proteins that are implicated in pathogenesis10. These fall into two broad categories: apoplastic effectors that accumulate in the plant intercellular space (apoplast) and cytoplasmic effectors that are translocated directly into the plant cell by a specialized infection structure called the haustorium11. Apoplastic effectors include secreted hydrolytic enzymes such as proteases, lipases and glycosylases that probably degrade plant tissue; enzyme inhibitors to protect against host defence enzymes; and necrotizing toxins such as the Nep1-like proteins (NLPs) and PcF-like small cysteine-rich proteins (SCRs) (Supplementary Table 6).
As in the other Phytophthora species5, candidate effector genes are numerous and typically expanded compared to non-pathogenic relatives (Supplementary Table 6). Most notable among these are the RXLR and Crinkler (CRN) cytoplasmic effectors, described later.
The archetypal oomycete cytoplasmic effectors are the secreted and host-translocated RXLR proteins12. All oomycete avirulence genes (encoding products recognized by plant hosts and resulting in host immunity) discovered so far encode RXLR effectors, modular secreted proteins containing the amino-terminal motif Arg-X-Leu-Arg (in which X represents any amino acid) that defines a domain required for delivery inside plant cells11, followed by diverse, rapidly evolving carboxy-terminal effector domains13,14. Several of these C termini have been shown to exhibit virulence activities as host cell death suppressors15,16. We exploited the known motifs and other conserved sequence features to predict 563 RXLR genes in the P. infestans genome (Supplementary Tables 6, 7 and Supplementary Information). RXLR genes are notably expanded in P. infestans, with ∼60% more predicted than in P. sojae and P. ramorum (Supplementary Tables 6 and 7). We observed that 70 of these are rapidly diversifying (Supplementary Table 8). Approximately half of P. infestans RXLRs are lineage-specific, largely accounting for the expanded repertoire (Supplementary Figs 12 and 13). In contrast to the core proteome, RXLR genes show evidence of high rates of turnover with only 16 of the 563 genes with 1:1:1 orthology relationships (Supplementary Table 2) and many (88) putative RXLR pseudogenes (Supplementary Table 9). This high turnover in Phytophthora is probably driven by arms-race co-evolution with host plants5,13,14,17.
RXLR effectors show extensive sequence diversity. Markov clustering (TribeMCL18) yields one large family (P. infestans: 85, P. ramorum: 75, P. sojae: 53) and 150 smaller families (Supplementary Fig. 14). The largest family shares a repetitive C-terminal domain structure (Supplementary Figs 15 and 16). Most families have distinct sequence homologies (Supplementary Fig. 14) and patterns of shared domains (Supplementary Fig. 17) with greater diversity than expected if all RXLR effectors were monophyletic.
In contrast to the core proteome, RXLR effector genes typically occupy a genomic environment that is gene sparse and repeat-rich (Fig. 2g and Supplementary Figs 18 and 19). The mobile elements contributing to the dynamic nature of these repetitive regions may enable recombination events resulting in the higher rates of gene gain and gene loss observed for these effectors.
CRN cytoplasmic effectors were originally identified from P. infestans transcripts encoding putative secreted peptides that elicit necrosis in planta, a characteristic of plant innate immunity19. Since their discovery, little had been learned about the CRN effector family. Analysis of the P. infestans genome sequence revealed an enormous family of 196 CRN genes of unexpected complexity and diversity (Supplementary Table 10), that is heavily expanded in P. infestans relative to P. sojae (100 CRNs) and P. ramorum (19 CRNs) (Supplementary Table 6). Like RXLRs, CRNs are modular proteins. CRNs are defined by a highly conserved N-terminal ∼50-amino-acid LFLAK domain (Supplementary Fig. 20) and an adjacent diversified DWL domain (Fig. 3a, b). Most (60%) possess a predicted signal peptide. Those lacking predicted signal peptides are typically found in CRN families containing members with secretion signals (Supplementary Table 10). CRN C-terminal regions exhibit a wide variety of domain structures, with 36 conserved domains and a further eight unique C termini identified among the 315 Phytophthora CRN proteins (Supplementary Table 11). We observed evidence of recombination between different clades as a mechanism driving CRN diversity (Supplementary Figs 21–23).
We explored the ability of diverse CRNs to perturb host cellular processes. In assays for necrosis in planta (Supplementary Information), deletion mutants of the previously described CRN2 secreted protein19 defined a C-terminal 234 amino-acid region (positions 173–407, domain DXZ) that is sufficient to induce cell death when expressed inside plant cells (Supplementary Fig. 24). Assays with representative P. infestans CRN genes identified four other distinct C termini that also trigger cell death inside plant cells (Fig. 3c). These include the newly defined DC domain (P. infestans: 18 genes and 49 pseudogenes (ψ)) and the D2 (14 and 43ψ) and DBF (2 and 1ψ) domains, which have similarity to protein kinases (Supplementary Table 11). These results indicate that the CRN protein domains expressed in planta are retained (lacking signal peptides and hence not secreted) by the plant cell and stimulate cell death by an intracellular mechanism, supporting the view that CRNs, like RXLRs, are cytoplasmic effectors. We propose that the conserved CRN N-terminal LFLAK domain may function similarly to the RXLR motif for delivery of CRN effectors into plant cells, and experiments to test this hypothesis are under way.
A further 255 CRN genes are fragmented or otherwise disrupted and presumably non-functional (Supplementary Table 10). CRN genes and pseudogenes are aggregated in large clusters at several genomic loci, typically clustered by domain type (Supplementary Fig. 25). One extraordinary example is scaffold 1.48 (∼1.2 Mb), containing 21 CRN genes and 31 CRN pseudogenes of the DXZ and D2 necrosis inducing domain-types (Fig. 3d). Many of the pseudogenes show only a few base changes, indicating recent conversion to pseudogenes. This high degree of expansion and pseudogene formation suggests that, like RXLR effector genes, CRN genes have undergone relatively rapid birth and death evolution.
Both CRN and RXLR genes typically occur in repeat-rich, gene-sparse regions of the genome, where conserved gene order with P. sojae and P. ramorum is either absent or disrupted (Fig. 2g, h and Supplementary Fig. 19). Expansion of large RXLR and CRN effector gene families seems to have been driven by non-allelic homologous recombination and tandem gene duplication. Although the genome is heavily populated by mobile elements, no direct evidence of transposition of effector genes was observed. Instead, the repeat-rich regions of effector clusters probably facilitate non-allelic-homologous-recombination-based expansion. In one intriguing case, nearly identical tandem arrays of CRNs are present on scaffold 1.6 in a perfect head-to-tail arrangement that is similar to that observed for some helitrons (Supplementary Fig. 26). This region of the genome is heavily enriched for helitron elements, implicating helitron-based rolling circle replication as a possible mechanism for establishing this CRN cluster.
To explore transcriptional responses to plant infection, we constructed a NimbleGen microarray based on the genome annotation. P. infestans gene expression during potato infection was monitored using samples from infected potato at 2–5 days post-inoculation (d.p.i.). In all, 494 genes were induced at least twofold during infection relative to mycelial growth. Days 2–4 of infection correlate with formation of infectious structures called haustoria. Mycelial necrotrophic growth on dead plant material occurs later at 5 d.p.i., and shows a similar expression profile to mycelial growth in plant extract media (Supplementary Fig. 27a and Supplementary Table 12). Seventy-nine RXLR genes exhibited this pattern of expression, including previously studied avirulence genes Avr3a (ref. 20), Avr4 (ref. 21), and Avr-blb1 (also known as ipiO) (ref. 22) (Supplementary Fig. 27b). Apoplastic effector genes, including protease inhibitors, cysteine-rich secreted proteins, and NPP1-family members, were among the most highly upregulated genes during infection of potato. Few CRNs were induced during infection; however most CRNs were very highly expressed, with ∼50% of CRNs within the top 10% of gene expression intensities (Supplementary Fig. 28). Several genes encoding metabolic enzymes were upregulated in planta (Supplementary Table 12), suggesting considerable metabolic adaptation of the pathogen to the host environment23. A related pattern of downregulation mirrors the induction of effectors, involving ∼115 genes (Supplementary Table 12). Among those repressed were elicitin-like genes and pseudogenes, suggesting that reduced expression during infection or mutation to pseudogene could contribute to evading activation of host innate immunity24.
P. infestans remains a critical threat to world food security, and the genome sequence is a key tool to understanding its pathogenic success. The sequence of the P. infestans genome showed an extremely high repeat content (∼74%) and unusual discontinuous distribution of gene density that correlate intriguingly with its biology. Gene-dense regions with conserved gene order across Phytophthora species are interrupted by repeat-rich expanded regions that are sparsely populated with genes, many of which are fast-evolving pathogenicity effectors such as the RXLR and CRN families. The localization of the effectors to dynamic regions of the genome probably both enables the rapid evolutionary changes and accounts for the considerable expansion in CRN and RXLR effector genes observed in P. infestans. This expansion provides a species-specific repertoire of effector genes, the dynamic nature of which probably provides an advantage in the arms race with host species. We postulate that these dynamic regions promote the evolutionary plasticity of effector genes, generating the enhanced genetic variation required to drive the rapid evasion of plant resistance that is a hallmark of the potato late blight pathogen.
Genomic sequence and gene annotations
The updated P. infestans genome sequence and annotation can be accessed through GenBank accession number AATU01000000, and are available through the Broad Institute website at http://www.broad.mit.edu/annotation/genome/phytophthora_infestans. All genome sequence reads have been deposited in the NCBI trace repository (http://www.ncbi.nlm.nih.gov/Traces/home/). Paired reads of P. infestans cDNAs are available in dbEST with accessions in the range GR284383–GR301386. The NimbleGen microarray data are available in GEO under accession number GSE14480. Full methods description and associated references are provided as Supplementary Information.
We thank L. Gaffney for help with figures and tables, E. Blanco and R. Guigo for training the GeneID gene prediction software, J. Crabtree for providing a Sybil (http://sybil.sf.net) software component used to render genome comparison illustrations, the Broad Institute Genome Sequencing Platform for sequence data generation, and C. Cuomo and D. Neafsey for comments on the manuscript. The project was supported by the National Research Initiative of the USDA Cooperative State Research, Education and Extension Service, grant numbers 2004-35600-15024 and 2006-35600-16623, and the National Science Foundation grants EF-0333274 and EF-0523670, and the Gatsby Charitable Foundation.
Author Contributions B.J.H., S.K., M.C.Z. and C.N. coordinated genome annotation, data analyses and manuscript preparation. B.J.H. and S.K. made equivalent contributions and should be considered joint first authors (listed by alphabetical order). R.H.Y.J., R.E.H., L.M.C., M.G., C.D.K., S.R., T.T.-A., T.O.B. and K.O. made major contributions to genome sequencing, assembly, analyses and production of complementary data and resources. All other authors are members of the genome sequencing consortium and contributed annotation, analyses or data throughout the project.
This table contains P. infestans reference transposable elements.
This table contains annotation of RXLR effectors in Phytophthora
This table contains fast evolving RXLR effector families.
This table contains candidate RXLR pseudogenes identified in the P. infestans genome.
This table contains Crinkler family gene annotations for Phytophthora species.
This table contains Crinkler protein domains.
This table contains genes in P. infestans found differentially expressed during plant infection.
This table contains Descriptions of the libraries used for sequencing of the Phythophthora infestans genome.
This table contains sequences and phenotypes for Crinkler genes assayed for the necrosis phenotype.