## Introduction

Phytophthora infestans was responsible for the Irish potato famine, which led to the death and emigration of over two million Irish people, and subsequent outbreaks of potato late blight that devastated European harvests throughout the latter half of the 19th century1,2. Although this catastrophe initiated efforts to breed more resistant potatoes3, P. infestans still remains a threat to potato and tomato production worldwide4. Global damage and control costs exceed US$6.2 billion per year, with over$1 billion spent on fungicides alone5.

Migration mediated by human activity has had an important role in the historic and recent spread of potato late blight from the New World6,7. It was initially proposed that the introduction of a single genotype of P. infestans (RFLP genotype US-1, mtDNA haplotype Ib) caused the 19th-century epidemics and that this clonal lineage dominated extra-Mexican populations until recently8. However, genetic analyses exploiting infected potato leaves and tubers from 19th century herbaria specimens indicated that the famine-era disease in the United States, Ireland and Europe was caused by the Ia mitochondrial lineage of P. infestans and suggested that the Ib lineage was introduced later, in the mid-20th century9,10. In the latter half of the 20th century, this Ib lineage was rapidly displaced in both Europe and the United States11,12,13 by more virulent lineages that possessed fitness advantages including fungicide resistance and increased aggressiveness7,12,14,15.

P. infestans is a hallmark example of an organism with rapid evolutionary potential16,17,18. This adaptability is fostered by its arsenal of secreted RXLR-motif effector proteins that facilitate infection by manipulating host physiology after translocation into host cells. Among these RXLR effectors are avirulence (avr) proteins that are recognized by corresponding resistance (R) proteins in the host, initiating the hypersensitive defense response19. R genes derived from wild potato (Solanum demissum) were deployed widely in breeding programs in the early 20th century20, but pathogen populations quickly overcame them21,22,23,24. Consequently, such potatoes are rarely planted today.

To provide a baseline for understanding the nature of temporal changes in the P. infestans genome, we generated genomic reads by shotgun sequencing DNA extracts of blight-infected potato leaves from historic herbarium specimens collected during the initial 1845 outbreaks of disease near Audenarde, Belgium1 and an additional sample from Britain25, and later samples from Germany, Denmark and Sweden (1876–1889). We also sequenced three modern isolates (RFLP genotypes US-8, US-22 and US-23) currently causing disease in North America13. These data are compared with the published reference genome sequence of the European strain T30-4 (ref. 26) and the European ‘blue 13’ strain 06_3928 (13_A2)12. To our knowledge, this work represents the first genome-level analysis of a eukaryote pathogen from a historical collection. We observe distinct nuclear genotypes in historical samples that imply multiple introductions to Europe before the 20th century. We also report a number of genes that were likely absent in the historical period, many of which are effectors related to infection of the modern S. tuberosum host. Specifically, at loci associated with virulence, a number of diploid genotypes present in all modern isolates are entirely absent from the historical gene pool. At the well-known effector gene Avr3a, only the avirulent AVR3aKI allele is detected in historical strains, thereby elucidating at least one mechanism by which the aggressive lineages of the 20th century completely replaced historical genotypes introduced just before the famine.

## Results

### High-throughput sequencing of samples

Three modern US isolates RS2009P1 (US-8), IN2009T1 (US-22) and BL2009P4 (US-23) were each independently sequenced to a high depth of at least 28-fold mean coverage of the ~240 Mbp P. infestans strain T30-4 reference genome (Supplementary Table S1, Supplementary Table S2). In addition, the 1845 sample from Belgium and the 1889 sample from Germany were sequenced, respectively, to 16-fold and 22-fold mean coverage. The special nature of these specimens—a pathogen embedded within poorly preserved tissues of its host—presents unique technical challenges10, not least of which are trace amounts of endogenous target DNA, post-mortem DNA damage and ambiguous mapping of short shotgun sequence reads to repetitive and G/C-rich regions of the genome (Supplementary Figs S1–S5). Consistent with our expectations of extensive fragmentation of the DNA extracted from dried leaves collected 122–167 years ago, sequence lengths of DNA inserts mapped to the P. infestans T30-4 reference genome range between 52–79 bp in historical samples Pi1845A (52 bp), Pi1845B (59 bp), Pi1876 (79 bp), Pi1882 (72 bp) and Pi1889 (75 bp; Supplementary Fig. S2). After normalizing each library by the number of reads mapped to the P. infestans genome, the overall mean insert size for historical samples is 67 bp.

### Phylogenetic analysis of nuclear genomes

Maximum-likelihood phylogenetic analysis of the uniquely mappable components of these genomes indicates that the historical European samples form a highly supported monophyletic clade that is distinct from modern isolates (Fig. 1, Supplementary Methods). The resequenced strain T30-4 sample occupies an intermediate position between the other modern samples and the historical European samples. Strain 06_3928A (13_A2) is most closely related to strain BL2009P4 (US-23) collected in Pennsylvania, USA. Samples from the modern and historical time periods differ by an estimated 120,359 single-nucleotide polymorphisms (SNPs) within nuclear gene-encoding regions (Supplementary Data 1). Of these SNPs, 24,040 are unique to the historical samples Pi1845A or Pi1889, and 12,893 SNP genotypes are not shared between the two. In a multi-locus phylogenetic analysis, the historical samples clustered with diverse isolates from the United States, Mexico, Ecuador and Europe (Supplementary Fig. S6).

### Gene content analysis

An analysis of the presence of modern strain T30-4 reference genes in historical samples based on read mapping shows that 21 of the reference genes absent in Pi1889 are detected with partial coverage in at least one of the four older historical samples (Supplementary Table S3, Supplementary Figs S7–S11). They include six secreted RXLR effectors, an NPP1-like pseudogene and two proteins of unknown function that are among the most induced during infection of potato by strain T30-4.

The analysis of modern strain T30-4 reference genes in resequenced samples also shows that genes tend to be deleted from gene-sparse regions of the P. infestans genome, which are enriched for effectors (Fig. 2, Fig. 3, Supplementary Fig. S12). Indeed, many of the differentially absent RXLR and CRN effectors that we have identified here belong to clusters of homologous genes that are more likely to be similar or redundant in function (Fig. 2, Supplementary Fig. S13, Supplementary Fig. S14).

In historical samples, a considerable number of genes relevant to plant pathogenesis pathways are absent (Supplementary Table S3). Six genes absent in Pi1889 were demonstrated to be differentially expressed during infection of potato by reference strain T30-4 (refs 26, 27). Of the 26 genes absent in Pi1889 that have an annotated function, 23 are implicated in plant pathogenesis26. Results are similar for Pi1845A. RXLR effectors made up a larger portion of absent reference genes in Pi1845A and Pi1889 than in aggressive, introduced strains from present-day Europe and the United States, including the recently sequenced ‘Blue 13’ isolate 06_3928A (13_A2) from Great Britain (Supplementary Fig. S15).

### Assembly and prediction of novel RXLR effectors

To complement our catalogue of absent genes, we searched the resequenced genomic data from our high-coverage samples for novel, non-reference RXLR effector genes. For each sample we performed de novo assembly of reads unmapped to the T30-4 reference genome, and then searched the assembled contig sequences for novel coding sequences containing the RXLR motif (Supplementary Table S4). This search discovered five new RXLR genes among the modern US isolates, but did not reveal any in the historical samples Pi1845A or Pi1889 (Supplementary Table S5, Supplementary Table S6).

### Temporal differences at virulence loci

Among avr genes, deletion, pseudogenization, transcriptional silencing, copy number variation and diversification of amino-acid sequence have all been proposed as mechanisms for evading recognition by the host12,28. We surveyed the amino-acid sequences of genes with avr-related annotations to identify candidates for further investigation into virulence differences between historical and modern lineages. The virulent AVR3aE80/M103 allele of P. infestans RXLR effector Avr3a, which is ubiquitous in diverse global populations29 and is present in all modern isolates in our analysis, is absent from all historical samples with coverage of this locus (Table 1). Only the avirulent AVR3aK80/I103 allele is detected.

Directly parallel cases are also observed at two genes near and closely related to Avr3a (PITG_14368 and PITG_14374), where the historical samples shared homozygous alleles, while all modern strains possessed one or two alleles encoding an alternative amino-acid sequence. Similar phenomena, where all the historical samples shared a single amino-acid genotype that was absent in modern isolates, are observed in PITG_21388 (Avrblb1), PITG_06077 (Avr2 family) and PITG_05121 (Avr2 family)22. These results are summarized (Table 1), and amino-acid sequences are provided (Supplementary Table S7).

## Discussion

Our nuclear genome sequence analysis of famine-era and later historical P. infestans samples from Europe demonstrates that they formed a phylogenetic clade that does not cluster with any modern isolates yet sequenced. In addition, we show that these genotypes were closely related and differed substantially from isolates now present in both the United States and Europe. These results are consistent with the hypothesis4,8 of introduction from a New World source of limited genetic diversity. In contrast to the three modern strains from the United States, contemporary European isolates did not segregate into a single lineage, indicating that introductions into Europe after the first outbreak likely came from multiple genetic sources that were closely related7,8. Historical samples clustered with diverse isolates from Mexico, Ecuador and the United States in our multi-gene phylogenetic analysis of available global isolates, but further investigation into the genomes of historical and globally diverse samples is needed to draw conclusions about the geographic origin of the famine-era P. infestans introductions.

The large number of nuclear-coding region SNPs distinguishing the two high-coverage historical samples from each other suggest substantial evolutionary divergence of P. infestans genomes within Europe in the 19th century. The gene content analyses also provide evidence that genetically distinct late blight lineages were present in Western Europe by 1889, but definitive confirmation would require more complete recovery of the nucleotide sequences of these genes from other historical isolates. Tentatively, however, we argue that if only a single clonal lineage was introduced to Europe in 1845, this would be an illustrative example of rapid evolution, with <50 years of clonal reproduction30 through seasonal population bottlenecks and varied regional selective pressures leading to gene loss. Although mutation rates in oomycetes are unknown, this rapid pace of divergence within a single clonal lineage8,14 indicates that we cannot rule out an alternative hypothesis—that multiple closely related lineages were introduced to 19th-century Europe.

Genes from the modern reference genome assembly tend to be deleted from rapidly evolving, gene-sparse regions of the genomes of the resequenced samples we analysed. These regions are enriched for effectors and evolve more quickly than gene-dense regions26,28. Indeed, many of the differentially absent RXLR and CRN effectors that we have identified here belong to clusters of homologous genes that are more likely to be similar or redundant in function. There is likely to be less pressure to conserve all members of these expanded tribes, thus contributing to the plasticity of these and other effector gene families and their genomic surrounds.

We also show that historical populations had a repertoire of secreted proteins different from that observed now. Reference genes absent from the historical genomes were more likely to be RXLR effectors than genes absent from aggressive, introduced strains from present-day Europe and the United States. Genes and alleles absent in historical samples that are now recognized as important for infection of potato may relate to genomic changes that enabled the successive 20th century replacements of Europe’s historical lineages with new strains better suited to the changing selective conditions of European potato fields.

We detected only the avirulent AVR3aK80/I103 allele in historical samples. Potato gene R3a, which recognizes Avr3a during infection, was not introduced to potato breeding stocks until the 20th century21,24. Thus, our data support the previously suggested hypothesis that the avirulent allele gave rise to the virulent allele under positive selection30. Given our observations of similar phenomena at other putative avr loci, they may illustrate examples of how the P. infestans effector genes present in circulating lineages rapidly changed following the introduction of diverse R genes in efforts to breed more resistant potato cultivars. This would parallel known examples of genome evolution as a response to human actions in other eukaryote pathogens that employ effectors, like the human malaria parasite Plasmodium falciparum, whose proteome shares in common with P. infestans motifs for protein translocation.

This catalogue of genes that were historically absent or later influenced by natural selection provides direction for further research into the functional causes of the episodic replacements that have occurred in the global migrations of P. infestans. Our data indicate at least two distinct lineages of P. infestans in 19th-century Europe and thus emphasize the role of migration in shaping the population history of this plant pathogen. Future efforts to understand the migratory history of this destructive pathogen and to further characterize the number and source of lineages that were present historically in Europe would benefit from focusing on sequencing more early-outbreak samples to high depth, should they become available.

## Methods

### Selection of historical samples

Five historic herbarium specimens from two different collections were chosen for analysis (Supplementary Table S1). These dried potato leaves with lesions produced by P. infestans represent material from early disease outbreaks in the British Isles, Belgium, Denmark, Sweden and Germany, and date from 1845–1889 (Supplementary Fig. S1). The samples included the oldest known specimens from the epidemics of potato late blight that occurred in Europe9. From each individually wrapped potato specimen envelope, a small (~5 mm in diameter) piece of dried tissue from the lesion edge was placed in a sterile microfuge tube for later DNA extraction using a modified CTAB procedure10. Extreme care was taken to avoid physical contact between specimens. Work with the herbarium specimens was performed in the North Carolina State University Phytotron Containment Facility Laboratory, which was equipped with separate supplies, reagents and equipment having no history of research involving P. infestans.

Primers PINF and HERB1 were used to amplify and sequence a portion of the internal transcribed spacer region 2 of ribosomal DNA to confirm that the specimens were infected with P. infestans10. Sequence analysis of the amplified portions of the mtDNA genes using methods previously described identified the mitochondrial haplotype9,10.

All historical samples were confirmed to possess mtDNA haplotype Ia31 by examining the consensus sequence at these loci generated on the Illumina platform.

### Library preparation and sequencing of historical samples

All pre-PCR manipulations were performed in the dedicated ancient DNA laboratories at the Centre for GeoGenetics, University of Copenhagen. Extracted DNA from samples Pi1889, Pi1876 and Pi1882 were built into Illumina libraries with a NEBNext DNA Library Preparation for Illumina kit. Extracted DNA from samples Pi1845A and Pi1845B were built into Illumina libraries with a NEBNext DNA Library Preparation for Roche/454 kit. The manufacturers’ protocol was followed, with the exclusion of the fragmentation step and reduction of adapter concentration in the adapter ligation step.

After library construction, unique index sequences were added through PCR. Initial DNA amplification was carried out through 10 cycles of PCR in a total volume of 50 μl, containing 5 units Taq Gold polymerase, 1 × PCR Gold buffer, 0.5 mM MgCl2, 50 μg BSA, 1 mM each dNTP, 0.5 μM inPE1.0 forward primer, 10 nM inPE2.0 reverse primer, 0.5 μM unique index primer and 5–15 μl DNA template (sample-dependent). DNA was further amplified through an additional 10–15 cycles (sample-dependent) by using 5 μl first-round PCR product as the template in a second round of PCR carried out in a total volume of 50 μl and under same conditions as the first round except that the primer inPE2.0 was not included.

Before sequencing, DNA fragment size selection was carried out with 2% agarose gels for samples Pi1889, Pi1876 and Pi1882, and with a Caliper LabChip XT for samples Pi1845A and Pi1845B. All products were purified with Qiagen QIAquick gel extraction kit before sequencing.

An initial quality assessment was performed on sample Pi1889 through sequencing 75 bp Single Reads on an Illumina Genome Analyzer II. Subsequent sequence data on other samples were generated using an Illumina HiSeq 2000 using both Single Read 100 bp and Paired End 100 bp chemistries (Supplementary Table S2).

### Library preparation and sequencing of modern isolates

Mycelia from three modern US isolates RS2009P1 (US-8), IN2009T1 (US-22) and BL2009P4 (US-23) were grown in pea broth and DNA was similarly extracted with a CTAB procedure. Library construction for these strains was performed at the University of North Carolina High-Throughput Sequencing Facility using a TruSeq DNA library preparation kit (Illumina). Each library was barcoded with a unique index sequence. Library quality assessment was performed on the Experion (Biorad) automated electrophoresis system and fluorometric DNA quantification was estimated by Qubit (Invitrogen). All three libraries were pooled and sequenced in a multiplex Paired End 100 cycles configuration on a single lane of the HiSeq 2000 flowcell. Data demultiplexing was performed by Illumina pipeline version 1.7.0 (Supplementary Table S2).

### Read trimming and mapping

Genomic sequence reads from P. infestans strains 90128 and T30-4 (submissions SRS115105, SRP003617 and SRP000883), and P. mirabilis strain PIC99114 (accession SRP003618) were obtained from the Sequence Read Archive (sra.dnanexus.com). Genomic sequence reads from isolate 06_3928A (13_A2) were obtained directly from the authors12.

Adapter sequence and low quality bases were trimmed from the 3′ ends of DNA reads with the software AdapterRemoval version 3 (ref. 32) using a mismatch rate of 0.333. Paired reads overlapping by at least 11 bp were collapsed into a single read with a re-calibrated base quality. Reads shorter than 25 bp after adapter sequence removal were discarded.

Mapping to the complete P. infestans T30-4 reference genome was performed with Burrows-Wheeler Aligner (BWA) version 0.5.9 (ref. 33) with seeding deactivated and otherwise default settings. PCR duplicates were removed conservatively with the software Picard version 1.66’s MarkDuplicates function, which removes all but one of reads whose 5′ and 3′ ends map to the same coordinates in the reference genome. Read alignments to the reference genome were optimized with the RealignerTargetCreator and IndelRealigner tools included in the software Genome Analysis Toolkit (GATK) version 1.3 (ref. 34).

The reference genome assembly and annotation of P. infestans strain T30-4 was obtained from the Broad Institute’s P. infestans Database ( www.broad.mit.edu).

Reads unmapped to the P. infestans reference genome assembly were extracted from Binary Sequence/Alignment Map (BAM) files and then mapped to the complete S. tuberosum reference genome assembly version 3 DM sequence35, which was obtained from the Potato Genome Sequencing Consortium public data release (solanaceae.plantbiology.msu.edu).

### Analysis of metagenomic composition of unmapped reads

To investigate the metagenomic composition of the historical DNA libraries of samples Pi1845A and Pi1889, we randomly sampled 1 M unique reads from those not mapped to either of the P. infestans or S. tuberosum reference genomes. The software megaBLAST version 2.2.21 (ref. 36) was used to align these sequences against the contents of the NCBI non-redundant nucleotide sequence database with an E-value cutoff of 0.0001. The best hit in the database was used for each input read. The software MEGAN version 4.70.4 (ref. 37) was used to assign the best matches to phylogenetic groups. The portion of the two libraries that could be classified was similar, with ~80% of reads having no hits in the database (Supplementary Fig. S5). Of the reads with close BLAST hits, the majority matched Phytophthora sequences in Pi1845A and bacterial sequences in Pi1889.

### Analysis of post-mortem DNA damage

BAM files containing reads aligned to the P. infestans T30-4 reference genome were analysed with the software mapDamage version 0.3.6 (ref. 38) to confirm the existence of nucleotide misincorporations characteristic of the degradation in ancient DNA39,40,41,42 (Supplementary Fig. S3).

### Report of polymorphism in gene-coding sequences

For samples Pi1845A, Pi1889, RS2009P1 (US-8), IN2009T1 (US-22) and BL2009P4 (US-23), SNPs were tallied at all non-N reference genome positions annotated as coding sequence. SNPs were included in the final tally if they surpassed a Q20 variant quality score and the variant position was covered by at least six uniquely mapped reads with MAPQ score ≥25 after PCR duplicate removal. We report 178,734 non-reference SNP positions (Supplementary Data 1). We detected 82,415 of these SNPs in the two historical samples and 154,694 of these SNPs in the three modern samples.