Archival influenza virus genomes from Europe reveal genomic variability during the 1918 pandemic

The 1918 influenza pandemic was the deadliest respiratory pandemic of the 20th century and determined the genomic make-up of subsequent human influenza A viruses (IAV). Here, we analyze both the first 1918 IAV genomes from Europe and the first from samples prior to the autumn peak. 1918 IAV genomic diversity is consistent with a combination of local transmission and long-distance dispersal events. Comparison of genomes before and during the pandemic peak shows variation at two sites in the nucleoprotein gene associated with resistance to host antiviral response, pointing at a possible adaptation of 1918 IAV to humans. Finally, local molecular clock modeling suggests a pure pandemic descent of seasonal H1N1 IAV as an alternative to the hypothesis of origination through an intrasubtype reassortment.


Supplementary Note 1: Histopathology
We performed histopathological analyses of the MU-162, BE-572 and BE-576 lung specimens.
For conventional histopathology, slides of lung tissue were stained with hematoxylin and eosin.
A gram staining was also performed to identify gram-positive bacteria. Sample MU-162 was characterized by severe purulent, partially confluent bronchopneumonia with small hemorrhages and edema without clear histomorphological evidence of bacterial colonization ( Supplementary Figs. 1a and 1b). Sample BE-572 was characterized by severe purulent, partially hemorrhagic bronchopneumonia with alveolar edema and evidence of bacterial colonization by gram-positive cocci (Supplementary Figs. 1c and 1d). Sample BE-576 was characterized by severe pulmonary alveolar edema with pronounced hemorrhages, emphysema and first stages of acute bronchopneumonia with evidence of bacterial colonization by grampositive cocci (Supplementary Figs. 1e and 1f).

Supplementary Note 2: Evaluation of human RNA preservation
Fully formed RNA virus particles are built to protect the integrity of the viral genome under various conditions for days or even months 1 . This might also affect RNA recovery from formalin-fixed tissue, due to a better RNA protection prior to fixation in comparison to the endogenous RNA. Hence, we assessed the endogenous human RNA preservation by mapping all reads to a full human transcriptome reference (Supplementary Table 1). Strikingly, most of these fragments show a significantly lower average length than the fragments mapping to the IAV reference (Extended Data Fig. 3). The smaller fragments of human endogenous RNA compared to the viral RNA are a first indication of the protective function of the virion in RNA recovery from formalin-fixed wet specimens. The length of fragments mapping only to the transcriptome varied greatly between the samples and within independent extractions from the same sample. This might be due to varying time between patient death and formalin fixation, the exact formalin composition (e.g. buffered or unbuffered formalin and formalin concentration), different penetration of the fixative in different parts of the tissue, or remaining gDNA fragments which leads to RNA fragmentation during the RNase H treatment of the rRNA depletion step.

Supplementary Note 3: Genomic comparison of 1918 influenza viruses
Together with the available BM and CU sequences, we used these influenza genomes from Germany to assess the genomic diversity of: (i) strains simultaneously involved in local transmission (BE-572 and BE-576), (ii) strains circulating in Europe (MU-162, BE-572 and BE-576) and North America (BM and CU), (iii) strains circulating during the pre-pandemic (BE-572 and BE-576) and pandemic peak period (BM and CU).
The two Berlin genomes sampled on June 28 th 1918, for which only a portion of the genome could be compared, differed from each other at only two (non-synonymous) nucleotide positions in the HA gene (Fig. 1b). These positions were however polymorphic in BE-576, with the minor variant being identical to BE-572. At a country/continent scale, BE-572 and MU-162 on one hand and BM and CU on the other differed both by 22 and 15 single nucleotide polymorphisms (SNPs; 12 non-synonymous, all genes but HA, NS and MP and 7 nonsynonymous, all genes but NP and PB2, respectively, Supplementary Table 2  positions, 18 of which coding for amino acid changes (all genes but HA). When comparing the pre-pandemic European strain BE-572 with pandemic peak strains BM and CU, we identified 29 and 22 SNPs (11 non-synonymous, all genes but HA and NS genes, and 8 non-synonymous, all genes but NS genes, respectively).
We acknowledge that we could not fully disentangle these comparisons (the two Berlin genomes are not complete, and the two accurately dated genomes from the pre-pandemic peak period are from European specimens, while the two accurately dated genomes from the pandemic peak are from North America) and our sample size was very small. An aspect that stood out from this comparison is that we identified a 70 nt stretch of the PA of BM comprising 8 SNPs that shows a high degree of identity to IAV strains circulating in 1933 or later (Supplementary Fig. 4). Given the size of this fragment and that of the fragments initially targeted to characterize the BM PA sequence 2 , this apparent recombination may reflect the contamination of one of the PCR reactions that allowed for the reconstruction of the BM genome with a PCR product derived from a later influenza A virus strain. However, we cannot formally exclude that the BM PA fragment is a genuine but rare case of natural recombination 3 or that it represents a sequencing artefact due to intrahost variation.
Resequencing of the BM genome using PCR-independent methods should clarify this issue.
In light of the hypothesis of an avian origin of the pandemic virus, we investigated the presence of amino acid (aa) signatures known or suspected to be associated with avian-to-human adaptation, either because they have been characterized functionally or found to be distributed differentially across bird-and human-infecting influenza viruses. Here, we focused on the high quality (but imprecisely dated) genome derived from MU-162 and identified two such positions. In PB2, we found a M631L aa change within the PB2 627 host range domain.
Although this mutation has recently been described as a main mediator of adaptation and Germany cluster in between pandemic variants sampled in North America ( Fig. 3 and Supplementary Fig. 6). This is in line with extensive transatlantic mixing of pandemic influenza. To more explicitly assess support for a migration scenario, the relative fit of models in which the clustering of the European isolates was not constrained versus constrained to be monophyletic was determined using the Generalised Stepping Stone sampling. In line with the clustering patterns, the unconstrained model clearly better fits the pandemic clade, irrespective of the use of ambiguity characters in the BE-576 sequence (ln BF 10.7 and 8.9). This agrees with the rejection of the constrained topology according to the AU-test 6 as implemented in IQtree 7 (p-value 0.0097 and 0.0087). The AU-test also rejects the topology in which monophyly is enforced for the pandemic peak viruses (p-value 0.0132 and 0.0091).

Similar epidemic dynamics in 1918 and 2009
To

Reconciling clock and non-clock topologies
Non-clock maximum likelihood phylogenetic reconstruction indicates that human seasonal H1N1 and 1918 pandemic viruses cluster together with reasonably high bootstrap support, with the seasonal lineage nested within the 1918 pandemic variants, for both the HA and NA segments (Supplementary Fig. 7), while inference under the standard HSLC model places the human seasonal lineage as a sister clade of the classical swine flu and 1918 pandemic lineages ( Supplementary Fig. 8a).
If the human seasonal and pandemic lineages are indeed monophyletic, the incompatible pattern under the standard HSLC model -which assumes a constant rate of evolution in each of the host-specific lineages -could be induced by a considerably higher rate of evolution in the years following the pandemic. This would result in a considerably higher divergence between pandemic and seasonal viruses for H1 and N1 than expected under a strict clock, and could therefore induce a sister lineage pattern with a relatively deep MRCA in the time-measured reconstructions. To explore this hypothesis, phylogenies including only the human H1 and N1 taxa were estimated using the same substitution and demographic models as under the standard HSLC but specifying a relaxed clock model 8 .
For HA, this indicates an elevated evolutionary rate on the branch ancestral to the human seasonal MRCA, and results in a topology that is compatible with the non-clock ML phylogeny ( Supplementary Fig. 9). For NA, a pattern similar to that of HA emerges ( Supplementary   Fig. 9). These results are in line with the idea that significant divergence that accumulated after 1918-1919 could indeed induce the sister pattern under a constant rate assumption over the entire lineage.

Simulation-based assessment of standard and extended host-specific local clock model inference and non-clock phylogenetic inference.
The simulations under the extended host-specific local clock (HSLCext) pattern ( Supplementary Fig. 10a) Fig. 10b), as estimated for the HA segment, indicate that both the non-clock ML tree reconstruction as well as the HSLCext model, which allows for a different rate on the branch ancestral to human seasonal, are able to recover the right topology with a 0.95 probability. In line with this, the extended HSLC model does not infer a significantly positive effect on the branch ancestral to human seasonal (not shown). This indicates that the extended HSLC model and the non-clock ML inference are unlikely to produce biased results.

HSLC standard versus HSLC extended: impact on clustering of human seasonal H1N1
For each segment, the plausibility that the seasonal lineage directly emerged from the pandemic diversity increases under the extended HSLC model as compared to under the standard HSLC model (Supplementary Fig. 8b) Bayes factor (b -a) 7.67 11.05 Supplementary Table 3. Support for the HSLCext model. The log marginal likelihood estimates for both models are given in rows a and b. Bayes factors are calculated as the ratio of the marginal likelihoods of both models.
Because the likelihoods are expressed as log-likelihoods, this equates to taking the difference of the log-likelihood of competing models.
Finally, we also ran the HSLCext model on the segment data sets without the new 1918 sequences from Germany. As with the complete data set, the inclusion probability for a rate effect on the human seasonal stem branch is 1 for all segments. In addition, the rate effect on the human seasonal stem branch is of the same magnitude as with the complete data set ( Supplementary Fig.11). Although the addition of the new German sequences did not impact the outcome of this specific analysis, generating these genomes led to revisiting, and ultimately improving, the HSLCstd model.