Introduction

Influenza A viruses (IAVs), which belong to the Orthomyxoviridae family, typically replicate in the intestinal tract of wild aquatic birds causing no overt disease and are transmitted through open-water sources via the fecal-oral route1. However, in mammals IAVs cause mainly respiratory infections. The genome of IAVs is a single-stranded, negative-sense RNA that consists of eight segments coding up to 12 proteins (PB2, PB1, PB1-F2, PA, PA-X, HA, NP, NA, M1, M2, NS1 and NEP)2,3,4,5. To date, there are 18 hemagglutinin (HA) and 11 neuraminidase (NA) subtypes of IAVs1,6,7,8,9.

Although almost all combinations of HA and NA subtypes are found in wild aquatic birds, only H1, H2 and H3 subtypes have caused epidemics and pandemics in the human population. H1N1 IAVs that descended from avian sources probably moved into the human and swine populations prior to the 1918 pandemic and have established in those hosts since then10,11. During reported human history, three pandemics have been caused by H1N1 viruses in 1918, 1977 and 2009. In particular, the pandemics in 1918 and 2009 resulted from the direct adaptation of the virus to mammals or the acquisition of several segments from avian H1N1 viruses prior to mammalian adaptation. Recently, we showed that the naturally circulating avian H1N1 viruses of North America could lead to high mortality in mice12. Therefore, it is important to understand the genomic characteristics of these wild-type avian H1N1 viruses that facilitate interspecies transmission and pathogenesis in mammals without prior adaptation in another mammalian host.

To date, several molecular markers or genomic changes have been identified in the genomes of IAVs in terms of mammalian adaptation and increased virulence upon transmission to a new host. For instance, the E627K mutation in the PB2 gene13 was detected in recently emerged H7N9 human influenza A isolates14. However, the natural existence of such markers in the circulating IAV population of wild bird origins is not well known due to lack of surveillance studies and the corresponding genomic data. With the increasing number of surveillance studies across the world, it is important to investigate the unique genomic variations in some of the avian IAVs that make them virulent and transmissible in mammals; so we can be prepared for future outbreaks and pandemics originated from wild bird population.

In light of these concerns, we sequenced the whole genome of 31 wild-type North American avian H1N1 IAVs that were used in our previous study12. Empowered by the whole genome information, in this study, we applied Cox proportional hazard model to the survival data that we previously obtained from mice infected by these viruses (n = 5 mice/virus) to evaluate the effects of deduced amino acid (aa) residue variation at each position, host origin for viruses (Anseriformes or Charadriiformes), the interaction between residue and virus host. In many medical studies, time to death is the main event of interest. It is common that at the end of follow-up time some of the individuals have not had the event of interest, thus their true time to event is unknown. Furthermore, survival data are rarely normally distributed. These features of the data required a special method called survival analysis. In study designed to determine multiple factors that influence the survival, the most commonly used regression model is the Cox proportional hazard model. The Cox proportional hazard model has been used in Genome-wide association study15,16 to identify the genetic variants that are associated with survival of patients. This is a semi-parametric model as we model the covariate effects with a parametric model, but we model the baseline hazard rate (common to all samples) non-parametrically. After excluding the least pathogenic virus in DBA/2J mice due to only 20% overall mortality, this statistical model analyzes survival data to identify differences in survival due to independent variable such as treatment (i.e. virus infection) while adjusting other influential covariates (i.e. host origin), providing an estimate of the hazard ratio and its confidence interval. We then assessed the potential contribution of a subset of significant polymorphic sites to the virulence of the virus by structurally modeling and predicting their functional impact.

Results

Polymorphic sites observed in 31 wild-type North American avian H1N1 IAVs

We performed whole genome sequencing on 31 wild-type North American avian H1N1 viruses and deduced their protein sequences for each segment that were then aligned in order to compare their variations in the deduced proteome.

In total, we observed 404 (8.7%) variable sites in the full proteome of avian H1N1 IAVs (Table 1). The variable sites across the whole genome for all 31 viruses were shown in a proteotype plot (Fig. 1). NS1 contributed the most (80) to these variable sites, followed by HA (49), PB1-F2 (48), PA (45) and NA (44) (Table 1). Among all proteins, PB1-F2, NS1, PA-X, NEP have over 10% aa showing variations (53.3%, 34.8%, 19.7% and 18.2%, respectively, Table 1). The individual plots showing polymorphic sites and the residue variations in each coding region are given in Supplementary Figs. S1–S11.

Table 1 Distribution of polymorphic sites in the proteome of the tested North American avian H1N1 IAVs
Figure 1
figure 1

Distribution of polymorphic residues in the proteome of North American avian H1N1 IAVs.

The proteome was concatenated using the polymorphic sites in each protein in the following order: PB2, PB1, PB1-F2, PA, PA-X, HA, NP, NA, M1, M2, NS1 and NEP. Similar amino acids were grouped by color: aliphatic amino acids, shades of brown and orange; positively charged residues, shades of blue; negatively charged ones, shades of red; polar uncharged ones, shades of green; aromatic ones, shades of purple; and glycine and cysteine, shades of yellow. These positions were based on the sequences used in this study.

Using previously determined survival data in DBA/2J mice upon infection with wild-type North American avian H1N1 IAVs12, we then investigated the effects of the residue variations, host origin of viruses and host-residue interaction in relation to their pathogenicity in mice.

Effect of residue variation alone

After adjusting the host effect, the survival analysis with Cox proportional hazard model identified 108 polymorphic residues at 105 sites (26% of the total polymorphic sites) that were statistically associated with the pathogenicity of the viruses in the DBA/2J mice by residue effect (adjusted P (FDR) ≤ 0.01; Supplementary Tables S1, S2). The polymorphic sites with residue effect were contributed by NS1 (60), NEP (18), PB2 (8), PB1-F2 (6), PA (5), PB1 (3), HA (2), NA (2) and M2 (1) (Table 1 and Supplementary Table S2). Three of 105 polymorphic sites had more than one substitution at each site (R75H or H75L in PB1-F2, S7L or T7L in NS1 and S7L or T7L in NEP) (Supplementary Table S2). None of the variable sites in PA-X, NP, or M1 proteins were associated with the pathogenicity of the virus strains in terms of amino acid substitutions.

Notably, among these 105 sites, 78 (74.3%) were from NS1 and NEP, which were coded as spliced products by NS segment. The proteotype of NS1 and NEP indicates that there exist two major alleles in these 31 wild-type H1N1 viruses (Fig. 1, Supplementary Figs. S10, S11). In other words, the residue effect in these two proteins was mainly contributed by the underlying alleles rather than the individual variable sites. Therefore, we adjusted the effect of these non-independent polymorphic sites. After removing the 73 sites from NS1 and NEP that do not vary from the underlying allele of NS, the remaining 5 significant sites in these two proteins are 7, 27, 227 in NS1 and 7 and 70 in NEP. Due to the overlapping NS1 and NEP ORFs, the nucleotide substitutions in NS segment leading to residue variations in NS1 and NEP are: c.C20T for S7L in NS1 and NEP, c.A19T, c.C20T and c.C21A for T7L in NS1 and NEP, c.A81G for I27M in NS1, c.A680G for E227G in NS1 and S70G in NEP. With this justification, the total number of significant polymorphic sites associated with pathogenicity of these viruses in mice is reduced to 32 (7.9% of the 404 polymorphic sites).

Among the remaining 32 sites, the most significant residue variations associated with increased pathogenicity were K43R and D416N in NA, K53R and R75H in PB1-F2, I292T and E358V, S203N and I550L in HA, S7L in NS1 and NEP, S70G in NEP, E227G in NS1 and L187I in PA (Fig. 2; adjusted P (FDR) ≤ 4.32E-05). For instance, the viruses with the residue N416 in their NA cause 100% mortality by 6 days post infection (dpi); however, the ones with the residue D416 in their NA showed delayed mortality (Fig. 2a).

Figure 2
figure 2

Polymorphic sites associated with pathogenicity in mice by residue effect.

Amino acid substitutions at certain positions affect the pathogenicity of H1N1 IAVs in DBA/2J mice. Sites showing significant residue effect were identified by Cox proportional hazard model after adjusting the host effect. Adjusted p-value (FDR) ≤ 0.01 for residue effect was deemed significant. The variants (a) N416D in NA, (b) K53R and (c) R75H in PB1-F2, (d) I292T and (e) E358V in PB2, (f) K43R in NA, (g) S7L in NS1, (h) S203N and (i) I550L in HA, (k) S70G in NEP, (l) E227G in NS1 and (m) L187I in PA are associated with increased pathogenicity in avian H1N1 IAVs. These positions were based on the sequences used in this study.

Effect of host origin of virus

To understand whether the host origin of an avian virus (i.e. Anseriformes or Charadriiformes) is associated with its pathogenicity in mice, we investigated the host effect after adjusting the residue effect. Thus, the host effect was determined between the IAVs of Anseriformes and Charadriiformes origin carrying only the same residue at those sites. Twenty-three polymorphic sites exhibited an effect of the host origin of virus on the pathogenicity (adjusted P (FDR) ≤ 9.96E-03; Supplementary Table S3). The polymorphic sites affected by host origin of viruses were in PB1-F2 (7), PB1 (5), PB2 (4), NP (2), HA (2), PA (1), NS1 (1) and NEP (1) (Table 1 and Supplementary Table S3). For instance, H75 in PB1-F2 and S70 in NEP were present in H1N1 viruses of Anseriformes and Charadriiformes origin but showed increased pathogenicity in mice infected with H1N1 viruses of Anseriformes origin (Fig. 3a, b).

Figure 3
figure 3

Polymorphic sites associated with pathogenicity in mice by host origin and host-residue interaction.

Amino acid substitutions at certain positions (pos) affect the pathogenicity of H1N1 IAVs of Anseriformes (Ans) or Charadriiformes (Char) origin in DBA/2J mice. Sites showing significant host effect and host-residue interactions were identified by Cox proportional hazard model with host, residue and host-residue interaction term in the model. Adjusted p-value (FDR) ≤ 0.01 was deemed significant. Two of the polymorphic sites with host effect: (a–b) H75 in PB1-F2 and S70 in NEP are individually associated with increased pathogenicity in Ans IAVs. Two of the polymorphic sites with host-residue interactions: (c–d) T46M in PB1-F2 and I89V in NEP are individually associated with increased pathogenicity in Ans IAVs and M46T in PB1-F2 and V89I in NEP have the same effect in Char IAVs. These positions were based on the sequences used in this study.

Effect of host-residue interactions

We then examined the host-residue interaction and identified 14 such polymorphic sites (adjusted P (FDR) ≤ 0.01) based on the residue effect on survival in different hosts. These sites were distributed mainly in polymerase genes (Table 1 and Supplementary Table S4). For example, the substitutions M46T in PB1-F2 in H1N1 IAVs of Charadriiformes origin was associated with increased mortality in the infected mice. However, the same substitution M46T in the viruses of Anseriformes origin showed decreased mortality in infected mice (Fig. 3c). On the other hand, the presence of I89V in NEP substitutions in viruses of Anseriformes origin was associated with increased mortality in infected mice, but decreased mortality in mice infected with the viruses of Charadriiformes origin harboring the same substitutions (Fig. 3d).

Polymorphic sites associated with pathogenicity in individual proteins

The pathogenesis of IAVs is a complex polygenic trait that is an outcome of various molecular variations in the overall gene constellation. More importantly, some of those variations cause alteration in protein structure and function that might contribute to the efficacy of viral entry, replication and transmission in different host systems. Therefore, we examined the polymorphic sites in individual proteins, mainly the ones statistically associated with pathogenicity, in terms of their structural and functional importance. The presence of known adaptation, virulence or drug resistance markers and structural importance of all other observed genetic variants, where possible, are given in Supplementary Information .

Polymerase proteins

Being a component of the viral polymerase complex, the PB2 protein contributes to the efficiency of viral replication in different hosts. Of the 37 polymorphic sites in the PB2 protein (Supplementary Fig. S1), eight (I67V, A152S, A199T, V255I, I292T, E358V, R508Q and V649I) were associated with increased pathogenicity by residue effect (adjusted P (FDR) ≤ 4.77E-03; Supplementary Table S2). Four sites (67, 152, 199 and 508) were associated with pathogenicity of avian H1N1 IAVs by host effect (adjusted P (FDR) = 9.50E-03; Supplementary Table S3). Five of those sites with residue effect (I67V, A152S, A199T, R508Q and V649I) were observed to co-occur in five avian H1N1 IAVs of Charadriiformes origin (A/shorebird/DE/300/2009, A/shorebird/DE/324/2009, A/gull/DE/428/2009, A/shorebird/DE/170/2009 and A/shorebird/DE/318/2009). Two variants associated with increased pathogenicity, E358V and V649I [adjusted P (FDR) are 5.63E-09 (Fig. 2e) and 4.77E-03, respectively] are located in the cap-binding and the host-specific domain of PB2 (Table 2). These variants were observed in some of the most pathogenic viruses (pathogenicity index = 4, ref. 12) and in five H1N1 IAVs of Charadriiformes origin mentioned above (Supplementary Fig. S1).The variant A199T, also observed only in viruses of Charadriiformes origin, was identified in association with pathogenicity by residue effect (adjusted P (FDR) = 4.77E-03). Interestingly, we also detected host effect at the same site when the viruses carrying A199 were compared (adjusted P (FDR) = 9.50E-03). This site has been previously identified as one of the genomic markers differentiating human and avian IAVs17.

Table 2 Selected statistically significant polymorphic sites and their potential function or structural insights

Of the 34 polymorphic sites in the PB1 protein (Supplementary Fig. S2), we identified three substitutions (K215R, L298I and I667V) associated with increased pathogenicity in mice by residue effect (adjusted P (FDR) ≤ 4.77E-03; Supplementary Table S2), two of which (L298I and I667V) co-occurred in only five avian H1N1 IAVs of Charadriiformes origin mentioned above. Additionally, we observed five polymorphic sites in PB1 (215, 298, 372, 642 and 667) that were associated with pathogenicity by host effect (adjusted P (FDR) = 9.50E-03; Supplementary Table S3). Based on host-residue interactions, the S375N substitution was associated with increased pathogenicity in avian H1N1 IAVs of Anseriformes origin and the N375S was more pathogenic in those of Charadriiformes origin (adjusted P (FDR) = 8.66E-03; Supplementary Table S4).

PB1-F2 is a small polypeptide expressed from the second ORF of the PB1 gene and contributes to the viral fitness by inducing apoptosis in the infected host cell2. We observed that 30 North American avian H1N1 viruses possess a full-length (90-aa) PB1-F2 ORF; but A/mallard/MN/AI07-3100/2007 possesses a slightly shorter (87-aa) PB1-F2 ORF. Of the 48 polymorphic sites in the PB1-F2 protein (Supplementary Fig. S3), seven residues (H15R, N23S, T27I, K53R, L58S, R75H, H75L) were associated with pathogenicity in mice by residue effect (adjusted P (FDR) ≤ 4.77E-03; Supplementary Table S2). Four of the substitutions (N23S, T27I, L58S and H75L) co-occurred in five avian H1N1 IAVs of Charadriiformes origin mentioned above. Positively charged K73 and R75 residues within a 12-aa subdomain (residues 63–75) are the minimal requirement for a full-length peptide to localize to the mitochondria instead of cytoplasm18. Here we observed that K73 was conserved in 31 H1N1 IAVs. Although R75H substitution in PB1-F2 of H1N1 IAVs was associated with pathogenicity in mice by residue effect (Fig. 2c and Table 2), localization of this protein was likely perturbed. Strains with variants in position 75 also have coevolved positive residues in the surrounding subdomain Q69R and K78R, which might compensate for the loss of R75.

For PB1-F2, we also identified seven sites (8, 15, 23, 27, 69, 75, 78) associated with the pathogenicity by host effect (adjusted P (FDR) ≤ 9.62E-03; Supplementary Table S3). On the basis of host-residue interactions, we identified four polymorphic sites. The residues R26, T46, S83 and N90 showed increased pathogenicity in avian H1N1 IAVs of Charadriiformes origin, whereas, the other four residue Q26, M46, F83 and S90 showed increased pathogenicity in those of Anseriformes origin (adjusted P (FDR) ≤ 8.66E-03; Supplementary Table S4). For instance, the host-residue interaction at position 46 in PB1-F2 indicated that M46T substitution was associated with increased pathogenicity of the viruses of Charadriiformes origin and decreased pathogenicity of viruses of Anseriformes origin (Fig. 3c).

We found that five of 45 polymorphic sites in the PA protein (L187I, R269K, I348L, S388G and I545V) were associated with pathogenicity by residue variations (adjusted P (FDR) ≤ 5.17E-03; Supplementary Table S2). The site, 187, was also associated with the pathogenicity by host effect (adjusted P (FDR) = 7.39E-03; Table 2, Supplementary Table S3). The L187I substitution was observed in influenza viruses of shorebird origin that were more pathogenic in mice (Supplementary Fig. S4), but the viruses of gull or Anseriformes origin did not carry this substitution. According to host-residue interaction, D272, I323 and P400 were associated with increased pathogenicity of avian H1N1 IAVs of Anseriformes origin; conversely, E272, V323 and Q400 appeared to help retain pathogenicity of avian H1N1 IAVs of Charadriiformes origin (adjusted P (FDR) = 8.83E-03; Supplementary Table S4).

The recently discovered PA-X protein is a product of the X-ORF within the PA-ORF and resulted from +1 ribosomal frameshifting, the 61-aa long C terminus of which fuses with the N terminus of PA3. We detected 12 polymorphic sites in this region, none of which were associated with pathogenicity by residue or host-origin effect (Supplementary Fig. S5). However, we identified four sites with substitutions associated with the pathogenicity by host-residue interactions. The L16, V21, S23 and P59 were associated with increased pathogenicity of viruses of Charadriiformes origin and S16, A21, L23 and Q59 were associated with increased pathogenicity of viruses of Anseriformes origin (Supplementary Table S4; adjusted P (FDR) was 9.02E-03 for position 21 and 8.83E-03 for other sites).

Surface glycoproteins

We detected 49 variants in HA of North American avian H1N1 IAVs (Supplementary Fig. S6). Two variants (S203N and I550L) were associated with pathogenicity by residue effect (Fig. 2h, i; adjusted P (FDR) = 4.32E-05; Supplementary Table S2). The variant S203N was located in the Sb antigenic site (Table 2), possibly playing a role in host immune response (Fig. 4). We identified two polymorphic sites (173 and 295) affecting the pathogenicity of viruses in mice by host effect (adjusted P (FDR) was 9.96E-03 and 9.50E-03, respectively; Supplementary Table S3). Position 173, which is located in the head of the HA protein, is part of the Sa antigenic site (Table 2) and plays a role in host immune response (Fig. 4).

Figure 4
figure 4

The HA variants associated with pathogenicity of avian H1N1 IAVs by residue and host effect.

(a) Ribbon model of the HA1 (teal) and HA2 (green) dimer in a single HA molecule from A/mallard/Alberta/35/1976 [PDB: 2WRH]. Antigenic sites Sa (peach), Ca (yellow), Sb (dark blue) and Cb (orange) and locations of variants (magenta) are indicated. (b) The variants S203 and T173 in HA of avian H1N1 IAVs are located in the head of HA, as part of the Sb (dark blue) and Sa (peach) antigenic sites, respectively. These variants play a role in host immune response. These positions were based on the sequences used in this study.

We detected 44 variants in the NA of avian H1N1 IAVs (Supplementary Fig. S8). Two variants (K43R and D416N) were associated with pathogenicity in mice by residue effect (Figs. 2a, f; adjusted P (FDR) was 4.24E-09 for site 416 and 9.41E-09 for site 43; Supplementary Table S2). One of the most significant variant, D416N, is located at the interface between the monomers (Supplementary Fig. S17b).

Matrix and nonstructural proteins

We identified two variants in M1 and nine in M2 proteins of North American avian H1N1 IAVs (Supplementary Fig. S9a, b). The only variant associated with pathogenicity in mice by residue effect, R18K, was identified in the M2 protein (adjusted P (FDR) = 4.23E-04; Supplementary Table S2).

Based on their sequence homology, the NS gene sequences are grouped as allele A (swine/human-like) or allele B (avian-like)19. The avian H1N1 IAVs in this study belonged to both allele A (n = 17) and allele B (n = 14). Therefore, the number of polymorphic sites in the NS gene products was higher due to the underlying allele effect.

Of the 80 polymorphic sites observed in NS1, 60 sites (75%) were initially associated with the pathogenicity of viruses in mice by residue effect (adjusted P (FDR) ≤ 4.77E-03; Supplementary Table S2). After adjusting the underlying allele effect, 4 polymorphic residues (S7L, T7L, I27M and E227G) in NS1 remained significant by residue effect (adjusted P (FDR) values are 1.37E-05, 4.77E-03, 1.87E-03 and 4.32E-05, respectively; Supplementary Table S2). The host effect was detected at only one site (227) (adjusted P (FDR) = 7.39E-03; Supplementary Table S3). On the basis of host-residue interactions, we observed that S206 was associated with increased pathogenicity of H1N1 IAVs of Charadriiformes origin and C206 was associated with increased pathogenicity in those of Anseriformes origin (adjusted P (FDR) = 4.19E-03; Supplementary Table S4).

Several of the polymorphic sites identified in association with pathogenicity by residue effect in NS1, but showing underlying allele effect, were found to have structural and functional importance. Modeling two of those variants, D101E and F103Y, low-pathogenic strains contain residues D101 and F103, but only the former interacts with the phosphoinositide 3-kinase (PI3K)-inhibitory domain at R587 (Table 2). Pathogenic viruses contain E101 and Y103, making it more plausible that both residues would be involved in the interaction (Fig. 5a). D101E would retain a similar salt bridge with R587, but the additional hydroxyl on Y103 would facilitate further interaction. The outcome would be a stronger interaction between NS1 and the PI3K-inhibitory domain.

Figure 5
figure 5

NS1 and NEP variants associated with pathogenicity by residue effect.

(a) Ribbon model of NS1 (green) in which S103 is replaced with F103 and NS1 is bound to the PI3K inhibitory domain (teal) [Bovine PDB: 3L4Q] are labeled. The predicted interaction (black dots <5 Å) between variant residues E101 and Y103 of NS1 protein with R587 (numbering is based on human) from the PI3K inhibitory domain. Low-pathogenic strains contain residue D101 and F103 in which D101 interacts only with the PI3K inhibitory domain R587. Pathogenic viruses contain E101 and Y103 and both residues are involved in interactions. The outcome is a stronger interaction between NS1 and the PI3K inhibitory domain. (b) NEP variants increase the negative charge of the M1-binding domain in more pathogenic viruses. Electrostatic surface potential representation of the NEP/M1-binding surface [PDB: 1PD3]; native (left image) and with three variant changes (R86I, K88T and K64T) (right image). The surface color indicates negative charge (red), positive charge (blue) and neutral (white). The extent of the charge is indicated below each image in units of kT/e. These positions were based on the sequences used in this study.

Among the 22 polymorphic sites on NEP (Supplementary Fig. S11), we identified 18 residues that were associated with pathogenicity by residue effect (adjusted P (FDR) ≤ 4.77E-03; Supplementary Table S2); host effect on pathogenicity occurred only at site 70 (adjusted P (FDR) = 7.39E-03; Supplementary Table S3). After adjusting the underlying allele effect, we still found 3 polymorphic residues in NEP by residue effect, S7L, T7L and S70G (adjusted P (FDR) values are 1.37E-05, 4.77E-03 and 4.32E-05, respectively; Supplementary Table S2). On the basis of host-residue interactions, we observed an increased pathogenicity of avian H1N1 IAVs of Anseriformes origin in mice due to I89V substitution in NEP (adjusted P (FDR) = 1.35E-03; Fig. 3d and Supplementary Table S4). The several statistically associated variants with underlying allele effect, K64T, R86I and K88T, were also present on the M1-binding surface of more pathogenic strains increasing the negative charge of the M1-binding domain (wild-type charge at pH 7.00 = –1.8; variants charge at pH 7.00 = –5.1) (Fig. 5b). On the basis of this finding, we speculated that the increased negative charge on the M1-binding surface of the more pathogenic strains provides an additional advantage for early infection due to tighter interaction of NEP and M1. Because M1 is not present in large quantities until later stages of infection, the tighter binding between NEP and M1 may cause early export of genomic ribonucleoproteins from the nucleus20.

Interestingly, the variant E227G in NS1 and S70G in NEP, caused by the same nucleotide substitution c.A680G in NS segment, showed both host and residue effect (Fig. 2l and Fig. 3d). The E227G/S70G was observed in three pathogenic viruses of Charadriiformes origin (A/shorebird/DE/300/2009, A/shorebird/DE/324/2009 and A/gull/DE/428/2009) with 100% mortality by 7 dpi in mice (Supplementary Table S1). In NS1, the E227G alters the PDZ-binding motif from ESEV to GSEV (Table 2). Species with ESEV contain a di-Arg (RRVESEV) in the upstream of PDZ-binding motif and those with GSEV do not (RTIGSEV). We detected 13 viruses with an RTIESEV motif, nine of which belong to the viruses of lower pathogenicity. In contrast, the RTIGSEV motif was found only in H1N1 IAVs of Charadriiformes origin with greater pathogenicity. In NEP, S70 is avian like and G70 is mammalian-like residue that is observed in swine, human and the 2009 pandemic H1N1 isolates17. It is located in NEP-M1 binding domain on NEP (residues 54-121)21.

Discussion

The pathogenicity of IAVs is a polygenic trait that results from combinations of genetic variants in different gene segments of the virus. Understanding the genomic dynamics of IAVs is an important step in risk assessment, in terms of determining the virus' virulence, transmissibility and pandemic potential. Previously, we introduced a measurement named “pathogenicity score” to indicate the overall effect of each virus on the infected host12. Here we applied the Cox proportional hazard model to the survival data obtained from individual IAV infected mice in order to identify crucial natural polymorphisms in the genome of wild-type North American avian H1N1 IAVs in terms of their pathogenicity in a mammalian model. Although the same statistical model has been widely used in human genomics, to our knowledge, it's the first time that this method is applied in the virus pathogenicity studies. This method enables us to directly estimate the relationship between covariates such as genetic variants, hosts and the time of death. The Cox proportional hazard model also allows us to assess the individual contribution of covariates to the model.

Our results show that at the deduced proteome level, the pathogenicity of the avian H1N1 viruses in DBA/2J mice is associated with 32 polymorphic sites by residue effect (after adjusting the underlying allele effect by NS segment), 23 polymorphic sites by host effect and 14 polymorphic sites by host-residue interactions. Overall, pathogenicity of avian H1N1 IAVs in mice was mainly associated with polymerase complex (PB2, PB1, PB1-F2, PA, PA-X) and NS gene products (NS1 and NEP). The highest number of polymorphic sites (n = 73) in the NS gene sequences was due to the circulation of two alleles (Allele A and B) in North American IAVs of various subtypes19,22. The viruses in our study fell into both allele groups, but the avian-like allele (Allele B) appeared to be more associated with pathogenicity in mice. Thus, the effect of underlying allele of NS segment played a significant role on the pathogenicity of avian H1N1 IAVs in mice.

Modeling the variants associated with pathogenicity using available crystal structures helped us understand the functional and structural importance of those sites. For example, some significant variants associated with pathogenicity, E358V and V649I, were located in the cap-binding and host-specific domains of PB2, respectively. The substitutions A199S and V667I in PB2 were previously identified in association with enhanced transmission to humans17,23. Here, we detected A199T substitution associated with increased pathogenicity in mice, which was observed in five H1N1 IAVs of Charadriiformes origin (A/shorebird/DE/300/2009, A/shorebird/DE/324/2009, A/gull/DE/428/2009, A/shorebird/DE/170/2009 and A/shorebird/DE/318/2009). Threonine is biochemically more similar to serine than to alanine. Therefore, H1N1 viruses of Charadriiformes origin might use A199T substitution as a step toward adapting to mammals. Given that V667I is observed only in A/pintail/ALB/210/2002, it could not be detected by statistical analysis due to the lack of statistical power.

The PB1 of avian IAVs mostly carry N375 although S375 and T375 can also be found at lower proportion (18% and 13%, respectively)24. The S375 variant was found in the PB1 of 1918 pandemic H1N1 and human H1N1, H2N2 and H3N2 IAVs24. We detected host-residue interaction at position 375 in PB1 that might indicate the importance of N375S substitution in terms of retaining the pathogenicity of viruses originated from different bird species. All 31 North American avian H1N1 IAVs possessed a full-length PB1-F2. The C-terminal region of the protein mediates mitochondrial localization18,25 by interacting with two mitochondrial proteins, ANT3 and VDAC1, to induce apoptosis26. Our analysis showed that one of those sites, 75, found in the 12-aa subdomain of C-terminal of PB1-F2 that was shown as minimal requirement for mitochondrial localization of the protein18,25 was associated with pathogenicity by residue and host effect.

The effector domain of NS1 (residues 74-230) interacts with many host proteins to enhance viral mRNA translation27, deregulate cellular mRNA processing28, inhibit double-stranded RNA-activated protein kinase29 and activate PI3K signaling30. Most of the polymorphic residues associated with pathogenicity were in the effector domain of NS1 of the avian H1N1 IAVs examined. Some were structurally critical for the NS1–PI3K inhibitory domain. Four residues at the C terminus of NS1 are known to interact with PDZ-binding proteins and to alter the host's response. The avian-like PDZ-binding domain (ESEV or EPEV) increases the virulence of the virus in a mouse model31. Twenty-eight of 31 avian H1N1 viruses carry an ESEV PDZ-binding domain that explains their pathogenicity in mouse models. However, we detected the GSEV PDZ-binding motif in three Charadriiformes isolates (A/shorebird/DE/300/2009, A/shorebird/DE/324/2009 and A/gull/DE/428/2009) that were previously categorized in the most pathogenic or moderately pathogenic virus groups12. A crystal structure of the PDZ domain with a bound ESEV peptide shows that the E227 residue on NS1 forming a salt bridge with R16 on the PDZ peptide32. The R227 (RSEV motif) seen in some human influenza viruses22 would not form this interaction but rather it would be electrostatically unfavorable. G227 would not form a salt bridge similar to that of E227–R16, nor would it cause electrostatically unfavorable interactions. Therefore, GSEV would most likely bind the same PDZ domains that ESEV binds, except with less stabilizing interactions. We speculate that two adjacent arginines at residues 224 and 225 are complemented by the acidic ESEV motif, which also allows interaction with a basic charged PDZ-binding domain. In the GSEV motif, the presence of the two adjacent arginines could shift the C terminus to be more basic and repel the peptide from the PDZ domain. Thus, these two residues provide an advantage for the interactions of ESEV with PDZ-binding domains, while arginine-threonine (RT) provides an advantage to GSEV. This hypothesis is supported by RT occurring upstream of the GSEV in more than 90% of the sequences in the publicly available database. The distribution of GSEV motif in various subtypes, hosts and geographical regions in the full-length publicly available NS1 sequences is given in the Supplementary Information (Supplementary Tables 6, 7). Overall, NS appears to play a key role in the pathogenicity of avian H1N1 IAVs in mammalian models.

We observed host-originated differences (Anseriformes versus Charadriiformes) in terms of the distribution and number of polymorphic sites in the full genome. According to the overall proteotype profiles, the six North American avian H1N1 IAVs isolated from Charadriiformes in 2009 displayed host-specific genetic differences such as shorebird versus gull, but the 25 influenza viruses isolated from Anseriformes (mallards, northern shovelers and pintails) over 24 years (1984–2008) did not. During the surveillance studies conducted in Delaware Bay, H1N1 subtype was not detected in Charadriiformes until 2009 (personal communication with Scott Krauss, St. Jude). The six H1N1 isolates of Charadriiformes origin in the St. Jude influenza repository differ from other avian H1N1 viruses at 10 positions in their HAs, but they do not differ in their NAs. We detected 11 positions associated with pathogenicity by residue effect which were found only in the polymerase genes of five H1N1 IAVs of Charadriiformes origin (A/shorebird/DE/300/2009, A/shorebird/DE/324/2009, A/gull/DE/428/2009, A/shorebird/DE/170/2009 and A/shorebird/DE/318/2009). We also detected two human-like residues (D382 and N409) already circulating in the PA gene pool of North American avian H1N1 IAVs of mainly Charadriiformes origin (human-like N409 in A/shorebird/DE/300/2009, A/shorebird/DE/324/2009, A/shorebird/DE/170/2009 and A/shorebird/DE/318/2009; and human-like D382 in A/shorebird/DE/274/2009). Although our sample size for H1N1 IAVs of Charadriiformes origin is too limited to make a solid conclusion, the unique substitutions may provide an additional advantage to the polymerase complex of avian H1N1 viruses of Charadriiformes origin to replicate more efficiently and to transmit in different host systems such as mammals.

Because genetic changes occur in multiple genes, it is difficult to identify exact residues and/or the combination of amino acid substitutions that plays a role in adaptation, transmission, or virulence of the virus. However, combining our previous findings on the disease-causing potential of avian H1N1 IAVs with those from this study serves as an initial attempt to reveal the pathogenic genomic signatures in a naturally circulating gene pool. With this novel approach, we highlighted the importance of some residues in each protein that might play a role in the pathogenicity of the virus by altering protein-protein interactions. Variations at individual residues appear to be host- and virus strain–specific; however, using this sort of approach, we might be able to pinpoint key locations on the proteins that are altered by selective pressure. Our analyses also emphasized the importance of genetic variants occurring in viruses of different bird populations. Overall, the statistical approach used to understand the importance of amino acid substitutions, the selective pressure on the different sites of the genome and the simulation of the polymorphic sites through structural modeling helped us understand the significance of certain genomic variations on the pathogenicity of avian H1N1 IAVs in mammalian models.

This comprehensive analysis as an initial attempt advances the field of influenza evolution and host adaptation by not only identifying polymorphic sites associated with pathogenicity, but also investigating where those residues are located on the proteins and their structural and functional contribution to pathogenicity. In the long run, in conjunction with broader surveillance efforts, the ultimate goal is to deduce the virulence of wild-type avian IAVs and other emerging zoonotic pathogens in mammals based on their genomic information.

Methods

Viruses

We analyzed 31 North American avian H1N1 IAVs (Supplementary Table S1) to investigate the genetic relatedness among the viruses and gain understanding about how genomic information of wild-type viruses determines pathogenicity. All viruses were obtained from the St. Jude influenza repository, with the exception of two viruses (A/mallard/OH/4809-9/2008 and A/mallard/MO/466554-14/2007) that were kindly provided by Dennis Senne (U. S. Department of Agriculture National Veterinary Services Laboratories, Diagnostic Virology Section, USDA-APHIS, Ames, IA), Tom DeLiberto and Seth Swafford (both of the U.S. Department of Agriculture Animal and Plant Health Inspection Service, Wildlife Services, National Wildlife Disease Program, Fort Collins, CO). Viruses were minimally propagated in embryonated chicken eggs as previously described12. Viral RNA was extracted using QIAamp® Viral RNA Kit (Qiagen, Gaithersburg, MD) per the manufacturer's instructions.

Whole-genome sequencing

The genomes of 31 avian H1N1 viruses (Supplementary Table S1) were sequenced using high-throughput (Illumina) and traditional (Sanger) platforms. Briefly, cDNA libraries were prepared using a SuperScript III One-Step RT-PCR Platinum Taq HiFi kit (Invitrogen, Grand Island, NY). Then, cDNA was sheared, end-repaired and a poly-A tail was added. Finally, adapters were ligated before the index sequences were inserted. High-throughput sequencing was done on an Illumina Genome Analyzer IIx system (Illumina, San Diego, CA). For Sanger sequencing, each segment was amplified using previously described gene-specific primers33. For efficient amplification of HA gene sequences from H1N1 IAVs, modified versions of previously described H1 primers were used (forward: 5′ AGCAAAAGCAGGGGAAATTCAAATC 3′ and reverse: 5′ AGTAGAAACAAGGGTGTTTTTCCACA 3′). Each gene segment was amplified using One-Step RT-PCR Kit (Qiagen) and the amplicons were purified by QIAquick gel extraction kit (Qiagen) per the manufacturer's instructions. Each amplicon was sequenced with Big Dye® Terminator (v3.1) Chemistry on Applied Biosystems 3730XL DNA Analyzers (Life Technologies, Carlsbad, CA). Sequence reads were assembled using DNASTAR Lasergene 9 Core Suite Seqman (DNASTAR, Inc., Madison, WI). The consensus sequences were aligned using MEGA 5.0 and checked for the genetic variants.

Proteotyping

The protein-coding sequences of each gene segment were aligned using ClustalW multiple alignment on the BioEdit Sequence Alignment Editor version 7.0.9.0. Conserved positions in each alignment were removed and the amino acids were color coded at the variable sites using a seqplot program (gpcplot and residueplot) written in Perl. Similar amino acids were grouped by color: aliphatic amino acids, shades of brown and orange; positively charged residues, shades of blue; negatively charged ones, shades of red; polar uncharged ones, shades of green; aromatic ones, shades of purple; and glycine and cysteine, shades of yellow. Gaps and unknown residues were indicated in dark gray. The individual color codes used for proteotype profile were assigned as follows: A: medium orange, C: yellow, D: red, E: pink, F: dark violet, G: yellow-green, H: pure violet, I: light orange, K: light blue, L: dark brown, M: medium brown, N: dark lime green, P: dark orange, Q: light green, R: dark blue, S: lime green, T: dark cyan, V: light brown, W: light violet, X: light gray, Y: maroon and gap(-): dark gray. For each strain, all protein sequences were combined (PB2+PB1+PB1-F2+, etc.) and clustered using ClustalW, so that viruses would be ordered according to their pathogenicity (highest to lowest). Then the viruses were clustered based on the concatenated nucleotide sequences of their 12 expressed gene products. Aligned sequences were manually curated for the presence of previously described pathogenicity, mammalian adaptation and drug-resistance markers.

Statistical analyses

We investigated the effect of single-residue variations on the proteome of 31 North American H1N1 IAVs in association with their pathogenicity previously identified in DBA/2J mice12. Additionally, we investigated the effect of host origin of virus strains and the host-residue interactions in relation to the pathogenicity. Survival data was used to examine the pathogenicity of each selected virus in the infected mice (n = 150 mice; 5 mice/virus strain). A/green-winged teal/LA/Sg-00090/2007 caused death in only 1 mouse 11 dpi; thus, it was defined as an outlier and excluded from the data analysis.

More specifically, the Cox proportional hazard model was used to evaluate the association between the mouse survival time after exposure to the virus and amino acid residues at each position, host origin for viruses (Anseriformes or Charadriiformes), the interaction between amino acid residue and virus host, simultaneously. The coefficient for each explanatory variable in the Cox model have been adjusted for confounding by the other explanatory variables, i.e. in effect when all other explanatory variables are held constant. For amino acid residue, we required that there were at least two groups of residues at each position and there were at least two viruses at each amino acid group. We then sorted the amino acid residues at each polymorphic site in alphabetical order; the first one was used as reference. For host effect, Anseriformes were chosen as the reference.

The Cox proportional hazard model is a semiparametric model that uses maximum-likelihood (ML) estimation to estimate the coefficient for each parameter while controlling the confounding effects of other included variables. The likelihood is the product of several likelihoods, one for each death time. Once the ML estimation is complete, the Wald test is used to determine the p-value for the coefficient34. Wald test uses z-statistics obtained by dividing the estimated coefficient by its standard error then assuming that this quantity is approximately a standard normal. The positive sign of the coefficient means that the hazard (risk of death) is higher for a mouse exposed to the virus with higher value of that parameter. The positive coefficient for a residue showed that the non-reference residue was more pathogenic after adjusting the host effect and vice versa. The positive coefficient for a host showed that H1N1 viruses of Charadriiformes origin were more pathogenic after adjusting the residue effect and vice versa. The interaction term showed whether the residue effect on survival differed for the two hosts; for instance, a substitution at a site is associated with increased pathogenicity in one type of host origin and decreased pathogenicity in another. The p-value for each factor was adjusted using Benjamini & Hochberg method ("BH" or its alias "FDR")35. The adjusted p-value of 0.01 or less was deemed significant. Significant residues that conferred more pathogenicity but appeared in the least pathogenic strain, A/green-winged teal/LA/Sg-00090/2007, were removed. All the analyses were done by survival package in R 2.15.1 (http://www.r-project.org/).

Structural modeling and interpretation of H1N1 viral variants

Structures that overlapped the following proteins were downloaded from the PDB36: PB2 [PDB: 2VQZ37 and 4ENF38 for cap-binding domain (318-483 aa); and 2VY839 for host-specific domain (538-678 aa)]; PB1 [PDB: 3CM8, 2ZTT40], PA [PDB: 2ZNL41], HA [PDB: 2WRH42], NP [PDB: 3RO543], NA [PDB: 2HTY44 and 3B7E45], M1 [PDB: 1EA346], M2 [PDB: 2H9547], NS1 [PDB:3L4Q48, 3D6R49 and 3PDV32] and NEP [PDB:1PD321]. NS1/PI3K-interface residues were identified with LigPlot +. Figures, mutations and analyses were performed using PyMol. Surface potentials were estimated using the APBS plugin in PyMol. Linear electrostatic charge was estimated using default options in the protcalc webserver (http://protcalc.sourceforge.net/). The location of the variants was visually inspected on available structures for potential functional impact.

Data Access

The sequences for the whole genome of 31 North American avian H1N1 viruses sequenced in this study can be accessed from the NCBI GenBank under accession numbers KF424015–KF424262 (Supplementary Table S1).