Subtype-specific structural constraints in the evolution of influenza A virus hemagglutinin genes

The influenza A virus genome consists of eight RNA segments. RNA structures within these segments and complementary (cRNA) and protein-coding mRNAs may play a role in virus replication. Here, conserved putative secondary structures that impose significant evolutionary constraints on the gene segment encoding the surface glycoprotein hemagglutinin (HA) were investigated using available sequence data on tens of thousands of virus strains. Structural constraints were identified by analysis of covariations of nucleotides suggested to be paired by structure prediction algorithms. The significance of covariations was estimated by mutual information calculations and tracing multiple covariation events during virus evolution. Covariation patterns demonstrated that structured domains in HA RNAs were mostly subtype-specific, whereas some structures were conserved in several subtypes. The influence of RNA folding on virus replication was studied by plaque assays of mutant viruses with disrupted structures. The results suggest that over the whole length of the HA segment there are local structured domains which contribute to the virus fitness but individually are not essential for the virus. Existence of subtype-specific structured regions in the segments of the influenza A virus genome is apparently an important factor in virus evolution and reassortment of its genes.


Results
Searching for potential conserved RNA structures within HA subtypes. From previous work in the search for structured regions in the influenza virus RNA segments 8,15,17 it was anticipated that the conserved structural elements in the HA segment, if any, could be relatively small domains with moderate thermodynamic stability. The deviations from thermodynamically optimal structures, common in any large RNA molecule, are expected to be even more significant in the native influenza vRNA due to its functioning in the structure-destabilizing complex with multiple NP molecules [18][19][20] . Thus, the predictions obtained by algorithms for detection of conserved structures in related RNA sequences 33,34 , which are heavily based on RNA thermodynamics, are expected to be only partially correct in case of HA vRNA, which apparently lacks a global ordered structure 17 . On the other hand, functional significance of predicted local structural elements, irrespectively of their thermodynamic properties, can be estimated using analysis of nucleotide covariations that indicate significant evolutionary constraints and thus functional importance. Thanks to the availability of a large number of sequences from various influenza virus strains, mutual information calculations can help to identify significant covariations 15 .
Usage of different sequence datasets (see Materials and Methods) gave the opportunity to generate alternative secondary structure models for each of the HA subtypes. The differences between alternative models for a given subtype were variable, depending on intra-subtype similarity levels. For instance, in three predictions of H1 HA structures only 31% of predicted base pairs were common (149 out of 481), and 31-53% of predicted base pairs were the same in pairwise comparisons of the models, while in the predictions for less diverse H15 HA segments more than 99% of predicted base pairs were present in all three models. The overlaps between alternative models were considered as putative conserved functional local structures. In order to distinguish functional structural RNA motifs that exhibited significant constraints on HA sequences from potential artifacts of folding predictions due to sequence similarity of protein coding regions, the analysis of covariations in conserved local structures was performed.
The predictions of putative subtype-specific secondary structures by the RNAalifold and RNAz algorithms in HA segments, followed by mutual information calculations as described in Materials and Methods, yielded a relatively small number of significant covariations that could be attributed to RNA structural constraints. The significances were evaluated by ratios R 1 (xy) = M(xy)/H(x) and R 2 (xy) = M(xy)/H(y), which reflect correlations between substitutions of bases involved in putative base pairs, taking both mutual information M(xy) and variabilities (entropies) at two positions into account. Covariations with at least one correlation value R 1 (xy) or R 2 (xy) higher than 0.8 are listed in Table 1. These values correspond to correlations between substitutions of bases involved in putative base pairs, and such ratios of mutual information and positional entropies take the variabilities at two positions into account. Folding simulations on permuted HA sequence datasets (Supplementary Methods) showed that lower correlation values corresponded to p-values of 0.05 and higher. Covariations with a threshold of 0.5 for correlation values are given in Supplementary Table S1.
Three structures that contained more than one covariation characterized by a score of at least 0.8 were identified: in human H3N2 strains, in H7 viruses and in H15 viruses ( Supplementary Fig. S1). The presence of multiple covariations may be considered as additional support for functional importance of the structures (see Supplementary Methods). Lowering the covariation threshold to 0.5 did not reveal any additional structures with multiple covariations in the RNAalifold and RNAz models.
Multiple covariation transitions in the 794-839 domain of HA segments from human H3N2 viruses. Four covariations, each having a correlation value of at least 0.7, were identified in the stem-loop structure predicted in the region 794-839 of HA segments from human H3N2 strains ( Supplementary Fig. S1,  Fig. 1). The covariation transitions occurred in different periods of the evolution of these viruses (Supplementary Tables S2-S5) while the substitutions of covariations 801-833 and 807-820 together lead to three amino acid changes (Fig. 1). The evolution of covarying nucleotides is punctuated, because a single change in a pair dominating over some period of time is quickly followed by a compensatory change defining a new dominating pair that is not prone to frequent changes (Supplementary Tables S2-S5). These covariation events did not exactly coincide with punctuated antigenic cluster transitions in the encoded HA protein 35,36 : the covariations and/or intermediate mismatch  combinations with single substitutions first appeared in the clusters preceding those where the sequences with  double changes dominate. One of the covariations in the 794-839 domain (807-820) seems to be determined by non-canonical base pairing (Fig. 1). In the early human H3N2 strains (before 1975) the dominating combination is GA (Supplementary Table S4). The strains with the G807U change, resulting in a Watson-Crick UA pair, which circulated in 1974-1981, were in turn substituted by viruses with the secondary mutation A820C, resulting in the UC pair that dominates since its first appearance in 1982. This implies that GA and UC are the preferred combinations at this position, while the Watson-Crick UA pair is less favorable.
In a number of HA segments from human H3N2 strains, an alternative conformation of the 794-839 domain with the same bottom stem and different apical part has lower free energy ( Supplementary Fig. S1). Furthermore, another conformation was predicted by the RNAalifold algorithm in H3 segments of avian strains, where the bottom helix exhibited a covariation at positions 801-833 between GC and AU, similar to human viruses ( Supplementary Fig. S1). Although the correlation values computed for these positions in H3 segments of avian origin are less significant (R 1 (xy) = 0.56; R 2 (xy) = 0.48) than those in human strains (Table 1), this independent covariation event within the avian branch yields more confidence in the pairing of the 794-801/833-839 helix. Consistent with the different base pairs in the domain interior, no correlation between the changes at positions 806-821 and 807-820 were recorded in avian strains. In contrast, positions 809-818 in avian H3 segments did display an independent covariation (M(xy) = 0.26; R 1 (xy) = 0.42; R 2 (xy) = 0.44) between AU and GC, similar to human strains. Thus, in all H3 HA segments six independent covariation events are observed in this domain with its interior part apparently adopting alternative conformations ( Supplementary Fig. S1).
Various RNA structures in the HA cleavage site region are conserved in more than one HA subtype. Two structures with homologs in different HA subtypes and covariations characterized by correlation values of at least 0.5 were detected using RNAalifold (Figs 2 and 3). Both are located in the central HA region, which encodes the amino acid motif required for HA polypeptide cleavage into HA1 and HA2 chains 37 .
One of these structures is a small hairpin in H7 and H10 subtypes, located 12 nt downstream of the nucleotides corresponding to the cleavage site. The H7 and H10 covarying positions are homologous to each other ( Fig. 2) and undergo synonymous changes with the same continent-specific pattern: dominating UR in Eurasia and RU in America, where R is A or G (Supplementary Table S6). Not surprisingly, the hairpin is also possible in H15 segments, which form a monophyletic cluster together with H7 and H10 38,39 . Computed for the whole cluster, the correlation R 1 (xy) is 0.76, which is relatively high. Furthermore, it should be noted that the H7 and H10 covariations are separate events with covariation R 1 (xy) scores of 0.78 and 0.74 (Supplementary Table S1), respectively, lending more reliable support for the hairpin's functional importance (Supplementary Methods).
Another structure, supported by covariations in different subtypes, was detected in H3 and H5 viruses, somewhat unexpectedly bearing in mind their distant phylogenetic relationship 38,39 . This stem-loop structure flanks the nucleotides encoding the HA protein cleavage site motif (Fig. 3a,b). The H3 structure, predicted in human H3N2 viruses, is supported by a covariation at position 1040-1073 (H3 numbering), characterized by one of the highest correlation values in this analysis: 0.90 ( Table 1). The homologous H5 structure is detected in the negative-sense vRNA predictions by the RNAalifold algorithm but it is also thermodynamically stable in positive-sense RNA. It exhibits a weaker covariation with R 2 (x,y) = 0.59 at base pair 1036-1057 in H5 numbering (Supplementary Table S1), located closer to the hairpin loop as compared to the H3N2 covariation ( Fig. 3a,b). All four positions, defining these two covariations in H3N2 human strains and H5 viruses, are wobbles in the HA codons, leading to silent substitutions.
High correlation values of the H3N2 1040-1073 covariation are determined by a fast transition from a GC pair to AU, occurring in 2002/2003 (Supplementary Table S7). Analysis of segment sequences and encoded HA proteins showed that the covariation occurred upon the transition between two antigenic clusters: SY97 and FU02 35 .
The H5 1036-1057 covariation is determined by two transversions converting YR typical for Eurasian strains into RY predominant in American H5 viruses (Supplementary Table S8), where Y is U or C and R is A or G. Due to frequent occurrence of both Watson-Crick combinations and mismatches AC or GU in this pair, its correlation values are not very high (Fig. 3b, Supplementary Table S8). On the other hand, within the American lineage a separate covariation event to the Eurasian-like UA pair was detected in 33 Californian strains of 2013 (Supplementary Table S8).
Free energy minimization structure predictions using the Mfold program 40 showed that the hairpins predicted by RNAalifold in this domain (1036-1079 and 1036-1057 in H3N2 and H5 viruses, respectively) can be further extended into larger conserved stem-loop structures formed by homologous nucleotides in H3N2 and H5 viruses ( Fig. 3a,b). This extended domain of 75 nucleotides is closed by an imperfect helix of 10 base pairs, interrupted by mismatches at different positions. The extension is significantly destabilized or unlikely to form at all in HA segments of many American H5 viruses ( Supplementary Fig. S2). Furthermore, in some American strains alternative stem-loops, also flanking the cleavage site region, are more favourable ( Supplementary Fig. S2). Nevertheless, the base pair between two wobble positions 1009-1081 in the conserved extended stem of H5 strains (Fig. 3b) exhibited a weak covariation (R 1 (xy) = 0.54), with an AU pair predominantly in Eurasia and GC in America (Supplementary Table S9).
Because of the conservation of this domain in such distant subtypes as H3 and H5, the potential for its folding was explored in all other HA subtypes. A similar extended stem-loop structure was found to be conserved in H6 strains, with an even more stable homologous bottom stem, mostly consisting of a perfect helix of 10 base pairs in positive-sense RNA (Fig. 3c). This domain-closing stem is also conserved within the monophyletic Scientific RepoRts | 6:38892 | DOI: 10.1038/srep38892 cluster consisting of H5, H6, H2 and H1 subtypes 38,39 . The domain interior base pairs are not conserved, leading to structural diversity with stem-loop structures in some H1 and H2 strains and a branched Y-shape in others ( Supplementary Fig. S3). In many H1 and H2 viruses alternative structures have considerably lower folding free energies and are probably more favorable. In other subtypes, the folding of this domain is not conserved.
The folding of the domain bottom stem in the strains of the H5/H6/H2/H1 cluster (Fig. 3, Supplementary  Fig. S3) is supported by a covariation in one of its base pairs (Fig. 3). It covaries within H6 subtype viruses (positions 1010-1070, R 1 (x,y) = 0.67) and exhibited even higher correlation values when computed for the whole cluster: R 1 (x,y) = 0.75 (Supplementary Table S10). Comparison of the dominant nucleotide combinations with the phylogeny of these subtypes 38,39 suggests at least two independent covariation events at this base pair. One transition occurs within the H6 subtype, with the dominant UR substituted by RU in a branch of American strains, another separates H5 with its dominant UA from H1 and H2 that have predominantly RY (Supplementary Table S10). In the negative-sense H6 HA sequences, the bottom stem is less stable thermodynamically, and the RNAalifold predictions yield an alternative hairpin with a relatively strong covariation 1070-1094 (Supplementary Table S1).
The conserved structure in the H5 cleavage site region is stabilized in the highly pathogenic A/goose/Guangdong/1/96 (H5N1)-like viruses. Highly pathogenic avian influenza A (HPAI) viruses are characterized by multibasic cleavage motifs in the HA protein, which are created in H5 and H7 strains by inserting one or several codons coding for basic amino acids in the cleavage site region 37 . These insertions occur in the region corresponding to the hairpin loop of the predicted H5 subtype structure, so they are not expected to disrupt this folding (Figs 3 and 4). Remarkably, the structure is significantly stabilized further in the lineage of the most widespread H5 HPAI viruses that originated from the A/goose/Guangdong/96 (H5N1) strain (gs/GD/96) and that continue to circulate to the present day 41 .
The stem-loop structure in the ancestor gs/GD/96 HA segment, with an insertion of 12 nucleotides (Fig. 4b), has a folding free energy Δ G°3 7 of − 18.6 kcal/mol in the positive-sense RNA, still comparable to the values calculated for homologous structures in low pathogenic avian influenza (LPAI) H5 HA genes of the lineages closely related to the gs/GD/96-like segments 42   lowest free energy local conformation for this HA sequence fragment, which is not observed in all H5 segments. The corresponding stem-loop in the gs/GD/96 negative-sense vRNA is also the thermodynamically optimal structure with a free energy of − 16.9 kcal/mol. Analysis of HA RNA sequences of the strains belonging to multiple clades that evolved from the gs/GD/96 lineage during 1996-2014 41 showed that several stably inherited substitutions significantly stabilized the predicted stem-loop structure (Fig. 5). The most stable structures are formed in the segments of the currently most widely spread clade 2.3.4 strains that accumulated 5 substitutions leading to free energies of − 32.9 kcal/mol (Fig. 4c). In particular, these 5 substitutions are observed in subclade 2.3.4.4 segments that are present in reassortant H5N8, H5N2 and H5N6 HPAI viruses isolated recently in Asia, Europe and America 41 . Structure-stabilizing substitutions were also observed in other clades of the gs/GD/96 lineage, the only exception being the clade 2.1.3.2 with several destabilizing mutations (Fig. 5). No extraordinary stabilization of RNA structures in the cleavage site region as compared to LPAI viruses was found in the H5 HA segments of the known HPAI outbreaks other than those caused by the gs/GD/96 lineage 43 . Multibasic cleavage sites of highly pathogenic H7 HA segments are also formed by insertions into hairpin loops. Similar to the insertions in the HA segments of HPAI H5 strains, the insertions in H7 HPAI segments seem to occur in the hairpin loops of LPAI H7 HA RNA structures. In the cleavage site region of H7 HA segments, different alternative structures were predicted (Supplementary Fig. S4). Close values of folding free energies of alternative stem-loop structures in this region suggest a possibility of conformational transitions, with different alternatives favored in various strains. The 1050-1059 hairpin, also conserved in H10 and H15 subtypes (Fig. 2), interferes with the extensions of these stem-loops, apparently stabilizing smaller domains ( Supplementary Fig. S4). An equilibrium between alternative conformations is supported by involvement of nucleotide 1059 in two covariations: 1050-1059 within the 1050-1059 hairpin and 1020-1059 in the bottom stem of one of the stem-loop structures flanking the cleavage site region. The H7 HPAI HA stem-loop domains are mostly stabilized by the base pairs formed by inserted sequences (Supplementary Fig. S4) that may originate from recombination with other RNA molecules 43   In H1 HA segments, these signals include both vRNA termini, containing 3′ and 5′ untranslated regions (UTRs) and extending to the coding part by 45 and 80 nucleotides from the 3′ -and 5′ -ends, respectively 44 . Among the vRNAs of all subtypes, only two hairpins with covariation correlation values of at least 0.8 were predicted in the regions homologous to these sequences: in the 3′ domain of H13 vRNA and in the 5′ domain of H14 vRNA (Supplementary Table S1, Supplementary Fig. S5). It should be noted, however, that the high correlation values of 1.0 obtained for covariations in all structures predicted in H14 HA segments (Supplementary Table S1) might be determined by the small number (N = 16) of available sequences. The structure with two covariations in the H15 3′ packaging signal region ( Supplementary Fig. S1) was only predicted in positive-sense RNA sequences and was not present in the negative-sense vRNA predictions.
Lowering the correlation value threshold to 0.5 adds only one hairpin predicted in the 3′ packaging signal of H1 HA vRNA at positions 78-67 (Supplementary Table S1, Supplementary Fig. S5). Although its covariation at base pair 68/77 has a relatively low correlation value of 0.62 (statistically unreliable, with an estimated p-value of about 0.6), possible functional importance of the hairpin is supported by prediction of a homologous hairpin in the H2 vRNA, which, in turn, exhibits a weak covariation (R 2 (xy) = 0.31) at another base pair ( Supplementary Fig. S5). Furthermore, a hairpin with a similar tetranucleotide loop sequence and a covariation with comparable correlation values was predicted in the H6 vRNA, but the position shifted 3 nucleotides downstream in relation to H1 and H2 hairpins according to the alignment of these sequences. The H13 vRNA hairpin is located 9 nucleotides downstream in the alignment as compared to that of H1 and H2 subtypes, but in distance from the vRNA 3′ end deviating by only one nucleotide from that of the H2 hairpin ( Supplementary Fig. S5).

Impact of conserved structures on virus replication.
In order to study the influence of conserved RNA structures in HA segments on influenza A virus replication, mutant viruses were designed using reverse genetics. For these experiments, the following structures located in different HA regions were selected: domain 794-839 of H3N2 human strains (Fig. 1), the stem-loops flanking the cleavage site regions in H3N2 and highly pathogenic H5N1 viruses (Figs 3a and 4c), and the hairpin in the packaging signal region of H1 HA segments ( Supplementary Fig. S5).
The series of mutations, introduced into the studied structures, were designed to disrupt the predicted base pairs by substitutions that could also complement each other, restoring the predicted structures in double mutants (Fig. 6, Supplementary Figs S6-S8). All mutations introduced in the cleavage site region and packaging signal structures were silent for the encoded HA amino acid sequence, while some non-silent mutants of domain 794-839 reproduced natural mutations in H3N2 viruses. The replication of mutants was studied by plaque assays, which were previously shown to be sufficiently sensitive to reveal effects of structure disruption and reconstruction in the NP segment RNA 15 .
In the experiments with HA mutants, significant effects on virus replication were detected only for some of the substitutions in the 794-839 domain of H3 HA RNA (Fig. 6). In contrast to all other designed HA mutants, it was not possible to rescue mutants D1 and D3, and despite three attempts still no recombinant virus was produced. Thus the combination of mutations G806U and G807U (mutant D1) severely affected the virus, whereas each of these substitutions separately was not detrimental for virus replication (mutants B1 and C1, Fig. 6). The substitution G806U is silent, therefore it can be concluded that the synergistic effect of the two mutations is determined by RNA misfolding rather than by the amino acid change (V244L) due to the G807U substitution. The compensatory substitutions, introduced in the mutant D3, were not sufficient to restore virus viability. However, the F1 mutant which contained, in addition to D1 mutations, the substitutions G801A and A809G, replicated similar to the wild type virus (Fig. 6). Remarkably, these differences are consistent with the history of substitutions recorded in natural human H3N2 strains ( Fig. 1): the combination of G806U and G807U was not observed before G801A and A809G had occurred.
No obvious correlation was found between the observed behavior of mutants and the introduced changes in codon usage according to available data on codon frequencies in human or influenza virus genomes 45 . In mutants replicating similar to the wild type virus, both the changes that substituted more common codons by more rare ones and vice versa were used. As far as D1 mutant is concerned, one of its substitutions (B1) replaced the relatively rarely used (in human genes) GUA codon (Val) by another rare one UUA (Leu), while silent substitution C1 replaced commonly used CUG Leu codon by less frequent CUU. This was also reflected in the estimate of heterologous gene expression in 293 T cells according to the Codon Adaptation Index (CAI), computed using the CAI calculator 46 for the wild type and mutant domains 794-839: the wild type CAI value was 0.65, while it was slightly decreased to 0.63, 0.60 and 0.59 in B1, C1 and D1, respectively. As seen, the CAI decrease in D1 is mostly determined by the C1 mutation, so this value alone cannot explain the phenotype difference between D1 and C1. Predicted attenuating effect of changing two neighbor codons 47,48 suggested the opposite trend in D1: according to the data on ratios between observed and expected codon pair frequencies in human genes 47 , the pair of mutated codons (CUU and UUA) is over-represented with a ratio of 1.402, while the wild type pair (CUG and GUA) is under-represented (0.854). Thus, although we cannot exclude some effects of codon substitutions, they are unlikely to be the reason of severe effects in D1 and D3 mutants, and it should be noted that influenza A viruses, attenuated by non-optimal codon compositions, are still able to replicate despite containing hundreds of de-optimized codons 48,49 .
Other designed HA mutants in this and other HA domains exhibited no significant replication deficiency which could be attributed to RNA structure disruption (Fig. 6, Supplementary Figs S6-S8). Even disruption of multiple base pairs in the structures did not result in noticeable effects. Significantly smaller plaque sizes, caused by substitution U821A in C2 and C3 mutants (p < 0.01 and p < 0.001, respectively) of the 794-839 domain in H3 HA segments (Fig. 6) were apparently caused by the resulting amino acid change N248K in the encoded HA protein, which was never observed in natural strains. The mutant D2 with the U821A substitution being silent in combination with A820C coding for the observed change N248T (Fig. 1), did not deviate from the wild-type (Fig. 6).

Discussion
The nucleotide covariation analysis of conserved RNA structures in the HA segment of the influenza A virus genome, presented here, suggests a number of local structured domains exist, which are specific for certain HA subtypes only and are not conserved in all 16 known subtypes. In contrast to the NS, M and NP segments, where several structures conserved in all influenza A strains have been identified in the coding regions 5,8,15 , no universally conserved structure is found in HA segments except for the well-known panhandle structure at the termini 4 .
A given covariation can serve as reliable evidence for a structural model only if it is likely that the observed bias at the putative base pair is indeed determined by RNA structure rather than by other evolutionary pressures or chance alone. Evaluation of covariation-based evidence for functional coevolution of two sites in a biopolymer is not straightforward [50][51][52] . A number of statistics have been proposed that measure how different the observed pattern of nucleotide pairs is from the expected upon independent changes at each of the positions. It is not the goal of this study to evaluate these measures, because they all have similar flaws in distinguishing functional coevolution from independent changes occurring in particular clades 52 . The metrics M(xy)/H(x) and M(xy)/H(y), which normalize mutual information values with the entropies of two sites, were chosen to evaluate the extent of structural constraint in putative covariations in assumption that the sites were paired indeed. The behavior of these metrics in datasets of well-established tRNA and rRNA structures was studied in detail 53 , allowing us to make comparisons. Folding simulations using datasets of permuted H2 HA sequences showed that single covariations with correlation values of less than 0.8 corresponded to p-values higher than 0.05 in structure predictions and thus did not provide sufficient support for base pairing (Supplementary Methods). As many base pairs in functional structures are not expected to have better correlations 53 , the conserved structures cannot be reliably identified by correlation values of single covariations.
An inadequacy of a covariation metric alone in the discrimination between different mechanisms of nucleotide coevolution can be compensated by taking the phylogenetic history of covariations into account [50][51][52][53][54] . Correlated changes in a putative pair occurring independently in more than one branch of a phylogenetic tree provide significantly more confidence in the base-pairing constraint. The same is true in case of covariations at different base pair positions in a given RNA structure element. The estimates based on permuted HA sequences show that even two independent covariation events with inconclusive individual covariation scores may serve as a strong support for a structural model (Supplementary Methods).
Indeed, several RNA structures supported by more than one covariation event were predicted in HA segments, pointing to important constraints determined by secondary structure. Interestingly, the most reliably supported structures were detected in the central regions of the HA segment, in particular, the region coding for the cleavage site motif in the HA protein. These structures turned out to be conserved in different HA subtypes, with independent covariations occurring at homologous positions.
On the basis of covariation data it is impossible to conclude whether predicted local hairpins are functional in the positive-sense RNA (mRNA or cRNA) or negative-sense vRNA, and folding free energy differences cannot be used in support of a conclusion because of the influence of NP binding on RNA structures within RNPs [18][19][20] . The HA vRNA hairpin loops in all regions of the segment may be involved in loop-loop contacts with other segments required for efficient copackaging in the virions [25][26][27] . Strong conservation of these hairpins in some HA subtypes and their disappearance in others is consistent with frequently observed non-random reassortment of compatible segments 26,[30][31][32] . In positive-sense mRNA, locally stable structures may be involved in modulation of protein folding, with their folding free energies slowing down the ribosome movement and thus facilitating correct cotranslational formation of native domain structures 55 , or in interference with host innate immunity 56,57 .
The insertions coding for multibasic cleavage site motifs in the HA segments of highly pathogenic H5 and H7 viruses occur in the hairpin loops of stem-loop structures flanking the cleavage site region so that the folding topology remains the same (Fig. 4, Supplementary Fig. S4). Furthermore, in the worldwide circulating gs/GD/96 lineage of highly pathogenic H5N1 avian influenza viruses, the destabilizing effect of the increased hairpin loop size is compensated by mutations stabilizing the double-helical part of the structure. Apparently, maintaining a stable folding of this domain is important for virus fitness. The structured character of this HA segment region is also predicted by statistics of folding free energies and codon conservation 8 . Various local structures in this and adjacent regions were previously suggested to regulate sequence duplications by transcriptional slippage on polypurine sequences in highly pathogenic H5 segments 58,59 .
The conserved H5 and H7 structures in the central HA domain (Fig. 4, Supplementary Fig. S4 ), folded either in positive-sense cRNA or negative-sense vRNA, may be an important factor affecting sequence insertion and recombination that lead to a pathogenic phenotype. The location of insertion sites in the hairpin loops is suggestive for a recombination mechanism with hairpin-mediated template switching, resembling the copy choice mechanism in retroviruses 60 . On the other hand, transcriptional slippage of a DNA-dependent RNA polymerase was shown to be stimulated by nascent RNA hairpins located near the slippage site 61 , and a similar topology was suggested to explain RNA editing by the RNA-dependent RNA polymerase of Ebola virus 62 . Stabilization of the H5 HA structure in the gs/GD/96 lineage years after the slippage has occurred and the covariation in the homologous fold in human H3N2 viruses (Fig. 3, Supplementary Table S7) suggest that this stem-loop structure has also some function which is not directly related to evolving multibasic cleavage site motifs.
A small hairpin, predicted to fold in the region essential for vRNA packaging ( Supplementary Fig. S5), exhibited only weak constraints in covariations despite the conservation of its motif in different subtypes. Such a pattern of independent covariation events indicates a functional role of the hairpin which either has a redundant function or can tolerate mismatches in the stem. Plaque assay experiments with mutated hairpins did not reveal any effect of this structure on virus replication (Supplementary Fig. S8). This is similar to the results obtained on packaging of NS segments with randomized packaging signal sequences, showing that sequences without any obvious sequence or structure similarity to the wild-type were sufficient for segment incorporation 63 . However, this is in contrast to the complementary mutagenesis of the structures predicted in the NP segment packaging signal, which showed their importance for efficient virus replication 15 .
A strong effect of the specific combination of mutations in the 794-839 domain of human H3N2 viruses on virus viability (Fig. 6) seems to be determined by a misfolded RNA conformation prohibiting HA segment functioning rather than by just disruption of the conserved structure. Other structure-disrupting combinations of mutations did not impair virus replication. Plaque assays of mutant viruses with disrupted or destabilized structures predicted in the HA cleavage site region also did not reveal any replication defects ( Supplementary Figs S6 and S7). Similar inconsistency between evolutionary conservation of virus structures and their influence on virus replication assays was previously noted for HIV-1 and murine norovirus 56,57 . It has been suggested that RNA folding could have a function that contributes to virus fitness and persistence in nature but is not essential for basic replication under the experimental conditions. For instance, disruption of murine norovirus RNA structures was shown to attenuate the virus in vivo while no effect in cell culture was seen 57 . One of the possible mechanisms is an interference of virus RNA structure with host innate defense responses 56,57 . It is also possible that the role of a single hairpin involved in the network of multiple contacts between genome segments 24-26 is redundant.
Certain correlations between covariation events in the structures of HA segments from human H3N2 viruses (Figs 1 and 3) and HA cluster transitions in antigenic evolution in these viruses 35,36 suggest that punctuated evolutionary trajectories of RNA and encoded protein may influence each other. Thus, in the 807/820 covariation from GA to UC the first G807U change is apparently favorable because of the cluster-difference V244L substitution in the HA protein 35 , while the compensatory A820C change seems to be caused by the need to restore the original RNA structure, leading to another amino acid substitution. The co-occurrence of the 1040/1073 covariation (Fig. 3a) with the SY97/FU02 cluster transition might be a coincidence, because it involves two silent substitutions occurring almost simultaneously on the evolutionary timescale, leading to a punctuated nature of RNA structure evolution resembling that of antigenic HA properties 35,36 .
Presumably, RNA secondary structures displaying nucleotide covariations only partially represent functional RNA folding in the influenza A virus HA segment. The covariation analysis reveals the importance of RNA folding in the HA segment evolution, but the most conserved RNA structures could be hidden from this consideration in genome regions that are less prone to mutations. Predictions of consensus RNA structures by the RNAalifold algorithm contained a number of potential hairpins with (nearly) invariant stems within a given HA subtype. Such hairpins could be formed by sequences containing clusters of conserved HA codons that are thought to determine the segment packaging signals 29 . In the absence of sufficient sequence variation, predictions of these hairpins are not reliable, and different alternative structures are possible. Diverse patterns of highly conserved codons in different HA subtypes 29 suggest that such structures would be subtype-specific, similar to those supported by covariations. Multiple subtype-specific structural motifs in vRNAs could determine specificity of intersegmental RNA-RNA interactions upon vRNA packaging and therefore the selection of specific constellations of compatible segment variants during reassortment 26,[30][31][32] . Elucidation of RNA folding patterns in the influenza A virus genome will significantly contribute to our understanding of the evolution and reassortment of the virus genes.

Methods
RNA structure predictions. Initial predictions of potential structured RNA domains in the influenza A virus HA segments were carried out using two algorithms: RNAalifold 33 and RNAz 34 . The RNAalifold algorithm calculates the secondary structure optimal for a dataset of aligned RNA sequences on the basis of the predicted ensemble of low free energy conformations. RNAz predicts locally optimal structures using a scanning window of the RNA alignment, estimating deviations in these structures from the patterns expected by chance alone. Conserved structures in a given HA subtype were identified by comparisons of alternative predictions yielded by different datasets of representative strains using the method applied previously for the NP segments 15 . Details are provided in the Supplementary Information. Essentially, for every subtype three different datasets of representatives were used for RNAalifold and RNAz predictions. Local structures that were present in at least two out of three RNAz and/or RNAalifold models and exhibited putative covariations in the representative sequences were further considered for evaluations of covariation significance.
Evaluation of covariation significance. Significance of covariations was first evaluated by mutual information calculations using all available sequences of the HA segment of a given subtype, taking positional entropies of sequence alignment into account, as described previously in the study of NP segment structures 15 . Details are provided in the Supplementary Methods. For a putative covariation of nucleotides at positions x and y, the ratios of mutual information M(xy) to positional entropies at the two positions R 1 = M(xy)/H(x) and R 2 = M(xy)/ H(y) were calculated. In the estimates of covariations in RNA structures, these measures are less sensitive to dataset bias than M(xy) scores 15,53 . The confidence levels for single and multiple covariations in a given structure were estimated using structure predictions with permuted HA sequences. Details are provided in Supplementary Methods.
Cells. MDCK and 293 T cells were cultured as described previously 15 . Plasmids. The HA gene segment of A/Anhui/1/05 (H5N1) was cloned into the BsmBI sites of the previously described modified pHW2000 plasmid 64 . The bidirectional plasmids containing the A/PR/8/34 and A/ Indonesia/5/05 gene segments were described previously 64,65 . Mutations were introduced into the HA gene segment of A/PR/8/34 (H1N1), A/Bilthoven/16190/68 (H3N2) and the HPAI A/Anhui/1/05 (H5N1). Primers to introduce the desired mutations were designed using the web based quick change primer design program (Agilent technologies, Amstelveen, the Netherlands). Mutations were introduced using the QuikChange Multi Site-Directed mutagenesis kit (Agilent) according to the manufacturer's instructions or using pfu ultra II fusion as described previously 15 .
Production of recombinant virus. Recombinant viruses were produced using the calcium phosphate method as described previously 15 . Recombinant viruses that contained mutations in the HA gene segment of A/ PR/8/34 (H1) or A/Bilthoven/16190/68 (H3) were produced by transfecting 293 T cells with 5 μ g of each of the 7 A/PR/8/34 gene segments and 5 μ g of the desired mutant or wild-type HA. Viruses containing mutations in the HA segment of A/Anhui/1/05 (H5) were produced with an A/Indonesia/5/05 (H5N1) virus backbone. If no virus was produced after MDCK cells were inoculated with the 293 T supernatant, a second MDCK passage was