Influenza is one of the most important infectious diseases of humans. The annual mortality that is caused by influenza in the United States alone is estimated at over 36,000 (Ref. 1) (Fig. 1), whereas occasional global pandemics can infect 20–40% of the population in a single year2. As a notorious case in point, the pandemic of 1918–1919 caused possibly 20–50 million deaths on a global scale, making it the single most devastating disease outbreak in human history3. The recent uncertainty over whether H5N1 avian influenza virus will adapt to human transmission, and how its spread might be controlled4,5,6,7, highlight the threat that is posed by influenza and the need to understand its evolutionary dynamics.

Figure 1: The periodicity of pneumonia and influenza mortality and excess mortality rates.
figure 1

Monthly pneumonia and influenza (P&I) death rates and excess death rates (above the baseline mortality due to other respiratory pathogens) in the United States from 1959 to 2001 are shown (see the web site for the US National Center for Health Statistics). Peaks occur during the winter in northern latitudes at 2–5 year intervals, usually during H3N2-dominant seasons, since the 1968 pandemic. See ref. 81 for more details.

Influenza viruses are single-stranded, negative-sense RNA viruses of the family Orthomyxoviridae that cause regular seasonal epidemics in humans, other mammalian species and birds. Three phylogenetically and antigenically distinct viral types — A, B and C — circulate globally in human populations, although type A viruses exhibit the greatest genetic diversity, infect the widest range of host species and cause the vast majority of severe disease in humans, including the great pandemics. The genome of influenza A virus (total length 13 kb) is composed of eight segments that can be exchanged through reassortment (Fig. 2). Wild waterfowl are the reservoir hosts for type A influenza viruses, harbouring numerous antigenically distinct subtypes (serotypes) of the two main viral antigens, the haemagglutinin (HA) and neuraminidase (NA) surface glycoproteins (16 HA and 9 NA subtypes)8,9. These avian viruses occasionally transmit to other species, in which they either cause isolated outbreaks with little or no onward transmission, as is currently the case with avian H5N1 influenza in humans; less frequently, they become established in new hosts, resulting in (irregular) major human pandemics.

Figure 2: The structure of influenza A virus.
figure 2

The genome of influenza A virus is composed of eight genomic segments that, by convention, are listed from largest to smallest, although their true arrangement within the spherical virion is unknown. Each segment contains a coding region that encodes one or two proteins, as well as short 5′ and 3′ flanking sequences. Three segments encode proteins that form the virus polymerase complex: basic polymerase 2 (PB2; 2,277 nucleotides in the protein-coding region, in segment 1), which controls the recognition of host-cell RNA; basic polymerase 1 (PB1; 2,271 nucleotides, segment 2), which catalyses nucleotide addition (and which also encodes a small pro-apoptotic mitochondrial protein that is translated in a different reading frame — PB1-F2); and the acidic protein (PA; 2,148 nucleotides, segment 3), which might possess a transcriptase protease activity. Two segments encode surface envelope glycoproteins that function as viral antigens: haemagglutinin (HA; 1,698 nucleotides, segment 4), which is responsible for binding to sialic-acid receptors and entry into host cells, and which is divided into two domains (or subunits) — HA1 and HA2 — and neuraminidase (NA; 1,407 nucleotides, segment 6), which is involved in budding of new virions from infected cells. A single segment encodes a nucleoprotein (NP; 1,494 nucleotides, segment 5), which binds to the viral RNA. The seventh segment encodes two proteins that share a short overlapping region: the matrix protein M1 (756 nucleotides) encodes the main component of the viral capsid, and M2 (291 nucleotides), which is an integral membrane protein, functions as an ion channel. Segment 8, the smallest segment of the viral genome, encodes a non-structural protein, NS1 (690 nucleotides), which affects cellular RNA transport, splicing and translation. Also encoded on segment 8 in an overlapping reading frame is the NS2 protein (363 nucleotides), a minor component of the virion, the function of which is currently unknown. A mature virion of influenza A virus is composed of the nucleocapsid, a surrounding layer of M1, and the membrane envelope, which contains the HA, NA and M2 proteins. For more details on the life cycle and replication of influenza virus see refs 88, 89. Figure reproduced with permission from Nature Reviews Microbiology Ref. 11 © (2005) Macmillan Publishers Ltd.

Here we review our current understanding of the evolutionary biology of human influenza A virus, showing how recent advances, particularly in comparative genomics and epidemiology, have shed new light on this important pathogen. We focus on the patterns and processes of influenza virus evolution at the level of recurrent human epidemics, highlighting areas in which future research might prove to be particularly profitable. For details of the biology of avian influenza virus and how it manifests as large-scale human outbreaks, see Refs 1012.

The determinants of influenza virus evolution

The phylodynamics of antigenic drift. Owing to the large amount of available sequence data, particularly from the HA1 domain of the HA protein, many studies have explored the evolutionary processes that shape the genetic diversity of influenza A virus. Investigating the complex interplay between natural selection, phylogeny and epidemiology is key to understanding influenza A virus evolution13,14. Because the human immune response to viral infection is not completely cross-protective, natural selection favours amino-acid variants of the HA and NA proteins that allow the virus to evade immunity, infect more hosts and proliferate15. This continual change in antigenic structure through time is called antigenic drift16. Although both the HA and NA proteins contain antigenic sites in which immune-driven natural selection can occur, the HA1 domain of the HA protein contains the highest concentration of epitopes and, correspondingly, experiences the most intense positive selection pressure15,17,18,19,20,21,22.

At the phylogenetic scale, the continual selective turnover of amino-acid variants is thought to produce the distinctive 'cactus-like' phylogenetic tree of the HA1 domain from A/H3N2 subtype viruses13,15,17. A single main trunk lineage depicts the pathway of advantageous mutations that have been fixed by natural selection through time, from past to present, whereas short side branches that stem from this trunk represent those isolates that die out because they were insufficiently antigenically distinct to evade immunity. The apparent regularity of this phylogenetic pattern has generated much interest, because of the potential to predict the future course of viral evolution and, in doing so, aid vaccine strain selection23. Likewise, there is still considerable debate over what aspects of influenza epidemiology so strongly favour the survival of a single HA1 trunk lineage in human A/H3N2 viruses, whereas multiple lineages seem to co-circulate more frequently within populations of equine H3N8 (Ref. 24), human H1N113 and influenza viruses types B25,26 and C27 (in which the equivalent haemagglutininesterase protein is termed HEF).

Although antigenic changes in the haemagglutinin protein are clearly important determinants of viral fitness, the 'progressive' model of influenza A evolution, as typified by the cactus-like phylogeny, was formed on the basis of studies that largely focused on HA1 in isolation, considered relatively few sequences from individual time points and geographical locations, and often targeted strains with unusual antigenic properties in the interests of vaccine design. Indeed, the antigenic evolution of HA1 seems to be more clustered than continuous28. Moreover, the recent explosion of large-scale genome sequence data from H3N2 viruses has shown that the evolutionary pattern that is observed in the HA1 domain does not always apply to the rest of the viral genome29,30. In contrast to the restricted number of lineages that can be observed at any time point in HA1, whole-genome phylogenies show the coexistence of multiple viral lineages, particularly on a limited spatial and temporal scale (Fig. 3). This indicates that the transition among antigenic types does not always proceed in a simple linear manner, that reassortment among coexisting lineages is relatively frequent (see below), and that, for these reasons, predicting the path of influenza virus evolution from sequence data alone will be inherently difficult.

Figure 3: Phylogenetic relationships of concatenated internal proteins.
figure 3

All segments excluding haemagglutinin (HA) and neuraminidase (NA) of A/H3N2 viruses were sampled from New York State, USA, from 1997 to 2005. Rectangles represent distinct clusters of isolates; the size of the rectangle reflects the number of isolates in the lineage. Roman numerals denote distinct viral lineages that were circulating within each influenza season, with seasons coloured individually. The tree is mid-point rooted for purposes of clarity only, and all horizontal branch lengths are drawn to a scale of substitutions per site (as shown by the scale bar). The phylogeny shows the genetic diversity of influenza A virus in a single locality, including the co-circulation of multiple viral lineages. Adapted from Ref. 30.

Recent analyses also indicate that positive selection on the HA1 domain occurs in a punctuated manner14,30,31. Indeed, the cactus-like structure of the A/H3N2 phylogenetic tree is not in itself conclusive evidence for the action of adaptive evolution, as similar phylogenetic patterns can be generated through a combination of serially sampled (that is, time-structured) data and sequential random population bottlenecks, without strong positive selection. Therefore, the definitive signature of positive selection in the influenza A virus HA protein is not merely the presence of a single trunk lineage, but rather that this trunk is defined by an increased frequency of non-synonymous substitutions, which reflects the continual fixation of (advantageous) amino-acid replacements (Box 1). Similarly, many of the mutations that fall on the side branches of the HA1 tree are likely to be deleterious, and will not achieve fixation even in the absence of immune selection. However, it is important to note that because the computational tools that measure the extent of positive selection are inherently conservative, and quantify the successive fixation of non-synonymous mutations at specific amino-acid sites, adaptive evolution is likely to occur more frequently than is usually detected (Box 1). In the future, methods that account for the rate of amino-acid fixation32 (as opposed to simply considering the total number of fixation events) might offer more analytical power.

Although antigenic drift is undoubtedly an important aspect of influenza A virus evolution, as reflected in the changing antigenic profiles33 and the need for continually updated vaccines, recent data indicate that this process does not occur within the time frame of a single epidemic season in a single locality; few amino-acid changes are fixed in HA1 within populations at the seasonal scale34. Consequently, key questions for future research will address the evolutionary and epidemiological processes that drive antigenic drift and the timescale on which this process occurs. To answer these questions will clearly require a far larger sample of influenza virus genomes, with greater resolution in both time and space.

Antigenic maps and cluster jumps. The episodic nature of the antigenic evolution of HA1 has been vividly documented in antigenic maps, one of the most important innovations in studies of viral evolution33. Antigenic mapping involves constructing a matrix of haemagglutinin inhibition assay distances among viral isolates (see Box 2 for more information) and then plotting these to produce a cartographic surface, analogous to a standard geographical map. This approach provides an important insight into evolutionary dynamics because it allows a direct comparison between changes in viral genotype (reflected by the HA amino-acid change) and an inferred phenotype (reflected by the haemagglutinin inhibition distance), although its relationship to measures of overall viral fitness is less clear. Maps of HA from A/H3N2 isolates, which have been sampled since the first appearance of this subtype in 1968, show that major jumps between antigenically distinct clusters of viral sequences occur with a periodicity of roughly 3 years33. Although these antigenic 'cluster jumps' are usually also apparent as long branches on HA1 phylogenetic trees, small genetic changes sometimes have a strong effect on antigenicity. Furthermore, as the cluster jumps tend to correspond to occurrences of vaccine failure35, they evidently represent a better predictor of antigenic novelty than do data from studies of genotypic evolution alone. Although it is clear that our understanding of influenza A virus evolution will greatly benefit from a better understanding of the rules that govern antigenic evolution, as manifest in the path the virus takes across the cartographic surface, determining the epidemiological processes that underlie the periodicity of antigenic evolution will undoubtedly be more complex36.

Reassortment in influenza virus evolution. Severe influenza pandemics can occur following a sudden antigenic shift — when a reassortment event generates a novel combination of HA and NA antigens to which the population is immunologically naive. The segmented genome of the influenza virus facilitates reassortment between isolates that co-infect the same host cell. Reassortment among HA and NA subtypes was fundamental in the human pandemics of 1957 (H2N2 subtype) and 1968 (H3N2 subtype), which also acquired a new basic polymerase 1 (PB1) segment37. The origin of the H1N1 strain that caused the severe pandemic of 1918, and whether it jumped to humans directly from an avian reservoir population or first circulated in another mammalian host such as swine, is less clear and the source of much debate38,39,40.

The evolutionary importance of reassortment in recurrent influenza epidemics is also uncertain. Reassortment events can be detected when sequences of different segments from the same isolate occupy incongruent positions on phylogenetic trees. Until recently, it was usually only reassortment involving HA and NA that could be detected in this manner, because the vast majority of publicly available influenza virus sequences comprised just these two proteins. However, the expansion of genome sequence data sets has shown that reassortment can also occur among internal segments, and among human strains of the same subtype29,30,41,42,43,44. As a case in point, a detailed phylogenetic analysis of 413 complete viral genomes from New York State, USA, sampled over a 7-year period, revealed 14 reassortment events that were identified on the basis of incongruent phylogenetic trees of HA, NA and concatenated internal proteins30. Even this result is likely to represent a significant underestimate of the true frequency of reassortment. Some reassortment events are undetectable by phylogenetic analysis because they do not lead to major differences in tree topology, they involve 'parental' isolates that have not been sampled, or they result in unfit progeny. It is therefore imperative that more sophisticated methods are developed to estimate both the rate of reassortment, particularly relative to the rate of mutation for each nucleotide, and the background phylogenetic history of the viral genome in the face of such frequent reassortment45. Such methods will evidently require the ability to more precisely determine which base changes are the result of mutation and which are due to reassortment46.

Reassortment might also have an important role in generating evolutionary novelty. For example, reassortment events occurred concurrently with two recent antigenic cluster jumps: the WU95 (Wuhan 1995 strain of H3N2 influenza A virus) to SY97 (Sydney 1997 strain of H3N2 influenza A virus) cluster jump and the SY97 to FU02 (Fujian 2002 strain of H3N2 influenza A virus) cluster jump29,31,41. However, determining cause and effect between a reassortment and a cluster jump is inherently difficult. In the case of the SY97 to FU02 antigenic cluster jump, for which a greater number of samples has allowed better resolution, a reassortment event that involved the HA segment coincided with the emergence of a single dominant lineage with a new HI type from an array of co-circulating lineages29,41. The evolutionary puzzle is why a clearly fit virus — that is, the virus belonging to the FU02 antigenic cluster — did not immediately rise to fixation, but instead circulated at low frequency for several years. One hypothesis is that the intrinsic fitness of the HA segment could not be realized until it was placed in a compatible genetic background, a process that was achieved by reassortment. If subsequent compensatory changes are required to increase compatibility within47 and among segments48, such reassortment events might also entail a burst of adaptive changes across the viral genome (Fig. 4). If correct, this means that the evolution of influenza A virus is more complex than previously realized, and that the evolutionary dynamics of the HA segment must be considered within the context of evolution at the genomic level. Revealing the evolutionary dynamics and fitness contributions of the other viral proteins during epidemic evolution49, and their epistatic interactions14, therefore represents an important direction for future research.

Figure 4: A model for the genome-wide evolution of human influenza A/H3N2 virus.
figure 4

a | Three lineages of influenza A virus — A, B and C — co-circulate in a population. Lineage A has the highest fitness and is therefore dominant. Each lineage has a unique configuration of genomic segments (represented by small rectangles). Whereas the segments in lineage A are internally compatible, lineage B contains a haemagglutinin (HA) segment of low fitness (shown as a black rectangle), which reduces the overall fitness of this lineage. By contrast, lineage C has a high fitness HA (shown as a red rectangle), which is contained within a lower-fitness genomic background. A reassortment event results in the transfer of the high fitness HA from lineage C into the more compatible genetic background of lineage B. b | Following the reassortment event, lineage B undergoes a burst of (compensatory) adaptive evolution across its genome, increasing its rate of divergence and fitness so that it becomes the new dominant lineage in the population.

The rate of reassortment in influenza A virus also provides insights into the extent of immunological cross-protection, which in turn might have implications for vaccine design. Reassortment among isolates that are assigned to different antigenic types necessarily means that they must co-infect a single cell, implying that protection is not complete at this level of antigenic difference. However, as studies of the intra-host genetic diversity in influenza virus have not been widely undertaken (see below), it remains unknown whether multiple genetically or antigenically distinct lineages co-circulate within individual hosts. Finally, the study of reassortment patterns might also provide important clues to the linkage of genomic segments, as it is expected that closely linked segments will be subject to less frequent reassortment. Analyses of this sort represent a key task for future evolutionary genomics in influenza virus.

A related, although far more controversial, issue is that of intrasegment RNA recombination. Although there is ample evidence that influenza viruses undergo various forms of non-homologous recombination, albeit rarely50,51, the occurrence of homologous recombination within segments is far from proven. Some comparative studies indicate that complex patterns of genetic diversity might be a footprint of past recombination, although the evidence is not conclusive52. The most compelling evolutionary evidence for recombination — the occurrence of incongruent phylogenetic trees — is generally lacking, and previous suggestions of incongruence, for example, in the emergence of 1918 influenza A virus53, are more likely to be due to differences in substitution rates between the HA1 and HA2 domains54. Indeed, low rates of RNA recombination seem to characterize negative-sense RNA viruses in general52.

Rates of evolutionary change in influenza virus

Accurate estimations of evolutionary rates at both nucleotide and amino-acid levels are central to resolving many long-standing questions about the evolution of influenza virus, including the relative roles of natural selection versus genetic drift, the origins of the 1918 H1N1 pandemic virus, and the ecology of the virus in its avian reservoir. The available literature provides inconsistent reports of evolutionary rates, largely because of differences in methodology, as well as the number and the epidemiological significance of the virus samples that have been used for analysis. Improved analytical methods and greater availability of whole-genome sequence data should enable large-scale systematic comparisons of evolutionary rates of entire genomes from multiple subtypes in numerous host species.

These limitations notwithstanding, the main determinant of variation in substitution rates among influenza viruses seems to be the strength of immune selection pressure; background mutation rates are generally similar among RNA viruses, at approximately one mutation at each genome replication55, which translates into long-term substitution rates of 10−3–10−4 nucleotide substitutions per site per year56,57. Such selection pressures generally reflect the length of time that an influenza virus subtype has been associated with a particular host species. So, 'older' influenza A viruses evolve more slowly (at non-synonymous sites) in the reservoir avian species with which they might have co-adapted, whereas newly emergent viruses, such as A/H3N2 and A/H5N1 in humans and domestic poultry, evolve more rapidly (through positive selection) to evade host immunity and achieve efficient transmission in new host species.

Studies of evolutionary rates of several influenza virus proteins generally support this model: the lowest rates of non-synonymous substitution have been reported in influenza viruses that were sampled from wild aquatic bird species8,58, the highest rates in viruses that caused human H3N2 epidemics and outbreaks in poultry and swine59,60, and intermediate rates in older human subtypes (such as H1N1), in type B viruses61,62 and in internal proteins63. However, the hypothesis that avian influenza A viruses have reached an adaptive equilibrium ('evolutionary stasis') following a long-term co-adaptation with wild aquatic bird species8, could be misleading. Although rates of amino-acid change in influenza A viruses are undoubtedly lower in wild aquatic birds compared with humans, overall rates of nucleotide substitution are not significantly lower than those of most RNA viruses58,64. As waterfowl and shorebirds seem to mount only a weak immune response to influenza A virus, it is likely that most amino-acid changes are deleterious and are purged through purifying selection, as shown by the tendency for non-synonymous mutations to fall on terminal branches of phylogenetic trees65. By contrast, higher rates of non-synonymous substitution in the HA segment have been observed in some domestic poultry species58, despite the fact that unvaccinated poultry do not typically mount a strong immune response59. These observations indicate that the selection pressures to adapt to a new host can lead to rapid evolution, even in the absence of immune selection, and confirm the importance of whole-genome analysis.

To fully understand the factors that generate variation in the rate of evolutionary change in influenza virus and how these relate to disease emergence and severity, a comprehensive survey of evolutionary rates for all eight genome segments of the main viral subtypes in different host species is needed, using updated methods. Previous linear regression analyses that compared genetic distance against the year of isolation were inherently biased, as data points were non-independent, leading to over-sampling of certain phylogenetic branches. By contrast, maximum likelihood66 and more recent Bayesian Markov chain Monte Carlo (MCMC)67,68 approaches explicitly account for phylogenetic structure, time of sampling and rate variation among lineages. Bayesian MCMC methods also provide an indication of statistical uncertainty, because estimates are made on large numbers of sampled trees.

Evolutionary aspects of seasonality

The clock-like consistency of the winter incidence peaks of influenza virus represents one of the strongest examples of seasonality in infectious disease (Fig. 5). However, the reasons that human influenza epidemics arise and then peak at consistent 6-month intervals across temperate regions of the northern and southern hemispheres are unknown. Various theories have been proposed to explain how seasonal change might stimulate influenza activity: transmission rates might increase during school terms and winter crowding, the stability of the virus might be enhanced by cooler temperatures, or host immunity might decline during colder weather (reviewed in Ref. 69). All of these hypotheses remain largely untested.

Figure 5: The seasonality of influenza virus.
figure 5

Weekly reports of influenza-like illness (ILI) are classed as 'Sporadic', 'Local Outbreak', 'Regional Outbreak' or 'Widespread Outbreak' by the World Health Organization's (WHO) FluNet surveillance system. Influenza virus activity peaks at similar times in countries at similar latitude, during winter and early spring in the northern hemisphere and during late spring and summer in the southern hemisphere (depicted by the heavily outlined boxes). In countries that are closer to the equator, influenza virus activity is more consistent throughout the year (depicted by the dotted-line box), with dampened epidemic fluctuations, indicating that these areas could potentially serve as year-round reservoirs for the virus.

Experiments from half a century ago are still cited as evidence that influenza virus is most stable in cool, dry temperatures70. However, later work weakened this correlation by showing an increase in susceptibility in mice during winter, even when temperature and humidity were held constant71. Recent research in tropical regions has also shown a significant burden of disease in areas with warm, humid climates72,73. Therefore, cooler temperature alone does not explain influenza seasonality. Various aspects of human behaviour, such as time spent indoors or in school, have been implicated in changing transmission rates; the effect of seasonal climate change on immune function and host susceptibility has also been documented74,75. For example, antibody responses are believed to fluctuate with melatonin secretion during seasonal light and dark cycles, potentially increasing human susceptibility to influenza infection at certain times of the year69. It has also been proposed that a deficiency of vitamin D during the winter months, which in turn might reduce the effectiveness of the innate immune system, could help to shape influenza seasonality76.

Explaining this phenomenon has been impeded by gaps in our basic knowledge of the epidemiology of the influenza virus, particularly in the reservoir avian species, which offer a potential avenue for further investigation of seasonal dynamics. Epidemics in wild aquatic birds peak in late summer and early autumn in the northern hemisphere, when fledgling and congregation lead to an increase in population density8,77. The virus emerges 6–8 weeks later in domestic turkeys, although the causative factors of this time lag are still unknown77. Overall, the apparent importance of pre-migratory congregation and the birth of new susceptible hosts in triggering avian epidemics implies that the indirect effects of colder conditions on human social behaviour might in part drive influenza virus activity.

Another poorly understood aspect of seasonality is the spatial link between epidemics in the northern and southern hemispheres, and how tropical regions fit within this cross-hemispheric dynamic. For example, although epidemics in the northern hemisphere occur on a regular timescale, those that occur approximately 6 months later in the southern hemisphere exhibit weaker correlation with the epidemic timing of the northern hemisphere78. Epidemic periodicity is even more variable in tropical regions, where available data indicate that influenza often occurs year-round, although incidence peaks sometimes coincide with rainy seasons79 (Fig. 5). Tropical regions might therefore serve as year-round reservoirs for influenza virus and clearly need to be surveyed more intensively80.

One highly informative, albeit indirect, approach to studying seasonality has been to improve documentation of the overall spatio-temporal dynamics of the influenza virus. Analysis of influenza-related mortality data from the United States over the past 30 years, using a newly developed gravity model, showed that the timing of epidemics is most synchronized between the most populous states and during the most severe disease seasons81. Furthermore, the long-distance spread of influenza between cities and states was better correlated with adult workflow traffic patterns than with simple geographical distance. Nevertheless, the full epidemiology of the virus remains complex, and children are still believed to drive the spread of influenza at more local levels: within schools, households and communities in general.

In the future, phylogenetic analysis could help to reveal key aspects of influenza virus seasonality. By inferring the evolutionary relationships that exist between viruses that have been sampled from spatially disjunct regions, particularly the tropics, it might be possible to determine the directions of global viral migration and the location of the virus during non-epidemic periods. Indeed, the recent phylogenetic analysis of viruses from single populations has shown that the virus does not 'over-summer', but dies out at the end of each seasonal epidemic, and that subsequent seasonal viral re-emergence is ignited by imported genetic variation30.


Improvements in methods for bioinformatic67 and epidemiological analysis81, as well as a greatly expanded GenBank database of influenza virus genome sequences, provides an unprecedented opportunity to investigate long-standing questions in influenza virus epidemiology and evolution82,83 (for a discussion of key research questions in influenza virus evolution, see Box 3).

Recent work indicates that the evolutionary dynamics of influenza virus might be more complex than was previously thought, reflecting an intricate interplay between antigenic variation, natural selection and reassortment. For example, at the local spatial level, migration and reassortment among multiple co-circulating lineages of the same subtype might be more important determinants of the seasonal evolution of influenza virus than antigenic drift30, with periodic, selection-driven cluster jumps that result in major changes in antigenic phenotype14,31,33. Despite these important insights, a comprehensive understanding of influenza virus evolution will require a far broader analysis of whole-genome sequences from a wider range of subtypes, host species and geographical areas, including tropical regions, as well as the development of more realistic epidemiological models.

It is also striking that, despite the huge amount of sequence data that has been generated for influenza A virus, studies of intra-host genetic variation are largely absent. However, the high rates of mutation and replication that are common to most RNA viruses mean that intra-host population diversity is likely to be extensive, even in viruses that cause acute infections84. Furthermore, if the population bottleneck at inter-host transmission is not particularly severe, multiple viral lineages, including reassortants, viruses with new antigenic characteristics or even defective viruses85, are likely to be transmitted among hosts. A crucial task for future studies in influenza virus evolution is therefore to quantify the extent of intra-host genetic variation within single individuals to determine whether this includes isolates that are antigenically distinct, and reveal how much genetic diversity is transmitted among hosts and how this might differ among avian and mammalian influenza viruses.

An important shortcoming in research on influenza virus has been the lack of a unifying framework that integrates genome sequence, phenotypic (including antigenic) and epidemiological data. The recent reconstruction of the 1918 pandemic influenza virus genome sequence86 demonstrates how a whole-genome analysis can provide crucial insights into long-standing questions about the virulence and aetiology of this catastrophic disease event87. Similarly, the ability to simultaneously analyse genetic and phenotypic influenza virus data has had a strong influence on our understanding of the patterns and timing of the antigenic evolution of influenza virus33. The Influenza Virus Resource, which is now available on GenBank, exemplifies the most recent attempt to integrate epidemiological and molecular data by making various influenza virus data publicly available, including whole-genome sequences along with the date of isolation, patient characteristics and geographical locations83. Notably, antigenic data are still excluded from this resource. Other influenza virus data sets are less conducive for research, as epidemiological data has rarely been collected in conjunction with sequence data, and much data have not been made publicly accessible.

Although new analytical methods and faster sequencing technology offer the opportunity to address crucial questions about influenza virus evolution through phylogenetic analyses, greater surveillance of viral populations and access to data underpin the advancement of this key field of viral research.