Main

The development of high-throughput sequencing has transformed biology and medicine. It is now possible to analyse thousands of genomes in a single study, and sequencing-derived technologies have had tremendous impact: detecting alleles associated with cancer or genetic disorders, characterizing and detecting antimicrobial resistance (AMR), studying microbial diversity and more1. Sequence data present a high-resolution view of the processes of diversification and adaptation, the origins of phenotypes of interest and the myriad ways that diversity may be acquired, lost or maintained. Phylogenetic tools allow inference of patterns of ancestry from observed diversity, and sampling and sequencing through time reveal how measurably evolving organisms have changed and adapted on observable timescales. When this change has happened in the presence of selection, environmental variation, genetic drift and population bottlenecks, sequencing technology and temporal sampling provide the opportunity to learn about evolution.

Pathogen diversification presents health challenges, with the rising burdens of AMR being a clear example. Since antibiotics were first introduced, clinical resistance has consistently followed the introduction of new antimicrobials within one to two decades2. Viral evolution is rapid, and treatment of fast-evolving infections such as human immunodeficiency virus (HIV) is challenging due to the speed at which some viruses can acquire resistance3. Influenza viruses evolve through patterns of antigenic drift and periodic antigenic shift; seasonal vaccines need to be updated regularly, and pandemic strains can emerge repeatedly4. The issue of pathogen diversification has come to the forefront during the coronavirus disease 2019 (COVID-19) pandemic, with the continued emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of concern driving epidemic waves5. Even in slowly evolving pathogens such as Mycobacterium tuberculosis, extensively resistant variants have been reported worldwide6. In Plasmodium falciparum, a eukaryotic organism that causes human malaria, resistance to antimalarial agents is a key factor driving global malaria increases7.

There are increasing efforts to compile genomic data in publicly accessible databases, with a focus on resistance. With large bacterial sequence datasets, researchers have characterized recombination pathways8, capsule switching and resistance acquisition following human intervention in Streptococcus pneumoniae9, identified resistance determinants in M. tuberculosis and Escherichia coli10,11 and characterized global patterns of cholera dissemination12, to name a few. Recently, over 2,000 whole-genome sequences of Neisseria gonorrhoeae were analysed alongside epidemiological data, revealing a novel resistant clone and transmission among distinct contact networks13. Researchers have identified mutations that confer drug resistance in HIV14 and hepatitis C virus15 and key differences in within- and between-host evolution that affect the development of resistance16. Many COVID-19 studies leveraged large volumes of sequence data: more than six million SARS-CoV-2 genomes were analysed to identify mutations associated with transmissibility of the virus17. Since the conception of ‘viral phylodynamics’ in 200418, models can estimate underlying parameters using likelihoods for phylogenetic trees, linking mechanistic models of diversity with genomic data. Estimated parameters can be used in forward-time models to make predictions of the relevant population dynamics. This approach effectively summarizes the information in a set of pathogen sequence data as one or several real parameters.

In contrast, models used for infectious disease forecasting often cannot incorporate pathogen diversity and cannot typically be compared with genomic data. These models include the susceptible–infectious–recovered compartmental model and its myriad extensions (for example, latent periods, age structure and vaccination status). Although recent work links birth–death and coalescent phylogenetic models to compartmental models and projections19,20,21,22, many modelling frameworks used for forecasting disease do not yet lend themselves to modelling diverse pathogens. For example, a report23 estimates that by 2050, AMR will cost up to US$100 trillion and cause 10 million deaths per year, based on an assumption that all bacterial infections will be resistant. But in many of the most prevalent bacteria causing human disease, resistance has remained at stable intermediate frequencies for many years24,25. Modelling even this single fact about bacterial diversity (resistant and sensitive types can coexist for long periods) has proved challenging25, but if models do not correctly describe standing, stable diversity, they have poor prospects for making good forecasts into the future.

In this Perspective, we outline the need and opportunity for stronger links between forecasting and genomic data. Sequencing technologies have matured to the point where sampling in a consistent manner over time is feasible, and this gives us the opportunity to observe the evolution of our most important pathogens in response to our interventions. If we could incorporate these data into predictive models—building, testing and refining them against high-resolution data on evolution through time—we would stand a much better chance of assessing risks of immune escape, AMR and other evolutionary changes, and mitigating these risks. We describe several recent efforts in this direction, the availability of relevant data and the remaining challenges. We call on modellers and genomics experts to create the data, models, benchmarking and refinements that will be required to bring genomic data together with forecasting efforts.

Rapid increase in genomic data

Large volumes of pathogen sequence data have been collected and made available online (Table 1). The Genomes Online Database (GOLD)26 links studies and metadata, sourced from the National Center for Biotechnology Information (NCBI), the Department of Energy Joint Genome Institute and others. Other databases host large amounts of viral sequence data (for example, NCBI’s Virus portal27 and the Global Initiative on Sharing All Influenza Data [GISAID]28). The Pathosystems Resource Integration Center (PATRIC) database29 collates genomes of pathogenic bacteria, with available year and country data and some antibiotic resistance information. The quantity of documented sequences has been exponentially increasing over time30.

Table 1 A selection of sequence databases containing pathogen genomes

There are a number of projects providing genomic data for the detection, comparison and study of AMR genes and isolates. Major databases include the Comprehensive Antibiotic Resistance Database31, MEGARes32, DeepARG33, and the broader-purposed Uniprot34. Knowledge of the emergence and origins of AMR is essential in preventing and mitigating its damages. However, large collections of AMR genes, with differing sampling strategies, without the organisms’ context and without information about antibiotic-sensitive counterparts, are not directly amenable to forecasting and modelling pathogen ecology and evolution. Continual sampling as part of surveillance programmes may close this gap to some extent, particularly if sampling also includes non-resistant and ‘background’ isolates and organisms. In this vein, the European Antimicrobial Resistance Surveillance Network project (https://www.ecdc.europa.eu/en/about-us/partnerships-and-networks/disease-and-laboratory-networks/ears-net) is focused on AMR surveillance, collecting data from invasive isolates originating from national surveillance programmes and laboratory networks. The US Antibiotic Resistance Laboratory Network (https://www.cdc.gov/drugresistance/ar-lab-networks/domestic.html) spans 50 states and Puerto Rico, and reports AMR to the US Centers for Disease Control and Prevention, which runs the Antibiotic Resistance Solutions Initiative. The World Health Organization’s (WHO’s) Global Antimicrobial Resistance Surveillance System (https://www.who.int/initiatives/glass) is promoting the development of additional national surveillance systems to collect, analyse and share data.

There are ambitious long-term sequencing projects underway, some linked to surveillance programmes. The Comprehensive Resistance Prediction for Tuberculosis: An International Consortium project (http://www.crypticproject.org/) is sequencing 100,000 genomes for tuberculosis from five continents, and both England and the United States use routine whole-genome sequencing for tuberculosis. The Wellcome Trust Sanger Institute’s Parasites and Microbes programme has an ongoing commitment to sequencing a range of organisms and making data available (Table 1). There are many clinical, reference and public health laboratories around the world that have stored isolates over many years; these isolate collections could be sequenced. In all, including existing datasets, upcoming projects and the decreasing cost of sequencing existing collections of isolates, there are rich opportunities to build the data to capture, at a high level of resolution, ongoing pathogen evolution.

Epidemiology and genomic data

The rapid accumulation of genomic data has provided insight into epidemiological and evolutionary processes and stimulated the development of a number of methods18. These have been applied to inferring epidemiological parameters, investigating transmission patterns35, determining the spatial, temporal and zoonotic origin of pathogens36, understanding acquisition and transmission of AMR37, and modelling fitness38.

Pathogen genomic data encode information for inferring epidemiological parameters including the basic and effective reproduction numbers and the effective population size through time. Several inference methods have been developed for large-scale models with relatively sparse sampling, for example, to estimate the basic reproduction number using Bayesian inference with a birth–death model for HIV-1 virus in Switzerland39, and with a structured coalescent model for SARS-CoV-240. Recent extensions to these approaches have allowed for differences between lineages, inter-strain interactions and geographic movements41,42,43. Multi-type branching process models allow for rapidly evolving or co-circulating pathogens and host populations with heterogeneous contact structures. Tree comparison approaches estimate parameters by comparing phylogenetic trees from simulations with those from data, via approximate Bayesian computation (ABC)44 or via mathematical representations of phylogenetic trees45,46,47.

Efforts have also been made to reconstruct outbreak transmission trees from genomic data in outbreak settings with dense sampling48,49. Genomic data are also used to understand contact networks, for example, using ABC for HIV genomic data to estimate structural parameters of contact networks50, and to identify transmission risk factors, for example, through clustering or viral diversification rates51. At much larger scales, genomic data have been used in influenza virus research to predict evolutionary change38, with the potential to inform vaccine design. This is enabled in part by routine collection of influenza sequences, linked to geographic and epidemiological information.

Pathogen genomic data can be more informative for predictive models when linked to metadata. Epidemiological models exploiting linked genomic data have been developed for a limited number of pathogens due to the availability of metadata. Methods that unify classic infectious disease compartmental models and population dynamics from genomic sequences19,52 have gained popularity, as they allow description of phylogenetic clustering patterns in addition to epidemiological parameter estimation. Methods combining epidemiological and genomic data have been used to reconstruct early transmission trees of foot-and-mouth disease outbreaks53 and to infer likely infection times and heterogeneity in infection54. With locations of sampled genomic sequences, phylogeographic methods help characterize the emergence of a pathogen, identify importation and local circulation, and evaluate factors driving transmission. With a structured coalescent model and Bayesian inference, genomic data and their sampling locations have been used to reconstruct transmission histories, migration patterns and outbreak origins43,55. Although not directly predictive, this has policy applications: it has been shown that many regional outbreaks of SARS-CoV-2 virus (in New York City, Israel and others) were initiated by multiple introductions, highlighting the highly porous nature of borders56,57.

Currently, these methods have been mainly retrospective or descriptive. We propose that it is possible, and it is time, to develop models with genomic data that produce results relevant to prediction. Data availability to support this effort is improving. Linkage of genomic data to metadata and other epidemiological information remains limited, although it would render genomic data far more informative and improve forecasting efforts58. Estimation of model parameters and incorporation of some epidemiological structure19,22 has been a clear step towards prediction, and strengthening models with additional epidemiological data would allow genomic methods to be more effectively linked with current forecasting efforts.

The potential of genomic data for forecasting in public health

Non-genomic mathematical models for infectious disease forecasting are widely applied in public health. There is a demand for models that allow public health agencies to prepare for expected demands on health services, vaccine stocks, mobilization of healthcare workers and communication campaigns59. Estimates of disease burden produced by the WHO60 and others are used to compare the relative importance of different diseases and to determine where to allocate limited resources. Although progress was at first mostly limited to retrospective analyses using agent-based, compartmental or time-series models, over the past 20 years the implementation of epidemiological models for forecasting has become more commonplace. Three key factors driving progress have been the collection of high-resolution spatio-temporal data61, the incorporation of more complex model features such as population structure and seasonal forcing, and computational advances in methods such as Markov chain Monte Carlo (MCMC)62 and ABC63.

We argue that the same steps towards forecasting should be taken with analyses incorporating genomic data. This need has previously been recognized64, but with the collection of more longitudinal genomic data, there is increased opportunity. The application of genomic data to real-time analyses is currently limited, despite a marked increase in forecasting and ‘nowcasting’ analyses65. Although the COVID-19 pandemic has seen a plethora of research in forecasting65,66 and genomic epidemiology67,68, these analyses have remained largely separate. In other contexts, real-time sequencing of viruses is facilitating prediction from genomic data69,70, although so far analyses have largely been descriptive.

The importance of forecasting in public health has been widely argued71 and further emphasized during the COVID-19 pandemic with many public health organizations turning to mathematical models for regular jurisdictional forecasts, despite uncertainties. Incorporating genomic data into predictive models will offer new opportunities. For example, existing mathematical models for malaria incorporate population structures and immune selection to design drug resistance control strategies72. Genomics would seem a natural tool to extend this, although existing analyses have been more focused on genome description and identification of vaccine candidate antigens73.

On the other hand, some findings in epidemiological models have been at odds with observations in genomics. Compartmental epidemiological models often predict competitive exclusion by the ‘fittest’ strains59,74. However, genomic studies have observed consistent strain diversity even in competing pathogen populations, such as long-term coexistence of drug-sensitive and drug-resistant strains of S. pneumoniae75 and apparent frequency-dependent selection in S. pneumoniae and E. coli76,77. Further incorporation of genomics into epidemiological modelling could help to reconcile these contradictory perspectives and ensure that models can capture realistic diversity.

Early in the West African Ebola virus outbreak of 2013–2016, phylogenetic tools were used to trace the outbreak source and to characterize transmission patterns78. However, the majority of collected sequences could not be linked to individual case records79, limiting applicability to modelling or forecasting, and studies that did include this were largely descriptive rather than predictive80. With pre-planned collection of genomic data during outbreaks and the goal of epidemiological analysis in mind, we could more fully incorporate the additional information that genomic data offer. A major difficulty with analyses of outbreaks is that the epidemic process is usually only partially observed: rarely do we know when individuals are infected or who infected whom. Genomic data can give us insight to these unobserved processes. However, this comes with ethical concerns, particularly around source attribution where this may have legal consequences or lead to stigmatization or social harm, as with HIV81. There is a growing base of ethical guidelines specifically concerning genomic research, but phylogenetic reconstruction studies must still make careful decisions around the costs and benefits of their findings81. The use of genomic data to reconstruct outbreaks also brings logistical challenges, for example, in rapid data collection and processing82.

Genomic data also offer opportunities to improve vaccine design through increased understanding of pathogen diversity dynamics, as is underway for influenza38. Epidemiological models have been widely used, for example, in estimations of herd immunity thresholds74 and to formulate vaccine development and deployment strategies83. In HIV research, anti-retroviral therapies have been analysed using compartmental epidemiological models84. However, this modelling has been focused at serotype and genotype levels, remaining somewhat separate from the field of phylodynamics despite seeking to answer similar questions. As non-genomic models for intervention strategies are not purely retrospective but also predictive, if we could integrate the rich information contained in now readily collected genomic data, we would be much better placed to make accurate forecasts incorporating evolutionary change.

Outlook

There are key challenges facing both surveillance specialists and modellers for progress in forecasting from genomic data. Longitudinally collected genomic data are critical to study how patterns of evolution and transmission are changing in time. Genomic and epidemiological data can be challenging to link, particularly when these are collected by different groups with different goals. Understandably, sharing genomic data and linking individual-level data (genomic, epidemiological, clinical) raise ethical and privacy questions, among many barriers to data sharing in public health85. All of the above will require dialogues between data collectors, data users and methods developers.

For modellers, finding the right level of abstraction is a challenge (Fig. 1). In most cases we do not wish to predict sequences, but rather the abundance or prevalence of different subgroups or types, or to understand selection and quantify the risk of emergence of new phenotypes and the impact of disease. Aims might include projecting the rate of spread of resistance, the emergence of new resistance or variants of concern, how strongly selection may favour new phenotypes, whether there are existing mutational profiles that could combine to confer advantages and so on. This will require finding an appropriate balance in the trade-off between simple and complex models, as well as useful summary statistics or descriptions of genomic data and the relationships between genomes. Methods that use genomes to infer population structure have been developed in recent years86, but methods that characterize interactions across these structures have not yet really been explored.

Fig. 1: Data, models, predictions and outcomes may cover multiple levels of resolution.
figure 1

For example, data may comprise sequences, from which we wish to model sequence types or genotypes, to forecast global disease trends and thereby design an efficient resource deployment strategy. Alternatively, the composition of the pathogen population by larger sub-populations (for example, serotype or variant) may be the focal level for genetic diversity. There are myriad combinations, from which the scientist must determine the optimum scale at each step.

Similarly, a challenge of existing phylogenetic and phylodynamic approaches, such as those using birth–death and coalescent models, is that they usually assume that genetic diversity is phenotypically neutral and are therefore not well suited to forecast resistance or antigenic evolution. There are modelling approaches that account for different growth or death rates in different lineages, including multi-strain epidemiological models, genome-scale negative frequency-dependent selection models76,77, multi-type birth–death models41, the binary-state speciation and extinction framework87 and its extensions, and estimates of selection coefficients or fitness using genomic data17,88. These may group sequences into types or variants and proceed with an assumption of phenotypic neutrality within these types or focus on identifying mutations that confer an advantage. In our view, methods that directly incorporate selection strengthen links between genomic data and forecasting, even where their main focus is not making forward-time projections.

Although machine learning methods have become widely used tools in modern statistical analysis, their application to forecasting given genomic and epidemiological data is not straightforward. In addition to the drawbacks of difficult-to-interpret ’black box’ methods89, machine learning approaches have been shown to struggle with genome-wide association studies90 despite a larger amount of training data than generally available in forecasting analyses. One key problem is that of hidden population structures in genomic data: complex interacting and evolving populations are likely to have complex dependence structures. These structures, if not accounted for in the mathematical model, can cause confounding91. Selection bias in which isolates are collected and which are included in sequencing studies is also a challenge, which must be accounted for or avoided. For example, prioritizing outbreaks for sequencing could lead to spurious conclusions of higher transmissibility if sampling strategy is not accounted for.

In both public health and evolution, we will require interpretable models that contain explanations of the predictions they make. This has ethical motivations in medical fields—patient safety, trust, concerns over unintended sociodemographic biases—and has been legally mandated, for example, by the European General Data Protection Regulation, which states that when personal data are used, the decision-making system of a model must be traceable and explainable92. This motivates the use of mechanistic and/or statistical models rather than, or in combination with, machine learning models and ties in with identified challenges in explainable AI93. The dimension of time, not present in genome-wide association studies or the vast majority of existing genomics applications in mathematics and statistics, also introduces complexities. We now require understanding of the interactions between organisms at different scales, including competition, horizontal gene transfer, synergy and niche differentiation. Over long periods where the drifting dynamics of these interactions may not yet be well understood, forecasting may not be feasible. However, mechanistic approaches offer the opportunity to model these behaviours at different levels and allow interpretation of the interactions between them.

The ‘curse of dimensionality’ is particularly challenging with genomic data—for example, the number of potential genotypes increases exponentially with the number of loci considered. It is not possible to sample every combination, resulting in limitations to the feasibility of prediction, and increasing the potential for population stratification confounding and the computational complexity. Factors affecting the success of different pathogen genotypes may depend on complex interactions between large numbers of loci and numerous environmental factors94. The dependence structures may not be known in advance, and the set of possible dependencies is large. Computational techniques will therefore need to infer or otherwise account for unknown dependence structures and overcome the dimensionality problems they introduce.

Throughout this Perspective, we have discussed areas of research that would benefit from incorporation of genomic data for forecasting. From existing forecasting approaches in infectious disease modelling to existing approaches in phylogenetics and other genomic research that have had limitations when it comes to prediction, further combining approaches and developing new methods at the intersection will unlock new possibilities, particularly with the ever-increasing availability of longitudinally sampled pathogen genomic data (Fig. 2). Many research efforts are already moving in this direction, but we take a more speculative view in Table 2 of where future research could focus, for what purpose, what the data and sampling challenges will be and where this may be most applicable to public health.

Fig. 2: Depiction of current research directions and the opportunities highlighted in this work.
figure 2

We primarily focus on mechanistic models for forecasting, but these approaches can also include statistical or empirical models. Tree reconstruction methods include Bayesian Evolutionary Analysis Sampling Trees (BEAST) and maximum likelihood (ML).

Table 2 Areas for further predictive applications of pathogen genomic data

Conclusions

We propose that there are high potential benefits to developing forecasting methods that can combine genomic data with epidemiological, clinical and surveillance system data. This will require combining existing techniques in novel ways (Fig. 2 and Table 2) and developing new approaches. If we can incorporate pathogen dynamics and evolution into existing forecasting approaches, there is scope to make more robust predictions. Similarly, methods that fit models to genomic data to estimate epidemiological parameters can be extended for forecasting by incorporating knowledge of the underlying generative processes. Although the application of machine learning methods to genomic epidemiological analyses has limitations, there is scope to further integrate them. Linking mechanistic models to machine learning approaches can help to motivate their structure, interpret their outputs or gain intuition about the mechanistic behaviours behind forecasts. All of the above has been made possible by tremendous efforts to collect and compile genomic data into publicly available repositories and would be further facilitated by (1) more longitudinally collected and representative sequences and (2) linkage to epidemiological, demographic and clinical information where feasible. Over the past 20 years, the rich information that genomic data contain has been successfully applied to retrospective epidemiological analyses. The next step is for genomic data to help us understand more about possible futures to come.