Main

Most emerging human infectious diseases originate in wild animals1. Of these zoonoses, viruses account for the majority of severe epidemics and pose the greatest pandemic threat due to their transmissibility, evolvability and lack of therapeutic options. Every year, a growing number of viruses are detected in human hosts for the first time2, both because of better surveillance and because the rate of viral emergence is increasing3. Factors that contribute to emergence risk include weak health systems, globalization, inequality, conflict, increasing human–wildlife contact, agricultural intensification, deforestation and climate change4,5,6. Given the urgency of understanding and containing these threats, substantial research effort has been directed towards modelling and predicting host–virus interactions in recent decades.

With the advent of the COVID-19 (coronavirus disease 2019) pandemic, the global scientific community may finally have a broad public and policy audience willing to tackle emerging zoonotic diseases, with the aim of ‘predicting and preventing the next pandemic’7. ‘Prevention’ ultimately falls on healthcare systems and policy interventions, but ‘prediction’—knowing which possible threats should be countered most urgently—is a task that draws on various fields, including microbiology, virology, ecology, evolutionary biology and statistics. Experts in these fields face a massive problem of triage: today’s public health emergencies are caused by only a small subset of the thousands of animal viruses that have zoonotic potential—the ability to infect a human host8,9. Modelling can accelerate the identification of potential future threats if general rules determine which animals and which viruses will pose a future threat to humans with enough specificity to make these predictions actionable. Statistical models have helped to identify reservoir species of novel human pathogens10, map the geographic distribution of risk11, identify seasonal trends in spillover12, estimate transmissibility and virulence post-emergence in humans13, quantify outbreak detectability and under-detection14, and project onward spread15. All of these objectives are part of a prediction pipeline intended to make basic science actionable in public health, but currently these fundamental endeavours rarely reach their full applied potential.

Here we review the subset of modelling studies that predict the potential for specific viruses to infect host species (hereafter host–virus associations). We aim to highlight the main approaches, hypotheses and innovations in this area. These studies have helped to synthesize the basic biological mechanisms that structure the global virome and have shown increasing potential to identify the highest-risk hosts and viruses. However, risk assessment for known viruses is still carried out using a mix of expert opinion and laboratory work16,17, largely in the absence of predictive methods. As such, most modelling work has translated into limited veterinary or public health benefits, leading to concern that predictive approaches may have limited utility for outbreak prevention, especially when compared with direct investments in aid programmes, capacity building, syndromic surveillance or vaccine development18,19. We assess how models may better deliver on their promise and contribute meaningfully to the prediction of future viral threats.

Using patterns for prediction

When the public or policymakers talk about ‘prediction’ in a One Health or pandemic preparedness setting, this idea often refers to anticipating future events. In this view, knowing an outbreak is coming may conceivably allow it to be circumvented or, at least, substantially lessened in scope and duration. However, in computational biology, ‘prediction’ more often refers to the ability of quantitative tools to recapitulate and explain the world as it exists today and has done through history and, by extension, anticipate both unknown contemporary patterns and potential future ones. Conventionally, these approaches fall on a continuum between mechanistic, hypothesis-driven statistics (often associated with the idea of explanatory prediction, based on iterative confirmation of theory) and mechanism-agnostic, exploratory machine learning (used to make predictions over new data, also called anticipatory prediction)20,21. However, the two approaches are synergistic, and the boundary between the approaches is increasingly blurred owing to both an expanding set of tools for interpretable (non-‘black box’) machine learning and a growing set of opportunities (and expectations) to use model–lab–field feedbacks to challenge and improve predictive models.

Predictive tools can be used to explain and anticipate many aspects of pathogen transmission. Here we review a subset of those tools, which aim to identify and predict why some host species can be infected with some viruses and others cannot. Models can, with increasing accuracy, predict the zoonotic potential, reservoir hosts or host range of a virus species; the viral diversity of a given host species; and viral sharing among host species. These basic model formulations can all be viewed as subsets of one overarching statistical approach: using statistical models trained on host–virus association data to explain, reproduce and infer the structure of the host–virus network (Fig. 1).

Fig. 1: Designing predictive models.
figure 1

Studies can use data describing known host–virus interactions to predict potential cross-species transmission. Several approaches involve predicting the structure of host–virus association networks (depicted here using real associations known from large datasets in disease ecology), including zoonotic risk prediction (A: can a virus infect humans?) and reservoir identification (B: where does a zoonotic virus come from?); work trying to predict viral host range (C: how many hosts?) and viral diversity (E: how many viruses?); and viral sharing analysis (D: which hosts share viruses?). All of these are subsets of a general problem (F): the prediction of bipartite network links, as a way of representing host–virus associations. Approaching these problems as general link prediction may lead to new insights, especially when considering a more complete network. For comparison, we show the full range of recorded interactions compiled across the HP3 database (G: black lines match the smaller examples, but additional known links are added in grey), for example a recent link prediction study found a high probability that two of these viruses (bovine viral diarrhoea disease virus 1 and bluetongue virus, red links) may have undiscovered zoonotic potential (red lines in ref. 37). Light and dark blue nodes on the right (G) represent the same viruses and hosts, respectively depicted on the left (A–F).

These approaches can be identified by the shared data structure they use, which always consists of an edgelist—a set of known host–virus associations, in which most missing or negative records of association represent untested potential interactions—and linked metadata that may include data collection methods, host and virus traits (microbiological, ecological or phylogenetic), and infection characteristics22,23,24,25,26. The quality of these datasets varies in terms of scope, completeness, accuracy and documentation, reflecting the challenges of both wildlife virology and data synthesis. For example, matrix sparsity is a major limitation for computational power: most datasets only record known interactions, with few ‘true negatives’, and many presumed negatives are actually unrecorded associations. Even in long-term ecosystem studies, a third of ‘cryptic’ host–pathogen interactions may go unrecorded27, while at the planetary scale, over 90% of possible mammal–virus associations may never have been observed28. At the same time, reported interactions may also include a mix of data quality (that is, a mix of true and false positives). For example, serological evidence can be confounded by cross-reactivity among closely related viruses, and the process of digitization can also introduce new errors and inconsistencies, particularly in host and virus taxonomy. Every dataset contains unique iterations of these challenges, and discrepancies among them can create significant problems for reproducible hypothesis testing26. Researchers may therefore aim to work from standardized datasets such as The Global Virome in One Network (VIRION)29, which aims to compile every available source of information on vertebrate viruses into one dynamic, open dataset with a reconciled taxonomic backbone and rich metadata on host and virus taxonomy, data provenance and evidence of interactions.

Owing to a common data structure, most studies explore the same basic biological patterns, leading to a broad set of similar findings. For example, exposure and susceptibility within host populations have analogues at eco-evolutionary scales in ‘opportunity’ and ‘compatibility’, captured by geographic and phylogenetic data, respectively4. Host species with broader geographic ranges develop more population genetic structure, encounter more habitats and contact more species—facilitating pathogen exchange at an ecological scale, and viral diversification at an evolutionary scale30,31. Consequently, host geographic range size is a common predictor of viral diversity22,32 and reservoir status11,22, and range overlap is a strong predictor of viral sharing between hosts30. Most studies find a similarly strong phylogenetic effect: some animal clades disproportionately host particular viruses22,33, closely related host species share more viruses30, and zoonoses disproportionately originate in non-human primates13,22. In predictive models, host phylogenies can help to identify and recapitulate a combination of intrinsic autocorrelation (closely related hosts share coevolutionary histories with specific viruses) and latent biological mechanisms (closely related hosts share traits such as metabolic pathways, viral receptors or innate immune mechanisms through identity by descent34). The relative contribution of the two is rarely identifiable, particularly because large databases of within-host traits (for example, receptor chemistry or innate immune responses) are mostly unavailable, thus most studies use host phylogeny as a broad, correlative proxy. Together, phylogeographic predictors are often the strongest across modelling approaches.

Broad similarities such as these point to a set of emerging ‘universal laws’—for example, ‘phylogeographic proximity increases the similarity of host viromes’—that have been repeatedly supported across modelling studies, were often suggested in advance by theory and experimental evidence34,35,36 and predict unknown host–virus associations with surprising accuracy30. However, different study designs can produce very different kinds of ‘predictions’ (Fig. 2), and even though many studies use the same data and statistical methods, the lack of a shared modelling framework has made it difficult to synthesize these findings, for example varying and complex reporting formats prevent researchers from conducting formal meta-analyses. We outline such a framework (Table 1) and the broad patterns that each approach has so far uncovered. To help researchers build on those patterns, we have organized the last decade of this scientific work into our taxonomy in the Host–Virus Model Database (HVMD, available at viralemergence.org/hivemind), an evidence base of predictive studies and their data, methodology and key findings (and one that we hope will, over time, become more comprehensive than the necessarily limited coverage of studies here).

Fig. 2: Four methods of interrogating host–virus networks.
figure 2

a, A link prediction model inferring the probability of host–parasite interactions using link prediction. Data are from ref. 27. b, A host inference model predicting the richness of probable Nipah virus reservoirs at specific locations. Data are from ref. 41. c, A zoonotic risk model, demonstrating that vector-borne viruses and those able to replicate in the cytoplasm are more likely to be able to infect humans. Individual data points indicate partial residuals, black lines are the mean partial effect and shaded areas indicate the 95% confidence interval. Data are from ref. 22. d, A viral sharing model showing that closely related sympatric mammals are more likely to share viruses. Black lines are the mean partial effects, and shaded areas indicate 95% confidence intervals. Data are from ref. 30.

Table 1 Six approaches to predicting the host–virus network

Model design shapes insights and applications

Predicting species interactions is a fundamental task in ecology, especially with the emergence of ecological network science as a subfield. Here we discuss six approaches researchers can use to understand host–virus interactions as a network science problem. The most general approach, link prediction models, uses known associations in an ecological network to infer the probable association of any two species. For symbiotic interactions, the model is usually structured on the basis of a bipartite network of hosts and symbionts (for example hosts and viruses, or plants and pollinators), with species traits used to predict binary link values that denote the presence or absence of an interaction27,37. Link prediction is a general case of other specialized models: for example, zoonotic risk models calculate the link probability between all virus nodes and one host node (humans) (Fig. 1a), and link prediction models can similarly be used to identify potential zoonoses as a subset of their predictions37. However, different kinds of link prediction may have subtle conceptual differences. For example, a ‘link’ can be hard to define: is the aim to predict all existing hosts of a virus or all potentially compatible hosts? Generic host–virus association data are often mismatched with the study aims; these sources primarily catalogue viruses in natural hosts, but they also contain a mix of experimental infections that may be unrepresentative of infection dynamics in the wild, and serological detections that indicate exposure but not necessarily competence (and that are confounded by cross-reactivity38). Careful training data selection and analytical design can help to predict ‘links’ with more biological meaning39, and reporting the sensitivity of model results to different evidentiary standards can improve transparency22.

Host inference considers a one-sided subset of the link prediction problem, focused on predicting the hosts, reservoirs or sometimes vectors of one specific virus. The approach is most valuable when the definitive reservoir is unknown40, additional reservoirs are suspected41 or intermediate hosts are of interest42. Models can be easily tailored to these circumstances by using training data that reflect host competence (for example viral isolation instead of PCR or serology, or testing tick larvae for pathogens as a proxy of a host’s ability to transmit39) or distinguish reservoirs from incidental hosts43. These approaches have also been widely suggested as a way of triaging viral sampling in wildlife41, but model performance is variable, and this approach has only been tested in limited settings. As genomic data become more integral to the field, these approaches also show tremendous promise to narrow the search for hosts of ‘orphan viruses’ with no known non-human hosts10. Finally, mapping the distribution of predicted hosts can help to reveal the spatial extent of spillover risk and can inform possible futures after climate and land use change4.

Conversely, zoonotic risk models aim to identify which viruses can infect one specific host (humans), a task that is often framed as the most important for public health applications. Most statistical analyses have focused on the factors that predict innate cross-species transmission potential, with the assumption that humans are ‘just another host’ and that high host plasticity predicts zoonotic potential44,45. More often, the risk factors used in zoonotic risk models are quite coarse and describe thousands of candidate viruses, such as RNA viruses46,47 with broad host range44,45, larger genomes48, vector-borne transmission22, replication in the cytoplasm22,33,49 or lipid envelopes22,33. Interestingly, many of these traits also have contradictory impacts on transmissibility or severity, adding a layer of complexity; some approaches extend this modelling framework to explicitly predict these downstream properties of zoonotic viruses13,50. Some cutting-edge approaches focus more on the specific underpinnings of human–virus compatibility, anticipating structural and biochemical interactions between viruses and cell receptors51 and using genomic features and machine learning to identify human viruses from metagenomic samples52 or to predict zoonotic potential across different influenza or bacterial strains53,54.

Statistical analyses of viral sharing reframe bipartite host–virus networks as unipartite host–host networks and predict whether two hosts share any viruses on the basis of host traits alone. These approaches are limited by viewing viral ecology as an emergent property of hosts, but this reframing reduces network sparsity and opens up an underexplored computational toolkit for unipartite network models55. These models readily identify neutral processes in the ecology of pathogen transmission30, predict cross-species transmission with surprising accuracy56 and can be easily interfaced with models of macroecological change4. However, because they treat viruses as interchangeable, these approaches lose potentially important signals in the data. A handful of studies also model one specific aspect of viral sharing: the probability that an animal host shares any viruses at all with humans (that is, whether the host is a zoonotic reservoir). As with viral sharing generally, sympatry and synanthropy determine opportunities for human–animal contact and predict sharing, both through domestication and through geographic overlap between wildlife ranges and human population centres22,57. Some traits may uniquely predict zoonotic reservoirs, such as a fast life history strategy, in which lifespan is traded off in favour of fertility58, and because they are smaller and more numerous, fast-lived species are more likely to thrive in disturbed ecosystems or alongside human settlements and thus may often be sources of zoonotic outbreaks59,60.

Finally, viral richness models and host range models investigate node degree in the bipartite network, that is, how many hosts a given virus can infect (host range) and how many viruses have infected a given host (viral richness). By collapsing the bipartite network into node-level traits, they provide coarse measures that can be used in species-level analyses (for example, see ref. 22). Identifying viral traits, such as vector transmission, that predict broad host range helps in exploring the evolutionary theory of cross-species transmission events and could inform zoonotic risk models22,45. Conversely, understanding ecological drivers of viral diversity can help to prioritize sampling for viral discovery22,61 and, potentially, to understand the distribution of zoonotic risk if some ‘hyper-reservoirs’ host disproportionately many zoonotic viruses32,49. In this special case, some studies investigate zoonotic viral richness and test whether some animals host a greater number of viruses with observed zoonotic potential and whether this effect differs from overall viral richness62. Increasingly, careful analysis often rejects widely held assumptions (for example, bats or urban-adapted animals host most more zoonotic viruses) in favour of the null hypothesis that zoonotic viral richness is often simply a product of higher total viral diversity49,62.

The limits of prediction

Each of these modelling approaches shows tremendous promise — but each is limited, first and foremost, by the availability of data on the global vertebrate virome. At most, 1% of mammal viruses have been described to date9 and even fewer are known from other vertebrates. At such an early stage in viral discovery, even the most basic statistics, such as host-level viral richness estimates, may say more about sampling effort than the underlying biological reality63,64. When a new zoonotic virus emerges, researchers are disproportionately likely to sample related host species and viral taxa in the vicinity of the spillover event (bottom-up sampling bias). Surveillance also often targets well-studied cosmopolitan species due to availability, and is therefore more likely to discover more viruses, and more zoonotic viruses, in these species (top-down sampling bias49). Similarly, screening efforts have historically focused on hosts and viruses with known relevance to human or domestic animal health; this impact bias may be especially salient in regions with underfunded veterinary and public health surveillance infrastructure. Although high-throughput sequencing and broad-range serological approaches65 can counteract some of these biases, these approaches are not always cost-effective or practical to implement in resource-limited laboratory settings. As a result, targeted screening remains the primary source of host–virus association data, and biases remain pervasive.

Together, the limitations and priorities of these sampling processes heavily shape the observed structure of the host–virus network and are difficult to correct for in modelling efforts30. At present, this is a notable barrier to the advancement of quantitative viral ecology. Most published disease modelling studies use one of only a few small datasets with substantial overlap and similar biases, test the same hypotheses and, unsurprisingly, have generated largely congruent findings (for example, ‘phylogenetic distance structures viral sharing’), most of which are underinformed by microbiology and use phylogeographic or ecological proxies. While independent verification of results is a critical part of the scientific method, especially if data easily facilitate re-analysis or meta-analysis, re-analysing these few datasets so intensely risks pseudo-replication and could entrench spurious findings that are readily explained by sampling bias. For example, a recent study showed that urban-adapted mammals have a higher recorded diversity of zoonotic viruses, but only because they also have a higher total diversity of recorded pathogens, which is probably a clear-cut example of top-down sampling bias62. Cases such as these have engendered scepticism of modelling approaches as a useful tool for applied risk assessment, particularly given the high diversity of wildlife viruses, significant gaps in both host and virus sampling, the spurious patterns generated by sampling bias and even the pace of viral diversification19,63,66. At present, scientists are unlikely to be able to ‘predict and prevent’ outbreaks using these tools. However, models will become more reliable if viral discovery continues at its current pace and, particularly, if data synthesis is a priority for quantitative research. As these datasets grow, they will open doors for more advanced methodologies that have greater impact.

Emerging directions for powerful inference

As this subfield advances, the microbiology underpinning models is becoming more detailed, leading to insights that better bridge virology, ecology and computational biology. Across the global virome, an intangible but finite set of host–virus associations are possible, while each impossible pair is prevented by at least one (identifiable and, ideally, predictable) incompatibility between viral and host microbiology. In this lock-and-key framework, a virus’s ability to infect a novel host species depends on the features that allow it to enter cells, hijack cellular machinery, replicate its genome, evade both the innate and adaptive immune response, produce infectious virions, optimize transmission and cause disease. While the ‘phylogenetic distance effect’ has been used as a broadly supported and convenient (but black box) proxy for these mechanisms, researchers are increasingly turning to data that explicitly characterize these processes instead. For example, host cell receptors and viral envelope proteins act as one kind of lock-and-key, which determine a virus’s potential for cell entry36; data on the angiotensin converting enzyme 2 (ACE2) receptor of mammalian host cells have been used to predict the broad host range of SARS-CoV (severe acute respiratory syndrome coronavirus) and SARS-CoV-2 (refs. 67,68). Compatibility is further altered by biochemical modifications of host and viral proteins, such as glycans (the sugars on the outside of host and virus proteins)69; viral proteins inherit host glycosylation, and their cross-species transmission potential may be enhanced or hindered by glycosylation by the source host70. The fractal geometry of these molecules could be represented as quantitative features, and glycan similarity may be predictive of viral sharing. Eventually, it may also be possible to represent more complicated immunology in this framework, for example broadly reactive innate antiviral factors, such as TRIM5α, act as barriers to different groups of viruses to varying degrees71, and while few models currently capture these pathways, this may be an important research horizon in the coming decade.

Increasingly, modellers have also harnessed the genomic revolution to make better predictions in the absence of detailed information on microbiological mechanisms. Genomes are inherently high-dimensional data that encode both meaningful phenotypes and residual signals of coevolution, and they can be used as features for both host and virus nodes in a network. Usually, genomes are analysed by quantifying the usage of dinucleotides, codons and codon pairs; in more advanced cases, these can be augmented with data on amino acid biochemistry, protein–protein interactions53 or longer k-mers72. In the near-term future, these datasets may increasingly be supported by machine learning tools that predict protein folding structures73,74. A number of studies have begun using these genomic features in various forms of link prediction, including predicting reservoir taxonomic orders10, characterizing the broad host and vector associations of flaviviruses75, and predicting the zoonotic potential of circulating strains of avian influenza53,54 (Box 1) and animal viruses more broadly76. Researchers have particularly advanced these methods while studying host–bacteriophage networks77, integrating genomic data into network-based frameworks with other predictors10,77 and exploring the potential for deep learning to identify genes or genomic features that control host specificity or virulence78. These approaches can even be useful in practical outbreak investigation, for example one recent study predicted the reservoirs of three dozen ‘orphan viruses’ with murky origins (for example, the Bas-Congo virus is predicted to be a virus of even-toed ungulates10).

In combination with these growing sources of data on host–virus interactions, researchers have increasingly started using network science to make more complex and more powerful predictive models. The structure of the host–virus network is determined by unobserved biological processes with identifiable signals; tools from graph theory and network science can recover this hidden information and leverage it for better prediction. Often, these recommendations rely on pairwise dissimilarity of virus communities among hosts or vice versa27, or on the degree distributions of viruses and hosts37. These approaches can be supplemented with phylogenetic or ecological traits fairly easily27,37, or even with genomic data77. More sophisticated ways of leveraging network structure have been developed in computer science, but they remain largely untested on viral networks; in particular, as network data expand — in both the number of associations and the dimensionality of predictors — the door for deep learning methods, such as collaborative filtering79 and neural networks, will also open80. The surprising strength of these methods for other link prediction tasks — from protein–protein interaction networks to online social network or shopping algorithms — makes this avenue particularly promising. Many of these approaches rely on graph embedding, a set of methods that use matrix algebra to generate a small number of feature vectors, which encode information about relationships between nodes or the graph as a whole81; these features can be used to improve link prediction or to add a network component to other kinds of models. For example, one recent study imputed missing links in the mammal–virus network using machine learning, generated graph embeddings of the derived network and used these features to substantially improve the performance of a genomics-based classifier of viral zoonotic potential28. By using these kinds of computational tools to characterize the structure of the global virome, scientists may be able to translate a broader understanding of the rules of cross-species transmission into applied problems such as zoonotic risk prediction.

From models to actionable science

Opportunities to apply these models to high-impact problems are abundant, albeit mostly unexplored. For example, host inference models can help target fieldwork during the early stages of zoonotic outbreaks, when origins are unclear (for example, the SARS-CoV-2 pandemic56) or when a familiar virus emerges in an unusual location (for example, Nipah virus in Kerala, India41). These models can also be used to target wildlife sampling more efficiently. Viral discovery is still expensive at scale: the USAID PREDICT programme spent over US$200 million to discover roughly 1,000 novel wildlife viruses in 10 years18, and the proposed Global Virome Project would aim to spend US$1 billion over the next decade discovering a million more8. Future programmes such as these present an opportunity to test model-guided approaches as both a cost-saving measure and shortcut to accelerate scientific progress. Once wildlife viruses are discovered and characterized, their zoonotic potential can be predicted as part of the first scientific report describing their existence82, helping virologists triage laboratory characterization; these tools may increasingly be paired with models that aim to predict dimensions of epidemic potential, such as human-to-human transmissibility45,50 and pathogen virulence13, which often use the same core datasets and machine learning approaches that are used to predict zoonotic potential. Once priority risks have been identified, managers can implement longitudinal, multi-site sampling programmes that can inform (and support other models that predict) where and when people are at risk of zoonotic spillover. Similarly, modelling approaches that integrate data on surveillance and health systems can help understand where those spillovers are most likely to go undetected14 and spread quickly15. When integrated into one pipeline, these different approaches capture all three components of risk: hazard (what the threat is), exposure (where and when it occurs) and vulnerability (what the potential disease burden is, and for whom).

Building predictive models into this pipeline requires that researchers, practitioners and stakeholders have confidence in these approaches. To refine existing models, formalize best practices and convince sceptics (including both colleagues and stakeholders) of the value of this work, modellers need to measure and report model performance in a way that is open, transparent and accountable. Developing standardized meta-datasets26 and forming collaborative teams (for example, the Verena Consortium; see viralemergence.org) can facilitate multi-model study designs that are commonplace in statistical research, such as ensemble models or ‘bake-offs’ testing predictive accuracy. However, these are only a step in the required direction. Actionable forecasting is an iterative process83, and adding feedback loops to the modelling process would help researchers to measure the accuracy of specific approaches, validate or falsify model-generated hypotheses and, ultimately, make more sound, actionable inference about the global virome. A lack of feedback among field, experimental and modelling approaches currently precludes that process of refinement; when predictions are tested, it has mostly been ad hoc. For example, one recent field study84 confirmed model predictions of bat filovirus hosts11, while another found no support85; a recent experimental study86 more definitively refuted another prediction about bat reservoirs of Nipah virus41. These kinds of data are rarely fed back into modelling efforts and are almost never pursued prospectively. In a unique counterexample, we recently generated eight predictive models of undiscovered bat hosts of betacoronaviruses and tracked their performance over more than a year as new viral discoveries were reported56. We found that biology-agnostic network models performed no better than random predictions, while machine learning and network models that also leveraged data on bat biology made strong, accurate predictions. Using measures of model performance, we were able to weight a predictive ensemble to make more accurate predictions, and the updated list of potential undiscovered hosts can now be confidently used to target the screening of samples from field surveys and biological collections. This example highlights several best practices for actionable prediction: making predictions public and interpretable, tracking predictive accuracy over time, and incorporating new data into dynamic predictions that keep pace with changing scientific knowledge.

We suggest that future sampling efforts would best complement modelling efforts by following up on actionable (high zoonotic risk) leads for public health priorities, as suggested by both expert knowledge and predictive models. If model-generated hypotheses turn out to be largely incorrect, this can help to identify spurious assumptions about a virus’s ecology or identify modelling approaches unsuited for future use; on the other hand, if accurate and effective, these integrated approaches will save time and resources during outbreaks. This will require researchers to match the scope of predictions to the nature of an intended outcome, for example host inference models are used to suggest gaps in known reservoirs11,41,56, and sampling these hosts first can reduce the cost of viral discovery. Similarly, models that predict viral zoonotic potential can identify threats to human health before the first case of infection28,76; in the near-term future, these tools could be used to identify which wildlife viruses should be the focus of testing for new therapeutics and candidate universal vaccines. Matching predictions to purpose will also help to identify potential barriers to implementation; these are discussed more extensively elsewhere87.

Conclusions

The promise of host–virus network prediction should be met with cautious enthusiasm, particularly with regard to zoonotic risk. These models still face many challenges in practice, and a well-trained scientist may be able to identify many of the same patterns or risks as the most advanced predictive models would. For example, a betacoronavirus pandemic was almost inevitable, not just because the zoonotic potential of bat viruses, which had been confirmed experimentally, but also because there had been two previous outbreaks of zoonotic betacoronaviruses and insufficiently responsive policy and planning (SARS and MERS (Middle East respiratory syndrome); Box 2 and Fig. 3).

Fig. 3: Two decades of coronavirus research.
figure 3

Even within the different formulations of predicting animal–virus interactions, models can answer a wide range of questions and can be useful at a wide range of points in the history of an outbreak. Leading up to the COVID-19 pandemic, we highlight here where the modelling frameworks (A–G) available today offered useful insights — or may have been able to make a difference if appropriate data, infrastructure and model technology had been available at the time90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105. Reservoir models from sequence data can help to trace orphan viruses to broad host groups10 (A). Some groups of viruses have higher host range and zoonotic potential than others22 (B). Knowing a subset of virus reservoirs allows predictive models to identify potential undiscovered reservoirs11 (C). Models that predict zoonotic potential from sequence data could help to generalize the work of gain-of-function studies with much lower investment and risk76 (D). Viral sharing models can identify hosts with similar viruses, targeting sampling around a given lead30 (E). Link prediction and reservoir inference models can suggest which wildlife hosts may be able to carry a virus in the future37,67 (F). Models can help to guide sampling to trace the origins of SARS-CoV-2 (ref. 56) (G). Meanwhile, hundreds of known wildlife coronaviruses still require risk assessment for zoonotic potential, transmissibility and pandemic risk. CDC, Centers for Disease Control and Prevention.

Just as ‘virus hunting’ has been insufficient to stem the emergence, re-emergence or global spread of several major viral threats18, there are obstacles to turning model-based predictions into disease prevention. Even with massive efforts to mitigate upstream drivers of disease emergence (and quantitative modelling to target those interventions), spillover risk will never be reduced to zero—especially for unknown threats—and after the first human case, the actual levers of pandemic prevention will always lie in diagnostic and surveillance capacity, healthcare access, social safety nets and health system investment—not the tools we discuss here.

However, as future threats emerge, modelling will be a key tool for rapid scientific inquiry, particularly given how much still remains unknown about the global virome. Although scientists may never be able to ‘predict and prevent the next pandemic’, a renewed vision of this work — ‘prediction’ as the development of quantitative tools that can learn the rules of life underpinning host–virus interactions and apply them to information-limited problems to benefit human health and the environment — could be an invaluable step towards true preparedness.

These approaches will help virologists to explore the ecology and evolution of coronaviruses and to build a data-driven risk assessment infrastructure along the lines of the global influenza monitoring system. But there is still no guarantee that the next SARS-like pandemic could be ‘predicted and prevented’, particularly given that the risk of a pandemic such as COVID-19 was ‘predicted’ for two decades by virologists on the basis of other kinds of scientific evidence88,89. Downstream problems preventing the translation of scientific knowledge to public health responses cannot be entirely solved through actionable science; no amount of viral discovery, laboratory characterization, modelling and risk assessment can solve vulnerability due to weak healthcare infrastructure and insufficient funding continuity and support for pandemic preparedness18. Knowing where SARS-CoV-2 came from may help us to target surveillance and slow the emergence of similar viruses, but another highly transmissible coronavirus will inevitably emerge in humans someday. Developing a universal vaccine that protects against bat coronaviruses with predicted zoonotic potential, building pandemic preparedness frameworks that include international governance of vaccine sharing and production, and developing responsive health systems with better syndromic detection of early outbreaks could be enough to achieve a future that never sees another coronavirus pandemic.