Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations

Wardeh, Maya; Blagrove, Marcus S. C.; Sharkey, Kieran J.; Baylis, Matthew

doi:10.1038/s41467-021-24085-w

Download PDF

Article
Open access
Published: 25 June 2021

Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations

Nature Communications volume 12, Article number: 3954 (2021) Cite this article

6988 Accesses
12 Citations
194 Altmetric
Metrics details

Subjects

Abstract

Our knowledge of viral host ranges remains limited. Completing this picture by identifying unknown hosts of known viruses is an important research aim that can help identify and mitigate zoonotic and animal-disease risks, such as spill-over from animal reservoirs into human populations. To address this knowledge-gap we apply a divide-and-conquer approach which separates viral, mammalian and network features into three unique perspectives, each predicting associations independently to enhance predictive power. Our approach predicts over 20,000 unknown associations between known viruses and susceptible mammalian species, suggesting that current knowledge underestimates the number of associations in wild and semi-domesticated mammals by a factor of 4.3, and the average potential mammalian host-range of viruses by a factor of 3.2. In particular, our results highlight a significant knowledge gap in the wild reservoirs of important zoonotic and domesticated mammals’ viruses: specifically, lyssaviruses, bornaviruses and rotaviruses.

Predicting the potential for zoonotic transmission and host associations for novel viruses

Article Open access 19 August 2022

Predicting the global mammalian viral sharing network using phylogeography

Article Open access 08 May 2020

Predicting mammalian hosts in which novel coronaviruses can be generated

Article Open access 16 February 2021

Introduction

Thousands of viruses are known to affect mammals, with recent estimations indicating that less than 1% of mammalian viral diversity has been discovered to date¹. Some of these viruses have a very narrow host range, whereas others such as rabies and West Nile viruses² have very wide host ranges (rabies can theoretically infect any mammal³). Host range is an important predictor of whether a virus is zoonotic⁴ and therefore poses a risk to humans. For example, Severe acute respiratory syndrome-related (SARS-CoV) and Middle East respiratory syndrome-related (MERS-CoV) coronaviruses are both believed to have originated in bats, but through a host range that includes other mammals (e.g. palm civets⁵, camels⁶) they have successfully infected humans. Most recently, SARS-CoV-2 has been found to have a relatively broad host range, including: bats; cats; ferrets; and a proposed intermediate host, Malayan pangolins, which may have facilitated spill-over to humans⁷. Knowing the potential host range of viruses is essential for efforts to mitigate the global burden of viral diseases^4,8.

However, our knowledge of the host range of viruses remains limited^1,4,9 and the information we have is hugely biased towards humans and domesticated mammals. For example, there is a significant gap between the number of known human viruses (274 species¹⁰), and those of wild primates (e.g. only 5 species in the toque macaque - Macaca sinica¹⁰, and average of ~7 viruses per primate host¹⁰) which is largely a result of differential research effort. Surveillance and research efforts often intensify during and after significant outbreaks, leading to further biases; for instance, recent efforts to identify potential reservoirs of SARS-CoV-2 have led to the identification of two new virus species in wild pangolins (Manis javanica and Manis pentadactyla)¹¹, and a pangolin coronavirus⁷, thereby doubling the number of known viruses of pangolins.

Despite these biases, the knowledge accumulated so far provides a valuable resource which can be exploited to estimate the extent to which we are under-observing associations between known viral agents and mammalian hosts. Networks, linking known viruses with their mammalian hosts, present a global view of sharing of these viruses amongst mammalian hosts. This sharing exhibits certain characteristics (e.g. DNA vs RNA viruses^12,13; bats vs rodents¹⁴) which could only be captured at the global level. Various network topological features have been exploited to provide significant insight into patterns of pathogen sharing¹⁴, disease emergence and spill-over events¹⁵, and as means to predict missing links in a variety of host-pathogen networks^16,17 including helminths¹⁸, and viruses^19,20.

Here, we express the topology of our virus-mammal network in terms of counts of potential motifs²¹. Motifs²² are small subgraphs which constitute the building blocks of larger, more complex networks²³. Motifs express specific functions or topological features of the underlying network, and have been used to capture complex and indirect interactions in a variety of systems including biology^24,25,26, ecology^27,28 and disease emergence²⁹. We integrate this global view of viral sharing into a machine-learning driven framework to predict unknown (i.e. either potential or undocumented/unobserved) associations between known viruses and their mammalian hosts. The novelty of our framework lies in its multi-perspective approach whereby each possible virus-mammal association is predicted three times: 1) from the perspective of each of our mammals (e.g. based on the traits of the viruses known to infect wildcats - Felis silvestris, which other known viruses could also infect them?); 2) from the perspective of each of our viruses (e.g. based on the traits of mammalian species in which West Nile virus has been found to date, which other mammals can carry this virus?); and 3) from the perspective of the network linking known viruses with their mammalian hosts.

Our framework utilises 6,331 associations between 1896 viruses and 1436 terrestrial mammals, representing 0.23% of all possible associations between these mammals and viruses. It assesses how much these associations are underestimated by predicting which unknown species-level associations are likely to exist in nature (or do already exist but are yet undocumented). We aggregate these predictions to enhance estimation of the host-range of known mammalian viruses, and to highlight variation in the degree of underestimation at the level of mammalian order (particularly in wild and semi-domesticated species), and viral group (Baltimore classification), family, and genus. In addition, we highlight knowledge gaps in mammalian species susceptible to known zoonoses and equivalent viruses in important domesticated mammals. By investigating this underestimation from three separate points of view, we enhance the overall predictive performance and capture local (at the level of a single viral or mammalian species), as well as global (aggregated) variations in our knowledge gaps.

Results

Our framework to predict unknown associations between known viruses and potential mammalian hosts or susceptible species comprised three distinct perspectives: viral, mammalian and network. Each perspective produced predictions from a unique vantage point (that of each virus, each mammal, and the network connecting them respectively). Subsequently, their results were consolidated via majority voting. This approach suggested that 20,832 (median, 90% CI = [2,736, 97,062], hereafter values in square brackets represent 90% CI) unknown associations potentially exist between our mammals and their known viruses, (18,920 [2,440, 91,517] in wild or semi-domesticated mammals). Number of unknown associations predicted by each perspective individually were as follows: mammalian only = 41,537 [4,275, 23,8971], viral only = 21,352 [2,536, 95,630], and network only = 76,081 [27,738, 20,5814]. Our results indicated a ~4.29-fold increase ([~1.43, ~16.33]) in virus-mammal associations (~4.89 [~1.5, ~19.81] in wild and semi-domesticated mammals).

Additionally, we trained an independent pipeline including only the 3534 supported by evidence extracted from meta-data accompanying nucleotide sequences, as indexed in EID2 (55.82% of all associations - see Methods section and Supplementary Results 8). Our sequence-evidence pipeline indicated that 15,721 (median, 90% CI = [1,603, 88,553]) unknown associations could potentially exist (13,930 [1,298, 83,043] in wild or semi-domesticated mammals).

In the following subsections we first illustrate the mechanism of our framework via an example, then further explore the predictive power of our approach for viruses and mammals.

Example

Our multi-perspective framework generates predictions for each known or unknown virus-mammal association (2,722,656 possible associations between 1,896 viruses and 1,436 terrestrial mammals). We highlight this functionality using two examples (Fig. 1). West Nile virus (WNV) a flavivirus with wide host range, and the bat Rousettus leschenaultia (order: Chiroptera). We first consider each of our perspectives separately, and then showcase how these perspectives are consolidated to produce final predictions.

1) The mammalian perspective: our mammalian perspective models, trained with features expressing viral traits (Table 1), suggested a median of 90 [17, 410] unknown associations between WNV and terrestrial mammals could form when predicting virus-mammal associations based on viral features alone – a ~2.61-fold increase [~1.3, ~8.32]. Similarly, our results indicated that 64 [4, 331] new associations could form between our selected mammal (R. leschenaultia) and our viruses – a ~4.37-fold increase [~1.21, ~18.42] (Supplementary Results 4).

Table 1 Viral traits & features used to build our mammalian models.

Full size table

(2) The viral perspective: our viral models, trained with features expressing mammalian traits (Table 2), indicated a median of 48 [0, 214] new hosts of WNV (~1.86- fold increase [~1, 4.82]). Results for our example mammal (R. leschenaultia) suggested 18 [3, 76], existing viruses could be found in this host (~1.95-fold increase [~1.16, ~5.00]) - Supplementary Results 5).

Table 2 mammalian traits & features used to build our viral models.

Full size table

(3) The network perspective: Our network models indicated a median of 721 [448, 1,317] (~13.88 [9, 24.52] fold increase) unknown associations between WNV and terrestrial mammals, and that 246 [91, 336] existing viruses could be found in our selected host (R. leschenaultia), equivalent to a ~13.95 [~5.79, ~18.68] fold increase (Supplementary Results 6).

Considering that each of the above perspectives approached the problem of predicting virus-mammal associations from a different angle, the agreement between these perspectives varied. In the case of WNV: mammalian and viral perspectives achieved 92.3% agreement [72.6%–98.5%]; mammals and network perspectives had 55.3% agreement [33.4%–69.5%]; and viruses and network had 52.9% agreement [19.8%–68.7%]. In the case of R. leschenaultia these numbers were as follows: 96.15% [82.44%, 99.58%], 87.24% [76.37%, 95.04%], and 87.61% [75.90%, 95.25%], respectively. The agreements between our perspectives across the 2,722,656 possible associations were as follows: 98.04% [90.36%, 99.73%] between mammalian and viral perspectives, 96.71% [88.62%, 98.92%] between mammalian and network perspectives, and 97.11% [91.57%, 98.95%] between viral and network perspectives.

After voting, our framework suggested that a median of 117 [15, 509] new or undetected associations could be missing between WNV and terrestrial mammals (~3.45-fold increase [~1.3, ~12.2]). Similarly, our results indicated that R. leschenaultia could be susceptible to an additional 45 [5, 235] viruses that were not captured in our input (~1.37-fold increase [~1.26, ~13.37]). Figure 1 illustrates top predicted and detected associations for WNV (Supplementary Data 1) and R. leschenaultia (Supplementary Data 2). Supplementary Results 1 illustrate results with research effort into viruses, and mammals included as a predictor in our mammalian and viral perspective models, respectively. Predictions with and without research effort incorporated into models trained in these perspectives broadly agreed.

Relative importance of viral features

Our multi-perspective approach trained a suite of models for each mammalian species with two or more known viruses (n = 699, response variable = 1 if the virus is known to associate with the focal mammalian species, 0 otherwise). This enabled us to assess the relative importance (influence) of viral traits (Table 1) to each of our mammalian models. This in turn showcased variations of how these viral traits contribute to the models at the level of individual species (e.g. humans), and at an aggregated level (e.g. by order or domestication status). The results, highlighted in Fig. 2A, indicate that mean phylogenetic (median = 95.4% [75.6%, 100%]) and mean ecological (90.90% [43.50%, 100%]) distances between potential and known hosts of each virus were the top predictors of associations between the focal host and each of the input viruses. Maximum phylogenetic breadth was also important (74.7 0%, [16.60%, 100%]).

Mammalian host range

Our results suggested that the average mammalian host range of our viruses is 14.33 [4.78, 54.53] (average fold increase of ~3.18 [~1.23, ~9.86] in number of hosts detected per virus). Overall, RNA viruses had the average host range of 21.65 [7.01, 82.96] hosts (~4.00- fold increase [~1.34, ~14.15]). DNA viruses, on the other hand, had 7.85 [2.81, 29.47] hosts on average (~2.43 [~1.14, ~6.89] fold increase). Table 3 lists the results of our framework at Baltimore group level and selected family and transmission routes of our viruses. Figure 2 illustrates predicted mammalian host range of our viruses (Fig. 2B, Supplementary Data 3), and the increase in predicted number of viruses per species in species-rich mammalian orders of interest (Fig. 2C, Supplementary Data 4).

Table 3 Predicted range of susceptible mammalian species of viruses per Baltimore group, family (top 15 families, ranked by fold increase) and transmission route.

Full size table

Relative importance of mammalian features

We trained a suite of models for each virus species with two or more known mammalian hosts (n = 556, response variable = 1 if the mammal is known to associate with the focal virus species, 0 otherwise). This allowed us to calculate relative importance of mammalian traits (Table 2) to our viral models. We were also able to capture variations in how these features contribute to our viral models at various levels (e.g. Baltimore classification, or transmission route) as highlighted in Fig. 3A. Our results indicated that distances to known hosts of viruses were the top predictor of associations between the focal virus and our terrestrial mammals. The breakdown was: 1) mean phylogenetic distance - all viruses = 98.75% [93.01%, 100%], DNA = 99.48% [96.03%, 100%], RNA = [91.93%, 100%]; 2) mean ecological distance all viruses = 94.39% [71.86%, 100%], DNA = 96.36% [80.99%, 100%], RNA = [69.48%, 100%]. In addition, life-history traits significantly improved our models, in particular: longevity (all viruses = 60.9% [12.12%, 98.88%], DNA = 68.03% [11.22%, 99.69%], RNA = [13.55%, 96.37%]); body mass (all viruses = 62.92% [5.4%, 97.65%], DNA = 72.75% [18.49%, 100%], RNA = 57.45% [4.32%, 95.5%]); and reproductive traits (all viruses = 53.37% [5.67%, 95.99%]%, DNA = 59.46% [8.27%, 99.32%], RNA = 50.17% [4.85%, 92.17%]).

Wild and semi-domesticated susceptible mammalian hosts of viruses

our framework indicated ~4.28 -fold increase [~1.2, ~14.64] of the number of virus species in wild and or semi-domesticated mammalian hosts (16.86 [4.95, 68.5] viruses on average per mammalian species). These results indicated an average of 13.45 [1.73, 65.04] unobserved virus species for each wild or semi-domesticated mammalian host (known viruses that are yet to be associated with these mammals). Our framework highlighted differences in the number of viruses predicted per order (Table 4). Figure 3 illustrates the predicted number of viruses in wild or semi-domesticated mammal by mammalian host range (Fig. 3B, Supplementary Data 5), and the top 18 virus genera (per number of host-virus associations) in selected orders (Fig. 3C, Supplementary Data 6). Supplementary Results 1 lists the results with the inclusion of research effort into mammalian species in our viral perspective models.

Table 4 Predicted number of viruses per top 15 orders by fold increase in number of viruses predicted in wild or semi-domesticated mammalian hosts (per species).

Full size table

Network perspective - Potential motifs

We quantified the topology of the network linking virus and mammal species by means of counts of potential motifs²¹. Figure 4 illustrates how potential motifs are captured in our network. Briefly, for each virus-mammal association for which we want to make predictions (n = 2,722,656, of which 6,331 are supported by our evidence, see methods section), we “force insert” this focal association into our network (Fig. 4A, B) and enumerate all instances of 3 (n = 2), 4 (n = 6), and 5-node (n = 20) potential motifs in which this association might feature if it actually existed²¹ (Fig. 4C visualises these different motifs). Following this process, a features-set is generated comprising the counts potential motifs for all included associations. Figure 4D illustrates the count of motifs (logged) grouped by mammalian order and virus Baltimore classification.

Relative importance of network (motif) features

Figure 4E illustrates that M4.1 was the most important feature in our network models: median = 100% [90.19%, 100%]. Followed by: M5.1 = 97.84% [89.19%, 99.93%], M5.7 = 98.8 97.22% [87.7%, 98.77%] and M4.6 = 96.75% [86.13%, 100%]. Research effort of viruses and mammals had relative importance = 90.26% [82.94%, 95.36%], 88.42% [78.38%, 94.87%] respectively. Overall, 5-node motif-features had median relative influence = 75.06% [1.21%, 98.14%]; whereas 3 and 4-node motif-features had relative influence = 71.69% [55.76%, 85.34%], and 61.06% [27.14%, 100%], respectively. Supplementary Fig. 29 illustrate the partial dependence of network perspective models on each of our network features.

Validation

We validated our framework in three ways: 1) against a held-out test set; 2) by systematically removing selected known viral-mammalian associations and attempting to predict them; and 3) against external data source, comprising viral-mammalian associations extracted using an exhaustive literature search targeting wild mammals and their viruses^4,30.

Our held-out test set comprised 15% of all data (randomly selected, n = 407,265; 954 known virus-mammal associations, see methods below). We removed this set from our network, computed network features (motifs), and trained constituent models in each perspective with the remainder data. We then estimated our framework performance metrics against the held-out test set. Our framework achieved overall AUC = 0.938 [0.862–0.959], F1-Score = 0.284 [0.464–0.124], and TSS = 0.876 [0.724–0.918], when trained without including research effort in its mammalian and viral perspectives. When research effort was included in these perspectives, performance metrics were as follows: AUC = 0.920 [0.823, 0.944], F1-Score = 0.272 [0.526, 0.093], and TSS = 0.840 [0.646, 0.888].

The performance of our voting approach was better than any individual perspective, or combination of perspectives (Supplementary Tables 8–11). The most significant improvement was in F1-score, where individual perspectives scores were as follows: network = 0.104 [0.210–0.051], mammalian = 0.115 [0.009–0.064] (0.131 [0.284–0.035] with research effort), and viral = 0.181 [0.374–0.074] (0.196 [0.373–0.067]).

Additionally, we conducted a systematic test to predict removed virus-mammal associations. In this test, we systematically removed one known virus-mammal association at a time from our framework, recalculated all inputs (including from network) and attempted to predict these removed associations. Our framework succeeded in predicting 90% of removed associations (90.70% for associations removed for viruses, 89.92% for associations removed from mammals, Supplementary Results 3).

Finally, our framework predicted 84.02% [77.69%, 89.60%] of the externally obtained viral-mammalian associations (with detection quality > 0) where both host and virus were included in our pipeline, and 77.82% [68.46%, 86.51%] (any detection quality). When including research effort in our mammalian and viral perspectives, these results were: 84.47% [78.15%, 89.60%], and 78.41% [68.83%, 86.37%], respectively.

Discussion

Overall, we predict a 5.35-fold increase in associations between wild and semi-domesticated mammalian hosts and known zoonotic viruses (found in humans, excluding rabies virus). Similarly, our results indicate a 5.20-fold increase between wild and semi-domesticated mammals and viruses of economically important domestic species (e.g. livestock and pets). Bats and rodents, which have been associated with recent outbreaks of emerging viruses such as coronaviruses³¹ and hantaviruses³², are linked with increased risk of zoonotic viruses^4,13,30,33. Our results could potentially enable targeted surveillance of rodents and bats for known viruses not yet associated with species in these orders: we predict a 5.55-fold (2.69 per species) and 5.45-fold (3.77) increases respectively (Fig. 2C, Supplementary Data 6). The fold increases are higher for zoonotic viruses and viruses observed in economically important domestic species, where for bats we predict a 7.42-fold (2.30 per species) and an 8.29-fold increase (2.42 per species) respectively. Whereas for rodents we predict a 6.43-fold (3.69) and a 7.7-fold increase (2.92), respectively.

The increase in associations indicates a knowledge-gap across mammalian species that are potentially susceptible to these viruses. For bats the largest fold increase was in group III viruses with an 8.72-fold-increase (1.43 per species, group IV had the highest fold increase per species, 2.26), whereas in rodents the highest increase was in group V viruses - a 6.23-fold-increase (3.49 per species).

The largest significant fold increases in included bats were with the group V Lyssaviruses (excluding rabies), a family of viruses causing an array of medically and veterinary important rabies-like diseases in a wide range of mammals^34,35, with a 10.4-fold increase in the number of predicted associations (Fig. 3C, Supplementary Data 6). Group V Bornaviruses, which cause a range of encephalitic diseases in mammals including the fatal Borna disease³⁶ (sad horse disease) common in horses and other domesticated animals, had a 23 and a 12-fold increase in associations in bats and rodents, respectively. Finally, group III Rotaviruses had an 8.11-fold increase in bats – rotaviruses are the most common cause of diarrhoeal diseases in children and are of particular concern in developing countries³⁷.

Analogous to bats and rodents being important hosts of zoonotic viruses, wild ruminants are key in the maintenance and circulation of viruses affecting ruminant livestock³⁸. Our framework highlights this knowledge-gap by predicting a 7.77-fold increase in number of associations between wild and semi-domesticated ruminants and known viruses (3.37-fold increase per species, Fig. 2C, Supplementary Table 14); and a 10.11-fold increase in associations between these ruminants and observed zoonotic viruses (2.25-fold per species). Furthermore, our model predicted a significant increase in the mammalian host range of important livestock viruses including: a 7.45-fold increase in range of Venezuelan equine encephalitis virus (Group IV, Togaviridae); a 5.33-fold increase in range of Schmallenberg orthobunyavirus (Group V, Peribunyaviridae); and a 2.96-fold increase in range of bluetongue virus (Group III, Reoviridae).

These results demonstrate that our approach can highlight large numbers of potentially missing associations of medically- and veterinary-important viruses and their potential hosts. For instance, we predicted 13 genera of viruses in three species of lynx (lynx canadensis, lynx rufus and lynx pardinus) which were not associated with the lynx in our input data, including Nipah virus. Such information can be used to better understand the risk to people and livestock from these hosts. There are several reasons for which virus-mammal associations may have been disproportionately under-described, which can be categorised as follows:

1.
Public health, food security and economically driven research biases: Most of our current knowledge of infectious agents, including viruses, is centred upon humans. Second to humans (37.1% of captured mammalian research effort), agricultural and companion animals tend to receive significantly more research effort (~15% of captured mammalian research effort). Examples include the well-studied microbiome of domestic cats (Felis catus, 57 known virus species) compared with the understudied microbiome of wild felines (e.g. Felis silvestris, 13 known viruses – these expanded to 51 using our framework). Linked to this is wealthier countries producing a larger research volume, and hence interactions common within or of importance to such countries are more likely to be described.
2.
Practical limitations: infectious agents of endangered and rare mammalian species, and mammalian species found predominantly in remote regions, we suspect, are less likely to be characterised due to difficulties in sampling these mammals in their natural habitats. The same likely applies to viruses that are less common in mammals (e.g. avian pathogens). Nevertheless, our approach can capture and expand associations of both rare viruses (found in one or two species), and understudied mammalian species, due to separation of perspectives. If a virus is rare, our approach would capture potentially susceptible mammals via the network and mammalian perspectives. Similarly, if a mammalian species is rarely studied, then we would still capture viruses potentially found in this mammalian species via the network and viral perspectives. Overall, our voting framework was able to expand the host range of rare viruses (known hosts ≤ 2, n = 1450) from 1,619 to 4,174 (~2.16 average increase per rare virus). Virus range of rare mammals (known viruses ≤ 2, n = 954) was also increased from 1150 to 4318 (~3.21 average fold increase per mammal).
3.
Biological reasons: virus-mammal associations which produce more visible or marked effects are more likely to have been studied³⁹. For instance, fertility or physically observable interactions are more likely to be over-studied, whilst potentially important asymptomatic interactions, or interactions where a cross-immunity from related viruses masks observable symptoms, may potentially remain unnoticed and hence understudied. Furthermore, co-evolution between virus and primary host often results in a less severe phenotype⁴⁰, whilst the same virus in an incidental host may result in more marked and hence more studied disease. Examples include Ebola viruses presenting minimal symptoms in bats but severe disease with high mortality in humans⁴¹; analogous interactions where the former host may have been unobserved are likely to be plentiful. For example, our framework indicated that 34 species of bats could be susceptible Ebolaviruses. Recently, advances in metagenomics have enhanced viral discovery in hosts, enabling cheap and rapid identification and sequencing of host viromes. This approach mitigates many historical ‘top-down’ limitations mentioned above, enabling simple identification of e.g. asymptomatic infections^39,42. However, whilst this methodology is likely to be increasingly used in future, it is currently in its infancy and a large proportion of current viral knowledge is still the result of potentially biased top-down approaches.

The novelty of our approach lies in the separation of perspectives - by isolating the viral, mammalian and network perspectives we were able to further our understanding of mammalian hosts of known viruses in a number of ways. Firstly, our framework integrated local (mammalian and viral) and global (network) approaches. Our locally trained mammalian and viral models enabled the exploration of the effect, by means of variable importance, of a comprehensive set of mammalian and viral traits. We were able to measure the relative influence each of our mammalian features had on each multi-host virus; and conversely, the influence each of our viral features had on each mammal (with two or more known viruses). This facilitated the aggregation of variable importance by, for instance, viral or mammalian taxonomy, which in turn illustrated differences in how these features influenced our models. For example, when aggregated at genus level, we found that body mass and a larger proportion of plants in the diet had higher influence on our models for Orbiviruses, which are known to infect ruminants and horses (7 species, median values = 90.97 and 86.83, respectively); whereas longevity, and weaning age were more influential to Ebolavirus models (5 species, 94.82 and 91.42, respectively). Uniquely, we incorporated geospatial features extrapolated from an extensive collection of global data on climate, environmental, agricultural, and mammalian diversity variables. The importance of these varied across our viral models. For instance, in coronaviruses, mean human population was more important for Beta-coronaviruses (83.38) than Alpha-coronaviruses (65.65). From the mammalian perspective, phylogenetic and ecological distances to known hosts were the most influential across all models. The importance of maximum phylogenetic breadth varied across families within the same taxonomic order. For instance, in rodents, it ranged from 89.08 (median) in Sciuridae (14 species) to 48.83 in Muridae (37 species). Local, species-level variable importance further enhances the utilisation of our approach to targeted surveillance, by enabling flexible aggregation of results from individual species to entire groups and orders.

Secondly, we consolidated these viral and mammalian traits with network topological features, expressed in terms of counts of potential motifs. We measured variable importance of our topological features and found that the likelihood of an association increased the more it featured in motifs linking its mammalian host with a virus with a wide host range (M4.1 and M5.1). Similarly, an association was more likely to be predicted by our network perspective models the more it featured in motifs linking a mammal with a wide range of viruses (M4.6 and M5.20), but the influence of these motif-features was not as high as the previous two. More complex motif-features (e.g. M4.4, M5.5, M5.9, M5.12, and M5.19) had a negative influence: the more an association featured in them, the less likely it was to be predicted by our models. This could be because these motifs indicate a separation between the known host range of the focal virus and the focal mammal, and vice versa. For instance, higher counts of M5.19 suggest that, in general, there are no indirect pathways between the focal virus and mammal, despite the mammal featuring in several such pathways. Thus, higher counts of M5.19 might indirectly indicate that the focal virus is known to affect different types of hosts (e.g. different taxa).

Thirdly, our voting approach, despite being more conservative than its components (Supplementary Results 2, 4–6), was able to bridge a significant gap in our knowledge of mammalian hosts susceptible to included viruses (>18,000 associations between wild and semi-domesticated mammalian species and known viruses). Furthermore, our voting approach outperformed each of its constituent perspectives, and any combination of two perspectives, across all included metrics. The estimated improvements in performance metrics are essential, particularly for the application of our approach to targeted surveillance, because they indicate that in addition to its ability to detect documented associations very well, we have more confidence in predicted novel (unknown) associations (better F1-score) compared with results derived from any individual perspective, or by joining any two perspectives. Additionally, the results of our approach align with recent advances in the field of predicting novel hosts of known viruses, which all predict an increase in the host range^{2,4,17,20,33,43,44}. For instance, we predict 44 novel associations between bats and Filoviruses (total of 60), which is a more conservative estimate than recent studies⁴³. For flaviviruses, we predict 85 species of primates to be hosts to both zika and yellow fever viruses (20 species when voting with the 90^th percentile across our 3 perspectives, we predict 20 primates to be hosts of both viruses) compared to 21 predicted in recent work⁴⁴. Despite the large number of predicted, potentially novel, associations, the fact that our predictions can distilled to the level of individual virus or a mammalian species, makes our approach suitable for targeted surveillance per host or virus, or groups therein.

There remains, however, key areas for further improvement. We differentiate between two types of unknown virus-mammal associations: 1) associations between a known virus and a potentially susceptible mammalian host of this virus: known-unknowns; and 2) completely unknown viruses associated with a host but are not yet discovered: unknown-unknowns. Our approach aimed at the first type: we included as much information as available on known viruses and their susceptible mammalian species to predict associations between wild and semi-domesticated species and our viruses. In the case of species-rich mammalian orders containing sufficiently studied species (e.g. Primates, Carnivora), a higher proportion of their currently known viruses are likely to have been found. Hence, our approach was able to make predictions for wild and semi-domesticated (medium to under-studied) species in those orders. However, for mammalian orders with fewer species, and where those species are under-studied, there are more likely to be more unknown-unknowns, therefore a larger proportion of their viruses would not be predictable by our approach or other approaches.

Our approach also has limitations with regards to known-unknowns; we acknowledge that it does not entirely ameliorate the impact of research effort (Supplementary Figs. 10–14). Whilst our models did not necessarily over predict for heavily studied mammalian species, particularly humans and economically important domesticated animals, it predicted more known-unknowns for well-researched mammals (Supplementary Figs. 10–11, 14). The effect of research effort into viruses is more prominent, with our approach predicting significantly more potentially susceptible mammalian species for heavily studied viruses such as Influenza A virus and Rotavirus A (Supplementary Figs. 12–14). In other words, our approach cannot fully distinguish between two possible reasons for a mammal having few virus associations: 1) the virus has never been observed in the mammal (due to research effort), and 2) the mammal is biologically not susceptible. One potential field-wide solution to this problem would be the inclusion of known-unsusceptible associations. This could potentially mitigate a large effect of ‘research effort’ related issues as well studies species would generally also have larger numbers of known-unsusceptible associations, which could tend to balance the effect. However, there are many reasons why this cannot be used at present, including: negative results are less likely to be published, especially for relatively under-studied and wild species; no resource of unsusceptible associations currently exists beyond review articles capturing a small number of either viral or mammalian species; and practical difficulties proving species-wide unsusceptibility to a given virus.

Prediction of unknown and novel viruses and their potential threat to humans, livestock and wildlife is an increasingly important and active research area. Where an established virus is increasing its geographical range (e.g. due to climatic or demographic factors), then our framework provides powerful means to assess potential hosts it has yet to come into contact with. The identification of these hosts is exceedingly important, as viruses continue to move across the globe via complex transmission cycles featuring migratory animals⁴⁵, legal and illegal trade in animals^46,47, unknown hosts (in various taxa, including non-mammalian hosts), bridge vectors², and reverse zoonoses⁴⁸. However, for completely novel or never-studied viruses, our approach cannot predict potential associations due to lack of viral and network traits: an example is SARS-CoV-2; our pipeline could not have predicted its host association when it first emerged, but subsequent study of the virus, its traits and its observed hosts allows for prediction of its unobserved host associations⁴⁹. Future work may be able to enhance the predictive power of our approach by incorporating more diverse viral traits, particularly in terms of detailed genetics⁹ and in terms of geographical distribution and associated features of the virus as highlighted in previous work^50,51. Integration of predictors of host-virus interactions such as the existence of particular viral receptors in host cells would also greatly benefit our models and create a fourth perspective that could be added into our framework.

Finally, a further separation of perspectives could also be achieved by incorporating arthropod vectors or intermediate hosts, or different classes of pathogens and hosts, particularly birds. Future integration of avian species into our network could potentially increase predictive power and explainability of our approach, particularly in relation to the ecology of viruses for which birds are known to be important reservoirs or amplifying hosts (e.g. flaviviruses such as West Nile and Japanese encephalitis, and influenza viruses). The incorporation of birds into our network component will enable quantification of yet-uncaptured important pathways in which birds play central roles. However, such integration will first require establishing a more complete picture of avian viruses and their hosts – the number of associations we were able to capture for avian species was 2,525 between 1,251 bird and 306 virus species (~40% of the total number of mammalian associations in this study). This could be achieved either by deeper mining of existing sources or by developing separate predictive pipelines focusing solely on birds.

In this study we attempted to expand our knowledge of viral host ranges by predicting the unknown hosts of known viruses. We applied a divide-and-conquer approach which separated viral, mammalian and network features into three unique perspectives, each predicting associations independently to enhance predictive power. We predicted over 20,000 unknown associations between known viruses and mammalian hosts, suggesting that current knowledge greatly underestimates the number of associations between wild and semi-domesticated mammals. Completing the picture of virus-host interactions can help identify and mitigate current and future zoonotic and animal-disease risks, including spill-over from animals into humans.

Methods

Virus-host species associations

Species-level virus-mammal associations were extracted from the ENHanCEd Infectious Diseases Database¹⁰– EID2 (version from December 2019). EID2 automatically mines information on pathogens (of any taxa), their hosts and locations from two sources: meta-data accompanying nucleotide sequences (hereafter sequences) published in Genbank^52,53; and 2) titles and abstracts (hereafter TIABs) of publications indexed in the PubMed⁵⁴. At time of extractions, EID2 has collated information from >7 million sequences (and processed 100 M + sequences), and >8 million TIABs. EID2 imports names of organisms (here viruses and mammals), and their taxonomy from the NCBI Taxonomy database⁵⁵. It also extends these names with an exhaustive, expertly curated, collection of alternative and common names. These names are utilised to disambiguate hosts and pathogens in sequence meta-data and TIABs using inclusion and exclusion terms¹⁰. Evidence collated from TIABs is considered likely if it exceeds a given threshold (usually ≥ 4 publications). For the vast majority of stored organisms, EID2 follows the NCBI definitions of ‘species’ and ‘subspecies’, with unclassified and uncultured species being denoted as ‘no rank’. For the purposes of this study, we recursively aggregated virus-mammal associations – a mammal that was found to host a strain or subspecies of virus was considered a host of the corresponding virus species (and vice versa). We further checked each of these species level associations for accuracy and to eliminate laboratory-produced results. This resulted in 6331 associations between 1896 viruses and 1436 terrestrial mammals. The support of these associations in EID2’s evidence base was as follows: 22.79% had publication and sequence evidence; 33.03% were supported by nucleotide sequence only, and 44.18% were supported by evidence extracted from TIABs. The nature of this evidence was as follows: 70.48% of associations were strongly supported by sequence, isolation, or PCR evidence; 29.52% were supported by serology-only evidence. Of the total number of associations inferred from publication-only evidence, 66.82% were supported by serological evidence. We trained our pipelines with associations obtained from both sources; this is because serology is a standard means of determining previous viral infection in an individual. Isolation cannot detect an infection that has since been cleared by the host’s immune system. Hence isolation and serology have different applications, and both should be utilised to get a more complete picture. Both sequencing and serological methodologies vary in their sensitivity and specificity depending on the virus clade. Both sequencing and serological methodologies vary in their sensitivity and specificity depending on the virus clade, with neither being superior in all cases^56,57,58. Consequently, we chose to present the results using both isolation and serology in the manuscript. However, to account for possible variations in the strength of our evidence base, we trained a separate predictive pipeline including only those associations supported by sequence evidence (55.82% of total); Supplementary Results 8 summarise predictions of this pipeline; full results are included in our data release (see below).

Multi-perspective framework to predict unknown virus-mammal associations

We transformed our species-level virus-mammal associations into a bipartite network in which nodes represent either virus or mammal species, and links indicate associations between mammalian and viral species. Our bipartite virus-mammal network is sparsely connected – roughly 0.23% of potential associations are documented in EID2, despite it being the most comprehensive resource of its kind. This sparsity is more evident in wild and semi-domesticated species where only 0.182% of potential associations are observed. We treated the problem of bridging this gap in our knowledge of virus-mammal associations as a supervised classification problem of links in the bipartite network. In other words, we aimed to predict unknown associations between known viruses and their mammalian hosts based on our knowledge to date of these species. Each possible virus-mammal association is predicted three times as follows.

1 – From the mammalian perspective: For each mammal in our network, given a set of features (predictors) comprising viral traits (e.g. genome, transmission routes) – Table 1, what is the probability of an association forming between this mammal and each of the 1,896 virus species?

2 – From the viral perspective: For each virus species found in our network, given a set of features (predictors) encompassing mammalian phylogeny, ecology, and geographical distribution – Table 2, what is the probability of an association forming between this virus and each of 1,436 terrestrial mammals?

3 – Form the network perspective: Given a set of topological features representing the bipartite network expressing most of our knowledge to date of virus-mammal associations, what is the probability of an association forming between any virus and any mammal in our dataset (n = 1,896 × 1,436 = 2,722,656 possible associations)?

Our framework trained and selected a set of supervised classifiers in each of the above perspectives as discussed below. It then consolidated the results of the best performing classifiers using voting whereby an unknown (potential or unobserved/undocumented) association was selected if it was predicted by at least two of the three perspectives. This is because each of our perspectives focuses on a particular aspect of virus-host associations. From the mammalian perspective, and for every included mammal, the probability of a virus affecting/associating with this focal mammal is quantified based on our knowledge of the viruses found in this mammal to date. Similarly, from the viral perspective, the probability of the virus infecting/associating with included mammalian species is quantified based on our knowledge to date of known hosts of this virus. The final perspective enables generation of predictions based on the topology of the network linking all included mammals with all included viruses. Thus, our three perspectives capture all aspects of viral-mammalian association without biasing toward one aspect.

Our framework is flexible, in terms of machine-learning algorithms selected, classifiers trained, and features engineered for each perspective. It avoids overfitting as it approaches the problem from various perspectives, and effectively consolidates ensembles of classifiers trained on subsets of the underlying data. In addition, no constituent model of our framework has been trained with all available data at any time. Finally, our framework enables the incorporation of hosts where only one virus has been detected to date (via perspectives 2 and 3), and viruses where only one host has been discovered (via perspectives 1 and 3).

The local approach – the mammalian and viral perspectives

Our mammalian and viral perspectives generate “local” predictions for hosts and viruses, respectively. These local predictions are derived by training a suite of models for each host (with two or more known viruses), and virus species (with two or more known mammalian hosts), as described in subsequent sections. In other words, each mammalian species has its own “local” suite of models, trained using viral traits (Table 1), to predict viruses which could associate with this host. Similarly, each selected virus has its own set of models, trained using mammalian features (Table 2), to predict mammalian hosts which are potentially susceptible to this virus. The reason for predicting locally (per host, or virus) is two-fold: 1) Variations of host susceptibility, viral host range: traits (features) determining, for instance, mammalian species susceptibility to West Nile virus, are potentially different to those affecting these species’ susceptibility to Bovine immunodeficiency virus. Hence, by training these models locally, we are able to ascertain the influence of these traits on each host, and each virus. 2) Class balancing: we synthesised new positive training instances for each of our hosts, based on the traits of their known viruses Likewise, we synthesised new positive instances for each of our viruses, based on the traits of their known mammalian hosts (as discussed below).

The network perspective - topologically derived network features of virus-mammal associations

In contrast with our mammalian and viral perspectives, the network linking known viruses with their mammalian hosts presents a “global” view of how these viruses are shared amongst their mammalian hosts. Here we capture the topology of this bipartite network by means of counts of potential motifs²¹ (Fig. 4A, C). These motifs capture important indirect pathways between viruses and their mammalian hosts. These pathways vary from simple generalisations capturing whether a virus has wide range of hosts or not (M3.1, M4.1, and M5.1), or if the mammal is exposed to many viruses (M3.2, M4.6, M5.20), to more complex pathways (e.g. two host species sharing 80% of their viruses with each other; three viruses sharing 50% of their hosts with each other). These pathways might indicate if an unknown association is more likely to exist in nature or not, and are only capturable, and most importantly quantifiable (Fig. 4D), at the global level as encapsulated by our network perspective.

Transforming these pathways into features from which supervised machine-learning algorithms could learn, enables us to make predictions directly from the network structure. Here, counting of potential motifs is limited to the 3-step ego network of both virus and host – the network comprising nodes which can be reached in 3 steps (links) or fewer from each focal node (nodes comprising the focal association (Fig. 4B).

We generated a features-set comprising the counts of potential motifs for all associations (Fig. 4D) and trained several machine-learning algorithms with this dataset (plus research effort) as detailed in following subsections. Motifs are usually associated with specific frequency thresholds²³. However, here we follow previous work²¹ in removing this restriction. We simply counted the number of occurrences of potential motifs of each focal association, and then let the machine-learning algorithms detect which motifs were particularly important to the problem of predicting links in our network (Fig. 4E).

Research effort

We incorporated research effort on mammal and virus species into our network perspective models. This is because it is through this perspective that predictions are made for all hosts and all viruses at the same time, and where the effect of research effort into both the hosts and viruses can be measured and corrected for adequately and simultaneously. We calculated research effort as the total number of sequences and publications of each species as indexed by EID2¹⁰. In addition, we trained a separate pipeline in which the research effort into our hosts was included as predictive feature in each constituent viral perspective model; and the research effort into our viruses was included in each constituent mammalian perspective model. Agreement between training constituent models with and without research efforts in mammalian and viral perspective was 99.7% [99.9%–99.2%] (values in bracket are confidence intervals derived from predictions CI). Cohen’s Kappa = 0.86 [0.85–0.89] (Kappa range: 0-1). Results of this pipeline are listed in Supplementary Results 1. Detailed validation of both pipelines is listed in Supplementary Results 2.

Multi-perspective prediction of virus-mammal associations

As highlighted above, our framework comprised three perspectives: mammalian, viral and network. Each of these perspectives trained a set models with different features (Tables 1 and 2, and Fig. 4 respectively), and hence required its own pipeline as described below (Supplementary Note 5).

Mammalian and viral perspectives

Class balancing

On average each virus in our dataset affected 3.45 mammals (~0.240% of the 1436 mammals in our models), and each mammalian host was affected by 4.41 viruses (~0.241% of the 1833 viruses in our models). This presented an imbalance in our data, whereby a small percentage of instances are actualised. We dealt with this issue in two ways: first we excluded any virus (n = 1281) which was found in only one mammal species from our virus models pipeline (viral perspective), and we excluded any mammal (n = 758) which is only affected by one virus from our mammal models pipeline (mammalian perspective). Second, we deployed SMOTE - Synthetic Minority Over-sampling Technique^59,60 to rebalance the classes prior to training each of our viral (n = 8 × 556) and mammalian (n = 8 × 699) models. SMOTE synthesises new minority class instances from existing minority instances using a variation of k-nearest neighbour algorithm. The SMOTE algorithm then over-samples from the minority instances (original and synthesised) and under-samples from the majority class to create a balanced training set. All class balancing was achieved using caret R package⁶³ (R version 3.6.2).

Classification algorithms

For each mammal and each virus selected above we trained 8 classification algorithms (Supplementary Table 4): Model Averaged Neural Network (avNNet), Stochastic Gradient Boosting (GBM), Random Forest, eXtreme Gradient Boosting (XGBoost), Support Vector Machines with radial basis kernel and class weights (SVM-RW), Linear SVM with Class Weights (SVM-LW), SVM with Polynomial Kernel (SVM-P), and Naive Bayes. These classifiers offer a varied subset of plethora of classifiers available for experimentation (over 179 classifiers categorised into at least 17 families⁶¹), and were selected due to their robustness, scalability⁶¹, and their potentially good performance with imbalance data classification⁶². All models were trained and optimised using caret R package⁶³ (R version 3.6.2) as described below.

Training and tuning

each of the above models was trained with 10-fold cross validation (10 repeats). This validation method works by splitting training data into 10 random samples, each sample is held out in turn, and the model is trained on the remainder groups. The model’s prediction for the existence or absence of the mammal-virus associations in the held-out group are used to construct confusion matrices and calculate an optimisation metric (here Area Under the ROC Curve, AUC for short). The optimisation metric is used to select best model in the validation process.

We adopted an adaptive resample approach⁶⁴ to tune the hyper-parameters of our models. This approach resamples the tuning parameter grid in a way that concentrates on values that are the in the neighbourhood of the optimal settings (adaptive). Due to the large number of classifiers trained in our framework this adaptive approach allowed us to find optimal (or near optimal) values of the hyper-parameters of each included machine-learning algorithm without relying on the nominal resampling process whereby all the tuning parameter combinations are computed for all the resamples before a choice is made about which parameters are good and which are poor.

Classifier selection strategy

We computed three performance metrics based on the median predicted probability across each set of replicate models: AUC, true skills statistics (TSS) and F1-score (Supplementary Table 5). The best performing classifier per each virus or mammal, across all measures, was included in our multi-perspective final model (Supplementary Results 4 and 5).

Confidence intervals

In order to allow us to incorporate uncertainty arising from variations in SMOTE resampling technique and resulting training sets, and to generate empirical confidence intervals (90%), hyper-parameters of best performing models were carried across to train 50 replicate models for each best performing mammalian or viral model. In other words, we generated a bragging (i.e. median) ensemble for each selected host or virus, and the resulting prediction was carried to our multi-perspective final model.

Network perspective

Class balancing

Our bipartite virus-mammal network is sparsely connected with 6331 documented associations out of 2,722,656 possible associations (0.23%). Due to this we implemented strict under-sampling: whereby balanced samples drawn at random (without replacement) from the set of all potential virus-mammal associations. Each sample comprised 2000 instances (1000 positive (known) and 1000 unknown virus-mammal associations.

Training & tuning

We trained the same selection of algorithms as above with balanced sets (2000 instances each) using 10-fold cross validation with adaptive resampling to optimise AUC. We repeated this process 100 times to generate a bragging ensemble of predictions (derived as probabilities) of these replicate models. We calculated empirical confidence intervals (90%) of the ensemble probabilities across the 100 replicate models.

Classifier selection strategy

We selected the bragging ensemble which obtained the best overall performance metrics (AUC, F1-Score and TSS) when applied to all available associations. The predictions of the best overall ensemble were incorporated into our final model (SVM-RW - Supplementary Results 6).

Performance assessment

We trained the constituent models of each perspectives with a stratified random training set comprising 85% of all data (n = 2,315,391 with 5377 known virus-mammal associations). The processes described above were repeated with training set only, and performance was measured against the held-out test set (15% of all data, n = 407,265 with 954 known virus-mammal associations). Performance metrics obtained through this assessment are reported above and in Supplementary Results 2. Additionally, we performed a complementary test to assess the ability of our model to predict systematically removed virus-mammal associations (Supplementary Results 3).

Variable importance

we calculated relative importance (influence or contribution) of viral (Table 1), mammalian (Table 2), and network features (Fig. 4C) to each model in our three perspectives. Due to the selection strategy implemented in our viral and mammalian perspectives, whereby models from 8 different algorithms were selected, we computed the importance of these features using a model-independent filter approach via a ROC curve analysis conducted on each predictor (as implemented in the caret package⁶³).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Virus-mammal species-level associations were obtained from the ENHanCEd Infectious Diseases Database (EID2). Viral, mammalian and geospatial data were obtained from open-access data sources. These sources are listed in detail in Supplementary Notes 1–3 of the Supplementary Information file, and their DOIs are provided in the Supplementary References. Data used can be found here: https://doi.org/10.6084/m9.figshare.13270304, with the exception of mammalian presence shapefiles and raw climate data (due to their large size) - these data can be obtained from the authors or directly from the sources listed in the Supplementary Information file. Final and intermediate (perspective) predictions of our approach, and predictions obtained using only sequence-evidence are also made available (https://doi.org/10.6084/m9.figshare.13270304).

Code availability

All codes used in our analyses are made available via figshare (https://doi.org/10.6084/m9.figshare.13270304).

References

Anthony, S. J. et al. A strategy to estimate unknown viral diversity in mammals. MBio 4, e00598–00513 (2013).
Article PubMed PubMed Central CAS Google Scholar
Weaver, S. C. & Barrett, A. D. T. Transmission cycles, host range, evolution and emergence of arboviral disease. Nat. Rev. Microbiol. 2, 789–801 (2004).
Article CAS PubMed PubMed Central Google Scholar
Mollentze, N., Biek, R. & Streicker, D. G. The role of viral evolution in rabies host shifts and emergence. Curr. Opin. Virol. 8, 68–72 (2014).
Article PubMed PubMed Central Google Scholar
Olival, K. J. et al. Host and viral traits predict zoonotic spillover from mammals. Nature 546, 646–650 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, L. F. & Eaton, B. T. Bats, civets and the emergence of SARS. Curr. Top. Microbiol. Immunol. 315, 325–344 (2007).
CAS PubMed PubMed Central Google Scholar
El-Kafrawy, S. A. et al. Enzootic patterns of Middle East respiratory syndrome coronavirus in imported African and local Arabian dromedary camels: a prospective genomic study. Lancet Planet. Heal 3, e521–e528 (2019).
Article Google Scholar
Lam, T. T. Y. et al. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature 1–6, https://doi.org/10.1038/s41586-020-2169-0 (2020).
Kreuder Johnson, C. et al. Spillover and pandemic properties of zoonotic viruses with high host plasticity. Sci. Rep. 5, 14830 (2015).
Babayan, S. A., Orton, R. J. & Streicker, D. G. Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes. Science 362, 577–580 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Wardeh, M., Risley, C., Mcintyre, M. K., Setzkorn, C. & Baylis, M. Database of host-pathogen and related species interactions, and their global distribution. Sci. Data 2, 150049, https://doi.org/10.1038/sdata.2015.49 (2015).
Article CAS PubMed PubMed Central Google Scholar
Gao, W.-H. et al. Newly identified viral genomes in pangolins with fatal disease. Virus Evol. 6, veaa020 (2020).
Wells, K., Morand, S., Wardeh, M. & Baylis, M. Distinct spread of DNA and RNA viruses among mammals amid prominent role of domestic species. Glob. Ecol. Biogeogr. geb.13045, https://doi.org/10.1111/geb.13045 (2019).
Wardeh, M., Sharkey, K. J. & Baylis, M. Integration of shared-pathogen networks and machine learning reveals the key aspects of zoonoses and predicts mammalian reservoirs. Proc. R. Soc. B Biol. Sci. 287, 20192882 (2020).
Article CAS Google Scholar
Luis, A. D. et al. A comparison of bats and rodents as reservoirs of zoonotic viruses: are bats special? Proc. R. Soc. B Biol. Sci. 280, 20122753–20122753 (2013).
Article Google Scholar
Bogich, T. L. et al. Using network theory to identify the causes of disease outbreaks of unknown origin. J. R. Soc. Interface 10, 20120904 (2013).
Elmasri, M., Farrell, M. J., Davies, T. J. & Stephens, D. A. A hierarchical bayesian model for predicting ecological interactions using scaled evolutionary relationships. Ann. Appl. Stat. 14, 221–240 (2020).
Article MathSciNet MATH Google Scholar
Farrell, M., Elmasri, M., Stephens, D. & Davies, T. J. Predicting missing links in global host-parasite networks. bioRxiv https://doi.org/10.1101/2020.02.25.965046 (2020).
Dallas, T., Park, A. W. & Drake, J. M. Predicting cryptic links in host-parasite networks. PLOS Comput. Biol. 13, e1005557 (2017).
Article ADS PubMed PubMed Central CAS Google Scholar
Carlson, C. J., Zipfel, C. M., Garnier, R. & Bansal, S. Global estimates of mammalian viral diversity accounting for host sharing. Nat. Ecol. Evol. 3, 1070–1075 (2019).
Article PubMed Google Scholar
Becker, D. et al. Predicting wildlife hosts of betacoronaviruses for SARS-CoV-2 sampling prioritization. bioRxiv https://doi.org/10.1101/2020.05.22.111344 (2020).
Abuoda, G., Morales, G. D. F. & Aboulnaga, A. Link prediction via higher-order motif features. In Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science. (eds Brefeld, U. et al.) Vol. 11906 (2020).
Milo, R. et al. Network motifs: simple building blocks of complex networks. Science 298, 824–827 (2002).
Article ADS CAS PubMed Google Scholar
Milo, R. et al. Superfamilies of evolved and designed networks. Science 303, 1538–1542 (2004).
Article ADS CAS PubMed Google Scholar
Stone, L., Simberloff, D. & Artzy-Randrup, Y. Network motifs and their origins. PLoS Comput. Biol. 15, 1–7 (2019).
Article CAS Google Scholar
Prill, R. J., Iglesias, P. A. & Levchenko, A. Dynamic properties of network motifs contribute to biological network organization. PLoS Biol. 3, 1881–1892 (2005).
Article CAS Google Scholar
Wolf, D. M. & Arkin, A. P. Motifs, modules and games in bacteria. Curr. Opin. Microbiol. 6, 125–134 (2003).
Article CAS PubMed Google Scholar
Simmons, B. I. et al. Motifs in bipartite ecological networks: uncovering indirect interactions. Oikos 128, 154–170 (2019).
Article Google Scholar
Bascompte, J. & Melián, C. J. Simple trophic modules for complex food webs. Ecology 86, 2868–2873 (2005).
Article Google Scholar
Chadès, I. et al. General rules for managing and surveying networks of pests, diseases, and endangered species. Proc. Natl Acad. Sci. USA 108, 8323–8328 (2011).
Article ADS PubMed PubMed Central Google Scholar
Albery, G. F., Eskew, E. A., Ross, N. & Olival, K. J. Predicting the global mammalian viral sharing network using phylogeography. Nat. Commun. 11, 1–9 (2020).
Article CAS Google Scholar
Cui, J. et al. Evolutionary relationships between bat coronaviruses and their hosts. Emerg. Infect. Dis. 13, 1526–1532 (2007).
Article CAS PubMed PubMed Central Google Scholar
Klein, S. L. & Calisher, C. H. Emergence and persistence of hantaviruses. Curr. Top. Microbiol. Immunol. 315, 217–252 (2007). vol.
CAS PubMed Google Scholar
Han, B. A., Schmidt, J. P., Bowden, S. E. & Drake, J. M. Rodent reservoirs of future zoonotic diseases. Proc. Natl Acad. Sci. USA 112, 7039–7044 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Bourhy, H., Cowley, J. A., Larrous, F., Holmes, E. C. & Walker, P. J. Phylogenetic relationships among rhabdoviruses inferred using the L polymerase gene. J. Gen. Virol. 86, 2849–2858 (2005).
Article CAS PubMed Google Scholar
Banyard, A. C., Evans, J. S., Luo, T. R. & Fooks, A. R. Lyssaviruses and bats: emergence and zoonotic threat. Viruses 6, 2974–2990 (2014).
Article PubMed PubMed Central Google Scholar
Richt, J. A. et al. Borna disease virus infection in animals and humans. Emerg. Infect. Dis. 3, 343–352 (1997).
Article CAS PubMed PubMed Central Google Scholar
Dennehy, P. H. Rotavirus infection: a disease of the past? Infect. Dis. Clin. North Am. 29, 617–635 (2015).
Article PubMed Google Scholar
Wiethoelter, A. K., Beltrán-Alcrudo, D., Kock, R. & Mor, S. M. Global trends in infectious diseases at the wildlife-livestock interface. Proc. Natl Acad. Sci. USA 112, 9662–9667 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Dutilh, B. E., Reyes, A., Hall, R. J. & Whiteson, K. L. Editorial: virus discovery by metagenomics: the (Im)possibilities. Front. Microbiol. 8, 1710 (2017).
Article PubMed PubMed Central Google Scholar
Cressler, C. E., McLeod, D. V., Rozins, C., Van Den Hoogen, J. & Day, T. The adaptive evolution of virulence: a review of theoretical predictions and empirical tests. Parasitology 143, 915–930 (2016).
Article PubMed Google Scholar
Whitfield, Z. J. et al. Species-specific evolution of ebola virus during replication in human and bat cells. Cell Rep. 32, 108028 (2020).
Shi, M., Zhang, Y. Z. & Holmes, E. C. Meta-transcriptomics and the evolutionary biology of RNA viruses. Virus Res. 243, 83–90 (2018).
Article CAS PubMed Google Scholar
Han, B. A. et al. Undiscovered bat hosts of filoviruses. PLoS Negl. Trop. Dis. 10, e0004815 (2016).
Article PubMed PubMed Central Google Scholar
Pandit, P. S. et al. Predicting wildlife reservoirs and global vulnerability to zoonotic Flaviviruses. Nat. Commun. 9, 5425 (2018).
Altizer, S., Bartel, R. & Han, B. A. Animal migration and infectious disease risk. Science 331, 296–302 (2011). vol.
Article ADS CAS PubMed Google Scholar
Karesh, W. B., Cook, R. A., Bennett, E. L. & Newcomb, J. Wildlife trade and global disease emergence. Emerg. Infect. Dis. 11, 1000–1002 (2005). vol.
Article PubMed PubMed Central Google Scholar
Fèvre, E. M., Bronsvoort, B. M. D. C., Hamilton, K. A. & Cleaveland, S. Animal movements and the spread of infectious diseases. Trends Microbiol. 14, 125–131 (2006).
Article PubMed PubMed Central CAS Google Scholar
Olival, K. J. et al. Possibility for reverse zoonotic transmission of sars-cov-2 to free-ranging wildlife: a case study of bats. PLoS Pathog. 16, e1008758 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wardeh, M., Baylis, M. & Blagrove, M. S. C. Predicting mammalian hosts in which novel coronaviruses can be generated. Nat. Commun. 121, 1–12 (2021).
Google Scholar
Allen, T. et al. Global hotspots and correlates of emerging zoonotic diseases. Nat. Commun. 8, 1124 (2017).
Article ADS PubMed PubMed Central CAS Google Scholar
Han, B. A., Schmidt, J. P., Bowden, S. E. & Drake, J. M. Rodent reservoirs of future zoonotic diseases. Proc. Natl Acad. Sci. USA 112, 7039–7044 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Benson, D. A. et al. GenBank. Nucleic Acids Res. 41, D36–D42 (2013).
Article CAS PubMed Google Scholar
Bethesda (MD): National Library of Medicine (US), N. C. for B. I. GenBank. https://www.ncbi.nlm.nih.gov/nucleotide/ (1982).
Bethesda (MD): National Library of Medicine (US). PubMed. https://www.ncbi.nlm.nih.gov/pubmed (1946).
Federhen, S. The NCBI taxonomy database. Nucleic Acids Res. 40, D136–D143 (2012).
Article CAS PubMed Google Scholar
ISHIDA, N. Laboratory diagnosis of virus diseases. Boei. Eisei. 9, 330–333 (1962).
CAS PubMed Google Scholar
Maggi, R. G. et al. Comparison of serological and molecular panels for diagnosis of vector-borne diseases in dogs. Parasites Vectors 7, 127 (2014).
Article PubMed PubMed Central CAS Google Scholar
Smeele, Z. E., Ainley, D. G. & Varsani, A. Viruses associated with Antarctic wildlife: From serology based detection to identification of genomes using high throughput sequencing. Virus Res. 243, 91–105 (2018).
Article CAS PubMed Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intel. Res. 16 https://arxiv.org/pdf/1106.1813.pdf (2002).
Agrawal, A. & Menzies, T. Is “better data” better than “better data miners”?: on the benefits of tuning SMOTE for defect prediction. 12 https://doi.org/10.1145/3180155.3180197.
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D. & Fernández-Delgado, A. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15, http://www.mathworks.es/products/neural-network (2014).
Tantithamthavorn, C., Hassan, A. E. & Matsumoto, K. The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Transactions on Software Engineering 46, 1200–1219 (2020).
Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 28, 1–26 (2008).
Article Google Scholar
Kuhn, M. Futility analysis in the cross-validation of machine learning Models1. arXiv https://arxiv.org/abs/1405.6974 (2014).
Sanjuán, R. et al. Viral mutation rates viral mutation rates. J. Virol. 84, 9733–9748 (2010).
Article PubMed PubMed Central CAS Google Scholar
Coffin, J. M. Structure and classification of retroviruses. In The Retroviridae 19–49 (Springer US, 1992). https://doi.org/10.1007/978-1-4615-3372-6_2.
Nisole, S. & Saïb, A. Early steps of retrovirus replicative cycle. Retrovirology 1, 9 (2004).
Wawrzyniak, P., Plucienniczak, G. & Bartosik, D. The different faces of rolling-circle replication and its multifunctional initiator proteins. Front. Microbiol. 8, 2353 (2017).
Lin, X. et al. Order and disorder control the functional rearrangement of influenza hemagglutinin. Proc. Natl Acad. Sci. USA 111, 12049–12054 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Rey, F. A. & Lok, S. M. Common features of enveloped viruses and implications for immunogen design for next-generation vaccines. Cell 172, 1319–1334 (2018).
Article CAS PubMed PubMed Central Google Scholar
Yakovchuk, P., Protozanova, E. & Frank-Kamenetskii, M. D. Base-stacking and base-pairing contributions into thermal stability of the DNA double helix. Nucleic Acids Res. 34, 564–574 (2006).
Article CAS PubMed PubMed Central Google Scholar
Komarova, N. L. Viral reproductive strategies: how can lytic viruses be evolutionarily competitive? J. Theor. Biol. 249, 766–784 (2007).
Article MathSciNet CAS PubMed MATH Google Scholar
Guth, S., Visher, E., Boots, M. & Brook, C. E. Host phylogenetic distance drives trends in virus virulence and transmissibility across the animal–human interface. Philos. Trans. R. Soc. B Biol. Sci. 374, 20190296 (2019).
Article Google Scholar
Longdon, B., Brockhurst, M. A., Russell, C. A., Welch, J. J. & Jiggins, F. M. The evolution and genetics of virus host shifts. PLoS Pathog. 10, e1004395 (2014).
Article PubMed PubMed Central CAS Google Scholar
Park, A. W. et al. Characterizing the phylogenetic specialism–generalism spectrum of mammal parasites. Proc. R. Soc. B Biol. Sci. 285, 20172613 (2018).
Article Google Scholar
Davies, T. J. & Pedersen, A. B. Phylogeny and geography predict pathogen community similarity in wild primates and humans. Proc. R. Soc. B Biol. Sci. 275, 1695–1701 (2008).
Article Google Scholar
Gower, J. C. A general coefficient of similarity and some of its properties. Biometrics 27, 857 (1971).
Article Google Scholar
Pavoine, S., Vallet, J., Dufour, A.-B., Gachet, S. & Daniel, H. On the challenge of treating various types of variables: application for improving the measurement of functional diversity. Oikos 118, 391–402 (2009).
Article Google Scholar
Hay, S. I. et al. Global mapping of infectious disease. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 368, 20120250 (2013).
Article PubMed PubMed Central Google Scholar
Anyamba, A. et al. Global disease outbreaks associated with the 2015–2016 El Niño Event. Sci. Rep. 9, 1930 (2019).
Article ADS PubMed PubMed Central CAS Google Scholar
Hassell, J. M., Begon, M., Ward, M. J. & Fèvre, E. M. Urbanization and disease emergence: dynamics at the wildlife-livestock-human interface. Trends Ecol. Evol. 32, 55–67 (2017).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

MW acknowledges support from BBSRC and MRC for the National Productivity Investment Fund (NPIF) fellowship (MR/R024898/1). Establishment of the EID2 database was funded by a UK Research Council Grant (NE/G002827/1) to MB, as part of an ERANET Environmental Health award to MB; subsequently, it has been further developed and maintained by BBSRC Tools and Resources Development Fund awards (BB/K003798/1; BB/N02320X/1) to MB, and the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Emerging and Zoonotic Infections at the University of Liverpool in partnership with Public Health England and Liverpool School of Tropical Medicine.

Author information

Authors and Affiliations

Department of Livestock and One Health, Institute of Infection, Veterinary & Ecological Sciences, University of Liverpool, Liverpool, UK
Maya Wardeh & Matthew Baylis
Department of Mathematical Sciences, University of Liverpool, Liverpool, UK
Maya Wardeh & Kieran J. Sharkey
Department of Evolution, Ecology and Behaviour, Institute of Infection, Veterinary & Ecological Sciences, University of Liverpool, Liverpool, UK
Marcus S. C. Blagrove
Health Protection Research Unit in Emerging and Zoonotic Infections, University of Liverpool, Liverpool, UK
Matthew Baylis

Authors

Maya Wardeh
View author publications
You can also search for this author in PubMed Google Scholar
Marcus S. C. Blagrove
View author publications
You can also search for this author in PubMed Google Scholar
Kieran J. Sharkey
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Baylis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.W. compiled the data, designed, and produced the analyses. All authors contributed to the study design. M.W. and M.B. established the EID database. M.S.C.B. and K.J.S. provided viral and network expertise. All authors contributed equally to the writing of the manuscript.

Corresponding author

Correspondence to Maya Wardeh.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Christine Johnson, Pranav Pandit and the other, anonymous, reviewer for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Data 4

Supplementary Data 5

Supplementary Data 6

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wardeh, M., Blagrove, M.S.C., Sharkey, K.J. et al. Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations. Nat Commun 12, 3954 (2021). https://doi.org/10.1038/s41467-021-24085-w

Download citation

Received: 08 June 2020
Accepted: 21 May 2021
Published: 25 June 2021
DOI: https://doi.org/10.1038/s41467-021-24085-w

This article is cited by

Predicting species abundance using machine learning approach: a comparative assessment of random forest spatial variants and performance metrics
- Ciza Arsène Mushagalusa
- Adandé Belarmain Fandohan
- Romain Glèlè Kakaï
Modeling Earth Systems and Environment (2024)
Interpreting random forest analysis of ecological models to move from prediction to explanation
- Sophia M. Simon
- Paul Glaum
- Fernanda S. Valdovinos
Scientific Reports (2023)
Predicting the potential for zoonotic transmission and host associations for novel viruses
- Pranav S. Pandit
- Simon J. Anthony
- Christine K. Johnson
Communications Biology (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Example

Relative importance of viral features

Mammalian host range

Relative importance of mammalian features

Wild and semi-domesticated susceptible mammalian hosts of viruses

Network perspective - Potential motifs

Relative importance of network (motif) features

Validation

Discussion

Methods

Virus-host species associations

Multi-perspective framework to predict unknown virus-mammal associations

The local approach – the mammalian and viral perspectives

The network perspective - topologically derived network features of virus-mammal associations

Research effort

Multi-perspective prediction of virus-mammal associations

Mammalian and viral perspectives

Class balancing

Classification algorithms

Training and tuning

Classifier selection strategy

Confidence intervals

Network perspective

Class balancing

Training & tuning

Classifier selection strategy

Performance assessment

Variable importance

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links