Viral outbreaks involve destabilized evolutionary networks: evidence from Ebola, Influenza and Zika

Recent history has provided us with one pandemic (Influenza A/H1N1) and two severe viral outbreaks (Ebola and Zika). In all three cases, post-hoc analyses have given us deep insights into what triggered these outbreaks, their timing, evolutionary dynamics, and phylogeography, but the genomic characteristics of outbreak viruses are still unclear. To address this outstanding question, we searched for a common denominator between these recent outbreaks, positing that the genome of outbreak viruses is in an unstable evolutionary state, while that of non-outbreak viruses is stabilized by a network of correlated substitutions. Here, we show that during regular epidemics, viral genomes are indeed stabilized by a dense network of weakly correlated sites, and that these networks disappear during pandemics and outbreaks when rates of evolution increase transiently. Post-pandemic, these evolutionary networks are progressively re-established. We finally show that destabilization is not caused by substitutions targeting epitopes, but more likely by changes in the environment sensu lato. Our results prompt for a new interpretation of pandemics as being associated with evolutionary destabilized viruses.

be derived by summarizing a network with different statistics, such as the number of connections that a particular node has (its degree), or the shortest distance between each pair of nodes (the average path length).
In order to contrast the evolutionary dynamics of pandemic versus non-pandemic viruses, we here used these statistics to assess the stability of these networks of amino acids that evolve in a correlated manner. We predicted that viral evolutionary dynamics are weakened during a pandemic. As these dynamics often lead to complex networks of interactions 9,11 , we more specifically tested how the structure of these correlation networks is affected during an outbreak. We show that during a pandemic, the evolutionary dynamics of viral genes are severely disrupted, but also that they are progressively restored after the pandemic.

Results
Networks of correlated sites are destabilized during outbreaks. In search for evolutionary differences between regular epidemics and severe outbreaks, we first contrasted the glycoprotein precursor (GP) sequences of the Ebola virus that circulated before and during the 2014/2016 outbreak. For this, we identified with a Bayesian graphical model 16 the pairs of nucleotides that show evidence for correlated evolution in each data set, before and during the outbreak. As in previous work 9, 11 , we found that these pairs of sites form a network. A first inspection of these networks of correlated sites revealed a striking difference between pre-2014 and outbreak sequences: in particular at weak correlations, the pre-2014 interaction networks are very dense and involve most sites of GP, while only a small number of sites are interacting in outbreak viruses (Fig. 1). Furthermore, at increasing correlation strengths, outbreak networks become completely disconnected faster: at posterior probability Pr = 0.80 some sites still interact in pre-2014 proteins, while all interactions have disappeared from Pr = 0.60 in outbreak proteins (Fig. 1 Destabilization affects weakly correlated sites. To further investigate this destabilization hypothesis, we analyzed the structure of these networks with the tools of social network analysis and graph theory 12 . Again, we found a consistent pattern when contrasting regular and outbreak viruses: at weak to moderate interactions (Pr ≤ 0.50), outbreak viruses have networks of smaller diameter, shorter path length, and reduced eccentricity ( Fig. 2a-c, columns 1-5). All these patterns point to fewer connected sites in outbreak viruses. Betweenness is smaller for outbreak viruses (except Ebola), and transitivity tends to be larger (except Zika). These last two measures also suggest that interactions among sites are weakened in outbreak viruses. Other networks statistics failed to show a clear pattern (Fig. S6): in particular, there were no clear differences in terms of degree, centrality or homophily -all properties that are not directly related to network stability.
Post-outbreak re-stabilization. Should these weak interactions play a critical role in the stabilization of viruses outside of pandemics, we would expect to observe the strengthening of all network statistics as years go by after the pandemic. To test this prediction and estimate how long this re-stabilization process can take, we analyzed in a similar way all influenza seasons in the Northern hemisphere following the 2009 pandemic (until 2015-16). Consistent with our prediction, both HA and NA genes show a gradual transition between a typical pandemic state to a regular state in two-to-three seasons (Fig. 2, column 5-6, respectively).

Non-genetic sources of destabilization.
To understand what the potential sources of this destabilization are, we assessed the involvement of viral antigenic determinants/epitopes. Should mutations accumulating in such epitopes be responsible for destabilization, we would expect (i) that weak interactions in non-pandemic viruses involve mostly epitopes, and (ii) that pandemics be associated with the disappearance of these interactions at epitopes first. Figure 3 shows no evidence supporting this hypothesis (X 2 = 0.0663, df = 1, P = 0.7967): non-pandemic viruses show a small number of predicted epitopes in their interaction network, that do not act as central hubs of these networks, while pandemic viruses may actually show an enrichment in interacting epitopes. This suggests that non-genetic factors are likely responsible for the initial destabilization of the genome of pandemic viruses. Changes in their ecology/environment (vector) cannot be ruled out.

Discussion
To understand how evolutionary dynamics are affected during a viral outbreak, we compared non-outbreak and outbreak viruses. Based on the hypothesis that non-outbreak viruses are in a stable evolutionary equilibrium, and that such a stability is mediated by correlated evolution among pairs of sites in viral genes, we reconstructed the coevolution patterns in genes of non-outbreak and outbreak viruses. In line with our prediction, we found that outbreak viruses exhibit fewer coevolving sites than their non-outbreak counterparts, and that these interactions are gradually restored after the outbreak, at least in the case of the Influenza (2009 H1N1) virus for both HA and NA.
Two independent lines of evidence are consistent with our destabilization hypothesis. First, all three viruses showed temporary increases in their rate of molecular evolution during each outbreak 1-3 ; such increases can be expected to disrupt the coevolutionary structure, and hence, destabilize viral genomes. We showed that epitopes were not particular targets of this mutational process. This can be expected, as mutations (i) most likely affect sites randomly, and (ii) are quickly lost from the viral population. Second, a probable cause of the epidemics can be identified in all cases studied here. For Influenza, the 2009 pandemic was caused by a series of reassortment events that affected the two genes studied here, HA (triple-reassortant swine) and NA (Eurasian avian-like swine) 1 . Such exchanges of segments can very well destabilize the evolutionary dynamics, at least of the implicated segments. Similarly, a zoonotic event was implicated in the Ebola outbreak 2 , and a change of continent in the case of Zika 3,17,18 . These corresponding changes of environment (sensu lato) might have triggered the destabilizations observed here. In addition to such environmental changes, it is very likely that destabilization reflects a complex interaction between the genetics of viruses, their demographic fluctuations and environmental changes.
This argument is further supported by recent work in physics, where it was shown that dense networks are more resilient, i.e. resistant to small perturbations, than sparser ones 14 . Moreover, in their simplest example, these authors modeled abundances in a community of mutualistic species, where the mutualistic term describes the pairs of interacting species; perturbations were then applied to the system to assess resilience. They showed that small perturbations did not affect average abundances, which remained high -their 'desirable' state. However, above a particular perturbation threshold, a bifurcation occured and a new 'undesirable' state, at low abundances, was reached. Our results are consistent with a similar system behavior, where the network of correlated amino acids is resilient to perturbations up to a certain point, when a bifurcation to an 'undesirable' state (associated with the pandemic) occurs, and the system returns to its resilient state post-pandemic. One major difference though is that we observed a progressive return to stability in the case of influenza, while the resilience model suggests a second bifurcation, i.e. an instantaneous change, to the 'desirable' state 14 .
One outstanding question is about the importance of weak patterns of coevolution within a gene: how can it be explained that it is essentially weak correlations (around Pr = 0.25) that distinguish non-outbreak from outbreak viruses? In a recent study on mice, four phenotypes were quantitatively analyzed following large intercrosses, and linear regressions on pairs of quantitative trait loci were used to detect non-additive effects, i.e., epistasis; it was then shown that most epistatic interactions were weak and, critically, tended to stabilize phenotypes towards the mean of the population 19 . Viruses are not mice, and not all the correlations that we detected are involved in epistatic interactions, but both this work in mice and the evidence presented here go in the same direction: weak interactions have a stabilizing effect on viral genes and their phenotype (regular epidemics). It is further possible that the intricate nature of these weak correlation networks has higher-order effects 19 , that in turn increase canalization and hence may help viruses weather modest environmental and genotypic fluctuations 20 . The elimination of these many weak interactions has a destabilizing effect that may be caused by or lead to outbreaks. Our findings call for a new interpretation of pandemics that, from an evolutionary point of view, appeared to be associated with unhealthy or diseased viruses. While the evidence shown here does not support the causal nature of this relationship, monitoring correlation networks could help forecast imminent outbreaks.

Methods
Sequence retrieval. Nucleotide sequences were retrieved for three viruses: Influenza A, Ebola, and Zika, for select protein-coding genes, chosen because they represent the most sequenced / studied genes for each of these viruses 11,[21][22][23] . All sequences were downloaded in May 2016 (Table S1).
Full-length Influenza A sequences were retrieved directly from the Influenza Virus Resource 24 . Only H1N1 sequences circulating in humans for the hemagglutinin (HA) and neuraminidase (NA) genes were downloaded. These two genes are also very commonly studied and largely sampled in public databases 22,23 . Two types of data sets were constructed: one containing pandemic and non-pandemic sequences circulating in 2009, the pandemic year, and one containing pandemic sequences circulating from August 1 to July 31 of each season in the Northern temperate region between 2009/2010 and 2015/2016 (seven seasons in total). Only unique sequences were retrieved.
For Ebola, the virion spike glycoprotein precursor, GP, was retrieved because of its key role in the emergence of the 2014 outbreak showing evidence for both correlated and adaptive evolution 11 as follows. A GP sequence (KX121421) was drawn at random from the 2014 strain used previously 11 and was employed as a query for a BLASTn search 25 at the National Center for Biotechnology Information. A conservative E-value threshold of 0 (E < 10 −500 ) was used, which led to 1,181 accession numbers. As most of these accession numbers correspond to full genomes, while only GP is of interest, we (i) retrieved all corresponding GenBank files, (ii) extracted coding sequences with ReadSeq 26 of all genes, (iii) concatenated the corresponding FASTA files into a single file, (iv) which was then used to format a sequence database for local BLASTn searches, and (v) used GP from KX121421 in a second round of BLASTn searches (E < 10 −250 , coverage >75%).
In the case of Zika, sequences of 252 complete genomes were retrieved from the Virus Pathogen Resource (www.viprbrc.org). The RNA-dependent RNA polymerase NS5 was specifically extracted by performing local BLASTn searches as described above. It is one of the most studied Zika genes 21,27 , as it is essential for the replication of the virus 27 .

Phylogenetic analyses.
Sequences were all aligned with Muscle 28 with the fastest options (-maxiters 1 -diags). Alignments were visually inspected with AliView 29 to remove rogue sequences and sequencing errors. Phylogenetic trees were inferred by maximum likelihood under the General Time-Reversible model with among-site rate variation 30 with FastTree 31 . As outbreak sequences (Ebola and Zika viruses) cluster away from non-pandemic sequences, we used the subtreeplot() function in APE 32 to retrieve accession numbers of pandemic sequences and hence separate them from non-pandemic sequences with minimal manual input. FastTree was used a second time to estimate phylogenetic trees of the subset alignments, with the same settings as above.
Network analyses of correlated sites. Amino acid positions ("sites") that evolve in a correlated manner were identified with the Bayesian graphical model (BGM) in SpiderMonkey 16 as implemented in HyPhy 33 . Briefly, ancestral mutational paths were first reconstructed under the MG94 × HKY85 substitution model 34 along each branch of the tree estimated above at non-synonymous sites. These reconstructions were recoded as a binary matrix in which each row corresponds to a branch and each column to a site of the alignment. A BGM was then employed to identify which pairs of sites exhibit correlated patterns of substitutions. Each node of the BGM represents a site and the presence of an edge indicates the conditional dependence between two sites. Such dependence was estimated locally by a posterior probability. Based on the chain rule for Bayesian networks, such local posterior distributions were finally used to estimate the full joint posterior distribution 35 . A maximum of two parents per node was assumed to limit the complexity of the BGM. Posterior distributions were estimated with a Markov chain Monte Carlo sampler that was run for 10 5 steps, with a burn-in period of 10,000 steps sampling every 1,000 steps for inference. Analyses were run in duplicate to test for convergence ( Figures S1-S2).
The estimated BGM can be seen as a weighted network of coevolution among sites, where each posterior probability measures the strength of coevolution. Each probability threshold gives rise to a network whose topology can be analyzed based on a number of measures 12 borrowed from social network analysis and graph theory. We focused in particular on six statistics: average diameter, the length of the longest path between pairs of nodes; average betweenness, measures the importance of each node in their ability to connect to dense subnetworks; assortative degree, measures the extent to which nodes of similar degree are connected to each other (homophily); eccentricity, is the shortest path linking the most distant nodes in the network; average strength, rather than just count the number of connections of each node (degree), strength sums up the weights of all the adjacent nodes; average path length, measures the shortest distance between each pair of nodes. All measures were computed using the igraph R package ver. 1.0.1 36 . Thresholds of posterior probabilities for correlated evolution ranged from 0.01 (weak) to 0.99 (strong). LOESS regressions were then fitted to the results. Epitope analyses. Epitopes were predicted using the NetCTL 1.2 Server 37 . Briefly, Cytotoxic T lymphocyte (CTL) epitopes are predicted based on a neural network algorithm trained on a database of human MHC class I ligands. Epitopes can be predicted for 12 MHC supertypes (A1, A2, A3, A24, A26, B7, B8, B27, B39, B44, B58, B62), that are broad families of very similar peptides for which independent neural network models have been generated. As such, we ran the epitope prediction for each supertype independently, on non-outbreak and outbreak viruses. Circos plots were generated with the circlize R package ver. 0.3.10 38 . Scripts and sequence alignments used are available from github.com/sarisbro.