Introduction

With the recent approval of the first-ever respiratory syncytial virus (RSV) vaccines and the monoclonal antibody (mAb) nirsevimab for the prevention of RSV in all infants1, our understanding of the global transmission dynamics of RSV becomes increasingly important. An important unsolved question is to what extent RSV epidemics are fuelled by local persistence from a previous epidemic versus that of viral seeding from other geographic areas. A better understanding of the global circulation dynamics and local persistence is crucial for RSV surveillance and prevention.

Viral genetic sequence data may offer valuable information to aid in testing predictors of spread and to empirically develop and validate epidemiological models. A challenge for reconstructing viral spread through space and time from genetic data has been the lack of a systematic and comprehensive global sampling of whole genomes from circulating RSV lineages. Current such sampling efforts include the global multiyear multicentre INFORM-RSV study and the Global RSV Surveillance Programme of the World Health Organisation (WHO). The INFORM-RSV study combines large-scale full genome sequencing and a global coverage over multiple RSV seasons to provide a molecular reference of RSV strains and sequence variability2. The best way of mapping genomic evolutionary dynamics of RSV is by analysing nucleotide substitutions of the complete genome. Previously selective pressure analyses with samples from the 2001–2011 time period showed that RSV genes consist predominantly of negatively selected and neutrally evolving sites. Only the G gene encoding for the surface glycoprotein G stood out in terms of detectable positive selection3. The primary role of the G protein is to attach virions to cell surfaces through interaction with host cell attachment factors1,4. The genetic factors that impact the replacement dynamics remain poorly understood and a full-genome perspective on the adaptive evolution of RSV is needed to reveal which other genomic variations affect the fitness of strains.

While sequencing efforts have been implemented on a large scale for SARS-CoV-2, systematic sequencing of RSV is still at an early, small scale stage. For respiratory viruses such as seasonal SARS-CoV-2 and influenza, human air-based travel (flight) has been shown to be an important driver of global circulation4,5,6,7,8. Air travel may also shape seasonal RSV dynamics. RSV molecular epidemiology data from Kenya showed that several new variants are introduced every epidemic season9,10,11,12,13,14,15. The interspersed nature of sequences from Kilifi and other parts of Kenya indicates a degree of mixing of lineages, which in turn suggests that air travel may be an important driver of spread. However, the global circulation patterns of RSV have remained unexplored. Therefore, we integrated human movement patterns with whole genome sequences from RSV samples that were collected in 17 countries worldwide over three RSV seasons (2017–2020) prior to the COVID-19 pandemic. Here, we show that air travel predicts global RSV spread. Travel restrictions due to COVID-19 have not affected the current analysis.

Results

Circulating genotypes

We obtained 1282 complete RSV genome sequences collected over a period of three years from 17 countries worldwide enroled in the INFORM-RSV study. We complemented these sequences with 1180 publicly available sequences from NCBI GenBank sampled within the same time interval. All RSVA and RSVB genomes in the genotyping datasets cluster among strains that were typed as A23 and B6. For this reason, the genotyping alignments were appended with strains of genotype A22 (RSVA) and B5 (RSVB) that served as outgroups for rooting the maximum likelihood (ML) trees. Applying previously established genotyping criteria show that genotypes A23 and B6, from which the currently circulating strains have evolved, can be reclassified into a set of 25 RSVA and 2 RSVB genotypes (Fig. 1). Variants with a duplication in the G gene have emerged16. These variants appear to have a fitness advantage17 and have started to replace previously circulating strains. This observation is reflected in our data, as 100% of the sequenced RSVA and RSVB isolates carry these duplications.

Fig. 1: Maximum likelihood reconstructions of RSVA (1482 genomes/taxa; 2006–2020) and RSVB (1543 genomes/taxa; 1997–2020) complete genome phylogenies and genotypes identification.
figure 1

Lineages that are not assigned to a genotype are shown in light grey. The SH-aLRT and UFB support values for the genotypes are provided in Supplementary Table S1.

Comparable site-specific diversifying selection in RSVA and RSVB

To identify positively selected sites in the coding genes of the RSV genome, we employ three different methods (FUBAR, MEME, and RC, cfr. Methods) that aim to capture different aspects of site-specific selection and report sites that were identified by at least two of these methods. Using this approach, we identify 28 positively selected amino acid sites in RSVA. Of these, 21 are located in the G protein, one in the F protein, and six in the L protein. Eight of the G protein sites and one L protein site are supported by all three methods. We obtain a similar number (n = 26) and distribution of positively selected sites in RSVB, with 18 sites in the G protein, two in the F protein, and six in the L protein. Eight of the G protein sites and one F protein site are supported by all three methods. Three of the positively selected sites are identified at the exact same amino acid position in the G protein of RSVA and RSVB (amino acid positions 136, 274, 310). However, the amino acid position on the linear protein sequence for RSVA may not necessarily be the same as for RSVB in the protein crystal structure. Substitutions in positions under positive selection are found on different branches of phylogeny, which is consistent with the expectation under diversifying selection (Figure S1 and S2).

Both RSVA and RSVB genealogies are shaped by non-neutral population turnover

RSV evolution may be shaped by selection for variants with higher replicative fitness and variants that evade host immune responses18. The latter is indicated by the site-specific selection analyses that identify the G gene as the major target of diversifying selection3,18. However, earlier testing has found that only RSVB tree shapes inferred from complete genome data deviate from what we expect under neutrality3,18. Now that considerably more complete genome data are available, we revisit the genealogical testing using posterior predictive simulation19. We employ the genealogical Fu and Li statistics as well as a trunk length proportion statistic as tree shape statistics (see Methods). We plot bivariate distributions for these statistics based on the genealogies inferred from the genomic data and the equivalent genealogies simulated under neutrality accommodating for potentially complex histories of population size change (Fig.S3). Both RSVA and RSVB show significant deviations from neutrality, with a more pronounced deviation for RSVB as compared to RSVA.

Global RSV circulation patterns are shaped by human air travel

To explore the factors that shape RSV global circulation, we apply a Bayesian phylogeographic approach that models the movement of virus lineages between a set of discrete locations20. This process is generally parameterised in terms of transition rates for all pairs of locations. Here, we use an extension of the discrete phylogeographic model that parameterises these transition rates as a function of a number of potential predictors5. This generalised linear model (GLM) parameterisation allows estimating the contribution of each predictor to the spatial diffusion as a coefficient (on a log scale). In addition, the model includes boolean indicator variables that determine the in- or exclusion of predictors allowing to estimate their inclusion probability. Here, we report the posterior distribution of the product of the log coefficient and inclusion probability for each predictor; positive estimates indicate a positive association between predictors and diffusion intensity while the opposite is true for negative estimates. As predictors, we consider human air travel, population size, geographic distances, and latitude differences (see Methods). Our analyses consistently support human air travel as a strong predictor of RSV global spread at both the country (strongly positive estimates, Fig. 2) and continental level (Fig. S4) for RSVA and RSVB separately, as well as for a model applied to both RSVA and RSVB data sets combined. The support for air travel is robust to the inclusion of sample sizes as predictors. Other candidate predictors occasionally find support, but not consistently so, suggesting that these other predictors could for example be attributed to sampling variability. For instance, the human population size at the origin location is estimated to have a negative log coefficient for its effect size in the RSVB analyses. This may be explained by the fact that the most populous countries, such as China and India, are represented by only a few genomes that are distributed as singletons in the phylogeny, thereby resulting in an underestimation of their potential role as origin locations in the global circulation dynamics. In fact, these two locations specifically have been shown to be important for persistence and global dissemination of seasonal influenza viruses4. Therefore, better global coverage will be needed to characterise the role of undersampled countries in RSV circulation and how they may relate to demographic characteristics.

Fig. 2: Posterior estimates of time-homogeneous predictor contributions to RSV diffusion between countries.
figure 2

The predictors include the number of passengers travelling by air between each pair of countries represented in the data set (air travel, in dark red), population size at the origin and destination location (pop size ori & pop size dest, in blue), geographic distance (geo distance, in light green), absolute differences in latitude (lat diff, in dark orange) and sample sizes at the origin and destination locations (# taxa ori & # taxa dest, in dark green). The Y-axis represents the product of the coefficient (on a log scale) and the inclusion probability for the predictors (coefficient * Inclusion). (A, B: RSVA. C, D: RSVB. The plots on the left and right distinguish between analyses without and with sample size predictors respectively. E and F summarise the estimates for a single GLM-diffusion model applied to the combined RSVA and RSVB data sets at the country level. The grey boxes in the violin plots represent the median and quantile estimates. Violin plots are based on n = 507 (A), n = 535 (B), n = 45002 (C), n = 45002 (D), n = 452 (E) and n = 452 (F) post-burnin samples from the respective MCMC chains. Source data are provided as a Source Data file.

While the phylogeographic data sets include genomes sampled between 2012 and 2020, the INFORM-RSV study contributes to the most recent years (2017–2020) of sampling. To determine how these data contribute to predictor support, we also apply a time-inhomogeneous GLM-diffusion model distinguishing between the five most recent years and the 5-year time period before that (Fig. S5). This illustrates that the support for air travel is consistently found for the recent time period whereas this is less convincing or less consistent across analyses in the earlier time period. This demonstrates how systematic global sampling contributes to the opportunity to identify meaningful patterns of RSV spatial spread.

Phylogeographic reconstructions indicate extensive geographic mixing

RSV spread by air travel offers the opportunity for substantial geographic mixing of viral lineages between locations. To assess geographic mixing, we use recently proposed entropy-based phylogeographic summaries for the genome sampling in the most recent pre-pandemic INFORM-RSV season (2019–2020). Specifically, we summarise normalised entropy measures or the phylogeographic clustering by country, reflecting the degree of phylogenetic interspersion of country-specific lineages (Fig. 3), and the number of unique lineages associated with each country circulating at the start of the most recent RSV season (see Supplementary Files S1 and S2 for the MCC summary trees from the evolutionary reconstructions underlying these inferences). Some countries have different results for RSVA versus RSVB, which could be explained by the fact that whether a lineage grows to be a persistent one is a stochastic event even if particular countries would be more prone to persistent circulation. This normalised entropy ranges between 0, reflecting no intermixing of viruses from different countries, and 1, reflecting a clustering that is randomised with respect to country of sampling.

Fig. 3: Posterior estimates of the normalised entropy for RSVA and RSVB phylogeographic clustering by country during the most recent season (2019–2020) of INFORM-RSV sampling.
figure 3

The normalised entropy ranges between 0 (~no mixing of lineages by country) and 1 (~random mixing with respect to country). Circles and error bars refer to the mean and 95% Highest Posterior Density (HPD) interval of the normalised entropy estimates respectively. The size of the circles is proportional to what fraction of the highest mean estimate each average estimate represents. The same is indicated by the colours of the circles, which range from blue for an average estimate that represents 0% of the highest value to bright red for the highest mean estimate. Entropy estimates are based on n = 901 post-burnin samples from the stationary MCMC chain. Source data are provided as a Source Data file.

For RSVA, we infer relatively high entropy estimates, with 13 out of 15 estimates above 0.8. For the Netherlands for example, we estimate entropy of 0.88 [95% highest posterior density interval (HPD) 0.82,0.94] and 10 [95% HPD 8,12] unique lineages circulating at the start of the most recent season (2019–2020), which together are represented by 23 sampled genomes in the final season. With an entropy estimate of 0.33 [95% HPD 0.28,0.38], South Africa appears to be an exception to the pattern of relatively extensive mixing. While we estimate a substantial number of unique South African lineages at the start of the final season (26 [95% HPD 21,30]), there is also a substantial degree of clustering of the 58 genomes sampled from that season, with 50 out of 58 samples belonging to a large South African cluster including also samples from the previous season (Fig.S6). Similarly high entropies are estimated for RSVB in most countries. While two more mean estimates fall below 0.8, their credible intervals are broad. Although the mean entropy estimate for South Africa is also <0.8 for RSVB, the deviation from countries with high entropy values is far more limited. Overall, these estimates suggest a substantial global geographic mixing of both RSVA and RSVB.

Discussion

Optimised surveillance and prevention of RSV infection at a global scale relies on our understanding of its spread. Here, we combine existing RSV genomic data and new full genomes from a systematic global sampling effort with empirical data on human mobility, demography and a proxy for synchronicity of RSV seasonality to evaluate which factors shape global RSV circulation. We show that air travel predicts global RSV spread, similar to what has been demonstrated for influenza H3N25,8, influenza H1N14, and recently SARS-CoV-26. Additional sampling efforts (including those within the framework of the ongoing INFORM-RSV study) are expected to generate more densely sampled genomic data. This will increase the resolution of phylogeographic reconstructions and it will likely allow testing predictors at other spatial scales where other forms of mobility could also shape RSV circulation. Understanding RSV spread is also important in the light of monitoring for escape mutations to emerging prophylactic approaches to RSV, as our findings show these have the potential to spread rapidly on a global scale.

Human air travel increases the likelihood of infectious diseases spreading rapidly between countries and continents21. We speculate that air traffic could be a mechanism of RSV transmission. It is still unclear how patients acquire viral respiratory disease in the context of air travel, and the prevalence of RSV in airplane passengers has not been studied. Previous research showed that almost one-half of all patients with clinical symptoms upon travel turn were infected with respiratory viruses22,23. Other evidence suggests that SARS-CoV-2 is transmitted during air travel24,25. Global concerns such as the emergence of Ebola Virus Disease in West Africa26 and novel SARS-CoV-2 variants27 have already led to a number of protocols implemented at airports of departure or arrival (e.g. testing, genomic surveillance, quarantines, etc.). As global connectivity has increased, so has the potential for RSV to spread across countries. Before the COVID-19 pandemic, over four billion passengers travelled by airplane annually and this number is likely to double by 2036. We expect the main mechanism of global spread to be spread at the country of arrival, mostly due to travellers infected in the community and bringing the infection from a seeding area where the epidemic is ongoing to the destination country. We show that seasonal RSV epidemics are likely fuelled by many independent introductions. However, the exact source locations cannot be identified with our data.

Our reconstructions provide some evidence of local RSV persistence in South Africa. These data build on earlier evidence of clustering and strong selective pressure for both RSVA and RSVB in South Africa28. RSV clustering in South Africa resembles data on influenza A which persisted in West Africa for almost two years29. Extensive spatial mixing of influenza A by air travel was observed in West Africa, perhaps because of its relatively lower connection within the global air transportation network. The climatic variability may also have contributed to the influenza persistence generating temporal overlap among epidemics29.

Currently, several genotype definitions are used in parallel and there is no universal approach to classify virus genetic diversity30. Therefore, genotyping based on complete genome sequences, instead of genotyping based on nucleotide sequence variability of subgenomic regions (mostly the G gene), can improve the RSV surveillance field by providing a more coherent classification. By focusing on active virus lineages and those spreading to new locations, this universal nomenclature would assist in tracking and understanding the patterns and determinants of the global spread of RSV. For SARS-CoV-2, a similarly proposed nomenclature represents an important asset to the field31. We hope that our study will motivate large-scale implementation of whole genome sequencing for RSV surveillance.

Site-specific selection analyses identified the G gene as the main target of diversifying selection. When compared to influenza with its ladder-like phylogeny and strong turnover, positive selection for RSV is less strong. Our results confirmed that the RSV genome is largely conserved, with the exception of the highly variable G gene. We have identified different positions under selective pressure for RSV A and B reporting on positive selection on the L gene at amino acid position 146, 624, 1725, 1748, 2111, and 2113 for RSVA and 560, 1712, 1718, 1719, 1759 and 2019 for RSVB, which may represent epitopes under pressure of adaptive immunity32. Immunological studies are required to confirm adaptive immune responses are developed during RSV infections against these epitopes on the L gene.

Strengths of this study are the sample size, the use of complete genomes, and a broad geographic coverage over a period of many years. Another strength is that our study only included prepandemic RSV sequences and mobility data, as COVID-19 drastically impacted human air travel. An important limitation of our study is lack of data from most of the African continent, as well as from specific large countries including China and India. Additionally, the sample size within countries was too small to explain short-distance spread of RSV. Broader and denser coverage is likely to reveal additional predictors at different scales of transmission.

RSV research and therapeutics are rapidly advancing with the recent approval of nirsevimab and two vaccine for older adults, which might be shortly followed by the approval of a maternal vaccine1. Surveillance of RSV may be particularly important in the wake of these vaccines, given the potential for increased immunologic pressure on RSV F. The integration of epidemiological and phylogenetic approaches has received great attention for other viruses because of its potential to uncover mechanisms of pathogen emergence, evolution, and spread. By capturing the spatial spread of RSV, our reconstructions of spatial evolutionary history shed light on viral persistence and transmission dynamics. We demonstrate that the use of human air travel data together with viral genetic data provides a powerful model to describe global spread of RSV. This work also provides a baseline of RSVA and RSVB genome evolution before the widespread use of immunisation programmes, and the new genome data will constitute a key resource for further extensive research in the field of RSV epidemiology.

Methods

Clinical samples

The INFORM-RSV study is a prospective, multiyear, multicentre, global clinical study enroling children with medically-attended RSV infection under the age of 5 years. Details about the study design and protocol have been previously described2. In summary, RSV positive nasal samples were collected from November 2017 to March 2020 at 18 hospitals in 17 countries globally. Whole genome sequencing was performed at the UMC Utrecht using the Illumina NextSeq 500 platform (details have been published a separate methodology paper2) and annotated with sampling data and country. Whole genome sequences derived from the first three seasons of the INFORM-RSV study are available at GenBank.

Ethical approval and consent

We declare that the planning, conduct, and reporting from this study was in line with the Declaration of Helsinki, as revised in 2013. Informed consent was obtained from parent(s) or legal representative(s) prior to sample collection in accordance with the International Conference on Harmonization Guideline on Good Clinical Practice E6 (ICH-GCP) and applicable national and international regulatory requirements. The INFORM-RSV study has been approved by the ethics committees of all 18 participating sites: The Netherlands: The Medical Research Ethics Committee of the UMC Utrecht (reference number WAG/mb/17/016170); Italy: Ethics Committee for Clinical Testing of the Province of Padova of the Padova Hospital (no. 345 of 27/10/2016); Russia: The Department for Science, Innovation Development and Management of Health and Biological Risks, Ministry of Health of the Russian Federation; Germany: Ethics Committee of the Medical Faculty of the Philipps University Marburg; France: Ethics Committee Southwest and Overseas of the Créteil Intercommunal Hospital Centre (ID-RCB No.: 2018-A02360–55 (file 1– 18–73); Spain: Ethics Committee for Research Santiago-Lugo of the Hospital Centre University of Santiago (registration code 2017/397); South Korea: Medical Research Committee of the Seoul National University Hospital; Finland: Ethics Committee of the Hospital District of Southwest Finland, Turku; Australia: Human Research Ethics Committee of the Perth Children’s Hospital; Brazil: The Research Ethics Committee of the Centro INFANT at Pontificia Universidade Catolica de Rio Grande do Sul (opinion number 2,569,872); Canada: Hamilton Integrated Research Ethics Board of the McMaster University; Canada: Research Ethics Board of the McGill University Health Centre; South Africa: Human Research Ethics Committee of the University of the Witwatersrand Johannesburg (no. M170966); Japan: Research Ethics Committee of the Fukushima Medical University (no. 29212); The United Kingdom: Health Research Authority of the King’s College Hospital (no. 17/EM/0469); Taiwan: Mackay Memorial Hospital Institutional Review Board (no. 19MMHIS171e); Chile: Ethics Committee for Research on Human Subjects of the Faculty of Medicine, University of Chile; Mexico: Ethics Committee of the University Autónoma De Nuevo León, Faculty of Medicine.

Data set compilation

Sequence data on the F protein of RSVA and RSVB from the INFORM-RSV study have previously been published33. However, the current data represent the first whole genome sequences which were complemented with a selection of publicly available RSV sequences downloaded from NCBI GenBank on April 21st 2021. These were first size-selected (only those of length without N >= 10k bases were kept for further analyses, n = 2865/27417 or 10.4%) and typed as RSVA or RSVB. After alignment with MAFFT v.7.47534 and manual verification using AliView v.1.2635, RDP536 was used to clean the RSV A and RSVB alignments from putative recombinant sequences. Next, only sequences with known country of sampling and sampling date known up to the year or more precise were retained for further analyses. The resulting alignment served to obtain a maximum likelihood tree with branch support estimated with the SH-aLRT test37 as implemented in IQtree v.2.1.238. From this tree, a well-supported subtree containing all INFORM-RSV sequences was selected for downstream analyses (Figs. S7 and S8).

Circulating genotypes

We investigated whether the additional genomic diversity from the INFORM-RSV samples warrants a reclassification. For this we adhered to the RSV type-specific patristic distance thresholds suggested by ref. 30 but assess clade support with the computationally more efficient SH-aLRT and UFB branch support tests, and require minimal support values of 80 (SH-aLRT) and 90 (UFB). The criteria for genotype delineation put forward by ref. 30 involve a patristic distance and a clade support threshold. This definition implies that genotypes form monophyletic clades in which a limited number of genetic differences has accrued. It can therefore be anticipated that, as evolution continues, a clade that was formerly classified as a single genotype can diversify into a set of new genotypes.

TempEst v.1.5.339 was used to identify sequences that represented outliers in a regression of root-to-tip divergence as a function of sampling time. To this end, an operational definition of outliers was used: outliers were defined as sequences for which the residual of the regression of root-to-tip genetic distance against sampling time falls outside the 99% credible interval of residuals, which was derived using the CODA R package40,41. 13 outliers were removed from the RSVA and RSVB data sets. This increased the correlation between the root-to-tip distance and sampling time from 0.94 to 0.95 for RSVA and from 0.79 to 0.83 for RSVB. Likewise, the R2 of the regression increased from 0.89 to 0.91 for RSVA and from 0.63 to 0.70 for RSVB. The resulting data sets, with 1213 taxa for RSVA and 1223 taxa for RSVB, were used for phylogeographic reconstruction and genotype classification30. For the latter, a maximum likelihood tree was estimated using IQtree38 with ModelFinder42 and branch support was evaluated with the SH-aLRT and ultra-fast bootstrapping (UFB) procedures. Genotypes were called using an in-house developed R41 script that capitalises on several packages (treeio, phytools, geiger).

A down-sampled data set was created for site-specific selective pressure analyses. For this, within-country transmission networks were downsized to a randomly chosen taxon according to a two-step procedure. First, within-country transmission networks were identified as clades with perfect SH-aLRT support for which all taxa were from the same country43 based on a midpoint rooted maximum likelihood tree (obtained with IQtree v.2.1.238) from the phylogeo-datasets. Next, this reduced data set was used for estimating time-calibrated evolutionary histories with the Bayesian Evolutionary Analysis by Sampling Trees software (BEAST v1.10)44 along with the high-performance BEAGLE v.3.2.0 library for computational efficiency45. The RSVA and RSVB data sets were equipped with the same evolutionary models. To capture the nucleotide substitution process while allowing for differences between the coding and non-coding genome regions, a General Time Reversible (GTR) model with Γ-distributed among site rate variation46,47 was specified for either region. The estimated rate of evolution was informed by the amount of evolution that accrued over the sampling time differences, and the rate was allowed to vary among lineages through a relaxed clock model with lognormally distributed branch rates48. The demographic history was modelled with the flexible skygrid tree prior49 with changes in the relative genetic diversity over time allowed at 6-month intervals between January 1st 2020 and January 1st 2005. Within country transmission chains were now identified as clades of taxa from the same country with perfect posterior support.

Phylogeographic inference

Time-calibrated evolutionary histories were estimated from the phylogeography data sets using the Bayesian Evolutionary Analysis by Sampling Trees software (BEAST v1.10)44 along with the high-performance BEAGLE v.3.2.0 library for computational efficiency45. The same models as for identifying within-country transmission networks (see above) were specified. Mixing and convergence properties of the Markov Chain Monte Carlo simulation were inspected using Tracer v1.750. Maximum Clade Credibility (MCC) summary trees were obtained with TreeAnnotator (distributed with BEAST v.1.10) and visualised in FigTree v.1.451. Continuous parameter estimates are summarised as means and 95% highest posterior density intervals (95% HPDs).

Generalised linear mixed model

To test for predictors of the global spatial diffusion process, we applied a generalised linear model (GLM) parameterisation of the discrete phylogeographic model5. Briefly, this model parameterises the log transition rates between pairs of locations as a function of potential predictors. Each predictor is associated with an estimable log effect size and inclusion probability. We reported the posterior estimates for the product of these parameters for our analyses. We applied this model both at the country and the continental level and employ a set of 1000 time-scaled trees sampled evenly throughout the post-burning posterior as empirical tree distributions for both RSVA and RSVB.

For the reconstruction at continental level, taxa were assigned to Africa, Asia, Europe, North America, Oceania or South America based on the WHO region classification. Specifically, taxa from the Sub-Saharan Africa and Northern Africa regions were categorised as African. Taxa from the Western, Central, Southern Eastern and South-Eastern Asia regions were categorised as Asian. Taxa from the Caribbean, Central and Northern America regions were categorised as North American. South American countries were categorised as South American. Countries from Melanesia, Micronesia, Polynesia together with Australia and New Zealand were categorised as Oceania. Taxa from Eastern, Western, Northern and Southern Europe were binned as European.

As predictors, we included passenger fluxes (i.e. the number of passengers travelling by air between countries and continents provided by the International Air Transport Association (IATA)52 for the period 2019–2020), population size (for 2019)53 at the origin and destination location, geographic distance and absolute difference in latitude (as proxy for synchronicity in northern or southern hemisphere transmission). For the geographic distances and absolute latitude differences, latitude and longitude coordinates representing the countries’ midpoints were downloaded from the Dataset Publishing Language as provided by Google50. Geographic distances were calculated using the Haversine formula. At the continental level, we used data for the countries from which genome samples are included in the analyses. In additional analyses, we assessed the sensitivity of predictor support with respect to sampling heterogeneity by also including sample size at the origin and destination location as potential predictors. Analyses were performed for both RSVA and RSVB separately, but we also ran the inference applying a single GLM-diffusion model to both data sets to examine the shared signal in both. Finally, for the country-level analyses we also applied a time-inhomogeneous version of the model54 partitioning the evolutionary history in an epoch before and after 5 years since the most recent sampling time. These analyses were performed to examine which time period was informing the predictor support.

Posterior summaries of geographic mixing

To quantify the degree by which RSV clustering is structured by country, we used a normalised entropy measure recently proposed by ref. 6. We focused on the most recent season (2019–2020) because the phylogenetic clustering of these samples and their degree of phylogenetic interspersion is expected to be maximally informed by the INFORM-RSV sampling during the two previous seasons. For each country, we considered a time interval that encompasses the sampling from that recent season and goes back to the end of the previous season for that country. The start and end months of RSV seasons were determined by the relative infection intensities per month for each country. In these time intervals, we summarised the times associated with contiguous partitions of a tree estimated to be in each country. Based on these time estimates we computed a normalised Shannon entropy for each country:

$$-\frac{1}{{{{{\mathrm{ln}}}}}(n)}\mathop{\sum }\limits_{i}^{n}{p}_{i}{ln}({p}_{i})$$

Where pi is the proportion of time associated with that country for partition i of the tree, and n represents the number of partitions for that country in the tree. In case all genomes sampled during the most recent season in a specific country would form a single cluster (partition) in the phylogeographic tree, the entropy measure is expected to be ≈ 0. When none of the genomes from the same country would cluster together, and hence are interspersed with genomes from other countries, the measure is expected to be ≈ 1. We used this measure to summarise the posterior distribution of phylogeographic reconstructions for the analysis with a single time-inhomogeneous GLM-diffusion model shared by both RSVA and RSVB (without sample size predictor). To aid interpretation of the entropy measures, we also summarised the number of unique lineages circulating in each country at the start of the most recent season. Multiple branches associated with the same country sharing a common ancestor with that country state after the end of the previous season are considered to constitute a single unique lineage6. We also attempted to summarise whether these unique lineages represented new introductions or persisting lineages since the end of the previous season for each country6, but this results in uninformative estimates because of an insufficiently dense sampling each season and lack of global coverage. Specifically, lineages from the last season often coalesced with other lineages earlier than the previous season, biasing the estimates towards persistence.

Identification of positively selected sites

Following recommendations by Kosakovsly Pond and Frost29, we identified positively selected sites using different complementary approaches. Specifically, we employed the fast unconstrained Bayesian approximation (FUBAR) and the mixed effects model of evolution (MEME) approach implemented in HyPHy and the renaissance counting (RC)55 approach implemented in BEAST. For FUBAR, we used the variational Bayes approximation and the default threshold of a posterior probability >0.9 for sites to be identified as subject to diversifying positive selection. For MEME, we used the default p-value threshold of 0.1 for testing for selection and we restrict the test to internal branches. For RC, we specified a skygrid coalescent prior, an uncorrelated relaxed clock model, and a GTR model for each codon position. We considered sites to be positively selected if the site-specific empirical Bayes estimate of the nonsynonymous to synonymous rate ratio (dN/dS) results in a lower 95% HPD interval boundary that is larger than 1 and if the mean dN/dS estimate is larger than 1.5. We only reported sites as positively selected if they are identified by at least two of the three approaches used.

Genealogical neutrality tests

To evaluate whether RSV evolution adheres to neutral evolution, we employed a model-based Bayesian procedure that distinguishes between the effects of demography from the effects of selection19. Specifically, we employed the posterior distribution from the genealogical inference produced by BEAST and perform posterior predictive simulation of genealogies under neutral coalescent models accounting for potentially complex demographic histories. For the latter, we adopt the skygrid coalescent model. For posterior predictive simulation under this model, we fit skew normal distributions to the estimates of the interval-specific population sizes and use these in an MCMC simulation procedure. By comparing the genealogical shapes of the inferred tree distribution to that obtained by the posterior predictive simulation using summary statistics, we tested for significant departures from neutral evolution. Here we used two genealogical summary statistics: i) the genealogical Fu and Li statistic (DF), which compares the length of terminal branches to the total length of the coalescent genealogy19, and ii) the ratio of the trunk (or backbone) length over the entire tree length. The concept of a trunk, representing the lineage(s) that persist(s) through time, has frequently been used in characterisation of the viral population turnover dynamics56,57, with viruses like human seasonal influenza that experience strong selective pressure to escape antibody responses showing pronounced trunk and short-lived side branches.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.