Accommodating individual travel history and unsampled diversity in Bayesian phylogeographic inference of SARS-CoV-2

Spatiotemporal bias in genome sampling can severely confound discrete trait phylogeographic inference. This has impeded our ability to accurately track the spread of SARS-CoV-2, the virus responsible for the COVID-19 pandemic, despite the availability of unprecedented numbers of SARS-CoV-2 genomes. Here, we present an approach to integrate individual travel history data in Bayesian phylogeographic inference and apply it to the early spread of SARS-CoV-2. We demonstrate that including travel history data yields i) more realistic hypotheses of virus spread and ii) higher posterior predictive accuracy compared to including only sampling location. We further explore methods to ameliorate the impact of sampling bias by augmenting the phylogeographic analysis with lineages from undersampled locations. Our reconstructions reinforce specific transmission hypotheses suggested by the inclusion of travel history data, but also suggest alternative routes of virus migration that are plausible within the epidemiological context but are not apparent with current sampling efforts.


Supplementary Text 1: Bayesian Evaluation of Temporal Signal (BETS)
As part of the BETS analysis to assess temporal signal in the data, we estimate the performance of the strict clock and the relaxed clock with an underlying lognormal distribution, with and without making use of the sequence sampling times. We make use of generalized stepping-stone samling while accommodating phylogenetic uncertainty 1 to estimate log marginal likelihoods for these four models in BEAST 1.10 2 , running an initial Markov chain of 25 million iterations followed by 100 power posteriors of 1 million iterations each. Figure S1: BETS reveals very strong evidence (log Bayes factors > 215) in favor of the models including sampling times, indicative of strong temporal signal in the data. The relaxed clock offers a marginal improvement of 0.3 log units in fit over the strict clock when using sampling times. Figure S1 summarizes the results of the BETS analysis, which were repeated across multiple independent replicates and with increasing computational settings. When including the sampling times in the analysis of the data, model fit increases significantly and yields very strong evidence (log Bayes factors > 215) in favor of using the sampling time information 3 , with strict and relaxed clock offering near-identical performance (log Bayes factor of 0.3 in favor of the latter).

Supplementary Text 2: Empirical example illustrating the concept of including unsampled diversity.
We used a limited data set of 56 Zika virus (ZIKV) genomes sampled across Southeast Asia, the South Pacific islands and the Americas to demonstrate the concept of incorporating unsampled diversity in phylogeographic reconstructions (Table S1). Several studies have provided evidence that Zika Virus spread from Southeast Asia into French Polynesia (FP) and from there to both continental America as well as to other South Pacific islands [4][5][6][7][8][9] . This resulted in a single lineage comprising viral strains circulating in the South Pacific islands and Latin America, referred to as ZIKV SP-AM . A Bayesian phylogeographic reconstruction based on the 56 genome data set confirms this pattern (Fig. S2A), clearly indicating the basal nature of the FP viruses in the ZIKV SP-AM lineage and a migration event from there to Brazil.
We focus on the key role of FP in the seeding of continental America and replace all FP genomes by unsampled taxa (Fig. S2B). In this analysis, the sampling times for the unsampled FP taxa were integrated over a normal prior distribution parameterized by estimates of the 2013-2014 FP epidemic curve 10 and the between-country diffusion process is parameterized using a GLM that considers air transportation distances as a potential covariate 11,12 . In summarizing the reconstruction, we remove the highly volatile unsampled taxa but we evaluate how they affect the posterior reconstruction of spatiotemporal spread. Our analysis indicates that even without any actual FP genomes, the 12 unsampled taxa ensure that FP is reconstructed as the location of origin for the continental America clade, and other South Pacific island viruses, with moderate support, (Fig. 2B). Including only a single FP genome in a standard reconstruction does not support FP at the origin of the continental America clade; in this case, the highest support is associated with a direct transmission from Asia into Brazil (Fig. 2C). However, including 11 additional unsampled FP taxa with this genome in a GLM analysis reconstitutes the expected pattern of FP representing a stepping stone of ZIKV spread between Asia and Brazil ( Fig. 2D). Figure S2: The impact of incorporating unsampled diversity on reconstructing phylogeographic spread. We compare a standard phylogeographic reconstruction for 56 ZIKV genomes including 12 FP genomes (A) to a GLM-diffusion phylogeographic reconstruction with air transportation with the 12 FP genomes replaced by unsampled taxa (B), a standard phylogeographic reconstruction using only a single FP genome (C), and to a GLM-diffusion analysis where the single FP genome is complemented with 11 unsampled taxa (D). The coloring represents the country of origin for the sampled/unsampled taxa as well as the country estimate for the internal node locations. The internal node circle sizes are proportional to posterior clade support values. The values at the ancestral node for the ZIKV SP-AM lineage, highlighted in grey, represent the posterior location support at this node.

Supplementary Text 3: Phylogeographic predictor estimation including sample sizes
We fit a phylogeographic generalized linear model (GLM) that considers origin & destination sample size predictors in addition to air travel, geographic distance and asymmetry out of Hubei. These estimates indicate a considerably reduced contribution of sample sizes to the transitions rates in travel-aware inferences. The sample size at origin locations reduces from maximum posterior inclusion probability (>.999) to a low posterior inclusion probability (0.222). The destination sample size predictor remains associated with a maximum posterior inclusion probability, but has a lower Log conditional effect size in the travel aware inference. It is interesting to also note that the support for a flux out of Hubei and its log conditional effect size are only moderate when using sampling location only with sample size predictors. Thanks to the many genomes associated with a travel history from Hubei, the flux yields a high support and a very high effect size irrespective of the inclusion of sample sizes in the travel-aware inference. Table 1: Inferred generalized linear model (GLM) of discrete location transitions when using sampling location only and sampling location augment with travel history. We report posterior inclusion probabilities and posterior mean (95% highest probability density intervals) log conditional effect sizes for air travel, geographic distance, asymmetry out of Hubei, and origin & destination sample size.

Predictor
Sampling location only Sampling location and travel history