The greatest public health burden from infectious diseases remains stubbornly endemic pathogens. Once established, pathogens such as Mycobacterium tuberculosis, HIV and, now, SARS-CoV-2 are difficult to control, even when vaccines are available1. Their persistence in the population can be partially explained by the co-circulation of multiple strains of the same pathogen. Endemic pathogens are complicated to study as we rarely understand the mechanisms that drive spread, including the role of human behaviour and why some lineages increase in prevalence over time whereas others disappear. Underlying genetic diversity is particularly extreme in the case of the bacterium S.pneumoniae (the pneumococcus), which is the leading cause of morbidity and mortality worldwide because of lower respiratory infections2,3,4. The pneumococcus comprises >100 known antigenically distinct serotypes and >900 classified lineages (also known as global pneumococcal sequence clusters (GPSCs))5,6,7. Moreover, it is not uncommon for more than 30 antigenically distinct serotypes to co-circulate within a country or region or for a human host to concurrently carry multiple serotypes8. We refer to lineage synonymously with GPSC, whereas a strain references any particular circulating phenotype (including specific serotypes and antimicrobial resistance (AMR)). Here we develop mathematical models using thousands of geolocated genome sequences from South Africa collected over a 15-year period to clarify several key uncertainties in pneumococcal migration. We model the rate and breadth of mobility geographically and how fitness changes linked to vaccine implementation and AMR may affect its spread.

The pneumococcus resides in the human upper respiratory tract. Carriage is a prerequisite for disease, and rates of carriage in children under 5 years old range from 20 to 90%9. Occasionally, asymptomatic carriage goes on to cause local infections such as otitis media or, more severely, invasive pneumonia and meningitis. More than 500,000 deaths per year linked to pneumococcus are estimated to occur globally3,10. Penicillin was first used to treat pneumococcal disease in the 1930s, and it successfully reduced pneumococcal disease until the late 1960s when penicillin non-susceptibility was first noted. Multidrug-resistant strains were described soon after penicillin resistance11,12. By 2019, 19% of deaths associated with AMR had pneumococcal aetiology13. In this context, vaccines are pivotal to disease control. A pneumococcal polysaccharide vaccine that included 23 serotypes was licensed in the USA in 1983. However, the absence of mucosal immunity has seen it replaced by pneumococcal conjugate vaccines (PCVs) except for in older adults and immunocompromised individuals14. PCVs (conjugated with toxin to stimulate mucosal immunity) target a small subset of the polysaccharide capsular serotypes, with the most common formulations including PCV7 and PCV13 (Pfizer)15 and PCV10 (GlaxoSmithKline)16. These target 7, 13 and 10 serotypes, respectively (with all the serotypes included within PCV7 and PCV10 also included within PCV13). In 2021, additional PCV formulations, PCV15 (Merck) and PCV20 (Pfizer), were licensed for use in the USA and Europe17. Pneumococcal vaccination is dynamic, and new vaccine compositions are frequently tested. Vaccine serotypes are often selected because of their high prevalence and AMR among disease isolates from infants and children. PCVs are now included in 76% of national immunization schedules, with different formulations in different countries. For example, in South Africa, PCV7 was implemented in 2009 and replaced by PCV13 in 2011, excluding PCV10 (ref. 18). Despite their success at reducing disease, their use has been linked to serotype replacement by NVTs in both invasive pneumococcal disease (IPD) and carriage8,19,20,21. In South Africa, this has been characterized by increases in NVT serotypes 8 and 15B among IPD, and increases in NVT serotypes 16F, 24, 35B and 11A among carriage isolates (8, 15B and 11A are now included in PCV20)20,22,23. Although there has been success in predicting the fitness of individual isolates based on the overall gene distribution in a population, quantitative measures of fitness linked to the serotype of individual isolates are lacking24,25. This includes quantifying the time it takes after implementation for vaccines to affect the serotype composition in the country. Moreover, quantifying the serotype growth before and after vaccine implementation at different time points is needed. These are crucial knowledge gaps, as serotype distributions ultimately drive vaccine development and deployment strategies. In addition, vaccine implementation has resulted in reductions in AMR among both IPD and carriage isolates20,22,26. However, it remains unclear whether these reductions will persist over time at the population level or whether AMR may rebound.

Mathematical models applied to geolocated pathogen genome sequence data are useful to disentangle the changing prevalence of different lineages. However, most phylogeographical models focus on the rate of pathogen flow between locations, which represents the overall effect of multiple transmission chains. Such models therefore consider a different ecological scale to the specific behaviours of infected people and the surrounding population at each transmission generation27. The relationship between behaviours of individuals at each transmission step and the overall patterns of pathogen flow between locations are complex and nonlinear. Most existing phylogeographical approaches also struggle to account for changing levels of surveillance in both space and time. Here we develop mechanistic models that use the generation time distribution to estimate the number of transmission events that separate the most recent common ancestors (MRCAs) from each pair of tips in a time-resolved phylogeny. Taken together with measures of human mobility probabilities and human population distribution, we can infer mechanisms of pneumococcal migration at each transmission generation. We implemented this model with a focus on South Africa, where approximately 65% of children ≤5 years of age (35% across all age groups) carry the pneumococcus28,29. We incorporate both uncertainty in the phylogenetic reconstructions and sampling uncertainty through a bootstrapping approach30. We explore the robustness of our approach to highly biased observation processes using simulated data with known parameter values. Finally, we quantify the changing fitness of strains in response to vaccine introduction, including those containing AMR.

Quantifying spatial structure

In partnership with the South African National Institute for Communicable Disease and the Wits Vaccines and Infectious Diseases Analytics Research Unit, we sequenced the whole genomes of isolates from each of South Africa’s nine provinces between 2000 and 2014 (n = 6,910, 5,060 from individuals with invasive pneumococcal disease and 1,850 from carriage studies) (Fig. 1a–c). Despite the large number of sequences, this dataset only represents a very small proportion (0.1%) of the circulating pneumococcus over this time period. We identified 184 GPSCs with 69 different serotypes (31.9% NVT) (Fig. 1a,e and Supplementary Table 1). This diversity persisted across provinces, and the distribution of serotypes within GPSCs did not follow a distinct geographical structure (Fig. 1a). In silico predicted AMR was common (penicillin, 48.2%; erythromycin, 17.3%, clindamycin, 11.2%; and co-trimoxazole: 68.4%), with similar distributions for isolates from both carriage and disease (Fig. 1f, Extended Data Fig. 10, Supplementary Fig. 13 and Supplementary Table 3).

Fig. 1: Descriptive summary of S.pneumoniae isolates.
figure 1

a, Phylogenetic tree of 6,910 South African isolates included in this study. Dominant GPSCs (n > 50) are in purple. GPSC1 (top) and GPSC5 (bottom) are highlighted. The columns describe the serotypes and provincial region for each isolate. The branch length legends refer to single nucleotide polymorphisms (SNPs) per site and trees are midpoint rooted. b, Map of the nine provinces of South Africa coloured by province. Scale bar is included in kilometres (km). c, Count of isolates (n = 6,910) per collection year from 2000 to 2014 used in the lineage-level analysis (black) and the 9 dominant GPSCs used in the divergence time analysis (maroon). d, The mean geographical distance for sequence pairs as a function of cumulative evolutionary distance across all GPSCs with 95% CI (blue). The model fit is shown in red. The implied true pattern of spread is shown in purple, after accounting for a biased observation process. e, The proportion of NVT serotypes across the study period. f, The proportion of in silico predicted AMR isolates for four drugs across the study period. The vertical lines denote the introduction of PCV7 in 2009 and PCV13 in 2011. An interactive phylogeny and metadata are available at Microreact (

Taking the 9 most dominant GPSCs in turn (each comprising more than 50 sequences, 2,575 sequences in total), we built recombination-free, time-resolved phylogenetic trees to determine the divergence times between sequence pairs (Extended Data Fig. 1a–i and Supplementary Table 4). We included 1,157 genomes from 14 other countries in Africa and 2,944 from 31 countries outside Africa in our phylogenies. We compared the geographical distance spread per divergence time between pairs in South Africa and found a clear geographical structure. The geographical distance between pairs increased from a mean distance of 142 km (95% confidence interval (CI) = 54–207 km) for those separated by less than 2 years of evolutionary time to 297 km (95% CI = 274–323 km) for those separated by 10–20 years (Fig. 1d). We obtained consistent results when using only sequences that came from individuals with disease (Supplementary Fig. 1A and Supplementary Table 5) and across the different GPSCs (Supplementary Fig. 2). This result is consistent with largely common patterns of spread irrespective of which particular GPSC an individual is infected with. Despite high heterogeneity in GPSC composition within any province, overall, pairs of isolates that are from the same province had 1.27 (95% CI = 1.23–1.31) times the relative risk (RR) of being the same GPSC compared with pairs of isolates from distant provinces (>1,000 km). This RR fell to 1.09 (95% CI = 1.03–1.12) for pairs separated by 500–1,000 km (Fig. 2a and Supplementary Table 5). We obtained consistent results when we limited our analysis to only disease isolates and when we subsampled to have even numbers of sequences by province to mitigate sampling bias (Extended Data Fig. 2a,b and Supplementary Table 5). As there can be hundreds of years of diversity within a single GPSC, we refined the analysis by using different evolutionary windows of separation between pairs of isolates as determined from the phylogenetic trees. Pairs from the same province had 3.87 (95% CI = 3.01–5.1) times the RR of having a recent common ancestor (within 5 years) than distal pairs (>1,000 km apart) (Fig. 2b and Supplementary Table 5). However, as the evolutionary time between isolates increases the relative probability of being from the same province decreases, it is only after around 50 years that the pneumococcus seemed to be well mixed throughout the country (Fig. 2c–f and Supplementary Table 6). Furthermore, comparisons of the spatial location of closely related pairs showed that pneumococcus flow was dominated by within-country movement compared with either movement between South Africa and other African countries or non-African countries (Fig. 2c–e, Supplementary Fig. 3A and Supplementary Table 5). Recognizing that our samples span age groups, we stratified the analysis by the age difference between pairs. Genome pairs from individuals who were greater than 5 years apart in age took slightly longer to become mixed across South Africa (Supplementary Fig. 3B and Supplementary Table 7). These findings are consistent with a highly entrenched pathogen that moves slowly within a country and with slow cross-border transmission.

Fig. 2: RR framework to determine geographical structure.
figure 2

a, RR of being the same GPSC within a province (blue), between different provinces over increasing distance (red) and compared with geographically distant pairs (>1,000 km) (reference). (South Africa; n = 6,910). be, RR of having a time to most recent common ancestor (tMRCA) 0–5 years (b), 5–10 years (c), 10–20 years (d) and 20–200 years (e) ago within South African provinces (blue), across larger distances within South Africa (red), from South Africa to other countries in Africa (n = 1,157) (green), and from South Africa to countries outside of Africa (n = 2,944) (purple). All plots use a reference of pairs that are from distant provinces in South Africa (open triangle). f, RR of similarity over rolling 20-year windows of divergence times for pairs isolated within the same South African province compared with pairs from distant provinces in South Africa (>1,000 km apart). For af, plots are centred at the median and error bars represent 2.5 and 97.5 percentiles across posterior phylogenies.

Inferring migration using human mobility

To understand whether human mobility can explain the slow spread of the pneumococcus, we built a mechanistic model of geographical spread fit to the nine dominant GPSCs and the observed province in which our genome sequences were isolated. We used the generation time distribution, time from one person being infected to infecting the next person (estimated mean of 35 days, standard deviation of 35 days (gamma distribution)) to translate branch lengths to the number of generations between pairs of sequences31,32 (Supplementary Fig. 4). Each transmission generation is an opportunity for pneumococcal mobility. We used directional human mobility probabilities between each of the 234 South African municipalities from Meta Data for Good30 to infer the probable location of a single transmission event, allowing for mobility of both the infected individual and the surrounding susceptible population. As Meta data captures mobility over a single day, we adjusted the duration in any cell to consider total mobility over the infectious period. Furthermore, as Meta users may act differently to those involved in pneumococcal transmission, we incorporated a parameter that allows individuals to have a different probability of staying within their home municipality than Meta users (Fig. 3b). We then calculated the probability of pneumococcal movement between each pair of locations for each transmission generation. This approach integrates over all possible pathways linking two locations. We incorporated the probability of an isolate being sequenced at each geographical location and within each collection year to account for the number of isolates being sequenced differing by year and location. We fit the model in a Bayesian framework using Markov chain Monte Carlo (MCMC). We compared the performance of our models using the Meta data to a gravity model in which the probability of mobility is a function of the distance to a location and its population size (Extended Data Fig. 4). We also separately modelled mobility that depended on distance only.

Fig. 3: Mechanisms of geographical migration.
figure 3

a, The estimated probability of the location (province) of each MRCA (y axis) compared with the population size (x axis) in that province. Points are centred at the median and error bars represent 95% credible intervals. b, The proportion of individual or pathogen mobility as a function of distance from the origin location. We compared the mean distance travelled when we consider the infector only (crossed square), when we consider the mobility of both the infector and the infectee (black filled circle) after a single transmission generation, as well as the overall mobility of the pathogen after ten generations (triangle). As a comparison, we present the expected pathogen spread after a single generation if transmission was completely spatially random (maroon). We also present the difference between Meta users (blue circle) and the movement of those involved in transmission. Points are centred at the median and error bars represent 95% credible intervals. c, The RR of being in each of the 234 municipalities of South Africa after 1 year (10 transmission generations) of sequential person-to-person transmission compared with being in a randomly selected municipality. Black dots denote municipalities with populations of >3 million people. d, The number of unique municipalities visited for 500 unique sequential simulations (grey) and the mean (black-dashed) across years of transmission following an introduction in a randomly selected municipality given the modelled migration probabilities at each transmission generation. e, The relative number of unique municipalities visited by NVT serotypes compared with VT serotypes after 1 and 2 years of transmission. Points are centred at the median and error bars represent 95% credible intervals.

Both the gravity model and the model that relied on Meta data were able to recover the observed spatial spread in pneumococcus (Fig. 1d, red line). However, model fit was better for the Meta data model (difference in deviance information criterion (DIC) of 3.4). Both these models outperformed the distance-only model (Supplementary Table 8). The Meta data model showed that the relative proportion of the population carrying the pneumococcus per province at any time is strongly correlated with the population size in that province (R2 = 0.97, P ≤ 0.01; Fig. 3a). This model enabled us to infer the true underlying spread of pneumococcus (that is, in which we accounted for the biased observation process and mobility in both the infected individual and the surrounding susceptible population) (Fig. 1d, dashed line). We estimated that among individuals involved in pneumococcal transmission, the daily probability of staying in their home municipality was 94.3% (95% CI = 93.8–95.0%) (Fig. 3b, black squares) compared with 99.8% (ranging from 86.6% in Mogale City, Gauteng, to 99.9% in Ba-Phalaborwa, Limpopo) for Meta users (Fig. 3b, blue points). When we incorporated the mobility of both infector and infectee and accounted for the mobility across the infectious period, we estimated that after a single transmission generation (35 days), 53.1% (95% CI = 46.6–58.2) of strains remained in their starting municipality, 22.8% (95% CI = 19.2–27.1%) were in a neighbouring municipality and a small minority were more than 500 km away (Fig. 3b, black points). As the number of transmission generations increased, the probability of reaching distal municipalities also increased (Fig. 3b). The size of the community seemed to be key to determining where lineages travel. After 1 year of sequential transmission, the probability of being in a municipality with a population size of >3 million people was 26.7 (95% CI = 19.8–40.10) times that of being in a randomly selected municipality. This result is consistent with most pathogen movement passing through urban centres (Fig. 3c and Supplementary Table 9).

The municipality in which a strain emerges also seemed important. After 1 year of sequential transmission, a new strain that first occurred in a rural municipality (population density of <50 people per km2) has travelled a median distance of 468.7 km (95% CI = 71.3–1,204.4 km), whereas in the same time window, a variant first occurring in an urban municipality (>500 people per km2) has travelled only 285.6 km (95% CI = 36.3–967.0 km). Furthermore, the variant that emerged in a rural municipality would have travelled to 1.53 times as many municipalities as the urban variant (Extended Data Fig. 3a–c). This result is corroborated by previous research that demonstrated high levels of in and out migration among individuals in rural settings owing to travel for work or education33,34. On average, 1 year after emerging, transmission chains visited 4 (95% CI = 1–8) municipalities, and after 10 years they visited 20 (95% CI = 13–27) municipalities (Fig. 3d). Overall, these results show that the breadth of geographical spread is driven by a small number of long-range transmission events, with most transmissions remaining local. Incorporating our model into a branching epidemic, we found that after 10 years, transmission events are an average 465 km (95% CI = 456–472 km) from where they began (Supplementary Fig. 5A,B).

To test the performance of our model, we simulated transmission within and between districts using the Meta matrix adjusted by a known parameter that determines the probability of staying within one’s home district. We then fit our model using a subset of infections to re-estimate this parameter. Even when only a small minority of infections were sequenced, we recovered the true probability of staying within a district. Our estimates were even robust to an extremely biased observation scenario whereby data from only two locations were available. Using this framework, we also explored the effect of misspecification of the generation time distribution. Using a 50% shorter generation time led to a small overestimate in the probability of staying in one’s home location each day (96.7% versus 95.3%), whereas using a 50% longer generation time had the opposite effect (Extended Data Fig. 5).

Vaccine-induced fitness changes

Implementation of PCV7 in 2009 and PCV13 in 2011 was associated with a substantial disruption in the patterns of circulating serotypes, results that are consistent with what has been previously observed8,20,35,36. However, the introduction of vaccines was also associated with a marked change in fitness. By 2014, serotypes included in PCV13 represented 33.2% of all isolates in our dataset, a reduction from 85.0% in the pre-vaccine era (Fig. 4a). These patterns were consistent across the nine provinces in South Africa (Supplementary Fig. 6). To quantify changes in fitness linked to the vaccines, we fitted models to the annual distribution of serotypes across 184 GPSCs from the full dataset, allowing for differential fitness in serotypes included in PCV7 (serotypes 4, 6B, 9V, 14, 19F, 18C and 23F), PCV13 (which includes additional serotypes 1, 3, 5, 6A, 7F and 19A), and those not included in the vaccine (NVT). This method tracks the proportion of all serotypes at the population level over time and quantifies the relative advantage of each of them following the implementation of vaccines. This simple formulation was able to recover the observed distribution of serotype proportions in each group, by year, across provinces (Fig. 4a–c and Supplementary Fig. 6). We note that the number of NVT and VT isolates we used in our model do not represent the underlying incidence of NVT and VT, as only a small proportion of all infections were detected and sequenced. However, as we focused on the relative abundance of NVT and VT strains per year, our approach is robust to changes in the absolute numbers of isolates sequenced.

Fig. 4: Vaccine-induced fitness dynamics.
figure 4

ac, Data (points) and model fit (lines) for the proportion of serotypes from NVTs (a), PCV7 types (b) and additional PCV13 types not included in PCV7 (c) from the years 2000 to 2014 in this study. The long dashed line indicates the time of PCV7 implementation (2009) and the short, dashed line indicates the time of PCV13 implementation (2011). d, Relative fitness for the three groups of serotypes compared with the NVT fitness estimates before and after PCVs were introduced. e, Relative fitness estimates for all three groups of serotypes comparing the before and after PCV eras. For ae, before PCV refers to before 2009 for NVT serotypes, before 2009 for PCV7 type serotypes and before 2011 for PCV13 type serotypes. f, Proportion of penicillin resistance overall (black line), within NVT strains (maroon points) and within VT strains (turquoise points) with model fits. The dashed line indicates the time of PCV implementation (2009). g, Relative fitness of penicillin resistance among NVTs (pink) and VTs (blue) in before (left) and after (right) PCVs. Data in d, e and g are on a log scale. For af, plots are centred at the median and include error bars representing 95% credible intervals around the posterior parameter distributions (n = 6,798).

Before the implementation of vaccines (2000–2008), NVTs had a relative fitness of 1.08 (95% CI = 1.06–1.09) compared with serotypes included in the vaccine (combining serotypes in PCV7 and PCV13 into VT). Following the implementation of PCV7 and PCV13, the fitness of the serotypes they target declined respectively compared with NVT serotypes (Fig. 4d and Supplementary Tables 10 and 11). When comparing serotype fitness before and after the introduction of PCVs, VTs had a relative fitness of 0.86 (95% CI = 0.78–0.92) and 0.76 (95% CI = 0.67–0.84) for PCV7 and PCV13 serotypes, respectively (Supplementary Table 11). Meanwhile, for the NVTs, vaccines were associated with a 1.25 (95% CI = 1.14–1.35) times increase in relative fitness from 2009 to 2014 compared with before the implementation of vaccines (Fig. 4e and Supplementary Tables 10 and 11). When we directly compared the fitness advantage of NVTs compared with VTs in the PCV era, NVTs had a relative fitness advantage of 1.68 (95% CI = 1.59–1.77), which is equivalent to a 1.05 (95% CI = 1.05–1.06) growth advantage (relative to VTs) at each transmission generation. In a sensitivity analysis, the results remained consistent for carriage and disease isolates, respectively, despite them having been sampled from different cohorts (Extended Data Fig. 6 and Supplementary Fig. 7). We also found consistent results across provinces (Supplementary Fig. 6). We additionally assessed whether there was a delay between vaccine implementation and resulting changes in strain fitness (Extended Data Fig. 7). The best fitting model assumed the change in fitness occurred in the same year as vaccine implementation.

Refining this model to look at the fitness of individual serotypes showed a wide range of fitness across serotypes. Our findings highlight that all strains are fundamentally different in underlying fitness (Extended Data Fig. 8), which means that NVTs will differ in their ability to alter their ecological niche following changes in vaccine formulation. This result needs to be taken into consideration in the development of new vaccine formulations. Among NVTs, serotypes 15A, 35B and 8 had the greatest fitness advantage after PCVs were used (Supplementary Figs. 8 and 9), a result concordant with what has been previously observed20,22,37. Shifts in lineage fitness also resulted in changing patterns of spread. By incorporating our fitness estimates for NVT and VT strains after vaccination into our mobility model, we estimated that the number of affected municipalities from a strain of a NVT was 2.02 (95% CI = 1.81–2.25) times the number of affected municipalities from types included in the vaccine (Fig. 3e).

We next explored whether using the proportion of isolates that were VT versus NVT within each GPSC at the start of the study period and our fitness estimates could explain the subsequent dynamics of individual GPSCs. Simply using VT and NVT fitness estimates could explain 60% of the variance in individual GPSC prevalence at any time. Allowing for serotype-specific differences produced a small improvement, explaining 65% of the variance (Supplementary Fig. 10 and Supplementary Table 13). The unexplained variance reflects GPSC-specific fitness and is probably driven by negative-frequency dependent selection (NFDS)24. This result highlights the predictive nature of the serotype composition of GPSCs in determining PCV-driven GPSC dynamics (Supplementary Figs. 11 and 12).

The serotypes included in the vaccines were prevalent in childhood disease and had high levels of AMR in the USA, where the vaccines were developed38. The high levels of AMR in the VT strains was also present globally26. In South Africa, similar to other countries, reductions in AMR have been noted since vaccine implementation; however, it remains unclear whether this trend will persist or whether AMR eventually rebounds26. In South Africa, before vaccines, 63.6% of VT and 8.8% of NVT strains were resistant to penicillin. We found that there was a clear reduction in overall penicillin resistance following vaccine implementation, which was driven by reductions in the proportions of strains that are VT (Fig. 4f and Supplementary Table 12). The trends, although still present, were less clear in the other investigated antimicrobials (Extended Data Fig. 9). Owing to the relevance of penicillin as a first-line antimicrobial for pneumococcal disease and the high proportion of resistance in this population, we used our same modelling framework and were able to recover the observed proportions of strains that were resistant to penicillin over time. Before vaccines, among both NVT and VT strains, there was limited difference in the fitness between those that were penicillin-resistant and penicillin-susceptible. However, following implementation of vaccines, NVT-resistant strains were 1.30 (95% CI = 1.19–1.43) times as fit as penicillin-susceptible NVT strains (Fig. 4f,g and Supplementary Fig. 19). Conversely, resistance did not seem to have changed among VT penicillin-resistant strains (relative fitness of 0.97 (95% CI = 0.91–1.03)) (Fig. 4f,g, Supplementary Fig. 13A–C and Supplementary Table 12). Expansion of NVTs within typically VT-associated lineages is the most common mechanism for serotype replacement36. As a result, the penicillin resistance associated with these newly expanded NVT lineages is able to persist in the population19,36. Together with our quantification of growing penicillin resistance among NVTs following the use of PCVs, this result suggests that the overall reduction in penicillin resistance seen following vaccine implementation may not persist. Our data also highlight the nuanced effect that vaccines can have on patterns of AMR (Supplementary Fig. 13D–F). It is probable that next-generation higher valency PCVs may also lead to increased AMR prevalence in those serotypes not included in the vaccine. Changing patterns of antimicrobial use in this population may also result in shifts in resistance patterns. The widespread presence of penicillin resistance at the beginning of our data collection period would implicate expansion of resistance among existing lineages; however, we cannot exclude the emergence of some new resistance forms.


We did not have complete carriage and invasive disease data across the entire time period. However, we performed sensitivity analyses to determine whether our results are robust to including carriage and invasive disease together. The human mobility data are Meta baseline data that were released by Meta owing to the SARS-CoV-2 pandemic in 2020. We used aggregated data across 17 months (January 2020 to June 2021) within a single mobility pattern matrix. As mobility was altered during this period and to address the possibility that Meta mobility data are different to the movement of individuals involved in pneumococcal transmission, we included an additional parameter that adjusts the human mobility data to account for more or less time being spent in home municipalities. As we were able to obtain good fits to the observed spread of pneumococcus, and these models outperformed standard gravity models, our findings highlight how imperfect mobility data can nevertheless be useful. Vaccination levels for PCV7 were reported to be 89% by 2022 (ref. 39). We did not have the data to consider changes in coverage in time or across provinces. Other settings with different levels of coverage may observe different fitness effects from the implementation of the vaccine. Within the fitness model, as we were looking at relative proportions of strains with increasing proportions of resistance, there may still be decreased total burden of resistant disease if those strains carrying it remain low in prevalence.


Here we quantified and explained the movement of a persistent human pathogen for which the geographical course has been largely hidden by its diversity and endemicity. The pneumococcus has an affinity for urban centres through which it channels its wider geographical spread. Although it is characterized by slow transmission overall, the use of vaccines can substantially and rapidly change pneumococcal lineage ecology. Although vaccine-associated fitness dynamics have been previously described in the pneumococcus8,36, they have not been directly quantified. Increasing proportions of NVTs in the disease isolates from the PCV era can be largely attributed to the decrease in number of VTs rather than the increasing prevalence of NVTs22. Vaccination has had a secondary effect on penicillin resistance, with a decrease in recent years in South Africa. Given the estimated growth advantage of penicillin-resistant NVT strains, we may see a reversal of this benefit; however, estimating the carrying capacity of this growth is beyond the scope of this model. Furthermore, we quantified pneumococcal geographical spread and the spatial impact of NVT expansion after vaccination. Our findings highlight how directly observed characterizations of human mobility using mobile phone data or Meta data can be used to obtain a mechanistic understanding of how pathogens spread within phylogeographic frameworks. This includes considering mobility of both infected individuals and the susceptible population and how we can adjust these datasets to account for systematic differences in behaviour between mobile phone data and Meta data and those involved in transmission. We note that basic gravity models also performed well, providing a useful alternative in settings in which mobile phone data options are not available. Our description of pneumococcal geographical spread provides new insight into the movement of emergent strains. In South Africa, the population density in the emergence location of a NVT strain may affect its speed of spread across the country and have implications for public health responses to emergent strains. Emergence in a highly mobile, peri-urban area may enable both rapid proximal geographical spread and less frequent distal seeding events, and thus more complete distribution across the country. The fitness model and human mobility model together provide frameworks to quantify and better understand the migratory and fitness dynamics of this globally endemic pathogen. The magnitude of South Africa and its provinces demonstrates that these frameworks may be applied to other large regions.


Data sources and processing

Pneumococcal sequence data and metadata

The genomes included in this study were collected as part of the Global Pneumococcal Sequencing project (GPS), which is a global genomic survey of S.pneumoniae40. The invasive disease isolates included here were collected by the National Institute for Communicable Disease in South Africa from 2000 to 2014 (ref. 22). In the initial phase of genome sequencing, approximately 300 invasive-disease isolates from each year (2005–2014) were selected with a specific target age breakdown (50% from <3 year olds, 25% from 3–5 year olds, 25% from >5 year olds). In the second phase of sequencing, approximately 200 disease isolates from 2000 to 2004 and approximately 100 invasive disease isolates from 2005 to 2010 from children <5 years old were randomly selected for sequencing22 (Supplementary Table 2). The carriage isolates were collected in Soweto, Gauteng (n = 736; collection years 2010, 2012 and 2013) and Agincourt, Mpumalanga (n = 1,114; collection years 2009, 2011 and 2013) by the Wits Vaccines and Infectious Diseases Analytics Research Unit. A random sample of 400 carriage isolates from each year were chosen for genome sequencing (Supplementary Table 2). We included both carriage and invasive-disease isolates and conducted sensitivity analyses throughout to confirm the methods were robust to both carriage and invasive disease individually. Invasive disease is defined as the bacterium being isolated from a typically sterile site. The majority of total isolates were from children aged <5 years (75.0%); 7.6% were from individuals aged 5–20 years and only 17.5% were from adults (>20 years) (Supplementary Table 7). These were distributed across the before and after PCV periods. We additionally included similar GPSCs from the GPS database (for context within the global population) for the RR analysis. We utilized metadata that included collection year and month, residence province of the patient, age of the patient, sampling site and clinical manifestation. The range of sampling sites included nasopharyngeal swabs (for carriage), blood, pleural fluid, cerebrospinal fluid, peritoneal fluid, pus and other joint fluid (for invasive disease).

Population data

We estimated the population for each municipality (n = 234) across South Africa using the population-size estimates from LandScan 2017 (refs. 41,42) (Extended Data Fig. 11).

Mobility data

The mobility data used in this study were collected using Meta Data for Good Disaster maps from South Africa. These are initiated at the onset of a disaster—in this case, the SARS-CoV-2 pandemic—and track the geographical movement of Meta users43. We used the baseline (adjusting to 2 weeks before) human mobility for each month from January 2020 to July 2021 (refs. 43,44) to attain a mobility probability from and to each municipality. For each origin (home) municipality (n = 234), we determined the mean monthly number of Meta users that were in each destination municipality in South Africa. Location pairs with a value of zero were given a value of ten. We divided each cell by the total number of users across all destinations for that origin municipality. Each cell in the resultant original–destination matrix therefore represented the probability of being in each destination municipality given your home municipality. This is the probability of mobility from each municipality to each other municipality after 2019. Because we do not have mobility data from the exact years the genomes were sampled, we adjusted the diagonal of the mobility matrix using an estimated parameter. This allows people to stay more or less at home and mitigates the effect of non-year matched mobility data. We define the radius of gyration (Ri) for each municipality as follows:

$${R}_{i}=\sqrt{{\sum }_{i}{m}_{i}{{d}_{i}}^{2}}$$

Where di is distance to region i, and mi is the probability of mobility to region i. The sum is across all municipalities (n = 234)45 (Extended Data Fig. 11).

Generation time distribution

We used a simulation framework to estimate the overall generation time using the separate contributions of the carriage durations and the incubation period. This approach has previously been used for other pathogens46.

We sampled 1,000 carriage durations from an exponential distribution with means that are inverse to the clearance rates estimated across serotypes in ref. 31 (clearance rate = 0.026 (95% CI = 0.025–0.028) episode per day) and in ref. 32 (clearance rate = 0.032 (95% CI = 0.030–0.034) episodes per day)32.

To sample the day of transmission, we randomly sampled a time point between zero and the time of clearance for each individual. We then separately sampled an incubation period using a uniform distribution of between 1 and 5 days. To account for longer carriages resulting in more opportunities for transmission, we sampled from the distribution of generation, with the probability of sampling each individual weighted by the total carriage duration. The total generation time is then the sum of the duration to transmission and the incubation period.

We repeated these steps 10,000 times and estimated the mean and standard deviation of this final distribution assuming that the generation time follows a gamma distribution with a mean of 35 and a standard deviation of 35 (exponential distribution with a rate of 1/0.096). By comparing the histogram of the distribution with the gamma distribution, this seems a reasonable assumption (Supplementary Fig. 4).

Sample culture and genome sequencing

The pneumococcal isolates were selectively cultured on BD Trypticase soy agar II with 5% sheep blood (Beckton Dickinson) and incubated overnight at 37 °C in 5% CO2. Genomic DNA was then manually extracted using a modified QIAamp DNA Mini kit (Qiagen) protocol. As part of GPS, pneumococcal isolates were whole-genome sequenced on an Illumina HiSeq platform to produce paired-end reads with an average of 100–125 bp in length, and data were deposited into the European Nucleotide Database. Whole-genome sequence data were processed as previously described36,47.


We performed predictive antimicrobial susceptibility profiling using the CDC-AMR pipeline for three classes of antimicrobials: β-lactams (penicillin; encoded by the genes pbp1A, pbp2B and pbp2X)48,49; sulfonamides (co-trimoxazole; folA and folP); and macrolides (erythromycin and clindamycin; ermB and mefA)50,51. This was done for 6,798 randomly selected isolates40.

Constructing time-resolved phylogenetic trees

We selected the GPSCs for which we had genomes from each of South Africa’s nine provinces and for which we had a minimum of 50 sequences in total to build phylogenies, henceforth referred to as ‘dominant GPSCs’. There were nine dominant GPSCs: GPSC1, GPSC2, GPSC5, GPSC10, GPSC13, GPSC14, GPSC17, GPSC68 and GPSC79 (n = 2.575). Assembly was performed using Wellcome Sanger Institute pathogen informatics automated pipelines and is freely available for download from GitHub under an open-source licence, GNU GPL 3 (ref. 52). For each sample, sequence reads were used to create multiple assemblies using VelvetOptimiser (v.2.2.5) and Velvet (v.1.2.10)53. An assembly improvement step was applied to the assembly with the best N50 and contigs scaffolded using SSPACE (v.2.0)54, and sequence gaps were filled using GapFiller (v.1.11)55. Assembly quality control parameters included a minimum average sequencing depth of 20× and an assembly length of 1.9–2.3 Mb. Sequences with more than 15% heterozygous SNP sites were excluded.

We created reference genomes for each GPSC using ABACAS (v.1.3.1) to order the contigs from a representative of each GPSC mapped to S.pneumoniae (strain ATCC 700669/Spain 23F-1) (EMBL accession: FM211187)56. Any contigs that did not align were concatenated to the end. We multiply mapped all genomes from each dominant GPSC against these references, respectively, using a custom mapping, variant calling and local realignment around indels pipeline ( using bwa-MEM (v.0.7.17)58 and samtools mpileup (v.1.6)59. The minimum base quality for a base to be considered was 50. The minimum mapping quality for a SNP to be called was 20, with a minimum of 8 reads matching the SNP. We built trees masking recombination regions using Gubbins (v.2.4.1)60 with the hybrid model that uses FastTree for the first iteration and RAxML subsequently61 and a GTR model. We converted branch length to time using BactDating (v.1.0) with a mixed gamma, relaxed clock model62. We compared concordance between BEAST (v.1.10.4)63 with both strict and relaxed clocks, and a Bayesian skyline prior. As the results were concordant, we used BactDating owing to its shorter runtime (Supplementary Fig. 14).

RR framework

We used a RR framework to investigate the risk of genetic similarity across geographical distance64. We compared the location (loc) and label (G) (that is, GPSC or genetic similarity) of pairs of sequences that were collected around the same time (t). This approach has been shown to be robust to substantial biases in timing and location of isolate collection64. We first constructed pair-wise matrices comparing every isolate to every other isolate (n pairs = 6,910). In the numerator was the ratio of pairs that were the same GPSC, collected within a year of each other, from the same province, over the total number of pairs collected within a year of each other from the same province. The denominator was the ratio of pairs that were the same GPSC, collected within a year of each other, from distant provinces (>1,000 km apart) (Lref), over the total number of pairs collected within a year of each other from distant provinces. Geographical distances were calculated based on the centroid coordinates of each province. To demonstrate the suitability of using centroid distances, we simulated a spatial transmission process for 1,000 separate chains in which at each generation, a daughter point is placed at a randomly located location 350 m in each of the x and y direction. This was repeated over 20 generations. We then identified the centroid of each case based on the closest coordinate rounded to the nearest kilometre. We then calculated the total distance covered for both the true distance and the centroid distances and found that the resulting distances were similar (Supplementary Fig. 15).

To quantify uncertainty, we used a bootstrapping approach whereby in each bootstrap iteration, we randomly sampled with replacement the isolates before recalculating the statistic. We report the 2.5 and 97.5 percentiles from the resulting distribution.

We also repeated the same analysis but used the time-resolved phylogenetic trees to interrogate pairs across increasing divergence times (breaking the GPSCs into higher resolution). For this, rather than matrices designating whether pairs were the same GPSC or different GPSCs, the divergence time between each pair was included. We only included the divergence times between like GPSCs (Fig. 2c–e).

$${{\rm{R}}{\rm{R}}}_{{\rm{l}}{\rm{o}}{\rm{c}}}(\,g1,g2)=\frac{\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{\rm{l}}{\rm{o}}{\rm{c}}}_{i}={{\rm{l}}{\rm{o}}{\rm{c}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}}\cap ({G}_{ij} > {g}_{1},{G}_{ij} < {g}_{2}))\,/\,\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{\rm{l}}{\rm{o}}{\rm{c}}}_{i}={{\rm{l}}{\rm{o}}{\rm{c}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}})}{\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{i}={{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}}\cap ({G}_{ij} > {g}_{1},{G}_{ij} < {g}_{2}))\,/\,\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{i}={{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}})}$$

We utilized the framework to compare a range of geographical distances, keeping the reference distance to pairs that were >1,000 km apart (Fig. 2).

To identify the divergence time at which pairs had an equal risk of being in the same province as distant provinces (time to homogenization across South Africa), we investigated the divergence time at which there was no increased risk of similarity within a province compared against distant provinces (RR = 1). We stratified distances in South Africa into groups of distances, including the 9 within provinces, 14 pairs that were <500 km apart, 16 pairs 500–1,000 km apart and 6 pairs of provinces >1,000 km apart. We repeated the framework across rolling 20-year time windows at 10-year intervals from 0 to 100 years (g1,g2) (Fig. 2f). We repeated this for pairs for which one was from South Africa and the other was from another country in Africa (Supplementary Fig. 3A). We sampled 300 sequences from each province with replacement to compensate for biased sampling (Extended Data Fig. 2a). Furthermore, to incorporate phylogenetic uncertainty into the statistical framework, we sampled 100 individual phylogenies from the BactDating posterior. We report the 2.5 and 97.5 percentiles from the resulting distribution. We also repeated the analyses only including pairs isolated from patients with pneumococcal disease (Extended Data Fig. 2b).

Probabilistic mobility model

Overall strategy

We extended a previously published mechanistic phylogeographic model65 to estimate the mobility of the pneumococcus between pairs of municipalities at each transmission step. To infer the probable path of transmission between sequence pairs, we used the divergence time and the generation time distribution to estimate the number of transmission generations between pairs of sequences. Each generation is a possible transmission event and provides an opportunity for a mobility event.

The approach ultimately aims to estimate an origin destination matrix for a transmission step, whereby each cell represents the probability that the pneumococcus is now in location j after one transmission step given it was previously in location i. As the phylogenetic trees combined with the generation time provide an estimate of how many transmission steps separate pairs of samples, we can use repeated matrix multiplication to integrate over all possible pathways linking two locations (see below for more details). We incorporated the probability of sampling at each geographical location, within each collection year, by GPSC to account for our observation process.


We follow notation per a previous study65. A pair of isolates, Ca and Cb, with sequences, Seqa and Seqb included in a phylogeny are found in locations, La and Lb, and the samples were taken in the years, Ta and Tb. The inferred MRCA between Ca and Cb is time Tm and located in Lm. The number of transmission generation from the MRCA to Ca is Ga and to Cb is Gb.

Model fitting

Single transmission generation

Considering a single transmission generation, the probability that persons i and j come into contact with each other given i lives in location a and j lives in location b can be written as follows:

$$\begin{array}{l}{\rm{P}}({\rm{person}}\,i\,{\rm{and}}\,j\,{\rm{come}}\,{\rm{into}}\,{\rm{contact}}| {L}_{i}=a,{L}_{j}=b)\\ =\mathop{\sum }\limits_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=b)\cdot {B}_{k}\end{array}$$

Where \(P\left({V}_{i}=k| {L}_{i}=a\right)\) is the probability that individual i, whose home location is in a, visits location k and \(P\left({V}_{j}=k| {L}_{j}=b\right)\) is the probability that individual j whose home location is in b visits location k, and Bk is the location-specific probability of transmission for k.

At time \({\tau }\), one infector in i in location a is expected to transmit to this number of persons in location b:

$$\begin{array}{l}E({\rm{number}}\,{\rm{of}}\,{\rm{persons}}\,{\rm{from}}\,b\,{\rm{infected}}\,{\rm{at}}\,{\rm{time}}\,{\tau }| {L}_{i}=a)\\ =\mathop{\sum }\limits_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=b)\cdot {B}_{k}\cdot {S}_{b,{\tau },{\rm{gpsc}}}\end{array}$$

Where \({S}_{b,{\tau },{\rm{gpsc}}}\) is the number of susceptible people in location b at time \({\tau }\) with some lineage = gpsc.

The total number of people infected by the infector is:

$$\begin{array}{l}E\,({\rm{number}}\;{\rm{of}}\;{\rm{persons}}\;{\rm{from}}\;{\rm{all}}\;{\rm{locations}}\;{\rm{infected}}\;{\rm{at}}\;{\rm{time}}\,\tau | {L}_{i}=a)\\ \,\,=\mathop{\sum }\limits_{m}^{N}\mathop{\sum }\limits_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot {P}({V}_{j}=k| {L}_{j}=m)\cdot {B}_{k}\cdot {S}_{b,\tau ,{\rm{gpsc}}}\end{array}$$

Conditional on transmission occurring, the probability that the infectee’s isolate is taken in location b is:

$$\begin{array}{l}{\delta }_{a,b,\tau ,{\rm{gpsc}}}=P({L}_{j}=b| {L}_{i}=a)\\ \,\,=\frac{{\sum }_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=b)\cdot {B}_{k}\cdot {S}_{b,\tau ,{\rm{gpsc}}}}{{\sum }_{m}^{N}{\sum }_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=m)\cdot {B}_{k}\cdot {S}_{b,\tau ,{\rm{gpsc}}}}\end{array}$$

We then created an NXN transmission matrix,\({\Delta }_{{\tau },{\rm{gpsc}},{\rm{gen}}=1}\), with N being the total number of locations, containing the transmission probabilities, asymmetrically, between all pairs of locations at each point in time; \({{\delta }}_{a,b,{\tau },{\rm{gpsc}}}\) is element [a,b] of the matrix.

Human mobility characterization

We use Meta mobility data (MetMob), as described in the mobility data section above, to characterize mobility between the 234 municipalities of South Africa. We aggregated these to the province level (n = 9) to fit the model. Initially we extracted a 234 × 234 matrix that sets out the probability that an individual from municipality a visits location k; again, where k = any municipality.

MetMob comes from individuals using Meta; however, this may not be representative of the amount of time spent at home by those involved in pneumococcus transmission. The mean of the diagonal of the Meta Mobility matrix (234 × 234) is 0.989, implying, on average, 98.9% of Meta Users stay in their home municipality, Hi. To allow for individuals to spend more or less time at home than represented in the MetMob data, we incorporated a parameter to adjust the probability of being at home (\({\theta }\)). We adjusted the probability of staying home using a standard logistic function and restricted it with bounds of –0.04 and 0.6 to facilitate exploration of a sensible space. The adjustment allowed by the bounds limits the range of movement Hi–0.6 and Hi+0.04.

The probability of a person remaining in location a therefore becomes:

$$\begin{array}{l}P({V}_{i}=a| {H}_{i}=a)\\ =\,{\rm{MetMob}}[a,a]-\left(-0.04+\left(\frac{(\exp ({\theta }))}{(1+\exp ({\theta }))}\right)\cdot (0.6+0.04)\right)\end{array}$$

Where a value of greater than 1.0 is obtained for a specific municipality, this is replaced by a value of 0.999.

The estimates thus far are mobility per day, but we were interested in mobility across the infectious period. Therefore, we adjusted the diagonal to account for mobility each day within the infectious period:


We rescaled the probabilities so that the sum of all mobility is equal to 1:

$$P({V}_{i}=a| {H}_{i}=a,k\ne a)={\rm{MetMob}}[a,k]/(1-P({V}_{i}=a| {H}_{i}=a))$$

Where MetMob[a,k] considers all mobility probabilities from the Meta Mobility data.

The sum of the movements to South African municipalities is equal to 1, which assumes that the analysis contains all possible movements of both the pneumococcus and people and implying no external introductions. The outcome of this is that some mobility may be missed, especially around the country borders.

Probability of the pneumococcus being in each location after G transmission generations

To determine the probability that location k contains the home location after G transmission generations, we used matrix multiplication, which integrates across all possible pathways connecting two locations.

$${\Delta }_{{\tau }={T}_{G},{\rm{gpsc}},{\rm{gen}}=G}=\mathop{\prod }\limits_{r=2}^{G}{\Delta }_{{\tau }={t}_{r-1},{\rm{gpsc}},{\rm{gen}}=r-1}\cdot {\Delta }_{{\tau }={t}_{r},{\rm{gpsc}},{\rm{gen}}=1}$$

Where tr is the time of generation Gr.

Probability of observing a pair of cases in two specific locations

The probability that CA has home location LA and CB has home location LB is conditional on the sequences being observed in locations LA and LB at times TA and TB. We assumed that the location of two cases, Li and Lj, is dependent on the location of their MRCA, Lm, and the number of transmission generations separating them from their MRCA, GA and GB.

The observations processes across locations are independent of each other, and each transmission event is independent of other transmission events. The probability of observing (Obs) a case at Li at time TA is not dependent on the number of generations to, or location of, the MRCA. We considered discretized space of the nine provinces of South Africa, resulting in the following equation:

$$\begin{array}{c}P({L}_{A},{L}_{B}| {{\rm{Obs}}}_{{L}_{B},{T}_{B}},{{\rm{Obs}}}_{{L}_{B},{T}_{B}},{{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})\\ =\frac{{\sum }_{{L}_{m}}{\sum }_{{G}_{A}}{\sum }_{{G}_{B}}P({{\rm{Obs}}}_{{L}_{A},{T}_{A}})P({{\rm{Obs}}}_{{L}_{B}{T}_{B}})P({L}_{A}| {L}_{m},{G}_{A})P({L}_{B}| {L}_{m},{G}_{B})P({L}_{m})P({G}_{A},{G}_{B}| {{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})}{{\sum }_{{L}_{i}}{\sum }_{{L}_{j}}{\sum }_{{L}_{m}}{\sum }_{{G}_{A}}{\sum }_{{G}_{B}}P({{\rm{Obs}}}_{{L}_{i},{T}_{A}})P({{\rm{Obs}}}_{{L}_{i}{T}_{B}})P({L}_{i}| {L}_{m},{G}_{A})P({L}_{j}| {L}_{m},{G}_{B})P({L}_{m})P({G}_{A},{G}_{B}| {{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})}\end{array}$$

Probability of G generations between MRCA and a sequenced isolate

Under the previous equation, \(P({G}_{A},{G}_{B}| {{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})\) represents the generation time distribution.

We can extract the joint probability of CA and CB being separated from the MRCA by GA and GB transmission generations, respectively, using the above-derived generation time distribution and the time-resolved phylogenetic trees.

Assuming the generation time is gamma distributed and all transmission events are independent, the sum of the gamma distribution is also gamma distributed. Additionally, we can extract the evolutionary times EA and EB, separating CA and CB from the MRCA. As previously described65, using equation (19), we can estimate the probability of g transmission events over many trees, allowing us to incorporate phylogenetic including evolutionary parameters from the tree structure.

We determined the probability for the number of generations from MRCA for each isolate for 1–1,000 generations, using the generation time derived above.

Location of the MRCA (P(L m))

We estimated the probability that, on average, an MRCA is in each of the nine provinces in South Africa. We estimated parameters for each of the eight provinces, setting Western Cape aside as a reference, and dividing by the total across all nine to ensure that the sum of the probabilities is 1. This again assumes no external introductions.

Calculation of likelihood

We calculated the likelihood using all pairs of available sequenced S.pneumoniae as previously described65. We accounted for the observation process by incorporating the probability of sampling in each location for isolates belonging to each GPSC annually.

Likelihood equation

We calculated the likelihood using all pairs of sequenced pneumococci as follows:

$$L\propto \mathop{\prod }\limits_{{\rm{province}}=1}^{9}\mathop{\prod }\limits_{i=1}^{{n}_{{\rm{gpsc}}}}\prod _{j\ne i}P({L}_{i},{L}_{j}| {{\rm{Obs}}}_{{L}_{i},{T}_{i}},{{\rm{Obs}}}_{{L}_{j},{T}_{j}},{{\rm{Seq}}}_{i},{{\rm{Seq}}}_{j},{T}_{i},{T}_{j})$$

Where ngpsc are the number of sequences available from GPSC gpsc.

Hamiltonian MCMC

We used an MCMC approach to estimate our parameters using the package fmcmc (v.0.5-1) implemented in R66. We estimated nine parameters: a parameter that adjusts the probability of staying in the home municipality compared with Meta mobility data; and eight parameters capturing the relative probability that the MRCA of a pair of individuals was in each of the other eight provinces compared with the province of Western Cape (the reference).

We only used pairs of sequences that were separated by less than 10 years of evolutionary time between them. After 10 years, there are limited spatial signals remaining, as the bacterium has had many opportunities to move. This approach was used to make the model computationally tractable. To incorporate phylogenetic uncertainty, we repeatedly refit the model using 50 randomly selected phylogenies from the BactDating posterior. We report the 2.5 and 97.5 percentiles from the resulting distribution. In a sensitivity analysis, we showed that increasing the model to 15 years resulted in similar estimates (Extended Data Fig. 12a).

Model performance

To assess the performance of our model, we used a simulation framework. We simulated 50,000 pairs of events, locations and the location of the MRCA between pairs, whereby the probability of mobility between each pair of locations was determined by the Meta mobility matrix adjusted by a parameter of known value of −2. We tested our ability to recapture this input parameter. To incorporate the biased observation process, we downsampled the simulated data based on the by-province sampling probabilities from our true data.

We used the downsampled data to fit our model with 20,000 steps of a MCMC with a jump step of 0.08 in 3 chains. We were able to recapture the downsampled data utilizing the human mobility framework and the estimated parameter for the probability of staying at home, accounting for the sampling probability per province. We then utilized the same human mobility framework and the estimated parameter, excluding the sampling probability, and were able to recapture the complete simulated data, including the input parameter (Extended Data Fig. 5a,b).

We repeated these simulations but downsampled with various biases. We determined how well the model performed when we only sampled two provinces. We also tested how far off our estimates would be if our generation time estimate was 50% higher or 50% lower than we had estimated given the Kenyan and Gambian data31,32 (Extended Data Fig. 5c,d).

Model sensitivity analyses

We estimated our parameters only to include isolates from patients with invasive pneumococcal disease (Supplementary Fig. 1B).

To test the impact of a range of generation times on the parameter estimates, we also re-ran our MCMC with generation times of 15, 35 (as included in the model) and 55 days. To quantify uncertainty, we sampled posterior parameters, reporting the 2.5 and 97% percentiles (Extended Data Fig. 12b).

We confirmed that the chains converged for each of the nine parameters estimated (Supplementary Fig. 16).

Probability guided transmission simulations

Person-to-person transmission chains

We simulated person-to-person transmission chains seeded in a starting municipality weighted by the population size. We determined the RR of being in a specific municipality after 10 transmission generations (approximately 1 year) across 500,000 simulations.

We fixed the starting municipality to be rural (population density <50 km–2) or urban (population density >500 km–2) and repeated the above simulation. For 10,000 sequential simulations, we counted the number of unique municipalities affected and distance travelled at each transmission interval weighting by population size. We then determined the number of municipalities travelled to across all transmission chains (Extended Data Fig. 3).

Branching epidemic

We simulated a branching epidemic in which we drew the number of transmission events seeded by each event from a Poisson distribution around an effective reproductive number (Reff) of 1 and amplitude of 0.15 over 100 iterations. We determined the mean distance from the starting municipality after 60 generations (5.8 years with a 35-day generation time) and the number of municipalities visited over that time. We calculated the uncertainty at the 95% CI of a normal distribution.

Gravity model

We compared the performance of the model that used Meta data with the performance of a simple gravity model whereby the probability of mobility is determined by the distance and human population size of the municipalities. We calculated the probability of mobility between locations i and j (GravMobi,j) to be the log of the destination population size (popsizej) raised to parameter β divided by the distance between locations, loci and locj (disti,j) raised to parameter γ.

$${{\rm{GravMob}}}_{i,j}=\frac{\log ({{{\rm{popsize}}}_{j}}^{{\beta }})}{{{{\rm{dist}}}_{i,j}}^{{\gamma }}}$$

We also tested a model including only distance and estimating an exponent adjustment parameter γ.

We calculated the DIC comparing the three models and found that the Meta mobility model (DIC = 12,290.97) was the best-performing model67 when compared with the gravity model (DIC = 12,294.34). Both of these models performed better than distance alone (DIC = 12,424.32) (Extended Data Fig. 4).

All statistical analysis for the RR framework and the mobility model was performed in R (v.3.6.2)68.

Population-level fitness model

Overall strategy

We developed logistic growth models to fit the changing prevalence of different serotypes and assess the impact of vaccine implementation, utilizing a method that has previously been implemented for the endemic bacterium Bordetella pertussis69.

Vaccine-type model

We first binned serotypes into three groups: those not included in the vaccine (NVTs), serotypes included in PCV7 (4, 6B, 9V, 14, 18C, 19F and 23F), and additional serotypes included in PCV13 (1, 3, 5, 6A, 7F and 19A). We used the full data for this analysis. We computed the relative abundance fi,ref of each group of serotypes, i, compared with a reference type, ref.


We chose to use NVTs as the reference. Varying the reference did not affect the model, as long as the reference samples span all years. We then used a simple logistic model, assuming a constant growth rate to capture the evolution of this abundance, at each time t.

$${f}_{i,{\rm{ref}}}\left(t\right)=\frac{1}{1+\left(\frac{1-{f}_{i,{\rm{ref}},0}}{{f}_{i,{\rm{ref}},0}}\right)\exp \left({-r}_{i,{\rm{ref}}}\cdot t\right)}$$

where ri,ref is the growth rate of that abundance shared across all provinces, and fi,ref,0 is the initial relative abundance of the serotype group i with respect to a chosen ref.

To control for the varying presence of all circulating serotypes through time, we present fitness as the average relative growth rate, \(\bar{{r}_{i}}\), for each group with respect to a randomly selected group in the population:

$$\bar{{r}_{i}}=\mathop{\sum }\limits_{j\ne i}^{n}\bar{{f}_{j}}({r}_{i,{\rm{ref}}}-{r}_{j,{\rm{ref}}})$$

where n is the number of groups, and \(\bar{{f}_{j}}\) is the average absolute frequency of the group in the period of time considered.

This average relative growth rate, \(\bar{{r}_{i}}\), can be identified as the selection rate coefficient of the group in the population considered69. The selection rate coefficient is a direct measure of the fitness advantage of emerging variants and is one of the best indicators as to whether a strain will increase in frequency during an outbreak70,71.

We can further multiply the selection rate by the mean generation time (35 days) to obtain the selection coefficient per generation, which is the relative fitness advantage per transmission generation.

In the Article, we also present estimates of the relative fitness advantage of groups in particular time frames. We used the same computation as for the average relative growth rate, but tailored it to specific references. For example, to compute the relative fitness advantage of NVTs compared with VTs in the PCV era \(\Delta \overline{\,{r}_{{\rm{NVT}},{\rm{VT}},{\rm{after}}{\rm{PCV}}}}\), we computed:

$$\Delta \overline{\,{r}_{{\rm{NVT}},{\rm{VT}},{\rm{after}}{\rm{PCV}}}}=\frac{1}{\overline{\,{f}_{{\rm{PCV}}7}}+\overline{\,{f}_{{\rm{PCV}}13}}}\sum _{i{\epsilon }({\rm{PCV}}7,{\rm{PCV}}13)}\overline{{f}_{i}}\,(\overline{\,{r}_{{\rm{NVT}},{\rm{after}}{\rm{PCV}}}}-\overline{\,{r}_{i,{\rm{after}}{\rm{PCV}}}})$$

To fit the model, we used Hamiltonian Monte Carlo as implemented in R-Stan72, with stan (v.2.26.1). We used a Poisson likelihood to fit the observed proportion of sequences that were of each category and the total number of isolates in that year as an offset. The model estimated the proportion of isolates that were of each type at the start of the dataset and the fitness parameters.

Exploration of vaccine effect

To investigate whether the serotypes fitness changed across after implementation of PCV7 and PCV13, we tested a range of models. We considered a model without any shift of fitness, a model with a single shift in 2009 and a model with two growth rates represented by a shift in fitness in both 2009 and 2011. The 2009 shift pertains only to those serotypes included in PCV7 and the 2011 shift pertains to those additional 6 serotypes in PCV13. Model comparison was done using the Watanabe–Akaike information criterion (WAIC) implemented in the loo package73. Furthermore, we tested for a potential delay between implementation of PCVs and the change in fitness. Model comparison is presented in Extended Data Fig. 7.

The model performed best with an initial fitness switch in 2009 (the initial year of vaccine implementation) and another fitness switch in 2011 (the year of PCV13 implementation). We used this model in the main text.

Estimation of individual serotype fitness

Using the same framework, we also estimated the growth rate per serotype to capture whether the individual dynamics were concordant with, or deviated from, what was expected according to their respective group (NVT, PCV7 or PCV13) (Fig. 4a–c). The reference strain in this analysis was set to serotype 13, which is a NVT and for which samples span all years.

Exploration of predicted GPSC dynamics based on their serotype composition

We explored whether using the proportion of isolates that were VT versus NVT within each GPSC at the start of the study period and our fitness estimates could explain the subsequent dynamics of individual GPSCs.

To predict the dynamics of individual GPSCs based on their serotype composition, we used the same framework as described above, applied to each GPSC serotype group in our dataset. As the number of such groups is large (n = 340 GPSC serotype groups), we restricted the analysis to the GPSCs present at a minimum prevalence of 1%. This led to 26 GPSCs being considered for this analysis (representing 74.7% of the dataset), split in a total of 101 GPSC serotype groups. We then modelled the dynamics of each GPSC serotype group, estimating one starting frequency per group and using the previously estimated fitness parameters, either by VT or serotype. We also considered a model with no fitness parameters (relative fitness = 1). To assess the model performances, we computed the Akaike information criterion (AIC)74 using the average likelihood of each model and the number of parameters used in each model. We also computed the predicted GPSC dynamics over time for each model by summing all the predicted GPSC serotype group dynamics for each GPSC. As a measure of goodness of fit we used the coefficient of determination of the observed versus predicted GPSC proportions each year:

$${R}^{2}=1-\frac{{\sum }_{i}{({\rm{o}}{\rm{b}}{\rm{s}}{\rm{e}}{\rm{r}}{\rm{v}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}-{\rm{p}}{\rm{r}}{\rm{e}}{\rm{d}}{\rm{i}}{\rm{c}}{\rm{t}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}})}^{2}}{{\sum }_{i}{({\rm{o}}{\rm{b}}{\rm{s}}{\rm{e}}{\rm{r}}{\rm{v}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}-{\rm{m}}{\rm{e}}{\rm{a}}{\rm{n}}{\rm{o}}{\rm{b}}{\rm{s}}{\rm{e}}{\rm{r}}{\rm{v}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}})}^{2}}$$

R2 is also the proportion of variation explained by the variables considered in each model.

Our model estimated a constant fitness for each group considered (VTs or individual serotypes), assuming that if a group has the highest fitness, it will eventually replace the other groups. This assumption is meaningful for VTs and serotypes, as some are directly targeted by the vaccines, a strong selective force in the population. However, the fitness of each GPSC cannot be modelled with this simple assumption as it has been shown that their fitness is inherently multifactorial, as described in the NFDS model24.


We then used the model to capture the decreasing proportion of penicillin resistance in the population. To do this, we incorporated the dynamics of VTs (PCV7 and PCV13) and NVTs in the population and the respective proportion of penicillin resistance within them. To keep the model tractable, we did not differentiate here between serotypes included in PCV7 or PCV13, instead we group them into VTs. We use the PCV7 implementation (2009) as the year of fitness shift. This parametrization marginally differed from the best model (next best) (Extended Data Fig. 7) and enabled us to keep the number of categories tractable. We also compared the single switch model to a model with no change in fitness at the time of vaccine implementation. This approach showed that the model with a switch was superior (Supplementary Fig. 17).

We used the same approach described above to model the proportion of VTs, fVT, and NVTs, fNVT. We then modelled the proportion of strains that were and were not penicillin resistant within each group, either VT, fAMR|VT or NVT, fAMR|NVT. We estimated the fitness of resistance in each.

We then fit the model to all four groups, fVX,AMR (resistant NVT, susceptible NVT, resistant VT and susceptible VT).

$${f}_{{\rm{VX}},{\rm{AMR}}}(t)={{f}_{{\rm{VX}}}(t)\cdot f}_{{\rm{AMR| VX}}}(t)$$

Where fVX is the proportion of either VT or NVT in the population and fAMR|VX is the proportion of penicillin resistance within each of those groups. We then derived the proportion of penicillin resistance overall in the population by summing the proportion of VT resistant, fVT,AMR, and NVT resistant, fNVT,AMR, strains.

Estimating effect of fitness on migration

We included the calculated average relative growth rates in the simulated branching epidemic to calculate the mean number of municipalities visited, distance travelled and probability of being in the home municipality over 5 years. We report uncertainty at one standard deviation from the mean. We repeated this incorporating the post-PCV selection coefficients for NVTs and VTs and estimated the relative increase in the number of municipalities visited for NVT serotypes compared with PCV serotypes after 2 years of transmission.

Fitness model carrying capacity sensitivity

Our fitness model assumed constant fitness over time. This specifically means that if a serotype is found to be fitter than the rest of the population, we expect it to replace the whole population after some time. However, the currently best-supported model proposes that NFDS drives pneumococcus population dynamics24, whereby lower frequency genes become more fit.

To test the effect of our assumption on our fitness estimates, we performed a sensitivity analysis. We introduced minimum and maximum carrying capacities in our logistic model. In equation (20), we let the relative frequency fi,ref go to a maximum of Kmax and in equation (21), a minimum of Kmin:

$${f}_{i,{\rm{ref}}}(t)=\frac{{K}_{\max }}{1+\left(\frac{{K}_{\max }-{f}_{i,{\rm{ref}},0}}{{f}_{i,{\rm{ref}},0}}\right)\exp ({-r}_{i,{\rm{ref}}}\cdot t)},\,{\rm{if}}\,{r}_{i,{\rm{ref}}} > 0$$
$${f}_{i,{\rm{ref}}}(t)=1-\frac{{1-K}_{\min }}{1+\left(\frac{({1-K}_{\min })-(1-{f}_{i,{\rm{ref}},0})}{(1-{f}_{i,{\rm{ref}},0})}\right)\exp ({r}_{i,{\rm{ref}}}\cdot t)},\,{\rm{if}}\,{r}_{i,{\rm{ref}}} < 0$$

with \({K}_{\min }\le {K}_{\max }\), and \({K}_{\min }\in \left[0,1\right],\,{K}_{\max }\in \left[0,1\right]\).

We considered a range of values for Kmin (0.1, 0.05 and 0) and Kmax (0.9, 0.95 and 1) (Supplementary Fig. 20). We tested the effect of this carrying capacity has on both the vaccine status model (Supplementary Fig. 18) and the AMR and vaccine status model (Supplementary Fig. 19). In each case, we compared the fitness estimates obtained. We found that the fitness estimates were robust to variable carrying capacities across those tested, which implied that our assumption does not affect this framework. However, it is important to note that this model is robust to set variable carrying capacities while being unable to estimate them, which is in contrast to NFDS. Our model assumes that there is a population replacement up to the carrying capacity and does not allow for fine-scale estimates of an equilibrium. This fitness model can be used as a quantitative descriptor of the growth of distinct groups (that is, penicillin resistance or VT serotypes in the context of a perturbation) but cannot describe the complex underlying fitness effects of changing frequencies of genes under NFDS24.

All statistical analysis for the fitness model was performed in R (v.4.0.5)68.


The study was coordinated by the GPS (, whose activities are approved by respective ethics committee in-country. Isolates sequenced in this study come from the National Institute for Communicable Diseases and the University of Witwatersrand carriage studies as previously published. These isolates were all collected as part of existing public health surveillance approved protocols in each country. No personally identifiable information was used as part of this study.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.