Introduction

Streptococcus pneumoniae (the pneumococcus) is a human-adapted clinically significant pathogen, which continues to kill ≈400,000 children globally despite widespread use of conjugate vaccines1. Over 90 pneumococcal capsular antigens or serotypes have been characterised globally2, which vary in their ability to colonise3, invade4,5, and evolve6. In some geographic regions with high incidence of pneumococcal carriage, nasopharyngeal colonisation with S. pneumoniae occurs within days or weeks after birth, and lasts for few days to several months, but, everyone is colonised at least once during first year of life7,8,9. Similar to other bacterial pathogens10, asymptomatic pneumococcal colonisation is an essential precursor for the development of life-threatening invasive pneumococcal diseases (IPD) such as pneumonia, septicaemia and meningitis11. Although asymptomatic pneumococcal colonisation is considered to be beneficial since it decreases the likelihood for recurrent colonisation12, the protective effects of such prior carriage are serotype-dependent and usually marginal6,13,14. As a result, it is unsurprising that extended and recurrent colonisation episodes are common especially in children12.

Nasopharyngeal colonisation facilitates the evolution and transmission of the pneumococcus and other respiratory tract pathogens; therefore, it is key determinant of the strain population dynamics6,13,14,15. Despite the frequent occurrence of pneumococcal colonisation, little is known regarding its within-host genomic diversity and evolution during carriage. Within-host evolution may play an important role in prolonged colonisation in addition to other risk factors such as age16 environmental and climatic conditions, and population density17,18 and immunity19,20,21. Genetically, the serotype-defining surface capsular polysaccharide biosynthetic locus22 is the major determinant of pneumococcal virulence and colonisation23,24. Beyond the capsule variation, there is limited understanding of the genetic diversity and evolution of pneumococcal strains within hosts, and its effect on colonisation dynamics25. Previous studies have used multi-locus sequence typing (MLST) to investigate colonisation dynamics but this approach does not resolve microevolution patterns of the strains due to limited discriminatory power26,27. Whole-genome sequencing studies of in vitro pneumococcal isolates have suggested that mutations in rpoE, an RNA polymerase delta subunit encoding gene, could be important for colonisation since they were associated with phenotypic changes relevant for carriage such as reduced capsule expression and increased biofilm formation but it’s unknown whether such substitutions occur during natural colonisation28. Another study has suggested that genetic variation in prophage sequences is associated with decreased colonisation duration25. Furthermore, isolates recovered from human subjects experimentally challenged with the pneumococcus for 35 days, revealed low genetic diversity; three nucleotide substitutions (one parallel) and no recombination29, however, it’s unknown whether these patterns are consistent with within-host evolution dynamics during natural colonisation. Clearly, genomic variation is important for pneumococcal colonisation as seen in other bacterial pathogens30,31,32. Therefore, understanding within-host evolution of the pneumococcus during natural colonisation could reveal genetic clues on variability of carriage between strains, which could be crucial for designing strategies to control carriage.

In this work, we investigate within-host dynamics, genomic diversity, and microevolution of pneumococcal strains during natural colonisation in new-born infants in the Gambia, Sub-Saharan Africa (SSA); a relevant setting with high IPD and colonisation rate up to ≈97% in infants <1 year old8. We undertook whole-genome sequencing of sequentially sampled isolates collected over one-year follow-up period. Our data show high within-host strain genetic diversity during the course of colonisation episodes, which varies by host, strain type and timing of the episodes, and is driven by rapid substitution rates, real-time within-host homologous recombination and neutral evolution. Furthermore, we show evidence of parallel evolution in both genic and intergenic regions particularly in key virulence genes essential for epithelial surface adherence, antibiotic resistance and evasion of immune responses, which suggests within-host adaptations.

Results

Colonisation dynamics of carried pneumococcal strains

We recovered S. pneumoniae from ≈79% (1232/1553) of the swabs obtained from 98 infants recruited into the infant birth-to-one year cohort in the Gambia33 (Fig. 1 and Supplementary Data 1, 2). We detected 80 serotypes associated and 144 STs from the recovered isolates. The most common serotypes were 19A (11.4%), 6A (8.74%), NT (5.71%), 15B/C (4.90%), 19F (3.85%) and 23B (4.31%) (Fig. 2a). The mean number of S. pneumoniae isolates sampled per infant was 15.85 (range: 6–17). The number of colonising serotypes and episodes per infant were 8.51 (range: 3–15) and 8.76 (range: 2–15) respectively. A single serotype caused ≈1 episode (range: 1–4) and each episode lasted ≈4.44 weeks (mean: 7.30, range: 1–48).

Fig. 1: Schematic of the study design and analysis workflow.
figure 1

The newly born babies were recruited into the study at birth and nasopharyngeal swabs were taken with the first week after birth and every two weeks until six months and then after every month until they were one year old at which sampling was stopped. The analysis of these longitudinal data involved fitting multi-state and other models to determine colonisation dynamics in the babies during the first year of life and whole-genome analysis to assess the within-host genetic diversity, recombination and mutation rate of the isolates. The map of The Gambia was generated by the authors in R software using ggmap v3.0.0 package (https://cran.r-project.org/web/packages/ggmap/). The images of the infants and adults, and the DNA sequencing machine were created with BioRender (https://biorender.com/) with permission to publish.

Fig. 2: Characteristics and dynamics of the extended pneumococcal strains.
figure 2

a Frequency of serotypes; each episode was counted once and serotypes with frequency >0.2% are shown. b An example of a colonisation profile for infant ID: 65 showing different colonisation episodes. The sampling point marked with the cross (×) represents culture-negative pneumococcal samples (uncolonised). Different types of episodes are shown in (b) namely transient colonisation whereby an episode consisted of a serotype was detected at a single time point, extended colonisation which refers to an episode where the serotype was detected at multiple time points and multiple colonisation where there was co-occurrence of overlapping episodes of different serotypes at certain time points. c Schematic representation of the three-state multistate model showing colonised and uncolonised carriage states and the estimated transition intensities (rates) between the states. d, e Observed and expected prevalence of each colonisation state. f The inferred sojourn time (duration) in each colonisation state. The error bars represent the 95% confidence interval for the estimated mean values.

We defined transient and extended colonisation episodes as the detection of an isolate of the same serotype at a single and consecutive sampling points respectively (Fig. 2b). We then used multistate modelling to estimate the transition rates, prevalence and duration associated with the uncolonised and colonised carriage states from birth until 12 months. From the inferred state transition matrix, transitions from uncolonised to colonised states was sixfold more frequent than in the opposite direction (Fig. 2b). The equilibrium colonisation dynamics were reached ≈14 weeks from birth and showed prevalence of 11 and 89% for the uncolonised and colonised carriage states (Fig. 2c, d). However, the sojourn time (duration) in the colonised carriage state was longer (mean: 12.3 weeks, 95% CI: 9.87–15.2) than duration in the uncolonised state (mean: 2.05 weeks, 95% CI: 1.73–2.43) (Fig. 2e).

Within-host genetic diversity during extended episodes

Of the 1553 pneumococcal samples collected from the infants, 1074 isolates were had a whole-genome sequence available and were analysed to infer within host genetic diversity of strains during extended colonisation episodes (Supplementary Data 1, 2 and Supplementary Fig. 2). We defined the amount of genetic diversity as the number of SNPs between a pair of isolates from the same episode, i.e., with the identical serotype and ST within the same individual. The mean genetic diversity varied between serotypes and episodes with the same serotype within the same or different infants. Combined analysis of the genetic diversity across the colonisation episodes using the ANOVA test showed statistically significant differences for the covariates for the serotype (P = 0.001), ST (P < 2.2 × 10−16), and specific episode (P < 2.2 × 10−16), which suggested an interplay of both the host and pathogen factors on within-host pneumococcal genetic diversity.

Emergence of highly divergent strain variants

We then conducted an in-depth analysis of the within host genetic diversity of the strains in each episode. The mean number of SNPs between consecutively sampled isolates from the same episode (two weeks apart) of the same serotype and ST was 14.8 (range: 3–150) but the mean number of SNPs between all the isolates in the episodes ranged from 3 to 27.5 for different serotypes (Fig. 3 and Supplementary Fig. 2). In some episodes, an unusually high number of SNPs were detected between some isolates relative to the other isolates in the episode. For example, serotype 19F isolate was detected in infant 33 at week 15, and it which was distinguished from the preceding and subsequent strains in the episode by 1177 and 1181 SNPs respectively. This exemplified the presence of multiple clones of the same strain, which may have been co-transmitted at the onset of the colonisation episode or were exogenously acquired during an ongoing episode (Supplementary Table 1). However, we also identified atypically high number of SNPs in some episodes between isolates of the same serotype and ST, which suggested the effect of additional evolutionary processes other than random mutation alone. These episodes were associated with serotypes 11A, 16F, 19A, 23F, 6A and 6B, and non-typeable (NT) strains, all of which are known efficient colonisers3. We hypothesised that these divergent strains emerged from their ancestral strains during the colonisation via intra-episode homologous recombination, which caused rapid accumulation of genetic variation during the course of the carriage episodes.

Fig. 3: Within-host pneumococcal genetic diversity during colonisation.
figure 3

The strip charts, box and violin plots showing the number of SNPs calculated between isolates of the same serotype and ST within the same episode. The isolates sampled at five or less weeks apart are coloured in light blue while those sample at more than six weeks apart are shown in darker blue. The genetic diversity of some strains was much higher than the rest of the strains in the episode for some serotypes for example 11A, 16F, 19A, 23F, 6A, 6B and NT; which suggested the occurrence of other evolutionary processes other processes other than random substitution particularly genomic recombination. The Y-axis of each plot is shown in log10 scale for clarity. The number of data points for each group are presented in the format serotype (n = n1; n2) where serotype is the capsular type, n1 and n2 is the number of points for isolates not sampled within and within six weeks apart: 10A (n = 19;39), 11A (n = 17;25), 11B (n = 1;0), 12F (n = 7;2), 13 (n = 17;29), 14 (n = 40;31), 15A (n = 17;17), 15B/C (n = 25;35), 16F (n = 10;12), 17F (n = 4;1), 18A (n = 15;21), 18C (n = 7;3), 19A (n = 78;112), 19F (n = 14;7), 20 (n = 17;38), 21 (n = 26;21), 22A (n = 5;11), 23A (n = 10;2), 23B (n = 43;32), 23F (n = 15;43), 28F (n = 3;0), 34 (n = 26;60), 35B (n = 25;49), 38 (n = 9;0), 39 (n = 12;12), 4 (n = 6;5), 40 (n = 6;6), 48 (n = 6;9), 6A (n = 76;102), 6B/E (n = 63;108), 7F (n = 3;3), 8 (n = 1;0), 9L (n = 14;25), 9V (n = 10;12) and NT (n = 5;0).

Homologous recombination is the major driver of evolution in bacterial pathogens34. To identify or rule out the occurrence of recombination, we aligned the genomes of the isolates from each episode to assess whether we could identify genomic regions with high density of SNPs, a well-known signature for recombination6. We analysed genomes from 116 extended episodes, which had >3 sequenced isolates of the same serotype and ST, and we found evidence for the occurrence of within-host recombination during 42 (36.2%) episodes. In these episodes, the divergent strain was similar to the oldest sequenced genome in the episode, i.e., the reference isolate, but it contained additional SNPs acquired from external DNA via recombination, which distinguished it from the rest of the isolates in the episode. Genome-wide analysis showed that the recombinant strains acquired a single recombination block (range: 1–6) (Table 1). Examples of episodes with evidence of intra-episode recombination were episode INF57:11A:1 and INF26:23F:1 (Fig. 4 and Supplementary Fig. 3). Episode INF57:11A:1 was caused by serotype 11A (ST11691) carried from week 3 to 19 in infant #57. We detected two recombination blocks during this episode at week 15, which were ≈36.1 Kb (location: 1,487,800–1,523,861 bp) and 25 bp (location: 1,722,073–1,722,097 bp) in size and introduced 169 and 4 SNPs respectively. The episode INF26:23F:1 was due to a serotype 23F (ST2174) strain which colonised infant #57 from week 7 to week 35 after birth and underwent a single recombination block days before week 11. This recombination block was ≈18.2Kb in size and it imported 150 SNPs (location: 1,752,957–1,771,123 bp), and it was detected at week 11 and week 17. This episode highlighted rare persistence of the strain that underwent recombination whereby the recombinant strain survived and co-existed with the ancestral wild-type strain for at least 4 weeks (week 11–17) but it was later displaced permanently by the wild-type strain from week 19 until clearance of the serotype at week 35. In other episodes, strains that underwent recombination were only detected at a single sampling point, which implied rapid clearance of the recombinant strains, which could reflect intense within-strain competition strongly favouring the ancestral wild-type strain; therefore, limiting opportunities for transmission and spread of the divergent strains in the population.

Table 1 Episodes with high intra-episode recombination rate during natural colonisation.
Fig. 4: Within-host homologous recombination during colonisation.
figure 4

a, b Two examples of colonisation episodes namely INF57:11A:1 and INF26:23F:1 respectively, where recombination blocks were detected. The episode name is shown in the format A:B:C where A,B and C represents the infant ID, serotype and number of episodes with the serotype respectively. (I) Colonisation episode showing the time points at which the serotype in the episode was detected. Some or all the detected samples were sequenced. In episode INF57:11A:1, serotype 11A was detected from week 3 to 17. A recombination block was detected at week 13 but the recombinant strain did not persist until the next sampling time at week 17. In episode INF26:23F:1, serotype 23F was detected from week 7 to week 35. Recombination block was first detected at week 11 but it persisted, and the recombinant strain was sampled again at week 17. (II) Distribution of SNPs across genome of the serotype 11A and 23F in episodes INF57:11A:1 and INF26:23F:1 respectively. The coloured line (red) shows occurrence of a SNP in the strain using the first sequenced genome in the episode as the reference or ancestral strain. The SNP are enhanced for clarity. (III) A multiple sequence alignment of showing location of the SNPs and visual evidence of the emergence of a recombinant strain within the episode. The value for r/m represents the number of SNPs within recombination blocks relative to SNPs outside the blocks. (IV) The distribution of the SNPs is highlighted by the frequency polygon, generated using widow size of 1000 bp, which shows spikes in the SNP density across the recombinogenic regions.

Multiple isolates of the same serotype but identical STs were also detected in some episodes. Such co-existence of highly divergent isolates with the same serotype but different STs occurred during 14 episodes (Supplementary Table 1). The majority of these isolates were distinguishable from the isolates with non-identical STs by >450 SNPs distributed over the entire genome. This clearly suggested that these co-existing strains did not emerge via recombination blocks spanning across the housekeeping genes, which could have altered the alleles used to define the STs via MLST35. It’s likely that such strains emerged through either co-transmission of both strains in the infecting inoculum at the onset of the colonisation episodes or independent acquisition of some strains during ongoing episodes. However, three episodes contained co-existing strains differing by <29 SNPs, which would not be implausible to suggest that they emerged via random mutation or recombination across the ST-defining genes during the episodes.

Frequency, rates and hotspots of intra-episode recombination

We then assessed the overall contribution of recombination to within-host pneumococcal diversity during the episodes with >2 sequenced isolates of the same serotype and ST (Table 1 and Supplementary Data 3). The mean number of recombination blocks per episode was ≈1 (range: 1–6) while the number SNPs within each block was 32 (range: 4–1063) per recombination block. We then assessed the ratio of imported SNPs via recombination relative to random substitutions (r/m) and total recombination blocks relative to random substitutions (ρ/θ), which are widely used statistics for quantifying the contribution of recombination to genomic diversification36. The r/m and (ρ/θ) averaged across all phylogenetic branches where recombination had occurred were 3.49 (range: 0.19–88.58) and 0.17 (range: 0.04–1) respectively. Although the recombinant blocks were associated with genes encoding for functionally diverse proteins, the majority of the recombination blocks were predominantly found psrP gene, which is a surface protein and a known hotspot for recombination in the pneumococcus13 (Supplementary Fig. 4 and Supplementary Data 4). Other less frequent hotspots were associated with bacteriocins, phage DNA, zinc metalloprotease (zmpA), autolysin and hypothetical genes.

Within-host substitution rates and population sizes

We then used 60 extended episodes with >4 sequenced genomes to infer within-host substitution rates. We estimated the number of accrued substitutions and the amount of time taken to accumulate the substitutions in each episode using the onset strain of the episode as the baseline. To assess whether the accumulation of substitutions was time-dependent, or consistent with molecular-clock evolution, we fitted a linear regression model of the number of accrued substitutions against the corresponding time (Fig. 5). We detected strong molecular clock-like pattern in few individuals (9/60) while substitutions did not accumulate linearly for the rest of the episodes, which was indicative of either non-constant appearance and disappearance of substitutions or presence of a cloud of within host genetic diversity within each swab, which masked the clock-like signals37. With the exception of two episodes of serotype 19A belonging to ST10542 and ST4029 in infants #55 and #76 respectively, whose within-host substitution rate (μ) were 2.93 × 10−06 SNPs site−1 year−1 and 3.81 × 10−06 SNPs site−1 year−1 respectively, similar to the rate measured over longer timescales (1.57 × 10−6 SNPs site−1 year−1)13, the other eight episodes showed higher within-host μ ranging from 6.46 × 10−05 to 1.00 × 10−05 SNPs site−1 year−1 (Table 2). Such within-host μ resulted in the introduction of up to ≈41 substitutions more than would have been introduced via μ estimated over longer timescales in S. pneumoniae13 and other bacterial species38. The within-episode Ne ranged from 1.22 to 72.2 similar to those observed during short-term within-host Neisseria lactamica evolution39.

Fig. 5: Within-host mutation rates during natural colonisation.
figure 5

Episodes where molecular-clock signal was evident were analysed. Serotypes with >4 sequenced genomes per individual were included in the analysis. The episode name is shown in the format A:B:C where A, B and C represents the infant ID, serotype and number of episodes with the serotype respectively. Linear relationship between the number of accrued SNPs in comparison with the reference genome sequenced at the onset of the episode was assessed using linear regression. The nucleotide substitution rate (μ) corresponded to the estimated number of SNPs site−1 year−1 based on the regression coefficient (β). The units of β, i.e., the mutation rate expressed as the number of SNPs per week. The shaded area surrounding the fitted linear regression line represent the 95% confidence interval based on the standard error of the mean slope of the regression line. The values of the substitution rates expressed as SNPs site−1 year−1 are shown in Table 2.

Table 2 Within-host nucleotide substitution rates during natural colonisation.

Parallel evolution in coding and non-coding regions

The probability of a parallel SNP occurring at any random location in the pneumococcal genome is extremely low ≈2.46 × 10−12 within a year and ≈9.07 × 10−16 within a week, which implies that the occurrence of such mutations reflects adaptive evolution. Since S. pneumoniae is a long-term human-adapted pathogen, we postulated that de novo parallel evolution would be uncommon since the adaptive genomic changes would already exist in the population as standing genetic variation. To test this hypothesis, we investigated the occurrence of de novo SNPs during the course of extended episodes whereby >3 sampled isolates were sequenced. We excluded SNP positions with an ambiguous DNA character (N) to avoid including SNPs from genomic regions which were potentially difficult to align properly. We identified 2523 SNPs locations during 449 unique extended colonisation episodes satisfied our analysis criteria. Of these SNPs, 2326 and 197 were non-parallel and parallel respectively. We detected 77 parallel genic and 120 SNPs intergenic SNPs (Fig. 6a–b and Supplementary Data 4, 5). Overall, the parallel intergenic SNPs were shared between more episodes than genic SNPs (P < 2.95 × 10−08, Kruskal–Wallis test) (Fig. 6c and Supplementary Fig. 5). Nineteen intergenic parallel SNPs occurred in at least 10 episodes, six of which appeared in >40 episodes including one in 75 episodes (Fig. 6a, b and Supplementary Data 5, 6). Comparatively, although more parallel SNPs were found in the coding than non-coding regions (P < 2.2 × 10−16, Fisher’s Exact test), the proportion of parallel SNPs was lower than in intergenic regions (P < 2.2 × 10−16, Fisher’s Exact test) (Fig. 6d, e).

Fig. 6: Parallel genic and intergenic SNPs identified during colonisation.
figure 6

a Bar plot showing coding or genic regions containing synonymous (red) and non-synonymous (blue) SNPs in the genome. b Bar plot similar to (a) but showing genomic regions with intergenic SNPs. c The number of episodes containing a genic or intergenic SNP. d Bar plot showing number of episodes containing a genic and intergenic SNP. e Proportion of episodes with parallel SNPs (dark blue) in genic and intergenic SNPs. f Number of episodes with synonymous and non-synonymous amino acid change in coding regions. g Number of colonisation episodes with a change at each codon position. h Carriage duration of episodes with parallel and non-parallel SNPs. The letters N, S and I stand for non-synonymous, synonymous and intergenic SNPs respectively. The number of data points for each group were as follows: N and non-parallel (n = 927), S and non-parallel (n = 1088), I and non-parallel (n = 311), N and parallel (n = 297), S and parallel (n = 228), and I and parallel (n = 790). i Functional classification of genes with parallel SNPs. Only episodes with >3 sequenced genomes were included in the analysis. The statistical significance is shown by the number of asterisks as follows: **P < 0.01, ***P < 0.001.

The most common parallel genic SNPs occurred in genes encoding for the penicillin-binding protein pbpX (75 episodes), iron transporter (32), an LPxTG cell-wall-anchored protein psrP (21) and lactose-specific phosphotransferase system (PTS) protein lacE2 (Fig. 6a, b and Supplementary Data 6). Other less common parallel genic SNPs were identified in dihydropteroate synthase folP (5 episodes), capsule biosynthesis wzx (6), zinc metalloprotease genes zmpA (4) and zmpD (7), Dps-like peroxide resistance protein dpr (6), bacteriocin blpL (4) and several hypothetical proteins. We assumed a null hypothesis that the frequency of mutations was identical at all codon positions. Statistical analysis showed that SNPs at the second codon were less frequent (P = 1.01 × 10−11, Proportions Z-test) while those at third position were more frequent than expected under the null or neutral hypothesis (P = 3.60 × 10−11, Proportions Z-test) (Fig. 6f). No significant deviation was detected at the first codon position. Despite the low frequency of SNPs at the second codon, non-synonymous SNPs occurred more frequently than synonymous SNPs (P = 0.03, Proportions Z-test) (Fig. 6g). Surprisingly, the carriage duration of the episodes with parallel SNPs were relatively shorter than those with non-parallel SNPs for intergenic (P < 2.2 × 10−16, Kruskal–Wallis test), and synonymous (P < 1.16 × 10−15, Kruskal–Wallis test) and non-synonymous genic mutations (P < 2.2 × 10−16, Kruskal–Wallis test) (Fig. 6h). Comparison of the carriage duration of the wild-type (ancestral) and evolved (parallel) SNPs individually suggested that some parallel SNPs, although few, were more likely to be associated with longer carriage than the wild-type mutation reflecting a beneficial effect towards carriage. This include SNPs at positions 38906, 702153, 225187 1395631, 1546314-15, 1619615, 1763592, 190783, 1395631 and 2131768-9 in intergenic region, and genic SNPs at positions 145748, 1790562, 293764, 265020, 562300, 615248, 813146, 1713629, and 1525760 (Supplementary Figs. 6, 7). Interestingly, functional analysis suggested that the majority of the parallel mutations were in genes associated surface-exposed, envelope biogenesis and membrane proteins (Fig. 6i). Further analysis comparing the timing for the occurrence of the parallel SNPs in each episode revealed that the parallel SNPs typically occurred early after onset of the carriage episode and were mostly propagated throughout the episode (Fig. 7a–c).

Fig. 7: Timing and duration of parallel mutation during natural colonisation.
figure 7

Type of parallel SNP is shown by different panels in the figure as follows; a non-synonymous, b synonymous, and c intergenic. The estimates were calculated for each extended colonisation episode with >3 sequenced isolates. The parallel SNPs coloured in orange were propagated throughout the episode after occurrence while those coloured in dark blue did not persist over the entire episode.

Frequently mutated genes and natural selection

We assessed the frequency of SNPs and compared the ratio of non-synonymous to synonymous SNPs in the genes mutated during extended colonisation episodes. The highest number of SNPs were found in infB, blpH and hasC, psrP and SPN23F_18240 genes, which encodes for translation initiation factor IF-2, serine histidine kinase, UTP-glucose-1-phosphate uridylyltransferase, cell wall surface anchored protein and hypothetical proteins respectively (Fig. 8a and Supplementary Data 7). To account for variability in the length of genes, we transformed the raw number of SNP counts to generate normalised number of SNPs per kilobase pair (Kb). The normalised estimates showed that genes encoding for a UTP-glucose-1-phosphate uridylyltransferase (hasC), bacteriocins (blpL, blpH, blpZ and blpR), immunity (pncG) and hypothetical proteins (SPN23F_18220, SPN23F_18240, SPN23F_21180 and SPN23F_04920) had the highest density of SNPs (Fig. 8a and Supplementary Data 8). We then used the ratio of the normalised number of non-synonymous to synonymous SNPs (dN/dS) to investigate natural selection in the genes (Fig. 8a and Supplementary Data 8). The majority of the genes (461/592) evolved neutrally (1/3<dN/dS < 3) but 131 genes showed some evidence of both positive and negative selection. Of the putatively selected genes, 96 genes showed dN/dS > 3 while 35 genes had dN/dS < 1/3, which implied that positively selected genes were twofold more common than those under negative selection.

Fig. 8: Highly mutated genes during natural colonisation.
figure 8

a Normalised and unnormalized number of SNPs detected in each gene during colonisation episodes. Normalisation was done by estimating the number of SNPs per kilobase pair (Kb). b Normalised number of synonymous and non-synonymous SNPs per Kb in each gene.

Discussion

Our findings provide compelling evidence that within-host genetic diversity of pneumococcal strains is rapid and adaptive during extended natural colonisation. Since our study was conducted in an African setting, where carriage rates in infants <1 year old ranging from 72 to 97% are among the highest globally1,8, our findings provide a better reflection of the genetic diversity of the carried pneumococcal strains in naturally colonised hosts. In these hosts, the diversity of the infecting inoculum is likely to be more diverse than seen during experimental human challenge experiments in the UK29, which could contribute to the differences in carriage rates in our study setting (≈89%) and the UK (<10%)40. The observed high within-host diversity appears to be driven by rapid mutation rates and limited effect of purifying selection; therefore, neutral evolution (drift) is predominant. We also noted that the amount of within-host genetic diversity varied between individuals, serotype and ST, and episodes, which suggests the collective importance of both the strain and host, and their interactions on within-host microevolution of S. pneumoniae41. Furthermore, we show the occurrence of real-time within-host pneumococcal recombination as the main mechanism through which divergent strain variants emerge from their parental strains during colonisation. However, other divergent strains were due to acquisition of multiple strains during the course of an episode or co-transmission at the onset of the episodes. Crucially, we found evidence of parallel evolution, whereby the parallel mutations typically occurred early after onset of a carriage episode and persisted throughout the episode. Functional analysis revealed that the parallel mutations were predominantly associated with genes encoding for cell wall, envelope biogenesis and membrane-associated proteins, some of which have been previously shown to promote pneumococcal attachment to epithelial surfaces and evasion of the immune responses; therefore, may promote efficient and extended colonisation.

The average pairwise genetic distance between isolates sampled from the same host during extended natural colonisation was higher than would be expected assuming μ inferred isolates sampled over long-time scales42. This signposted rapid μ and possibly low purifying selection, which removes deleterious substitutions thereby decreasing μ over longer-time scales than considered in our study43. However, the fact that we were only able to detect significant evidence of molecular clock-like evolution in ≈20% of the episodes suggests either non-linear accrual of substitutions or obscured temporal signal due to the presence of a cloud of diversity within the samples in the majority of the extended episodes. In the episodes with clock-like evolution, where μ could be estimated, the majority of the values (1.00 × 10−05 to 6.46 × 10−05 SNPs site−1 year−1) were higher than estimated over longer timescales in the pneumococcus (1.57 × 10−6 SNPs site−1 year−1)43. These substitution rates corresponds to within-host μ of up to ≈41 times faster than μ inferred over longer timescales in S. pneumoniae6,13,42 and other bacterial species38. These findings clearly show that pneumococcal evolution is rapid during short-term colonisation reflecting weak purifying selection and possibly early host adaptation in order to successfully establish extended colonisation. The observed high within-host μ in S. pneumoniae is similar to the estimates inferred during the first 30 days of acute phase of Helicobacter pylori infection (8.1 × 10−5 SNPs site−1 year−1)44 and experimental human carriage of N. lactamica (1.45 × 10−5 SNPs site−1 year−1)39. Indeed, the within-host mutation burst during acute H. pylori infection44 is triggered by inflammatory immune response and weak purifying selection43. We found variably low Ne (1–72), which suggests higher selective bottleneck following transmission and or growth limitation due to immune-mediated clearance, which can limit within-host selection45. These patterns are indicative of weak purifying and predominance of neutral evolution.

Strain interactions are vital for pneumococcal colonisation41. Our results show that extended colonisation is driven by a single dominant strain but <10% of the episodes contained highly divergent strain variants. In-depth analysis of the SNP distribution across the genomes of strains in episodes with the highly divergent strains revealed evidence of rare homologous recombination during ongoing episodes, which is compatible with the genomic plasticity of the pneumococcal genomes13,46. Consistent with the uncommon occurrence of recombination within the episodes described at population level13, on average a single recombination block was detected during the course of some episodes but these typically involved shorter genomic regions, which are less likely to result in major phenotypic changes such as capsule switching. The majority of the recombination blocks were located in psrP, which encodes a surface-exposed serine-rich protein and is a known hotspot for recombination in the pneumococcus13. The overall r/m values averaged across genomic regions where recombination occurred ranged from low (≈1) to high values (≈143), which suggests that recombination blocks rarely occur more than once during a single colonisation episode. With the exception of one episode whereby the recombinant strain outcompeted the ancestral wild-type strain for 4 weeks before being replaced by the wild-type strains, the majority of the divergent recombinant strains were primarily detected at a single time point. Such short survival times of the recombinant strains could imply strong competition with the wild-type strains. Therefore, we hypothesise that such rapid clearance of the recombinant strains could be a mechanism for limiting the spread of novel divergent strains arising due to recombination, which preserves the population structure. The observed presence of other divergent strains with no evidence of recombination during the episodes reflect either co-transmission of multiple variants in the infecting inoculum from another host or additional acquisitions during the episode. Whether both scenarios are equiprobable could not be established by our study as it was not equipped to answer this question, but this will be addressed in follow-up studies. Nevertheless, the presence of multiple divergent strains and the well-known multi-serotype carriage47 signposts diversifying selection favouring co-existence of strain variants as observed in Burkhoderia dolosa32, Pseudomonas aeruginosa48 and Staphylococcus aureus49. Since we predominantly sequenced single colonies, these may have failed to capture temporal dynamics of co-colonising strains especially those present at low frequency. Therefore, follow-up studies sequencing either multiple colonies or better yet the entire culture at high read depth will be required to fully unravel within-sample genetic diversity and temporal dynamics of the wild-type and recombinant strain variants50.

Our results suggest that within-host evolution is adaptive since the occurrence of parallel mutations is unlikely to due to chance alone39,51,52. We showed that parallel SNPs are relatively more common than non-parallel SNPs in intergenic than genic regions, which could suggest that the non-coding regions are less constrained evolutionary than those in coding regions, which may be more deleterious, hence, more likely to be selected against. Such parallel intergenic variation may promote colonisation by regulating gene expression. The parallel SNPs occurred at high frequency in pbpX gene, which confers resistance to penicillin antibiotic53. Considering lack of strict regulation of antibiotics in African settings, the high occurrence of substitutions in pbpX could reflect the high background antibiotic selection pressure. A recent study has showed that another penicillin-binding protein (pbp1b), which does not directly confer penicillin resistance but prolongs the killing time, increases the risk for pneumococcal meningitis54. Therefore, it is plausible that the parallel SNPs in pbpX may also have additional functions in promoting colonisation beyond their role in antibiotic resistance. We also detected other parallel SNPs at lower frequency than pbpX in psrP gene, a surface-exposed adhesins important for epithelial attachment and biofilm formation55, and has been associated with extended colonisation56. Other parallel SNPs were found in genes encoding for the iron transporters, lactose-specific phosphotransferase system protein (lacE2), which collectively plays a role in nutrient uptake, while the SNPs associated with capsule biosynthesis proteins (wzx), could have an effect on mucosal adherence by altering capsule expression leading to exposure of cell-surface adhesins57; and immune evasion by inhibiting complement activity and phagocytosis24,58. The other less common parallel SNPs were associated with dihydropteroate synthase (folP), zinc metalloproteases (zmpA and zmpD), and bacteriocin (blpL) genes, which play roles in epithelial adherence and resistance to opsonophagocytic killing59,60,61, resistance to trimethoprim antibiotic62, cleavage of human immunoglobulin A1 (IgA1)63, and modulating competition between bacterial strains and species64 respectively. Although we did not identify parallel SNPs in the DNA-directed RNA polymerase delta subunit protein gene (rpoE) previously identified in in vitro studies, this may reflect differences in evolution between in vitro experiments and during natural human carriage28. There is also a possibility that such SNPs already exist in the population as standing variation as a result rarely occur within hosts during carriage episodes. The infrequent occurrence of mutations in the second codon position, which cause changes in amino acid and the most constrained position evolutionary65, suggests the impact of purifying selection. However, although non-synonymous mutations were more common but surprisingly episodes with parallel mutations were not necessarily the longer than those with other non-parallel SNPs. This may suggest that the majority of the parallel mutations did not lead to longer carriage duration, however, some SNPs clearly showed longer duration relative to the ancestral mutations. Furthermore, the frequent occurrence of the parallel mutations early in the episodes and their persistence throughout the episode, suggests that the parallel SNPs could be beneficial towards carriage. Our approach focused only at detecting core rather than strain-specific accessory genomic changes within hosts, therefore, follow-up studies are needed to characterise genetic variation in the accessory genome. Altogether, our findings provide evidence of continual adaptive within-host evolution of S. pneumoniae during extended carriage, which may promote colonisation through host immune evasion, resistance to antibiotics, efficient nutrient uptake and epithelial surface adherence, and adept competition and coexistence with other strains and nasopharyngeal commensals.

Our findings show rapid within-host microevolution of S. pneumoniae during natural extended colonisation in asymptomatic human hosts with evidence of adaptations through parallel mutations in intergenic and genic regions association with immune evasion and epithelial adherence proteins, which may promote efficient and prolonged colonisation. Our findings enhance our understanding of within-host pneumococcal evolution during natural colonisation and provides a framework for discovering novel genomic changes and pathogenicity genes important for extended colonisation which will be validated in future experiments. Such experiments will inform design of evidence-based clinical interventions such as anti-adherence and anti-virulence agents, which can attenuate extended colonisation; therefore, decreasing the likelihood for within-host occurrence of invasive-disease-predisposing mutations66,67. Hence, by impeding pneumococcal progression to disease without completely eradicating asymptomatic carriage, these interventions will avert significant upheaval of the nasopharyngeal niche; thus, minimising the risk for overgrowth of as-yet-unknown highly virulent but profoundly suppressed pathogens capable of inhabiting the nasopharyngeal niche previously occupied by the eliminated pneumococcal species.

Methods

Sample collection

One thousand five hundred and fifty-three nasopharyngeal swabs were collected from 98 infants from 21 villages in rural areas via the Sibanor Nasopharyngeal Microbiome study in the Gambia, West Africa, between November 2008 and April 200933 (Supplementary Data 1). Participants were recruited on a roll-in basis starting when a new birth in each village was reported to the study liaison by a community contact. Written informed consent was obtained from the parents and guardians before the infants were enroled in the study. Nasopharyngeal swabs were taken from the recruited infants bi-weekly from the first week after birth to 6 months (weeks 1,3,5 until 27) and then bi-monthly afterward until 12 months (weeks 35, 43 and 52). The NPS specimens were stored in skim milk–tryptone-glucose glycerol medium and at −80 °C within 8 h of collection. For the isolation of S. pneumoniae, broth enrichment of nasopharyngeal swab samples (NPS) using 5 mL of Todd-Hewitt broth (Oxoid, Basingstoke, UK) containing 5% yeast extract with 1-mL rabbit serum (TCS Biosciences Ltd, Botolph Claydon, UK) was performed as described elsewhere8. Pneumococci were identified by their colony morphology and optochin sensitivity. Sterile saline suspensions of gentamicin blood agar pneumococcal plate sweeps were then used for serotyping by latex agglutination which can detect multiple serotypes68. Latex agglutination was performed by capsular and factor-typing sera (Statens Serum Institut, Copenhagen, Denmark)69. A single isolate was selected from NPS sample and prepared for whole-genome sequencing. The Medical Research Council (MRC) Unit, The Gambia Joint Ethics Committee and the Gambian Government approved the study (approval number: SCC1108).

Multistate modelling of colonisation dynamics

To investigate colonisation dynamics of the strains, we defined a multi-state model with two intermittently observed states; colonised and uncolonised. The uncolonised state referred to a swab that yielded no pneumococcal isolates. We defined a colonisation episode as detection of the same serotype from acquisition to clearance of the serotype. We defined colonisation episodes similar to Turner et al.7. We considered acquisition of a serotype to occur at either first acquisition or re-acquisition after clearance while clearance was defined as observation of two consecutive cultures were negative for the serotype for samples collected up to 27 weeks, while for those collected after week 27, clearance was considered to occur when a single culture-negative sample for the serotype was detected (Supplementary Fig. 1 and Supplementary Data 2). The episodes were considered to be transient and extended when the same serotype was detected once and >1 sampling point respectively. Due to the detection of multiple serotypes at some sampling points, some episodes for different serotypes overlapped (Supplementary Fig. 1). The multi-state model was fitted using msm v1.6.7 package70 with Nelder-Mead optimisation in R v3.5.3 (R Core Team, 2020).

DNA sequencing and genomic analysis

Genomic DNA was extracted from pure pneumococcal colonies33 and WGS of the picked single colonies was done at the Wellcome Sanger Institute using paired-end sequencing on the Illumina HiSeq 4000 as part of the Global Pneumococcal Sequencing (GPS) project (www.pneumogen.net). Serotypes were identified in silico based on the genomic data using SeroBA v1.0.071. The sequence types (ST) were identified using MLSTcheck v2.0.151061272 based on the pneumococcal multilocus sequence typing (MLST) scheme35. Whole-genome alignments were created from consensus pseudo-genome sequences generated after mapping the reads against the ATCC700669 pneumococcal reference genome (GenBank accession: NC_011900)73 using SMALT v0.7.4 (minimum insert size: 50, maximum insert size: 1000, minimum quality: 30, minimum depth of coverage: 4, minimum matching reads per strand: 2 and minimum base call quality: 50, minimum mapped reads: 5). Insertion and deletions were realigned using GATK v4.0.3.074. Consensus single nucleotide polymorphisms (SNP), excluding sites with ambiguous DNA characters (N), were identified using consensus whole-genome alignments using SNP-sites v2.3.175.

Genetic similarity between isolates and substitution rates

The genetic distance between a pair of isolates was estimated as the number of SNPs distinguishing them based on the whole-genome sequence alignment using snp-dists v0.6.3 (https://github.com/tseemann/snp-dists). We excluded nucleotide sites with ambiguous DNA characters or deletions when estimating the genetic distances. To estimate substitution rates, we identified serotype and ST combinations with >3 sequenced genomes per episode within an individual followed by determination of the number of accumulated nucleotide substitutions from the onset of the index strain as reference to each subsequent sampling point. We then fitted a linear regression model for the number of accrued substitutions versus the time between the isolates and the time when the first isolate in the episode, i.e., the reference strain was sampled. A significant linear relationship between the number of substitutions and time provided strong evidence for molecular-clock-like evolution. The serotypes with evidence of clock-like evolution were then used to infer the substitution rate (µ), expressed as nucleotide substitutions per site per year (SNPs site−1 year−1), was measured as follows: µ = βW/G where β is the regression slope parameter with units as SNPs per week, W is the number of weeks per year (52) and G is the pneumococcal genome size (2,221,315 bp)73. Data visualisation was done using ggplot2 v3.1.076.

Recombination, natural selection and parallel evolution

To detect the occurrence of recombination, natural selection, and parallel evolution within extended colonisation episodes, we selected strains from episodes with >3 sequenced genomes. We assessed the distribution of SNPs in the affected genes using the crude ratio of the number of non-synonymous substitutions per kilobase pair (dN) to synonymous substitutions per kilobase (dS), i.e., dN/dS with pseudo counts of 1 added to both the dominator and numerator to avoid division by zero. Homologous recombination was assessed using Gubbins v2.4.136. The occurrence of parallel substitutions was determined by identifying genomic locations identified in >1 distinct extended episode. The probability of the occurrence of two parallel substitutions in different episodes was estimated as the product of the per-site probability of substitutions arising at any location in the genome using the substitution rate as follows: probability 1 − eµt where µ is the pneumococcal substitution rate (1.57 × 10−6 SNPs site−1 year−1)13 and t is the time in years. The within-episode effective population size (Ne) was estimated as Ne = θ/(2 µgl)39 where θ, µ, g and L represent the strains’ mean pairwise genetic diversity, substitution rate13, generation rate (14/365 cell divisions/year)77 and genome length (2,221,315 bp)73, respectively. Genomic data were processed using BioPython v1.7.678 and multiple sequence alignments diagrams were generated using alignfigR v0.1.1 (https://github.com/sjspielman/alignfigR). We performed functional analyses of the genes using eggNOG-mapper v2.079. Three dimensional scatter plots were generated using scatter3D function in plot3D v1.3 package (https://cran.r-project.org/web/packages/plot3D/). Maps were generated in R software using ggmap v3.0.0 package (https://cran.r-project.org/web/packages/ggmap/). All statistical analyses were done using R v3.5.3 (R Core Team, 2020).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.