Within-host microevolution of Streptococcus pneumoniae is rapid and adaptive during natural colonisation

Genomic evolution, transmission and pathogenesis of Streptococcus pneumoniae, an opportunistic human-adapted pathogen, is driven principally by nasopharyngeal carriage. However, little is known about genomic changes during natural colonisation. Here, we use whole-genome sequencing to investigate within-host microevolution of naturally carried pneumococci in ninety-eight infants intensively sampled sequentially from birth until twelve months in a high-carriage African setting. We show that neutral evolution and nucleotide substitution rates up to forty-fold faster than observed over longer timescales in S. pneumoniae and other bacteria drives high within-host pneumococcal genetic diversity. Highly divergent co-existing strain variants emerge during colonisation episodes through real-time intra-host homologous recombination while the rest are co-transmitted or acquired independently during multiple colonisation episodes. Genic and intergenic parallel evolution occur particularly in antibiotic resistance, immune evasion and epithelial adhesion genes. Our findings suggest that within-host microevolution is rapid and adaptive during natural colonisation.


Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection

Data analysis
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Chrispin Chaguza
Jun 18, 2020 No software was used for data collection. All the samples used in this study were collected by trained clinicians and processed by experienced molecular microbiology team. The extracted DNA underwent whole genome sequencing and the resulting data was analysed using the open source tools listed below.

October 2018
Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

MRI-based neuroimaging
The whole genome sequences (reads) were deposited into the European Nucleotide Archive (ENA) and are publicly available under the accession numbers provided in Supplementary Data 1 of this paper. The reference genome sequence used for the read mapping (Genbank accession: NC_011900) is available from GenBank (https://www.ncbi.nlm.nih.gov/nuccore/NC_011900). The authors declare that all other data supporting the findings of this study are available within the paper and its supplementary information files. . It was determined that a sample size of 90 (30 infants each of 3 groups) would give 90% power (p=0.05) to detect a difference of 25% or more in the mean number of bacterial genera carried in the nasopharynx 150 days after birth among the in infants in the three study groups. Participants were recruited on a roll-in basis until each group had at least 30 infants. A total of 102 infants were recruited in total and 96 completed the study. The infants were recruited within 7 days of birth and followed up bi-weekly for the first 6 months and bi-monthly thereafter until twelve months. A total of 1595 nasopharyngeal swabs were collected from the infants.
Four infants died or left the study before 2 months and their data are excluded from the study.
Replication was not done for this study. Longitudinal studies of this magnitude are costly and require a lot of time to conduct especially in our study setting i.e. Sub Saharan Africa (the Gambia).
Newborns were recruited from 27 villages with estimated birth rates between three and twenty-six per year. The villages were split into 3 groups of 9 villages with estimated population sizes of 2000 persons and birth rates of approximately eighty per year. Group I and II villages had to be at least 1Km from Group III villages where PCV-7 had been trialled. Group III villages were PCV naive. Trained village reporters in each village recorded and reported pregnancies, births, deaths and other serious events to the field team. To avert recruiting bias, participants were enrolled on a roll-in basis, whereby infants born in any of the participating villages and for whom written informed consent was granted were included in the study.
No blinding was performed in this study. Blinding was considered not to be necessary for our longitudinal cohort study as the nature of the study was observational and no intervention was given to the newborn infants. We followed up all the enrolled infants and collected nasopharyngeal samples at specific time points to study strain dynamics, genomic diversity and evolution during natural colonisation therefore there was no risk for bias by not blinding the investigators.