Population genomic datasets describing the post-vaccine evolutionary epidemiology of Streptococcus pneumoniae

Streptococcus pneumoniae is common nasopharyngeal commensal bacterium and important human pathogen. Vaccines against a subset of pneumococcal antigenic diversity have reduced rates of disease, without changing the frequency of asymptomatic carriage, through altering the bacterial population structure. These changes can be studied in detail through using genome sequencing to characterise systematically-sampled collections of carried S. pneumoniae. This dataset consists of 616 annotated draft genomes of isolates collected from children during routine visits to primary care physicians in Massachusetts between 2001, shortly after the seven valent polysaccharide conjugate vaccine was introduced, and 2007. Also made available are a core genome alignment and phylogeny describing the overall population structure, clusters of orthologous protein sequences, software for inferring serotype from Illumina reads, and whole genome alignments for the analysis of closely-related sets of pneumococci. These data can be used to study both bacterial evolution and the epidemiology of a pathogen population under selection from vaccine-induced immunity.

Streptococcus pneumoniae is common nasopharyngeal commensal bacterium and important human pathogen. Vaccines against a subset of pneumococcal antigenic diversity have reduced rates of disease, without changing the frequency of asymptomatic carriage, through altering the bacterial population structure. These changes can be studied in detail through using genome sequencing to characterise systematically-sampled collections of carried S. pneumoniae. This dataset consists of 616 annotated draft genomes of isolates collected from children during routine visits to primary care physicians in Massachusetts between 2001, shortly after the seven valent polysaccharide conjugate vaccine was introduced, and 2007. Also made available are a core genome alignment and phylogeny describing the overall population structure, clusters of orthologous protein sequences, software for inferring serotype from Illumina reads, and whole genome alignments for the analysis of closely-related sets of pneumococci. These data can be used to study both bacterial evolution and the epidemiology of a pathogen population under selection from vaccine-induced immunity.

Background & Summary
Streptococcus pneumoniae (the pneumococcus) is a genetically diverse bacterial species commonly asymptomatically carried in the nasopharynx of infants, an age group not capable of generating a strong adaptive immune response to the polysaccharide capsule that envelopes most pneumococcal cells 1 . This capsule inhibits immune clearance by both complement-and neutrophil-mediated pathways 2 , and is a critical factor in allowing S. pneumoniae to invade other anatomical sites and cause disease such as pneumonia, bacteraemia and meningitis, particularly in both the young and elderly. Consequently, polysaccharide conjugate vaccines (PCVs) based on the pneumococcal capsule have been developed to induce anti-pneumococcal immunity in children 3 . These vaccine formulations are limited in the number of antigens they contain: the first licensed formulation (PCV7) contained seven serotypes. However, over 90 immunologically distinct capsule polysaccharides (serotypes) have been identified in S. pneumoniae 4 , which is the consequence of extensive genetic variation at the capsule polysaccharide synthesis (cps) locus. Hence at the point of vaccine introduction the pneumococcal population consists of a mix of 'vaccine serotypes', susceptible to artificially-induced host immunity, and 'non-vaccine serotypes'. Nevertheless, the widespread use of PCVs has caused a substantial fall in the incidence of invasive pneumococcal disease 5 . This is the consequence of the PCV7 vaccine having at least 50% efficacy in preventing nasopharyngeal carriage of vaccine serotypes 6,7 , and around 98% (ref. 8) efficacy against invasive disease caused by the same types. However, the overall levels of pneumococcal carriage have not changed postvaccination owing to serotype replacement 9,10 : the increase in frequency of non-vaccine serotypes. The reduced incidence of pneumococcal disease has therefore been attributed to the lower rate at which these non-vaccine serotypes cause symptomatic infections relative to vaccine serotypes 11 . Additionally, as many multidrug-resistant pneumococci were of PCV7 serotypes prior to the vaccine's introduction, it was anticipated that PCV7 would decrease the levels of S. pneumoniae antimicrobial resistance; however, any such benefit in this regard observed shortly after vaccine introduction 12 was not sustained over the longer term 13 . Hence understanding how the carried pneumococcal population structure changes following the implementation of partial coverage PCVs is important for relating the intervention to its subsequent clinical outcomes.
To address this question, the carried population of pneumococci in Massachusetts has been followed by the Streptococcus pneumoniae Antimicrobial Resistance in Children (SPARC) project. The instigation of this study coincided with the introduction of PCV7 in the USA in 2000. Samples were obtained through swabbing the nasopharynx of children seven years of age or under during routine visits to primary care physicians 14 [13][14][15] . The detected level of pneumococcal colonisation remained stable over these winters, with 190 (26% prevalence), 232 (23% prevalence) and 294 (30% prevalence) S. pneumoniae isolates recovered in the three successive sampling periods. This collection has been used to analyse the changing antigenic profile of the population in response to vaccine-induced immunity through serological typing [13][14][15] . The same collection was also genotyped by multilocus sequence typing 16 (MLST) to relate these changes to the elimination, emergence and diversification of individual bacterial lineages 17,18 . Subsequently, whole genome sequencing was applied to the collection to investigate the population dynamics in greater detail 19 . This dataset consists of 616 annotated draft S. pneumoniae genomes representing the evolutionary epidemiology of the species as the population structure changed in response to vaccine-induced selection pressures. To aid analysis of the overall set of isolates, the species-wide core genome alignment and phylogeny are also made available, as are the predicted protein sequences, a method for inferring serotype from Illumina reads, and whole genome alignments for fifteen sets of closely-related isolates.

Methods
Culturing and phenotyping of strains Following retrieval from storage, all bacterial samples were colony purified, then grown on 5% sheep's blood agar overnight at 37°C in the presence of 5% CO 2 . Samples were serologically typed using latex agglutination (Statens Serum Institut, Copenhagen) as a check on sample handling. Discrepant results with previous typing were verified using the Quellung reaction (Statens Serum Institut, Copenhagen; Table 1 (available online only)).
Overnight plate growth was harvested through resuspension in phosphate buffered saline, and genomic DNA was extracted using DNeasy columns (Qiagen) following manufacturer's instructions. The concentration of genomic DNA was quantified using the Qubit system (Life Technologies); all samples yielded at least 3 μg of DNA. The integrity of the genomic DNA was checked using agarose gel electrophoresis relative to a λHindIII ladder (New England Biolabs).

Generation of sequence data
Illumina sequencing libraries were constructed as described previously 20,21 . Briefly, genomic DNA was first fragmented using Adaptive Focused Acoustics technology (Covaris). The resulting fragments were www.nature.com/sdata/ SCIENTIFIC DATA | 2:150058 | DOI: 10.1038/sdata.2015.58 then repaired to ensure they had blunt ends, phosphorylated at their 5′ end, A-tailed at the 3′ end, and ligated to adapter molecules. This library of fragments was then separated by agarose gel electrophoresis. DNA constructs of the appropriate size range (generating an insert size of approximately 150-300 bp) were then extracted from the gel and amplified by a polymerase chain reaction using Kapa HiFi polymerase (Kapa Biosystems) that added one of the 96 index tags used in this project. Libraries were then quantified using qPCR, and combined into an equimolar pool of 96 samples prior to denaturation, cluster generation and sequencing Assembly and annotation of sequence data Sequences were assembled de novo using Velvet 22 version 1.2 with parameters selected to be optimal for individual datasets as described previously 23 . Both Glimmer3 (ref. 24) and Prodigal 25 were trained on the reference sequence of S. pneumoniae ATCC 700669 (ref. 26; Data Citation 1), then applied to the complete draft assembly with an 11 nt sequence encoding stop codons in each reading frame added to each end, to facilitate the identification of partial coding sequences (CDSs) broken by the assembly. Putative CDSs were then trimmed at the 3′ end to stop them spanning contig breaks within the assembly. Final CDS predictions were identified as the consensus of Glimmer3 and Prodigal outputs, as described previously 19 (Table 1 (available online only)). Protein sequences were then translated, aligned using BLAT 27 suite 0.34, and 'clusters of orthologous genes' (COGs) identified using COGsoft 28 . Pairs of orthologous sequences were then manually identified as described previously 19 .
To generate functional annotations of genome sequences, all CDSs were labelled with a unique identifier (of the form, 'ERSX_Y', where 'ERSX' is the sample accession code in the European Nucleotide Archive listed in Table 1 (available online only) and Y is an incrementing index) and their COG (of the form, 'SPARC1_CLSZ' or 'SPARC1_CLSTZ', where Z is a number). COGs relating to antibiotic resistance and the newly-characterised variable restriction-modification system loci were annotated as described previously 29 ; the 590 COGs found to be specific to prophage, the three COGs found to be specific to a particular prophage remnant, the 142 COGs found to be specific to phage-related chromosomal islands, and the 355 COGs found to be specific to integrative and conjugative elements were also appropriately identified in these datasets 29 . All COGs not belonging to one of these categories were annotated using a database of pneumococcal CDS information. This was constructed by extracting the protein sequences and annotated functions from publicly-available complete genomes and the annotation of 90 capsule polysaccharide synthesis loci 30 . Where a CDS in one of the draft genomes had a putative protein sequence identical to the translated sequence of a previously annotated locus, the annotation was directly transferred; otherwise, the annotation was transferred on the basis of orthology, if another putative protein in the same COG was identical to the translation of a CDS in an annotated genome sequence. In cases where no such information could be obtained, CDSs were labelled as producing 'hypothetical proteins'. Pneumococcal small interspersed repeats were annotated as described previously 31 , and tRNA and rRNA loci were annotated with tRNAscan-SE 32 version 1.3.1 and rnammer 33 version 1.2, respectively (Table 1 (available online only)).
Generation of core genome alignment and overall phylogeny As described previously 19 , the 1,194 COGs found to have a single representative in each of the 616 genomes were individually aligned at the protein level using MUSCLE 34 , prior to backtranslation to generate a 1.14 Mb codon alignment. The 106,196 polymorphic sites were extracted and used to generate a phylogeny using RAxML 35 version 7.0.4 with the general time reversible substitution model and a four category gamma distribution to account for rate heterogeneity. This tree was midpoint rooted on the longest branch, which separated sequence cluster 12 from the rest of the population. This is consistent with a wider phylogenetic analysis of multiple species that suggested sequence cluster 12 was the earliest lineage to diverge from the other isolates 19 . The same alignment was analysed with BAPS 36 version 5 to identify the sequence clusters. Both the core genome alignment and tree are made available as part of Data Citation 2.

Generation of whole genome alignments
For each of the fifteen monophyletic sequence clusters identified using BAPS and RAxML, a single reference draft assembly was selected for manual curation. The Illumina read data were reassembled with SGA 37 version 0.9, and these contigs merged with those from Velvet using Zorro 38 version 2.2. These were arranged into scaffolds using SSPACE 39 version 2, then manually curated and ordered using ABACAS 40 and ACT 41 . These fifteen assemblies are made available as part of Data Citation 2. Illumina read pairs from isolates of the same sequence cluster were then mapped against this reference using SMALT 42 version 0.5.8. The resulting read alignment was processed with Samtools 43 , VCFtools 44 and Biopython 45 to generate a consensus sequence. Bases were called at positions spanned by at least two reads in each direction, where at least a 75% consensus on the allele was evident; additionally, the base quality at the site had to be at least 50, and the mapping quality had to be at least 30, on the Phred scale 46,47 . These consensus sequences from each representative of the sequence cluster were then combined to generate a reference-based multiple genome alignment, each of which was analysed as described previously using an earlier version of the Gubbins 48 software. These fifteen whole genome alignments, which do not include the reference assemblies themselves, are also made available as part of Data Citation 2.

Code Availability
The algorithm used to predict recombination events has been developed into a software package, named Gubbins 48 , which can be installed on Linux and Max OSX operating systems. It can also be run on Windows operating systems using a virtual machine environment, and is freely available from http://sanger-pathogens.github.io/gubbins/. Code and reference sequences for the serological typing of S. pneumoniae using sequence reads is made available as part of Data Citation 2.

Data Records
The raw sequence data (FASTQ format) and annotated draft genome sequences (EMBL format) for the 616 isolates in the dataset have been deposited in the European Nucleotide Archive (http://www.ebi.ac. uk/ena/) with the project accession code in Data Citation 3. Individual accession codes for both raw read data and annotated draft genome assemblies are listed in Table 1 (available online only) and in the machine-readable ISA-tab metadata files associated with this article.
Further files are made available through the Dryad repository (https://datadryad.org) with the digital object identifier in Data Citation 2. The core genome codon alignment (FASTA format) and maximum likelihood phylogeny (Newick format) describe the overall population structure. For the study of individual lineages, a whole genome alignment (FASTA format) and draft reference assembly (FASTA format) are included for each of the 15 monophyletic sequence clusters 19 . These encompass 491 of the isolates, excluding those in the diverse polyphyletic sequence cluster 16. In addition, the full set of protein coding sequences (FASTA format), and the translated proteins (FASTA format), can be used to study the diversity of individual COGs. To minimise the manipulation of the protein sequences, the initial amino acid of each protein is directly translated, rather than being converted to a methionine.
The epidemiological and phylogenetic data can also be interactively visualised and analysed online using the Microreact website (http://microreact.org/) with the URL in Data Citation 4.

Technical Validation
Integrity of sample handling and quality control An overview of the processing pipeline, including the technical validation steps, is shown in Fig. 1. Of the 716 samples collected as part of the surveillance project between 2001 and 2007, 631 could be revived and cultured. All these isolates were subjected to serological typing using latex agglutination to ensure consistency of sample handling relative to previous studies ( Fig. 1 and Table 1 (available online only)). As multiple colonisation is often observed in children 49 it was not necessarily expected that the original strain would be recovered in all cases; in the earlier studies, only a single isolate per individual was analysed in order to maximise the size of the host population sample, as detecting strains carried at low frequencies is inefficient using standard microbiological techniques 50  previously recorded for 12 of the 631 samples. These new serotype inferences were all independently verified as being correct using a second culture of the same isolate, and subsequently confirmed in all seven cases where the samples were retyped in a separate laboratory using the Quellung reaction 51 . One sample yielded two clearly morphologically distinct strains that proved to be of different serotypes, increasing the number of isolates in the study to 632. All isolates found to be non-typeable serologically were tested for optochin susceptibility using 'P discs'; this identified one isolate as likely to represent a non-pneumococcal streptococcus, the exclusion of which reduced the overall collection size to 631. Sufficient high-quality genomic DNA for analysis was extracted from all isolates. After sequencing, nine samples failed either automated quality control checks implemented by the Sanger Institute, or manual investigation of dataset properties, such as adapter content, insert size and GC content; one of these failures was successfully resequenced. After assembly of the remaining 623 samples, seven further samples were rejected as generating low quality draft genomes. These represented cases where the N50 was below 15 kb and the total assembly length was greater than 2.4 Mb, likely as a consequence of sequence from more than one isolate being mixed in the raw Illumina reads. This resulted in the final dataset of 616 draft genomes.

Integrity of data handling
Serotypes and multilocus sequence types (STs) were inferred from Illumina read data as described previously 23 . Excluding isolates determined as being 'non typeable' by either microbiological or bioinformatic serotyping, across the final set of 616 samples the capsule polysaccharide synthesis (cps) locus was congruent with the serogroup (a set of antigenically similar serotypes) identified through immunological tests in all but two cases. Of the 594 isolates for which an ST had been previously established, 553 (93%) were identical with those inferred from Illumina reads (Fig. 2a). These included both cases where the genome's cps locus did not match the experimentally ascertained serogroup, indicating these discrepancies were not likely to result from sampling handling issues. Of the 41 cases in which the original ST was discrepant with that inferred from the reads, 29 differed at only one of the seven loci (Fig. 2b). All remaining inconsistencies are likely to reflect instances of multiple colonisation, resulting in different strains being originally genotyped before storage and subsequently retrieved for sequencing. This is consistent with the cps locus from the sequence reads in these cases matching the serology of the revived isolates from which the genomic DNA was extracted.

Quality of genome assemblies
The 616 samples in the dataset each yielded between 267 and 1,865 Mb of sequence data (median of 652 Mb). Assuming a typical 2 Mb S. pneumoniae genome 26 , this meant each isolate had a sequencing depth of over 100 fold coverage. Based on a random sample of 10,000 reads aligned to a set of prokaryotic and eukaryotic reference genomes using BWA 52 , a substantial majority of reads matched to the S. pneumoniae representative (strain ATCC 700669) in each dataset, confirming these data were primarily derived from the submitted genomic DNA sample.
All draft assemblies had an N50 greater than 15 kb and a total length between 1.98 and 2.19 Mb, similar to complete S. pneumoniae genomes. Additionally, the number of CDSs in the de novo assemblies was within the range of CDSs found within annotated complete or high-quality draft S. pneumoniae genomes (Fig. 2c,d). Each isolate's annotation included at least 101 of the 102 protein functional domains recently suggested to be essential and ubiquitous across cellular genomes 53 ; the only discrepancy was the short ribosomal protein coding sequence rpsN, which was not consistently identified by the automated gene annotation software even when present within assemblies. Assembly quality was also judged on the basis of non-coding RNA content: using previously-defined criteria 53 , the majority of isolates had a fulllength representative of each of the ribosomal RNAs. Similarly, quantifying tRNA content found the majority of isolates had at least one tRNA for each standard amino acid.
To ascertain the accuracy of the de novo assemblies relative to the original epidemiological data and Illumina sequence reads, the STs were extracted from the contigs through identification of the relevant loci using BLAST 54 (Fig. 2a,b). All seven loci could be recovered from each assembly. Across the 616 samples, the ST extracted from the assembly and Illumina reads was identical in 602 cases (98% accuracy). In four cases, the ST inferred from the assemblies differed from the consensus of the original genotyping and the ST extracted from the reads at a single locus (Fig. 2a). In six cases, the ST inferred from the Illumina reads differed from the consensus of the original genotyping and the assembly at a single locus; the assemblies indicated these all corresponded to a single ST, suggesting a rare systematic error. In four cases, the STs inferred from the assembly and Illumina reads differed at a single locus, and the original genotyping data were missing or inconsistent with both.
The wider 'core' genome was defined as a set of 1,194 COGs, a single representative of which was found in each genome 19 . Independent re-analysis of this dataset with a different method for defining COGs found 1,206 'core' COGs, of which 1,027 were identical to the 1,194 originally identified 55 . Concatenated codon alignments of the 'core' COGs were subject to three independent analyses with BAPS version 5, which converged on identical membership of the sixteen sequence clusters, in two cases, with additional isolates included within SC7 in the third. The fifteen sequence clusters containing similar isolates were correspondingly monophyletic in the core genome phylogeny, confirming them as being closely related sets of bacteria (Fig. 3).

Validation of recombination analyses
Recombination detection was only attempted within the 15 monophyletic sequence clusters, as they consisted of groups of isolates with detectable similarity that was likely to reflect recent common ancestry. Simulations indicated that the type of approach that was used to identify recombinations in whole genome alignments is most accurate when applied to sets of closely-related sequences 48 . In the analyses presented in this work, all alignments in which at least ten recombination events were detected formed an exponential recombination length distribution with a rate constant consistent with other genomic data 23,56-58 and experimental work 47 . The positions of recombination 'hotspots' relative to the reference genome annotations were also consistent with these independent analyses. In the cases where evidence of a molecular clock could be detected, the substitution rate was also found to be consistent with the analyses of other S. pneumoniae datasets 23,56-58 . Additionally, the consistency of the final phylogenies with the epidemiological data allowed a phylogeographic signal to be detected 19 .
Further analysis of the isolates' inferred serology (Table 1 (available online only)) identified cases where closely-related isolates differed at their cps loci, suggesting 'serotype switching' had occurred. In all cases where the pattern of switching could be robustly established, the change at the cps locus could be attributed to an inferred recombination affecting the relevant genes 4 .

Usage Notes
Sequence data may be downloaded from the European Nucleotide Archive using the project accession codes ERP000809 or PRJEB2632. All accession codes for raw sequence data and annotated individual assemblies are listed in Table 1 (available online only). Associated epidemiological data was published previously 19 . Sequences and functional annotation can be displayed using Artemis 41 . Whole genome alignments can be viewed and analysed using standard software. Gubbins 48 can be applied to them for the inference of recombined sequence. The software for serological typing can be run on Linux or Mac OSX as described in the accompanying README file.