D R A F T Signatures of non-neutral processes within the population structure of Streptococcus pneumoniae

,

M any bacterial pathogen populations contain a number of co-circulating lineages bearing unique signatures of alleles at selected housekeeping loci [1] and also at a whole genome level [2][3][4]. The maintenance of these discrete lineages is hard to ascribe to purely neutral processes, given the high rate of genetic exchange in these pathogen populations [5]. We have previously proposed that extensive co-adaptation between loci may give rise to these patterns, as even small fitness differences among different combinations of alleles can lead to the loss of less 'fit' lineages under intense competition for resources [6]. Bacterial populations may also segregate into a set of successful 'metabolic types' which are able to co-circulate by virtue of exploiting separate metabolic niches and thereby avoiding direct resource competition and immune pressures [7]. As an example, specific differences in the ability to absorb particular carbohydrate resources have been observed in functional genomics studies of Streptococcus pneumoniae [8], and these may reflect specialization upon different resources within the same environment as a means of avoiding competition.
Several bacterial pathogens are also antigenically diverse: S. pneumoniae, for example, can exist in over 90 different serologically distinguishable states or 'serotypes' [9]. Many bacterial populations -including S. pneumoniae -exhibit strong associations between antigenic type and lineage, at least at the level of MLST [7,10]. Such associations may have arisen through neutral processes; alternatively, as we have previously demonstrated, they may represent the outcome of a combination of immune selection acting upon antigen genes and direct resource competition acting upon metabolic genes and virulence factors [6,10]. Distinguishing between these two hypotheses is complicated by the high levels of linkage disequilibrium observed across the whole genome [10,11]. However, the alternative hypotheses make very different predictions about how the system would respond to perturbation by vaccination, particularly when only a subset of antigenic types are included in the vaccine, as is the case for S. pneumoniae. Under these circumstances, antigenic types that are not included in the vaccine may be expected to increase in frequency but, under the neutral model, it would be highly unlikely that they would do so in association with the genotypes previously associated with vaccine serotypes. By contrast, if the associations were primarily generated by selection, one would expect non-vaccine serotypes to become associated with genotypes that were previously commonly associated with vaccine serotypes [10]. Evidence for this phenomenon of Vaccine

Significance Statement
Populations of Streptococcus pneumoniae (the pneumococcus) appear to form stable clusters of closely related organisms despite the fact that they frequently exchange genetic material. In this paper we show that, rather than emerging by chance, these clusters have evolved to maintain their differences so that they may avoid competing with each other. Our work suggests that these clusters are fundamentally determined by variation within a set of genes which encode "chaperones" that help other pneumococcal proteins fold correctly under changes in the physical environment they would encounter while trying to infect their vertebrate host. These chaperone proteins are also targets of immunity and therefore may have originally diverged to minimise immunological interference between pneumococci, thereby necessitating changes across the whole genome.
S.G., J.L. and E.W. designed the study; S.G. and J.L. conducted the study. All authors were involved in analysis and interpretation of results and available data and in writing the paper.
We declare we have no competing interests. 1 To whom correspondence should be addressed. E-mail: jose.lourenco@zoo.ox.ac.uk
Establishing the contribution of co-adaptation and metabolic competition in the maintenance of lineage structure is thus important as the outcome of certain interventions, such as vaccination, depends crucially on these underlying determinants of population structure. Here, we assess the potential to achieve this by machine learning techniques, which work by attempting to identify relevant features based on information supplied on a set of potential predictor variables for each individual genetic sample. Random forests (RF) are one of such methods currently witnessing a surge of attention, owing to its unique advantages in dealing with large datasets of both numerical and categorical data, as well as having low computational overhead, a nonparametric nature and a well defined probabilistic output [16]. A random forest algorithm (RFA) is an ensemble method that combines the information of multiple, regression or classification trees built around predictor variables towards a response variable. The output of an RFA is composed both of the classification success rates of the response variable and a ranking of the predictor variables (scores) quantifying their relative role in the classification process. RFA-based methods are widely applied in genome-wide association studies of cancer and chronic disease risk [17], drug resistance [18], species classification [19], and in the analysis of microarray data [20]. In the context of host-pathogen systems, machine learning techniques have been shown to be able to successfully ascertain host tropism, for instance by identifying the key sites that determine host specificity of zoonotic viruses [21], by analyzing the probability of Escherichia coli cattle strains more likely to be virulent to humans [22], and by selecting the clear genetic distinctions in both avian and human proteins of Influenza viruses [23,24].
In this paper, we undertake a feature selection analysis of a dataset containing 616 whole genomes of S. pneumoniae collected in Massachusetts (USA), including 133 samples from 2001, 203 from 2004 and 280 from 2007 [3,25], thus representing the bacterial population at the point of PCV7 introduction in year 2000, and any changes that may have followed. These data have been used in numerous studies, including analysis of post-vaccine epidemiological and genetic changes [3,10,26], maintenance of population structure [2], beta-lactam resistance [27], determinants of colonization [28] and constraints on serotype switching [29]. Each isolate in this dataset contains information on its capsular serotype (determined by serological means), and had also been assigned to one of a number of monophyletic Sequence Clusters (SC) using a phylogenetic and clustering analysis on a core genome built from all putative protein-coding sequences that were present in a single copy in all genomes [3]. Using a machine learning technique and a previous allelic annotation of 2135 genes among these isolates (using ATCC 700669 serotype 23F as reference [10], table S1), we attempt to identify the relative contribution of each gene in maintaining the observed population structure in terms of (i) capsular serotype and (ii) Sequence Cluster (SC). We find a clear distinction between the sets and functions of genes highly informative for serotype versus SC, suggesting that different selective processes have led to the emergence and maintenance of S. pneumoniae's population structure.

Results
Genes which predict serotype do not perform well in predicting Sequence Cluster. We first assessed the success of the combined variation in 2135 genes of known and unknown function in identifying the Sequence Cluster (SC) to which isolates belonged, this being a measure of shared ancestry (as per [3]). Classification of SC by RFA was accurate ( Fig. S1B) with all SC types being predicted with success close to 100%. By contrast, the success rate in identifying the capsular serotypes of the 616 whole genomes, although also very high, was not perfect. None of housekeeping genes included in MLST classification performed better than average in predicting serotype or SC (Fig. 1).
As might be expected, genes within the capsular locus (defined as being within but not including the genes dexB and aliA) achieved high scores in predicting serotype but these did not score above average in predicting SC (Fig. 1). We noted that many of these genes contained what appeared to be a high proportion of deletions but, in fact, had simply eluded allelic notation on account of their high diversity at the level of the population. For certain genes, such as those encoding the polysaccharide polymerase Wzy and the flippase Wzx, the allelic notation process failed at least 50% of the time for over 90% of the isolates, essentially working only for 23F (the reference genome) and the closely related 23A and 23B serotypes. In general, the degree of success in allelic notation of each gene was closely linked to the potential for alignment with its counterpart in the 23F reference genome (Fig. S4). Nonetheless, the same shift towards lower RFA scores of capsule associated genes in predicting SC rather than serotype was observed upon performing these classification exercises after excluding all genes which contain > 50% (Fig.  S2) or > 10% (Fig. S3) of gene mismatches/deletions. When imposing an exclusion criterion of > 10% we retained only the genes wze, wzg and wzh (in addition to two pseudogenes), and these could also clearly be seen to shift from above the upper 97.5% limit into the neutral expectation of RFA scores when predicting SC (Fig. S3).
Finally, we performed the same analysis excluding all genes which showed mismatches or deletions above a threshold of 1%. This eliminated all of the genes considered above as belonging within the capsular locus, although many flanking genes were retained and a number of these achieved the top 2.5% of RFA scores in predicting serotype ( Fig. 2A, Table 1): 38% of the top genes occurred within 10 genes downstream and upstream of the capsular locus, and 66% were situated within 60 genes (a distance amounting to 2.8% of the genome). None of the genes achieving the top 2.5% of RFA scores in predicting serotype (shown in red in Fig. 2) remained in the top 2.5% category when asked to predict SC. Similarly, all genes which achieved top scores in predicting SC ( Table 2) were only of average importance in elucidating serotype (shown in green in Fig. 2). Interestingly, the MLST gene spi gained a place among the top-scoring genes for SC (Table 2) under this stringent cutoff.
Top-scoring genes for serotype classification mediate competitive interactions. Genomic position for each gene in the dataset against their normalised RFA score. The circular genome is presented in a linear form, with the first gene being dnaA and the last gene parB. MLST genes are marked in yellow diamonds (spi, xpt, glkA, aroE, ddlA, tkt) and genes within the capsular locus with blue diamonds (pseudogenes tagged with 'x'). (B) RFA for sequence cluster classification; figure details the same as in A. Blue shared areas mark the capsular locus (genes within aliA and dexB).

D R A F T
were associated with serotype to be involved in metabolic functions linked to resource competition, at least in related streptococcal species. The top-scoring gene trpF, for example, is known to be essential for the biosynthesis of tryptophan for S. pneumoniae [30], and more generally of the biosynthesis of aromatic amino acids in at least 9 species of bacteria [31]. Another top-scoring gene, fabG, encodes the β-ketoacyl-ACP reductase, the only known keto-acid reductase in bacterial fatty acid biosynthesis [32]. The gene lysC codes for an aspartokinase involved in lysine production and aminoethyl cysteine resistance in Corynebacterium glutamicum [33]. We also found two genes, mvaD and mvaK2, of the mevalonate pathway scoring highly for serotype prediction. This pathway, also known as the HMG-CoA reductase pathway, can be found in bacteria, eukaryotes and archaea [34]. One of its main products, the isopentenyl pyrophosphate (IPP), is used to make isoprenoids, a diverse class of over 30,000 biomolecules. In bacteria, the principal products of IPP include the lipid carrier undecaprenol (involved in wall biosynthesis), plus a range of menaquinones and ubiquinones both involved in electron transport, and the latter also in aerobic cellular respiration [34][35][36]. In S. pneumoniae, these two genes are essential for growth and are proposed to be part of a single operon [35].
ATP-binding cassette (ABC) transporter genes, which are critical for intake and metabolism, were found 5 times more frequently in the top genes classifying serotype compared to those determining SC ( Table 1). As part of one such transporter, the SpuA protein is involved in α-glucan metabolism, whose main substrate is glycogen (polysaccharide of glucose), an abundant resource in human lung epithelial cells [37,38]. The gene patB also encodes part of an ABC efflux pump in S. pneumoniae, responsible for resistance to fluoroquinolenes [39][40][41]. Among other ABC transporters, two other genes were located within the pit operon which is involved in iron uptake. In line with our findings, the pit operon has previously been shown to exhibit strain-specific variation [42]. In contrast, our approach did not select 2 other operons involved in iron uptake (piu and pia), which are conserved between S. pneumoniae strains [42] and therefore unlikely to be predictors of serotype. Another important regulator of iron transport among the topscoring genes is gnd [43]. The latter is transcriptionally linked to another top gene, ritR, which is orthologous to the streptococcal global regulator covR, for which there is conclusive evidence from S. pyogenes, S. suis and S. agalactiae of regulatory functions on capsular biosynthesis [44][45][46]. Another ABC transporter known as Ecs is represented in the top list by one of its two genes, ecsA. The substrate of Ecs is so far unknown, but obligatory anaerobes or microaerophilic bacteria do not carry the Ecs transporter, and its function is therefore argued to be related to respiration [47]. Finally, transport can be achieved by a multitude of systems alternative to ABC transporters, such as 'passive' channels like the top-scoring sodium symporter GlyP [48]. Sodium is one of the main electrolytes in human saliva, existing there at a higher concentration than in blood plasma, and differentiation in sodium transport, similarly to iron or glucose transport, could potentially be under selection for niche specialization.
High RFA scores for serotype were also found among a number of genes flanking the capsular locus which are involved in the cell wall peptidoglycan biosynthesis pathway [9].  These include the penicillin-binding protein genes pbpX and pbp1A, the 16S rRNA cytosine-methyltransferase gene mraW and the phospho-N-acetylmuramoyl-pentapeptide-transferase gene mraY. Mutations in these genes can lead to penicillin resistance, and single-nucleotide positions in all three genes have been shown to associate strongly with S. pneumoniae βlactam resistance in genome-wide association studies (GWAS) performed on the dataset used in this study [3], in a Thai study containing 3,085 isolates [49], and in a Canadian study on 11,083 isolates [50]. It is of relevance to note that in S. Pneumoniae, pbp1A is also involved in the formation of the septum during cell division [51] and is associated in a two-gene operon with the top-scoring gene recU, coding for the Holliday junction resolvase, required for homologous DNA recombination, repair and chromosome segregation [52,53]. Finally, resistance to various classes of cell wall-inhibitory antibiotics (ex. methicillin, vancomycin, daptomycin) in S. Aureus is regulated via the vra operon, by up or downregulation of a set of genes commonly designated as the cell wall stimulon [54]. We find this operon represented by two entries, the vraT and vraT genes.

D R A F T
In addition to genes clearly related to critical resource functions, transport and antibiotic resistance, we also found some of the top-scoring entries to be involved in functions associated with direct inter-and intra-species competition, either through factors related to immune escape or warfare. For instance, blpH is part of the BlpABCSRH pathway [55], which regulates production of class II bacteriocins and related immunity proteins [56,57]. In related species, the aminotransferase GlmS is also known to upregulate the production of ammonia thereby increasing acid tolerance and survival [58]. The capsular flanking gene luxS is also a good example, as it is part of a Staphylococcus epidermidis quorum-sensing system in biofilm formation, and linked to pneumolysin expression, a key player in interference with the host immune response [59,60]. Finally, the top-scoring lytC gene encodes a lysozyme (or glycoside hydrolase) which can be found in a number of secretions, such as tears, saliva and mucus, with the potential to damage (interspecies) bacterial cell walls by catalyzing hydrolysis of linkages and residues in peptidoglycans and chitodextrins [61, 62].
Several top-scoring genes for SC classification are also key determinants of phenotype.
A number of top scoring genes (ex. sodA, groEL, groES, lmb) in predicting SC have previously been demonstrated to be powerful discriminators of genealogy in a range of bacterial species. For instance, sodA, encoding for the manganese superoxide dismutase, critical against oxidative stress and linked to both survival and virulence, has been highlighted in numerous studies for its relevance in identification of rare clones of pneumococci [63, 64] and Streptococci at the species level [65,66]. Also, the lmb gene encodes for an extracellular protein with a key role in physiology and pathogenicity [67,68], and homologs of this protein have been documented to be present and discriminatory of at least 25 groups of the Streptococcus genus with possible similar functions [69, 70].
Certain top-scoring genes were strongly associated with phenotype such as cell-shape, virulence or invasiveness. For instance, glycolytic enzymes (GE) such as the one encoded by the top-scoring gene pdhB have long been regarded as virulence factors [71] and are involved in cytosol-located metabolic D R A F T processes. When transported to the surface, the PdhB proteincomplex is known to interact with host factors such as the extracellular matrix and fibrinolysis system [72]. Critically, Mycoplasma pneumoniae's pdhB is involved in the degradation of human fibrinogen and is also able to bind human fibronectin [72,73]. Fibronectin is commonly found in human saliva, presenting a vast set of functions, from prevention to colonization of the oral cavity and pharynx, to involvement in adhesion and wound healing [74]. Another top gene, pclA, encodes for the pneumococcal collagen-like protein A, a top candidate for human collagen mimicry [75], involved in host-cell adherence and invasion [76]. Binding to fibronectin and collagen are common strategies employed by various invading bacterial pathogens to colonize or disseminate within the host [77,78].
In ovococcus bacteria such as S. pneumoniae the function of the top-scoring protein MreD (the Rod shape-determining protein) is unknown. It is therefore down to speculation on why this protein is a good predictor of SC, but since the depletion of MreD protein can cause cells to stop growing, become spherical, form chains and lyse, its selection hints on the possibility that variation in this gene may dictate specific lineage differences in cell-shape phenotype [79]. We also find the genes designated as SPN23F11320 and SPN23F09460 to be relevant for SC classification, which in our dataset represent about 13% of all non-putative GCN5-related, N-acetyltransferases of the (GNAT) family. These are key proteins involved in acetylation, and there is growing evidence in the literature of their role in regulation of central carbon metabolism and phenotype through epigenetics [80,81].
Overall, the characteristics of these top-scoring genes differed significantly from those which were successful in predicting serotype and, contrary to expectations from a population structured mainly by neutral evolution, we found the top-scoring genes for SC (ancestry) not to be uniformly distributed across the genome. Most strikingly, 25% of the top scoring genes for SC were contiguous and contained the groESL operon, which includes the GroEL and GroES chaperon proteins (Table 2). Other studies have reported the power of the groESL operon and its proteins to ascertain phylogeny and classification within the Streptococcus genus [82] and between species of the Viridans and Mutans Streptococci groups [83,84]. We also noted the top-scoring gene recX is in close proximity to the groESL operon, which encodes a regulatory protein that inhibits the RecA recombinase in multiple species of bacteria [85-88].

Discussion
We have presented a novel technique for attempting to distinguish the effects of selection from neutral processes giving rise to population structure by applying a machine learning algorithm to genomic data. Our strategy involves applying a Random Forest Algorithm (RFA) to predict particular features (serotype or Sequence Cluster) of each isolate from information on the allelic composition of all isolates. By comparing the contribution of different genes as reflected in their RFA scores in predicting serotype or Sequence Cluster, inferences can be made concerning the evolutionary processes underlying their formation, relationship and maintenance at the population level. We performed this analysis on a dataset containing 616 whole genomes of S. pneumoniae collected in Massachusetts (USA) [3] , for each of which we had obtained allelic profiles Letters a to h in the second column denote groups of contiguous genes.
Classification success of Sequence Cluster (SC) to which each isolate belonged was achieved almost perfectly by the RFA. This is a reflection of the strong correspondence between taxonomy and classification trees based on genetic information, as explored in recent studies [19], and demonstrated by Austerlitz and colleagues when comparing the success of RFA, neighbour-joining and maximum-likelihood (PhyML) methodologies on simulated and empirical genetic data [89]. Classification of serotype by the RFA was more variable and, most importantly, there was no overlap between the genes which appeared to be most important in determining serotype and those which scored highly in identifying SC. As might be expected, genes of the capsular locus (cps) and many of those flanking it achieved high RFA scores in predicting serotype but did not perform better than average in predicting SC. Interestingly, none of the genes among the MLST loci showed a consistently strong association with SC across sensitivity Lourenço et al .   497  498  499  500  501  502  503  504  505  506  507  508  509  510  511  512  513  514  515  516  517  518  519  520  521  522  523  524  525  526  527  528  529  530  531  532  533  534  535  536  537  538  539  540  541  542  543  544  545  546  547  548  549  550  551  552  553  554  555  556  557  558   559  560  561  562  563  564  565  566  567  568  569  570  571  572  573  574  575  576  577  578  579  580  581  582  583  584  585  586  587  588  589  590  591  592  593  594  595  596  597  598  599  600  601  602  603  604  605  606  607  608  609  610  611  612  613  614  615  616  617  618 619 620 experiments and all performed no better at predicting SC than serotype. We encountered difficulties in using the entire dataset due to the large number of putative deletions recorded. Some of these proved to be a result of the extreme diversity of genes (such as wzx and wzy within the capsular locus) which interfered with their alignment to the reference serotype 23F genome (ATCC 700669). In the entire dataset, around 7% of the genes had over > 80% deletions/mismatches recorded, 10% had > 50% deletions/mismatches recorded, and just over 25% of the data had to be discarded if we rejected all genes with an excess of > 1% of deletions/mismatches. Given these limitations, we repeated the RFA analysis under various cutoffs for percentage of gene deletions/mismatches in a series of sensitivity exercises. While this did not affect the trend of genes within and flanking the cps locus to shift to lower RFA scores when comparing prediction of serotype against prediction of SC, it thwarted Conceptual representation of phylogenetic relationships between serotypes and Sequence Clusters (SC), where the former are defined by variation at the cps locus (arbitrarily designated X, W, Y, Z, M, and L, respectively coloured yellow, purple, green, orange, cyan and pink) and the latter are linked to variation in the groESL operon (arbitrarily designated A and B and respectively coloured red and blue). Circles symbolize genotypes, with size relative to their prevalence. Inner genome arcs represent epistatic links: those with the groESL operon extend across the genome, while links with the cps locus are more local. Within our framework and according to observed patterns [3], most SCs will be dominantly associated with a single serotype. Current vaccine strategies (white area) that target a selection of capsular serotypes can lead to the expansion of non-vaccine serotypes (VISR, [26,29]), potentially within the same sequence cluster (VIMS [10]). Vaccine strategies based on groESL variants (grey area) would target entire SCs instead, including all uncommon serotypes within and thereby preventing their expansion. efforts to ascertain whether any specific associations with serotype existed among other highly variable surface proteins of interest: PspA, choline binding protein CpbA/PspC, the IgA proteases or the histidine triad proteins. Future work of this methodology will rely on the development of more robust methods of allele classification for this category of genes, an area still lacking adequate approaches.

D R A F T
By eliminating all genes with > 1% of deletions/mismatches, we were left with 1581 genes which likely corresponds to the 1500 'core' cluster of orthologous genes (COGs) identified by Croucher et al [2] in their recent analysis of the same dataset. Within this more restricted set, we also observed a clear disjunction between genes that score highly in predicting serotype and the top-scoring genes for predicting SC. Not surprisingly, a significant proportion of genes that were good markers for serotype were found to flank the capsular locus (shown in bold in Table 1), although there were a number which were distal to it. A high proportion of genes scoring highly for serotype prediction were associated with key functions in metabolism and very likely defined unique 'metabolic types', but since most were in proximity to the cps locus, it was not possible to determine whether these had become segregated through resource competition [10] or by physical and/or functional associations with this locus. The presence of several co-functional, co-transcribed or co-localizing sets of genes (eg. the gnd and ritR genes, the pit, mva and vra operons, and the penicillinbinding genes) on this list (Table 1) argues, however, that the evolution of these serotype-associated traits may best be understood within a modular framework in which different serotypes are characterized by particular combinations of these units.  621  622  623  624  625  626  627  628  629  630  631  632  633  634  635  636  637  638  639  640  641  642  643  644  645  646  647  648  649  650  651  652  653  654  655  656  657  658  659  660  661  662  663  664  665  666  667  668  669  670  671  672  673  674  675  676  677  678  679  680  681  682   683  684  685  686  687  688  689  690  691  692  693  694  695  696  697  698  699  700  701  702  703  704  705  706  707  708  709  710  711  712  713  714  715  716  717  718  719  720  721  722  723  724  725  726  727  728  729  730  731  732  733  734  735  736  737  738  739  740  741  742 743 744

D R A F T
Genes that were highly informative for SC classification were also not uniformly distributed across the genome, with around a quarter of them co-localizing within and around the groESL operon (shown in bold in Table 2), encoding a chaperonin system which remains a paradigm of macromolecular machinery for protein folding [90]. Apart from assisting protein folding by preventing inappropriate interactions between non-native polypeptides [90], this system may also buffer deleterious effects of mutations on protein foldability and stability [91], with important consequences for protein evolution. The protein GroEL is also highly immunogenic for different bacterial species and has been shown to provide strain-specific protection in vaccine studies [92][93][94]. This raises the radically alternative possibility that sequence clustering may have arisen from immune selection operating on these genes in conjunction with epistatic interactions between the relevant heat shock proteins and the loci encoding the proteins they are chaperoning. The associations between serotype and SC may thus be primarily driven by immune selection operating on multiple immunogenic loci (in this case, cps and groESL) causing them to be organized into non-overlapping combinations, as predicted by strain theory of host-pathogen systems [95,96]. It has previously been proposed that immune selection acting jointly on capsular and sub-capsular antigens could account for the maintenance of these associations [29]. Immunological selection of unique combinations of cps and groESL, however, has the additional advantage of consolidating the link with a range of other genes across the genome through essential epistatic and highly specific (chaperoning) interactions with GroEL and GroES [90].
Our results are in broad agreement with the framework proposed by Croucher and colleagues [2], based on their analysis of the same dataset, in which lineage structure is maintained by infrequent transfer of modular elements ("macroevolution") and provides a stable backdrop for more frequent, and often transient, "microevolutionary" changes (see Figure 3). The differentiation of the groESL operon is potentially a striking example of "macroevolution", being specific not only to S. pneumoniae sequence clusters but also serving to genealogically distinguish closely related bacterial species [82][83][84]. We propose that this is the evolutionary outcome of a combination of immune selection and epistasis operating on specific modules, such as groESL, rather than neutral processes. Selection would also operate at a "microevolutionary" level in creating (more transient) associations between SC and serotype as means of avoiding immunological and direct resource competition [6,10,29]. We note that genes belonging to the Rec family are positioned in close proximity to both the contiguous clusters of top-scoring genes for SC and serotype (Tables 1 and  2) and would argue that these endorse the role of restrictionmodification systems (RMS) in protecting the modularity of the genome [2], and that population structure arises through selection favouring particular combinations of variants of these modules. Our analyses support the hypothesis that lineage structure in maintained by co-adaptation and competition [6,10] and show, unexpectedly, that these selection pressures converge upon the same locus, namely the groESL operon, and strongly endorse the development of vaccines targeting the associated chaperone protein GroEL to avoid vaccine induced changes in the population structure such as Vaccine Induced Serotype Replacement (VISR, [26,29]) or Vaccine Induced Metabolic Shift (VIMS, [10]) which have the potential of greatly reducing the benefits of capsular serotype targeted interventions.  [3] for collection details). In summary, allelic notation was carried out using the BIGSdb software with an automated BLAST process [97], and the genomes were analysed using the Genome Comparator tool (with ATCC 700669, serotype 23F, accession number FM211187, as the reference). Alleles identical to the reference were classified as '1', with subsequent sequences, differing at least by one base, labelled in increasing order. Genes were further classified as allele 'X' when genetic data present had no match to the genome of interest, or were found to be truncated or non-coding (see S1 Dataset of [10] for a visual representation of allele annotation and diversity). The allelic matrix as obtained by this approach and used in the RFA analysis is herein made available in supplementary Table S1, which also includes the Accession Numbers, gene name, gene product, gene position in reference genome, and year of collection, Sequence Cluster and serotype of each sample.

Random Forest Approach.
We implement a machine learning approach based on a Random Forest Algorithm (RFA) to predict particular features (serotype or Sequence Cluster) of each pneumococci isolate from information on the allelic composition of 2135 genes [16]. In summary, the RFA process takes the following pseudosteps: (I) the response variable and predictor variables are chosen by the user; (II) a predefined number of independent bootstrap samples are drawn from the dataset with replacement, and a classification tree is fit to each sample containing roughly 2/3 of the data, for which predictor variable selection on each node split in the tree is conducted using only a small random subset of predictor variables; (III) the complete set of trees, one for each bootstrap sample, composes the random forest, from which the status (classification) of the response variable is predicted as an average (majority vote) of the predictions of all trees. Compared to single classification tress, RFs increase prediction accuracy, since the ensemble of slight different classification results adjusts for the instability of the individual trees and avoids data overfitting [98].
Here we use randomForest: Breiman and Cutler's Random Forests for Classification and Regression, a software package for the R-statistical environment [99]. Predictor variables are set to be each gene in our genome samples and the response variable is set to the serotype or Sequence Cluster classification of each genome (as per [3]). We use the Mean Decrease Accuracy (MDA), or Breiman-Cutler importance, as a measure of predictor variable importance, for which classification accuracy after data permutation of a predictor variable is subtracted from the accuracy without permutation, and averaged over all trees in the RF to give an importance value [98]. For the results presented in the main text, we assume the predictor variables to be numerical (as opposed to categorical). This assumption is known to introduce RF biases, as classification is effectively made by regression and artificial correlations between allele numbering and the features being selected (serotype and Sequence Cluster) may be present. The assumption is herein necessary since the RFA R-based implementation (version 3.6.12) has an upper limit of 53 categories per predictor variable and we find some genes to present up to 6 times this limit in allele diversity. The categorical constraint is a common feature of RFA implementations, as predictor variables with N categories imply 2 N possible (binary) combinations for an internal node split, making  745  746  747  748  749  750  751  752  753  754  755  756  757  758  759  760  761  762  763  764  765  766  767  768  769  770  771  772  773  774  775  776  777  778  779  780  781  782  783  784  785  786  787  788  789  790  791  792  793  794  795  796  797  798  799  800  801  802  803  804  805  806   807  808  809  810  811  812  813  814  815  816  817  818  819  820  821  822  823  824  825  826  827  828  829  830  831  832  833  834  835  836  837  838  839  840  841  842  843  844  845  846  847  848  849  850  851  852  853  854  855  856  857  858  859  860  861  862  863  864  865  866 867 868 D R A F T the RFA method computationally impractical. Given this inherent RFA limitation, we implemented an input shuffling strategy to minimize potential bias. For this, M random permutations of each gene's allelic numbering in the original dataset is performed, effectively creating M independent input matrices. The RFA is run over the input matrices and in the main results we present each gene's average MDA score. A sensitivity analysis was performed by comparing RFA results between two independent sets of M = 50 input matrices (effectively comparing 100 independent runs) (Fig.  S5). Results suggest that the existing biases in independent runs of the RFA due to the assumption of numerical predictors are virtually mitigated with our shuffling approach, specially for experiments classifying serotype.

ACKNOWLEDGMENTS.
The authors acknowledge the sequence data and constructive comments by Angela Brueggemann and Andries van Tonder.  1055  1056  1057  1058  1059  1060  1061  1062  1063  1064  1065  1066  1067  1068  1069  1070  1071  1072  1073  1074  1075  1076  1077  1078  1079  1080  1081  1082  1083  1084  1085  1086  1087  1088  1089  1090  1091  1092  1093  1094  1095  1096  1097  1098  1099  1100  1101  1102  1103  1104  1105  1106  1107  1108  1109  1110  1111  1112  1113  1114 1115 1116 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q