Lineage structure of Streptococcus pneumoniae may be driven by immune selection on the groEL heat-shock protein

Populations of Streptococcus pneumoniae (SP) are typically structured into groups of closely related organisms or lineages, but it is not clear whether they are maintained by selection or neutral processes. Here, we attempt to address this question by applying a machine learning technique to SP whole genomes. Our results indicate that lineages evolved through immune selection on the groEL chaperone protein. The groEL protein is part of the groESL operon and enables a large range of proteins to fold correctly within the physical environment of the nasopharynx, thereby explaining why lineage structure is so stable within SP despite high levels of genetic transfer. SP is also antigenically diverse, exhibiting a variety of distinct capsular serotypes. Associations exist between lineage and capsular serotype but these can be easily perturbed, such as by vaccination. Overall, our analyses indicate that the evolution of SP can be conceptualized as the rearrangement of modular functional units occurring on several different timescales under different pressures: some patterns have locked in early (such as the epistatic interactions between groESL and a constellation of other genes) and preserve the differentiation of lineages, while others (such as the associations between capsular serotype and lineage) remain in continuous flux.

Genes were aligned across serotype using a highly restrictive routine penalizing the introduction of gaps (MUSCLE software parameters: -gapopen -12.0 -gapextend -1.0). Gaps are shown in white, Adenine in red, Thymine in green, Cytosine in blue, and Guanine in yellow. Each gene is only represented by the serotypes for which sequences were found; and genes for which 3 or less serotypes were represented are not displayed (wchV, wchW, wchX ). The red/blue colours herein used in serotype labels mirror Varvio et al. serotype categories according to statistically, highly supported sequence clusters found in that study. Accordingly, we find genes wzg and rmlABCD to present no population structure, while genes wzh, wzd, wze and wchA present clear divergence patterns between the two groups. As expected, given their sero-specific origins, wzx and wzy present a significant number of alignment gaps. This contrasts the other genes, although some structure can be seen within serogroups for genes wzx and wzy (ex: serogroups 6 and 23). q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  The results show a very high correlation between the scores obtained for the same genes in the two input sets, specially for Serotype classification. This demonstrates that by averaging scores over N input randomized matrices, the existing RFA bias due to allelic numeration is mitigated. In particular, the variation of mean scores for the top 2.5% ranking sites (green shaded area) is shown to be small, specially for Serotype classification, such that the same genes are effectively selected on that upper limit in any set of N input matrices.

SC, 1% cutoff (zoom in)
gene position RFA normalised score Figure 6: RFA ranking versus gene position. The RFA normalised scores for 4 scenarios are presented: Serotype and SC classification, with no cutoff (left column) and 1% cutoff (right column) for gene mismatches (1% cutoff as a rejection of genes with an excess of 99% of mismatches). For simplicity, we represent the circular genome in a linear form, with the first gene at position 186 being dnaA and the last gene at position 2220530 being parB. (left column) The genes with highest scores for serotype classification are seen to cluster around the capsular locus (blue area). The bottom two subplots, presenting a zoom in the capsular region show that core capsular genes (centre of yellow region) score highly, but flanking genes also present high scores with decreasing values with distance. (right column) Core genes from the capsular locus (centre of blue and yellow areas) are not considered, since these present an excess of 99% of mismatches to the reference genome. However, genes flanking the capsular locus still score highly for serotype classification and scores decrease with distance. 6 rand.   Table 1 (main text) that are 100 genes away from dexB. (C) Linkage between the gene groEL and all other genes at a maximum distance of 100 genes. Gree points are the genes of Table 2 (main text) that are 100 genes away from groEL. In B,C subplots, the vertical dashed lines mark the gene of reference (dexB or groEL). In A,B,C subplots, linkage is calculated according to Lewontin's normalization and excluding all genes which showed mismatches or deletions above a threshold of 1% (exclusions are marked with grey areas in subplots B,C ). Figure 8: Comparison of random forest SC classification success rates when using the groESL cluster of genes or random genes. Results from experiments are presented in different colors, in blue when using the 10 genes highly informative for SC classification clustering around the groESL locus (see Figure 2, Table 2), in red when using 10 genes randomly selected from the genome, in green when using using random sets of 10 contiguous genes (presented is the mean classification for 10 RFA runs using 10 independent clusters of such 10 genes). The groESL cluster is seen to be the most informative (100% of the SCs are predicted with accuracy above or equal to 90%) when compared to 10 random genes (for which ≈ 43% of SCs have classification success lower than 90% and ≈ 31% have classification success lower than or equal to 70%), or for the mean classification of 10 contiguous genes (81% of the SCs are predicted with accuracy above or equal to 90%, and only 12% of SCs are better classified when compared to the results of the groESL cluster). Only genes with less than 1% of mismatches are included. Figure 9: Distribution of distances within clusters of 10 genes sampled from the top SC predictor genes. The main results describe the unexpected result of a cluster of 10 genes within and around the groESL operon when classifying SC (genes shown in Figure 2, Table 2). To quantify the significance level of this finding, we here present the distribution of distances found among such top genes. From the 41 genes found as good predictors for SC (Table 2), we sample 10,000 independent sets of 10 genes. From each set, we use the each gene's genome position to calculate the standard-deviation (StDev) of the distance (number of genes) between every 2 genes in the set. Given the circular nature of the genome, we use the minimum possible distance. The resulting distribution is shown in the plot. The 95% boundaries are presented in red (dashed lines). The StDev of the distances found within the cluster of 10 genes including the groESL operon is shown in blue (dotted line). Finding such cluster of genes has a p-value of ≈ 1.52e − 06 and is therefore significant in the background of the observed distribution. The results show a very high correlation between the scores obtained for the same genes in the two input sets. Only genes with less than 1% of mismatches are included. These plots demonstrate that by averaging scores over N input randomized matrices, the existing RFA bias due to allelic numeration is mitigated. In particular, the variation of mean scores for the top 2.5% ranking sites (green shaded area) is shown to be small (and smaller than when considering all genes, Fig S5), such that the same genes are effectively selected on that upper limit in any set of N input matrices. For the cluster of 10 genes around the groESL operon, found to be informative for SC in the results of the main text ( Figure 2B,  [5]. One of the pathway's main products, the isopentenyl pyrophosphate (IPP), is used to make isoprenoids, a diverse class of over 30,000 biomolecules. In bacteria, the principal products of IPP include the lipid carrier undecaprenol (involved in wall biosynthesis), plus a range of menaquinones and ubiquinones both involved in electron transport, and the latter also in aerobic cellular respiration [6][7][8]. In S.
pneumoniae, these two genes are essential for growth and are proposed to be part of a single operon [6].
spuA Glycogen metabolism The SpuA protein is involved in alpha-glucan metabolism, whose main substrate is glycogen (polysaccharide of glucose), an abundant resource in human lung epithelial cells [9][10]. The protein SpuA has also been shown to be highly immunogenic [11]. These genes were located within the pit operon, encoding for an ABC transporter involved in iron uptake. In line with our findings, the pit operon has previously been shown to exhibit strain-specific variation [15].
gnd Iron uptake Regulator of iron transport [16]. Genes flanking the capsular locus involved in the cell wall biosynthesis pathway [22]. Mutations in these genes can lead to penicillin resistance, and single-nucleotide positions associate strongly with S. pneumoniae beta-lactam resistance [23][24][25]. In S. pneumoniae, pbp1A is also involved in the formation of the septum during cell division [26]. pbp1A is associated in a two-gene operon with another top-scoring gene, recU [27][28]. blpH Bacteriocin production Part of the BlpABCSRH pathway [30], which regulates production of class II bacteriocins and related immunity proteins [31][32].
glmS Production of ammonia In related species, the aminotransferase GlmS is known to upregulate the production of ammonia thereby increasing acid tolerance and survival [33]. lytC Lysozyme production Encodes a lysozyme (or glycoside hydrolase) which can be found in a number of secretions, such as tears, saliva and mucus, with the potential to damage (interspecies) bacterial cell walls by catalyzing hydrolysis of linkages and residues in peptidoglycans and chitodextrins [36][37].

Description of top-scoring genes highly informative for SC classification
Gene Function / Type Description / Background sodA Survival Virulence Encodes for the manganese superoxide dismutase, critical against oxidative stress and linked to both survival and virulence; has been highlighted in numerous studies for its relevance in identification of rare clones of pneumococci [38][39] and Streptococci at the species level [40][41].
lmb Virulence

Laminin binding
Encodes for an extracellular protein (Lmb) with a key role in physiology and pathogenicity [42][43], and homologs of this protein have been documented to be present and discriminatory of at least 25 groups of the Streptococcus genus with possible similar functions [44][45]. The Lmb protein also called laminin-binding protein, given its capacity to bind to laminin present in the host's extracellular matrix and is immunogenic [45]. regarded as virulence factors [46] and are involved in cytosol-located metabolic processes. When transported to the surface, the PdhB protein-complex is known to interact with host factors such as the extracellular matrix and fibrinolysis system [47]. Critically, Mycoplasma pneumoniae's pdhB is involved in the degradation of human fibrinogen and is also able to bind human fibronectin [47][48]. Fibronectin is commonly found in human saliva, presenting a vast set of functions, from prevention to colonization of the oral cavity and pharynx, to involvement in adhesion and wound healing [49]. The gene recX is in close proximity to the groESL operon, which encodes a regulatory protein that inhibits the RecA recombinase in multiple species of bacteria [56][57][58][59]. Its immediate function is to regulate recombination. groEL groES Protein folding Encode a chaperonin system for protein folding [60]. Apart from assisting protein folding by preventing inappropriate interactions between non-native polypeptides [60], this system may also buffer deleterious effects of mutations on protein foldability and stability [61]. The protein GroEL is highly immunogenic for different bacterial species and has been shown to provide strain-specific protection in vaccine studies [62][63][64]. Other studies have reported the power of the groESL operon and its proteins to ascertain phylogeny and classification within the {Streptococcus genus [65] and between species of the Viridans and Mutans Streptococci groups [66][67].
carB Production of

O-antigen
Many bacterial species use O-antigen to avoid phagocytosis and to resist the lytic action of the complement system [68]. For Salmonella typhimurium it is known that o-antigen's immunogenicity is species dependent [69]. In E. coli, over 160 different o-antigen structures are known to be produced and are strain dependent [70].
vanZ Virulence Resistance to teicoplanin The protein VanZ is linked to teicoplanin resistance in Enterococcus faecium [71] and is also essential for lung infection in S. pneumoniae [72]. persistence on mucosa by Neisseria species [74] and pathogenesis of invasive disease by Haemophilus influenzae type b [73].