Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Identifying genes associated with invasive disease in S. pneumoniae by applying a machine learning approach to whole genome sequence typing data


Streptococcus pneumoniae, a normal commensal of the upper respiratory tract, is a major public health concern, responsible for substantial global morbidity and mortality due to pneumonia, meningitis and sepsis. Why some pneumococci invade the bloodstream or CSF (so-called invasive pneumococcal disease; IPD) is uncertain. In this study we identify genes associated with IPD. We transform whole genome sequence (WGS) data into a sequence typing scheme, while avoiding the caveat of using an arbitrary genome as a reference by substituting it with a constructed pangenome. We then employ a random forest machine-learning algorithm on the transformed data, and find 43 genes consistently associated with IPD across three geographically distinct WGS data sets of pneumococcal carriage isolates. Of the genes we identified as associated with IPD, we find 23 genes previously shown to be directly relevant to IPD, as well as 18 uncharacterized genes. We suggest that these uncharacterized genes identified by us are also likely to be relevant for IPD.


Invasive pneumococcal disease (IPD) is defined as an infection in which the bacterial pathogen Streptococcus pneumoniae (pneumococcus) enters a usually sterile site, such as the blood or cerebrospinal fluid1. Although pneumococci are usually carried asymptomatically within the human nasopharynx, IPD is often life-threatening and constitutes a major cause of mortality, disproportionally targeting children, elderly and immune-suppressed individuals2,3. Genetic changes facilitating the survival of pneumococci during invasion have been previously identified and described through experimental and bioinformatic methods4,5,6,7,8,9,10. The work of Hava and Cammilli, for instance, describes a set of 378 genes that are associated with attenuated virulence in mice in the model pneumococcal strain TIGR44. Several other works have been successful in identifying differential expression patterns of key virulence genes of S. pneumoniae in vitro and in vivo. These works used RT-PCR on previously described virulence factors and high-throughput microarray expression profiling to identify gene expression signatures during invasion of model organisms or growth on epithelial cell lines5,6. DNA microarrays have also been employed in order to identify a common core genome differentiating between strains isolated from invasive disease or carriage in three pneumococcal serotypes often found in IPD (6A, 6B and 14)8. Although these methods did highlight features involved in the ability of pneumococci to invade a host, they were limited by either using a small sample size, focusing only on a fraction of the pneumococcal serotypes, or relying on a single reference genome to identify patterns of differential gene expression and gene presence in strains isolated from IPD. Recent studies which used whole-genome sequence data failed to identify adaptive differences, in terms of presence and absence of genes or genetic mutations, between strains invading the blood and strains that were able to cross the blood-brain barrier11,12,13. These works highlight the need of future research to comprehensively identify whether adaptation of IPD isolates occurs through genetic variation between carriage and invasion, suggesting that subtle changes may influence the virulence of the bacterial isolates.

Here, we sought for genetic changes between pneumococcal carriage and IPD isolates using a whole genome sequence typing approach. In this approach, a reference genome is designated, every gene in a data set is assigned an allelic coding based on the reference genome, and new alleles are defined as gene variants containing any change from previously defined alleles (see Methods). This is an extension of the well-known multilocus sequence typing (MLST) method14,15. However, in contrast to MLST, which is based on a small number of conserved genes usually present in all isolates of a bacterial species, sequence typing of an entire bacterial genome may contain substantial variations in genes presence.

Hence, using a particular reference genome for such an analysis might preclude identification of genes not found in a certain serotype. To overcome this caveat, we have constructed a pangenome of our IPD samples and used it as a synthetic reference genome. We employed the random forest algorithm (RFA) – a machine learning algorithm commonly applied on genomic data16 and previously used by our group identify genes associated with immune selection and lineage structure in pneumococci17. Furthermore, to reduce the confounding effect of genes associated with bacterial lineages rather than IPD, we have performed this analysis on three data sets, each time selecting carriage isolated from a different country (UK, USA and Iceland). These carriage isolates were compared against pneumococcal blood isolates originating from heterogeneous geographical locations. The final result consisted of 43 genes ranked in all three datasets among the top most predictive genes for IPD. We characterize these genes, find that many of them are supported in the literature as associated with IPD, and compare our results to presence-absence and tree-based approaches. Finally, we analyze the identified genes’ length and location on the pneumococcal genome relative to capsule-determining loci.


We obtained 378 invasive pneumococcal isolates causing bacteremia, from different countries, as presented in Table S1. The number of invasive isolates was limited by public availability of WGS samples marked as isolated from blood. A pangenome of 9032 genes was generated from this data set, from which all genes in the soft-accessory genome (defined as genes appearing in at least 15% of samples) were used as a reference genome for the sequence typing process. The sequence typing process was applied three times: on the invasive disease isolates (n = 378) joined with a data set of carriage isolates from the UK (n = 520), USA (n = 622) and Iceland (n = 622). The three datasets were not combined to a single data set for two reasons: First, comparing the results from three different countries constitutes a more conservative approach, increasing the probability of finding genes truly associated with invasive disease, rather than associated with lineages more prevalent in certain datasets. Second, the computational complexity of sequence typing increases non-linearly with the number of genomes. Therefore, we used all carriage isolates available from the UK, and as many isolates from Iceland and the US as possible while maintaining a total of no more than 1000 sequences (the limit in BIGSdb, the web server used for the typing service – see Methods).

Following this rationale, RFA was applied to each of these data sets with invasive/non-invasive disease as the predicted variable. The out-of-bag (OOB) classification success18 was similar for the three datasets with 94.6% (95% CI 94.6–94.7), 93.2% (95% CI 93.1–93.2), and 94.4% (95% CI 94.4–94.5) success for carriage, and 76.7% (95% CI 76.7–76.8), 86.7% (95% CI 86.7–86.8), and 79.7% (95% CI 79.6–79.8) success for bacteremia in Iceland, the UK, and USA, respectively. For each data set, the top 100 genes with the highest importance score were chosen, using a heuristic method aiming maximize the number of joint genes (see Methods), and then recorded and compared. Out of these, 43 were joint to all three data sets. The probability of this many, or more, genes joint to all three data sets in a random selection of 100 genes (i.e. the p-value for the null hypothesis of the RFA choosing genes randomly) was verified via simulations to be <10−6. Furthermore, RFA was run again using only the 43 genes joint to the three data sets. The OOB classification success was 93.2% (95% CI 93.1–93.2), 96% (95% CI 95.9–96), and 95.9% (95% CI 95.9–96) for carriage, and 73.4% (95% CI 73.3–73.4), 90.1% (95% CI 90–90.2), and 80.6% (95% CI 80.4–80.7) success for bacteremia, in Iceland, the UK, and USA, respectively. The comparable accuracy when using only the 43 joint genes indicates that they are providing sufficient information to classify invasive versus non-invasive pneumococcal strains. All the identified genes are presented in Table 1. We compared our method to two established analysis methods: first, we repeated the analysis based on a genome-wide presence and absence of genes, rather than their alleles, using the software Scoary)19 (see Methods for details). No genes were identified as jointly highly predictive in all three data sets using Scoary, even when the top-300 ranked genes were considered. We then applied a sequence-based maximum likelihood phylogeny of the core genes of each dataset20. This method also could not capture the evolutionary changes between the invasive and carriage isolates, as these isolates remained scattered across different clades (see Methods and Supplementary Material Figs S3S5).

Table 1 Genes associated with IPD.

Interestingly, 23 of the genes we identified had BLAST matches with genes previously found to be associated with invasive disease or associated with immune response to it. 18 genes were found to encode for hypothetical proteins with unknown functions and 2 genes were found to encode for transposases, which catalyse the rearrangement of mobile genetic elements in the bacterial chromosome21. As a control measure, we performed a similar BLAST analysis on 43 of the jointly lowest-ranked genes (i.e. the worst predictors as determined by the RFA). These genes were comprised of 8 ribosomal genes, 7 metabolism genes, 3 translation/transcription regulation genes, 4 bacteriocin-related genes, and various conserved hypothetical proteins (SI Table S5). The only gene found to be related to virulence was ilvE, which is an aminotransferase also relevant for lung infection4. As our isolates were derived either from the nasopharynx or from patients’ blood, it might be expected that this gene will not be a highly ranked. Thus, virulence-related elements were over-represented in our top ranked genes, as we would expect from genetic elements associated with invasive pneumococcal isolates.

We explored the characteristics of the identified IPD-associated genes by determining the locations of the identified genes on the pneumococcal genome. Figure 1A shows the locations of the identified genes across the genome of a 19A serotype sample and reveals that they are spread across the pneumococcal genome. Pneumococcal serotypes are known to be differentially associated with IPD22,23, and hence genes located around the capsule polysaccharide synthesis locus might be expected to be involved in IPD. Indeed, several identified genes are found near this gene cluster (orange rectangle on Fig. 1A), but many of the other genes identified are spread across the genome, verifying that our findings do not simply rely on differences in serotype compositions between the datasets used for our analysis (for serotype distribution in our data, see Fig. S1). We note that since pneumococcal serotypes have substantial genomic variation, driven by recombination, horizontal gene transfer and events of gene loss or addition, the locations of genes within the their respective genomes are not constant24. Regardless, qualitatively similar location distributions were obtained when plotting these genes on other serotype samples (SI Figs S5, S6). In addition to this, we examined the length of the IPD associated genes (Fig. 1B), since many of them were found to have BLAST matches to short subsets of known genes (see Table 1). The IPD associated genes were statistically significantly shorter than those in the soft-accessory genome (Wilcoxon rank-sum test, p-value = 0.00027) but not significantly shorter than those of the entire pangenome (Wilcoxon rank-sum test, p-value = 0.079). Furthermore, the length of genes in the entire pangenome was significantly shorter than those in the soft-accessory genes (Wilcoxon rank-sum test, p-value = 10−16). I.e., the genes we identified are comparable in length to those in the entire pangenome, but are shorter than the soft-accessory genes we used as a reference. Finally, we have examined the variation in the identified genes. In both the invasive and the carriage isolates, the identified genes had more allelic categories than genes not identified by our method (Wilcoxon rank-sum test, p-value < 10−15). The presence of the identified genes in the isolates ranged between approximately 30–100%, and was similar between the populations, although slightly lower in the invasive isolates (SI Table S2).

Figure 1

Location and length of genes associated with IPD. (A) Location of identified IPD-associated genes (see Table 1) on a 19A streptococcal genome (accession NC_010380.1). Orange rectangle marks the capsular synthesis locus (CPS). Similar plots using other serotype samples can be found in Figs S5S6. (B) Boxplots and distributions of log10-transformed gene lengths from the IPD-associated genes, the entire pangenome and the soft-accessory genome used in our analysis (see methods).

Thus it seems that shorter, more variable genes with varying presence across pneumococci, had a higher probability of being associated with IPD.


In this study we identified pneumococcal genes associated with of IPD using a novel method, comprising a combination of several techniques. First, we encoded WGS data by extending multi-locus sequence typing. This approach enables information to be extracted from gene variants, or alleles, as well as from the presence/absence of genes. Consequently, the sequence typing approach outlined is more sensitive to finding variations within genes without losing information due to the absence of genes.

As a reference genome for the typing scheme, we constructed a genome which included any genes existing in more than 15% of invasive samples, namely the soft-accessory genome. We thus avoided relying on genes present in an arbitrary reference genome for our analysis. This is especially important when typing pneumococcal samples, which have highly variable genomes and can yield a core genome shorter than 50% of an average pneumococcal genome25,26. Using a reference genome constructed in such a way has proved beneficial, as all but three of the genes eventually identified as associated with IPD were present in fewer than 95% of isolates, categorizing them in the soft-accessory genome (SI Table S2).

We then used an RFA to score the genes by their marginal contribution to improving classification of invasive disease and carriage. Our method was implemented on three datasets of pneumococcal carriage samples isolated from different countries, and the top-ranked genes were reduced to only those that were jointly top-ranked in all three datasets. Selecting the jointly top-ranked genes imposes a stringent cutoff for the identified genes, and reduces potential bias introduced due to local ancestry or population structure. It resulted in a total of 43 jointly high scoring genes out of 100 top-ranked genes associated with IPD – implying a relatively high replicability of results across datasets. Additionally, we applied a presence/absence method and a sequence-based phylogenetic approach, which yielded no significant results joint to all three data sets.

Reassuringly, many of the genes we identified are parts of known virulence factors, or are associated with invasive pneumococcal disease and especially with bacteremia (see Table 1). For instance, our method identified the gene lytB as associated with IPD. The LytB protein is involved in the attachment of S. pneumoniae to human nasopharyngeal cells in vitro, and its loss was shown to heavily impair the pneumococcal virulence in a mouse sepsis model27,28. Additionally, it was shown that these proteins are essential for a successful biofilm production and act to avoid pneumococcal phagocytosis29. Another gene identified here is the lactate oxidase lox. In other streptococcal species, particularly S. mutans, S. pyogenes and S. oligofermentas, H2O2-producing lactate oxidase activity was shown to be used in absence of glucose and for niche competition30,31. S. pneumoniae is also known to use lactate as an energy source in absence of glucose, converting the lactate molecule to pyruvate with consequent production of H2O232. It was also recently demonstrated that S. pneumoniae produces hydrogen peroxide in order to facilitate DNA damage, cell apoptosis and ultimately pathogenesis33. Interestingly, homologs of the pspC gene appear in four instances amongst the genes we identified (namely hpp7, hpp10, hpp18 and hpp35). This could be explained by the characteristic polymorphism of the pspC gene: it is known to present high copy-number variation as well as numerous alleles in pneumococcal isolates34,35. PspC is a bacterial surface protein (adhesin) essential for colonization of nasal tissue, as well eliciting protection against pneumococcal carriage and bacteremia in a mouse model36,37. Moreover, it was found to bind to endothelial blood-brain barrier receptors, facilitating bacterial brain invasion38. Although the use of PspC was proposed in a non-capsular vaccine, which could confer protection to invasive disease, its high variability have limited its use as vaccine candidate39. This repeated identification of several copies of PspC by our method strengthens the gene’s importance as a factor contributing to IPD.

Furthermore, among the genes identified here were two encoding for transposases. It is known that S. pneumoniae is characterized by a high level of genomic plasticity, which allows to the bacterium to react quickly to changes in environmental conditions40. As mobile genetic elements are responsible for the dissemination of phenotypic characteristics in the bacterium, such as antimicrobial resistance41, and are overexpressed in conditions related to virulence, such as during biofilm production42, it is possible to speculate that these mobile genetic elements could be associated with the dissemination of virulence factors amongst the S. pneumoniae species.

Most of the other identified genes were hypothetical, with no known function. Based on our method’s classification success, the fact that the highly ranked genes were identified in the analyses of three independent carriage datasets, and the high presence of known virulence factors among the genes, we believe that the hypothetical genes identified are highly likely to be involved in pneumococcal invasive disease. Of particular interest are identified genes which are farther from the capsular locus (see Fig. 1A), which could potentially be serotype-independent IPD-associated genes and therefore relevant across streptococcal strains. The length of the identified genes was also unusually short relative to the synthetic reference genome we used (Fig. 1B), implying that some previously overlooked short gene/protein sequences may also be involved in IPD. Our analysis suggests that further focus should be turned to shorter sequences and gene fragments, which could be factors contributing to IPD. For comparison, we analyzed the 43 jointly lowest-ranked genes, yielding hits in ribosomal, transcription and translation regulation, metabolic and bacteriocin-related genes, together with conserved hypothetical proteins (SI Table S5).

The main limitation of our method is that all alleles are marked as different ‘states’ of a gene and their degree of similarity/difference is not taken into account. Thus, we can identify which genes are associated with differing phenotypes, but subtler methods will be necessary to discern exactly which alleles are responsible for which phenotypic changes. Other methods using WGS data as input may be able to achieve that, but the vast amount of variables needed to encode features of a full sequence do not easily lend themselves to classification methods. A feasible future extension of our method could be adding variables encoding more information about the alleles, such as structural properties of their resulting proteins43. However, such an extension will necessitate an efficient way of combining the genetic and protein information as the interactions between genes and their translated protein characteristics will likely have a substantial effect on the results.

Additionally, our method does not explicitly account for the potential confounding effects of the different population structure of pneumococci sampled from various locations (although we have previously shown that RFA is able to distinguish between genes defining lineages to those defining serotypes)17. We aimed to reduce this confounding factor by using carriage isolates from three different countries, and invasive isolates from various countries (SI Table S1). The weak effect of population structure in our data is corroborated by the failure of clustering the isolates to invasive/carriage using WGS and sequence type (ST) based trees (SI Figs S2S4, S10). Furthermore, examining the different sequence types in our data (which are a proxy for pneumococcal lineages)44 shows the mixed distribution of these among the datasets (SI Fig. S10) and a similar shared percent of STs between the carriage and invasive data (SI Table 11), determining that population structure cannot account for the differences between invasive and carriage isolates.

However, by restricting the genes we identify to those that are highly ranked in multiple datasets to reduce confounding by population structure, our method trades sensitivity for specificity. Such an approach may miss genes that are less common in certain datasets, but should reduce the probability of identifying genes that are spuriously correlated with IPD due to sampling or population structure. In light of the multiple identified genes with unknown functions, we considered such a conservative approach appropriate and preferred increasing the certainty of our results over identifying more genes with lower confidence.

Finally, using a pangenome based solely on invasive isolates restricts our findings to genes found in at least some of the invasive isolates. Assuming that most of the relevant genes for invasive disease would be present in some invasive isolates is reasonable if the adaptation for invasive disease is more likely to occur by allelic variations in genes present across pneumococcal types, or by pneumococci gaining new genes facilitating adaptation to invasion. It might, however, disrupt identification of genes that are removed from carriage isolates for adaptation to the invasive environment, if such genes exist. Addressing this issue would be possible by creating a larger pangenome, consisting of all available isolates, but would also be more computationally expensive.

The limitations mentioned above can explain why other genes known to be relevant for IPD, such as pyl, prtA, lytA, lytB, sodA and cbiO, piuA7,27,45,46, were not identified by our method.

We believe the method presented here can be applied to a variety of pathogens to identify genes responsible for virulent phenotypes. We foresee our approach being particularly useful when the examined pathogens share only a small core genome, such as E.coli47 and C. jejuni48. The goal of our method is to discern with high confidence genes associated with IPD, or any other phenotype, so their function could eventually be experimentally examined. Accordingly, we hope the hypothetical genes identified in this study will be further analyzed and prove to be useful in our understanding of invasive pneumococcal disease.


Pangenome construction and sequence typing

A total of 378 genome sequences of S. pneumoniae strains isolated from invasive disease were downloaded from BIGSdb49 with geographical origin of isolates and accession numbers available in SI Table S1.

These genomes were used to build an invasive population pangenome using Roary V.3.6.150. Briefly, each draft genome downloaded from BIGSdb was re-annotated with PROKKA V1.1251 and the annotation output was fed to Roary for the pangenome construction. Roary parameters were set to minimum blastp identity 90% and MCL inflation value of 1.5.

For the purpose of this analysis, we included in the pangenome the genes present in the soft-accessory genome, i.e. present in >15% of isolates, for a total of 2649 genes. This pangenome was used as the reference genome for sequence typing of three new datasets, containing the invasive sequences together with each one of the three carriage data sets, namely Iceland, the UK or USA. Under the BIGSdb typing scheme, all gene variations in a dataset (defined by any difference between a gene and any previously recognized gene variants) result in new allelic categories. BIGSdb parameter values were set to the webserver defaults: 70% minimum identity for partial matching; 50% minimum alignment for partial matching; BLASTN word size of 20.

Genome sequences were quality controlled before the pan-genome construction by making sure that the total length of each assembly was between 2.0 and 2.3 Mb (the common genome length of completed S. pneumoniae genomes as retrieved from Moreover, the absence of low-level contamination was ascertained using Kraken v 0.10.552. Briefly, if more than 5% of the total genome assembly sequence was identified as belonging to a different bacterial species, that assembly was removed from further analysis. As shown in Supplementary Material Figs S8 and S9, satisfactory pangenome saturation was reached in terms of core genome and number of new genes added per new genome53.

Scoary, FastTree, sequence type analysis, and genome-location

A pangenome wide association study (Pan-GWAS) was performed using the Scoary V.1.6.16 pipeline19. Three new pan-genomes were built using 3 different datasets, each including the 378 genome sequences from the invasive disease strains and genomes from the carriage strains isolated from either Iceland, the UK or USA.

Each of the three pan-genomes (invasive + carriage strains) was then input to Scoary using the invasive/carriage origin of the strain as classifier for the pan-GWAS pipeline.

The genes representing the core genomes of the three invasive + carriage datasets (present in more than 99% of the analysed isolates) were concatenated and aligned with MAFFT V.7.22154. The alignment of the core genome was used to reconstruct the maximum likelihood phylogeny of each group of isolates using FastTree V.2.120 under a generalized time-reversible model. Phylogenetic trees for each datasets were then edited and annotated using Evolview V.255. Genome location plots were produced using BRIG V.0.9556 with the genome sequence of strains 19A (NC_010380.1), D39 (NC_008533.1) or R6 (NC_003098.1) as references (Figs 1, S5 and S6, respectively).

Pneumococcal sequence typing was carried out according to the PubMLST guidelines, assessing the allelic profiles of 7 housekeeping genes49. For each dataset reported in Fig. S10 (USA + Invasive, UK + Invasive and Iceland + Invasive) a neighbour-joining tree was produced using the alignment of the concatenated sequences of the 7 housekeeping genes and the results were visualised using iTol57. The phylogenetic trees and the alignments were produced using the BigSDB – iTol tool49.

Random forest analysis

Random forest was implemented in R using the randomForest package V.4.6–1258. Allele types were turned into numeric variables in the RFA due to computational limitations. To break any biases such enumerations might introduce, we permuted each allele typing and reran the RFA for 200 times on each dataset17. The measure used to rank genes was permutation importance (aka Breiman-Cutler importance). Under this method, variable values are permuted for the OOB data of each tree and the resulting classification error is subtracted from the OOB data error without the variable permutation18. The average of this difference across all trees is the permutation importance. These importance measures were ranked for all variables and the rankings were averaged across the 200 permutations of RFA applications on each dataset17. The fraction of genes joint to the three datasets was compared as a function of the number of top-ranked genes selected. To reduce noise due to small samples of top-ranked genes, both the fraction of genes and the low bound of a 95% binomial confidence interval (with n = number of top-ranked genes and p = fraction of joint genes) were used. In both measures, the maximum fraction corresponded to using 100 top-ranked genes (Fig. S7). Although similar peaks occur when more top-ranked genes were used, we chose 100 as a conservative threshold (i.e. to reduce the number of false positive genes identified).

Functional annotation of genes associated with IPD

All gene sequences in Tables 1 and S5 were first functionally annotated using the NCBI conserved domain search engine ( Each DNA and translated amino acid sequence was checked for similarity against known genes and protein using nucleotide and protein blast (megablast and blastp algorithms respectively, The combined results of the conserved-domains search and blast are described in Tables 1 and S5.

Data Availability

Accession numbers for pneumococcal sequences used are listed in SI Table S1; the pangenome built from invasive isolates can be found in SI Table S3.


  1. 1.

    Randle, E., Ninis, N. & Inwald, D. Invasive pneumococcal disease. Archives of Disease in Childhood-Education and Practice 96, 183–190 (2011).

    Article  Google Scholar 

  2. 2.

    Bernatoniene, J. & Finn, A. Advances in pneumococcal vaccines. Drugs 65, 229–255 (2005).

    CAS  Article  Google Scholar 

  3. 3.

    Organization, W. H. (2013).

  4. 4.

    Hava, D. L. & Camilli, A. Large‐scale identification of serotype 4 Streptococcus pneumoniae virulence factors. Molecular microbiology 45, 1389–1406 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    LeMessurier, K. S., Ogunniyi, A. D. & Paton, J. C. Differential expression of key pneumococcal virulence genes in vivo. Microbiology 152, 305–311 (2006).

    CAS  Article  Google Scholar 

  6. 6.

    Mahdi, L. K., Ogunniyi, A. D., LeMessurier, K. S. & Paton, J. C. Pneumococcal virulence gene expression and host cytokine profiles during pathogenesis of invasive disease. Infection and immunity 76, 646–657 (2008).

    CAS  Article  Google Scholar 

  7. 7.

    Brown, J., Hammerschmidt, S. & Orihuela, C. Streptococcus Pneumoniae: Molecular Mechanisms of Host-Pathogen Interactions. (Academic Press, 2015).

  8. 8.

    Obert, C. et al. Identification of a candidate Streptococcus pneumoniae core genome and regions of diversity correlated with invasive pneumococcal disease. Infection and immunity 74, 4766–4777 (2006).

    CAS  Article  Google Scholar 

  9. 9.

    de Andrade, A. L. S. S. et al. Genetic relationship between Streptococcus pneumoniae isolates from nasopharyngeal and cerebrospinal fluid of two infants with pneumococcal meningitis. Journal of clinical microbiology 41, 3970–3972 (2003).

    Article  Google Scholar 

  10. 10.

    Goonetilleke, U. R., Scarborough, M., Ward, S. A. & Gordon, S. B. Proteomic analysis of cerebrospinal fluid in pneumococcal meningitis reveals potential biomarkers associated with survival. The Journal of infectious diseases 202, 542–550 (2010).

    CAS  Article  Google Scholar 

  11. 11.

    Lees, J. A. et al. Large scale genomic analysis shows no evidence for pathogen adaptation between the blood and cerebrospinal fluid niches during bacterial meningitis. Microbial genomics 3 (2017).

  12. 12.

    Kulohoma, B. W. et al. Comparative genomic analysis of meningitis-and bacteremia-causing pneumococci identifies a common core genome. Infection and immunity 83, 4165–4173 (2015).

    CAS  Article  Google Scholar 

  13. 13.

    Doit, C., Loukil, C., Geslin, P. & Bingen, E. Phenotypic and genetic diversity of invasive pneumococcal isolates recovered from French children. Journal of clinical microbiology 40, 2994–2998 (2002).

    Article  Google Scholar 

  14. 14.

    Maiden, M. C. et al. Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences 95, 3140–3145 (1998).

    ADS  CAS  Article  Google Scholar 

  15. 15.

    Spratt, B. G. Multilocus sequence typing: molecular typing of bacterial pathogens in an era of rapid DNA sequencing and the internet. Current opinion in microbiology 2, 312–316 (1999).

    MathSciNet  CAS  Article  Google Scholar 

  16. 16.

    Chen, X. & Ishwaran, H. Random forests for genomic data analysis. Genomics 99, 323–329 (2012).

    CAS  Article  Google Scholar 

  17. 17.

    Lourenço, J. et al. Lineage structure of Streptococcus pneumoniae may be driven by immune selection on the groEL heat-shock protein. Scientific Reports 7, 9023 (2017).

    ADS  Article  Google Scholar 

  18. 18.

    Friedman, J., Hastie, T. & Tibshirani, R. The elements of statistical learning. Vol. 1 (Springer series in statistics Springer, Berlin, 2001).

  19. 19.

    Brynildsrud, O., Bohlin, J., Scheffer, L. & Eldholm, V. Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome biology 17, 238 (2016).

    Article  Google Scholar 

  20. 20.

    Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2–approximately maximum-likelihood trees for large alignments. PloS one 5, e9490 (2010).

    ADS  Article  Google Scholar 

  21. 21.

    Muñoz-López, M. & García-Pérez, J. L. DNA transposons: nature and applications in genomics. Current genomics 11, 115–128 (2010).

    Article  Google Scholar 

  22. 22.

    Brueggemann, A. B. et al. Clonal relationships between invasive and carriage Streptococcus pneumoniae and serotype-and clone-specific differences in invasive disease potential. The Journal of infectious diseases 187, 1424–1432 (2003).

    CAS  Article  Google Scholar 

  23. 23.

    Hausdorff, W. P., Bryant, J., Paradiso, P. R. & Siber, G. R. Which pneumococcal serogroups cause the most invasive disease: implications for conjugate vaccine formulation and use, part I. Clinical Infectious Diseases 30, 100–121 (2000).

    CAS  Article  Google Scholar 

  24. 24.

    Croucher, N. J. et al. Diversification of bacterial genome content through distinct mechanisms over different timescales. Nature communications 5 (2014).

  25. 25.

    van Tonder, A. J. et al. Heterogeneity Among Estimates Of The Core Genome And Pan-Genome In Different Pneumococcal Populations. bioRxiv, 133991 (2017).

  26. 26.

    Andam, C. P. et al. Genomic Epidemiology of Penicillin-Nonsusceptible Pneumococci with Nonvaccine Serotypes Causing Invasive Disease in the United States. Journal of clinical microbiology 55, 1104–1115 (2017).

    CAS  Article  Google Scholar 

  27. 27.

    Ramos-Sevillano, E., Moscoso, M., García, P., García, E. & Yuste, J. Nasopharyngeal colonization and invasive disease are enhanced by the cell wall hydrolases LytB and LytC of Streptococcus pneumoniae. PloS one 6, e23626 (2011).

    ADS  CAS  Article  Google Scholar 

  28. 28.

    Bai, X.-H. et al. Structure of pneumococcal peptidoglycan hydrolase LytB reveals insights into the bacterial cell wall remodeling and pathogenesis. Journal of Biological Chemistry 289, 23403–23416 (2014).

    CAS  Article  Google Scholar 

  29. 29.

    Moscoso, M., García, E. & López, R. Biofilm formation by Streptococcus pneumoniae: role of choline, extracellular DNA, and capsular polysaccharide in microbial accretion. Journal of bacteriology 188, 7785–7795 (2006).

    CAS  Article  Google Scholar 

  30. 30.

    Seki, M., Iida, K.-i, Saito, M., Nakayama, H. & Yoshida, S.-i Hydrogen peroxide production in Streptococcus pyogenes: involvement of lactate oxidase and coupling with aerobic utilization of lactate. Journal of bacteriology 186, 2046–2051 (2004).

    CAS  Article  Google Scholar 

  31. 31.

    Liu, L., Tong, H. & Dong, X. Function of the pyruvate oxidase-lactate oxidase cascade in interspecies competition between Streptococcus oligofermentans and Streptococcus mutans. Applied and environmental microbiology 78, 2120–2127 (2012).

    CAS  Article  Google Scholar 

  32. 32.

    Taniai, H. et al. Concerted action of lactate oxidase and pyruvate oxidase in aerobic growth of Streptococcus pneumoniae: role of lactate as an energy source. Journal of bacteriology 190, 3572–3579 (2008).

    CAS  Article  Google Scholar 

  33. 33.

    Rai, P. et al. Streptococcus pneumoniae secretes hydrogen peroxide leading to DNA damage and apoptosis in lung cells. Proceedings of the National Academy of Sciences 112, E3421–E3430 (2015).

    ADS  CAS  Article  Google Scholar 

  34. 34.

    Iannelli, F., Oggioni, M. R. & Pozzi, G. Allelic variation in the highly polymorphic locus pspC of Streptococcuspneumoniae. Gene 284, 63–71 (2002).

    CAS  Article  Google Scholar 

  35. 35.

    Croucher, N. J. et al. Diverse evolutionary patterns of pneumococcal antigens identified by pangenome-wide immunological screening. Proceedings of the National Academy of Sciences 114, E357–E366 (2017).

    CAS  Article  Google Scholar 

  36. 36.

    Brooks-Walter, A., Briles, D. E. & Hollingshead, S. K. The pspC gene of Streptococcus pneumoniae encodes a polymorphic protein, PspC, which elicits cross-reactive antibodies to PspA and provides immunity to pneumococcal bacteremia. Infection and immunity 67, 6533–6542 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. 37.

    Balachandran, P., Brooks-Walter, A., Virolainen-Julkunen, A., Hollingshead, S. K. & Briles, D. E. Role of pneumococcal surface protein C in nasopharyngeal carriage and pneumonia and its ability to elicit protection against carriage of Streptococcus pneumoniae. Infection and immunity 70, 2526–2534 (2002).

    CAS  Article  Google Scholar 

  38. 38.

    Iovino, F. et al. pIgR and PECAM-1 bind to pneumococcal adhesins RrgA and PspC mediating bacterial brain invasion. Journal of Experimental Medicine, jem. 20161668 (2017).

  39. 39.

    Giefing, C. et al. Discovery of a novel class of highly conserved vaccine antigens using genomic scale antigenic fingerprinting of pneumococcus with human antibodies. Journal of Experimental Medicine 205, 117–131 (2008).

    CAS  Article  Google Scholar 

  40. 40.

    Claverys, J. P., Prudhomme, M., Mortier‐Barrière, I. & Martin, B. Adaptation to the environment: Streptococcus pneumoniae, a paradigm for recombination‐mediated genetic plasticity? Molecular microbiology 35, 251–259 (2000).

    CAS  Article  Google Scholar 

  41. 41.

    Santagati, M., Iannelli, F., Oggioni, M. R., Stefani, S. & Pozzi, G. Characterization of a genetic element carrying the macrolide efflux gene mef (A) in Streptococcus pneumoniae. Antimicrobial Agents and Chemotherapy 44, 2585–2587 (2000).

    CAS  Article  Google Scholar 

  42. 42.

    Sanchez, C. J. et al. Streptococcus pneumoniae in biofilms are unable to cause invasive disease due to altered virulence determinant production. PLoS One 6, e28738 (2011).

    ADS  CAS  Article  Google Scholar 

  43. 43.

    Eng, C. L., Tong, J. C. & Tan, T. W. Predicting Zoonotic Risk of Influenza A Viruses from Host Tropism Protein Signature Using Random Forest. International journal of molecular sciences 18, 1135 (2017).

    Article  Google Scholar 

  44. 44.

    Hanage, W. P. et al. Using multilocus sequence data to define the pneumococcus. Journal of bacteriology 187, 6223–6230 (2005).

    CAS  Article  Google Scholar 

  45. 45.

    Ogunniyi, A. D. et al. Identification of genes that contribute to the pathogenesis of invasive pneumococcal disease by in vivo transcriptomic analysis. Infection and immunity 80, 3268–3278 (2012).

    CAS  Article  Google Scholar 

  46. 46.

    Mahdi, L. K. et al. A. D. Characterization of pneumococcal genes involved in bloodstream invasion in a mouse model. PloS one 10, e0141816 (2015).

    Article  Google Scholar 

  47. 47.

    Lukjancenko, O., Wassenaar, T. M. & Ussery, D. W. Comparison of 61 sequenced Escherichia coli genomes. Microbial ecology 60, 708–720 (2010).

    CAS  Article  Google Scholar 

  48. 48.

    Méric, G. et al. A reference pan-genome approach to comparative bacterial genomics: identification of novel epidemiological markers in pathogenic Campylobacter. PloS one 9, e92798 (2014).

    ADS  Article  Google Scholar 

  49. 49.

    Jolley, K. A. & Maiden, M. C. BIGSdb: scalable analysis of bacterial genome variation at the population level. BMC bioinformatics 11, 595 (2010).

    Article  Google Scholar 

  50. 50.

    Page, A. J. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691–3693 (2015).

    CAS  Article  Google Scholar 

  51. 51.

    Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).

    CAS  Article  Google Scholar 

  52. 52.

    Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology 15, R46 (2014).

    Article  Google Scholar 

  53. 53.

    Medini, D., Donati, C., Tettelin, H., Masignani, V. & Rappuoli, R. The microbial pan-genome. Current opinion in genetics & development 15, 589–594 (2005).

    CAS  Article  Google Scholar 

  54. 54.

    Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution 30, 772–780 (2013).

    CAS  Article  Google Scholar 

  55. 55.

    He, Z. et al. Evolviewv2: an online visualization and management tool for customized and annotated phylogenetic trees. Nucleic acids research 44, W236–W241 (2016).

    CAS  Article  Google Scholar 

  56. 56.

    Alikhan, N.-F., Petty, N. K., Zakour, N. L. B. & Beatson, S. A. BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons. BMC genomics 12, 402 (2011).

    CAS  Article  Google Scholar 

  57. 57.

    Letunic, I. & Bork, P. Interactive tree of life (iTOL)v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic acids research 44, W242–W245 (2016).

    CAS  Article  Google Scholar 

  58. 58.

    Liaw, A. & Wiener, M. Classification and regression by randomForest. R news 2, 18–22 (2002).

    Google Scholar 

  59. 59.

    Ogunniyi, A. D., Grabowicz, M., Briles, D. E., Cook, J. & Paton, J. C. Development of a vaccine against invasive pneumococcal disease based on combinations of virulence proteins of Streptococcus pneumoniae. Infection and immunity 75, 350–357 (2007).

    CAS  Article  Google Scholar 

  60. 60.

    Hamel, J. et al. Prevention of pneumococcal disease in mice immunized with conserved surface-accessible proteins. Infection and immunity 72, 2659–2670 (2004).

    CAS  Article  Google Scholar 

  61. 61.

    Yun, K. W., Lee, H., Choi, E. H. & Lee, H. J. Diversity of Pneumolysin and Pneumococcal Histidine Triad Protein D of Streptococcus pneumoniae Isolated from Invasive Diseases in Korean Children. PloS one 10, e0134055 (2015).

    Article  Google Scholar 

  62. 62.

    Navais, R., Méndez, J., Pérez-Pascual, D., Cascales, D. & Guijarro, J. A. The yrpAB operon of Yersinia ruckeri encoding two putative U32 peptidases is involved in virulence and induced under microaerobic conditions. Virulence 5, 619–624 (2014).

    Article  Google Scholar 

  63. 63.

    Gaspar, P., Al-Bayati, F. A., Andrew, P. W., Neves, A. R. & Yesilkaya, H. Lactate dehydrogenase is the key enzyme for pneumococcal pyruvate metabolism and pneumococcal survival in blood. Infection and immunity 82, 5099–5109 (2014).

    Article  Google Scholar 

  64. 64.

    Vanier, G. et al. Disruption of srtA gene in Streptococcus suis results in decreased interactions with endothelial cells and extracellular matrix proteins. Veterinary microbiology 127, 417–424 (2008).

    CAS  Article  Google Scholar 

  65. 65.

    Hu, D.-k. et al. Roles of virulence genes (PsaA and CpsA) on the invasion of Streptococcus pneumoniae into blood system. European journal of medical research 18, 14 (2013).

    Article  Google Scholar 

  66. 66.

    Zähner, D. & Hakenbeck, R. The Streptococcus pneumoniaeBeta-Galactosidase Is a Surface Protein. Journal of bacteriology 182, 5919–5921 (2000).

    Article  Google Scholar 

  67. 67.

    Dalia, A. B., Standish, A. J. & Weiser, J. N. Three surface exoglycosidases from Streptococcus pneumoniae, NanA, BgaA, and StrH, promote resistance to opsonophagocytic killing by human neutrophils. Infection and immunity 78, 2108–2116 (2010).

    CAS  Article  Google Scholar 

  68. 68.

    Morona, J. K., Miller, D. C., Morona, R. & Paton, J. C. The effect that mutations in the conserved capsular polysaccharide biosynthesis genes cpsA, cpsB, and cpsD have on virulence of Streptococcus pneumoniae. Journal of Infectious Diseases 189, 1905–1913 (2004).

    CAS  Article  Google Scholar 

Download references


This work was supported by an MRC grant jointly funded by the UK Medical Research Council, and the UK Department for International Development under the MRC/DFID Concordat agreement and is also part of the EDCTP2 programme supported by the European Union (Gori, Heyderman, Gupta); a Wellcome Trust Recruitment Award (Heyderman); an ERC Advanced (DIVERSITY) grant (Gupta, Lourenco); and an EMBO postdoctoral fellowship (Obolski).

Author information




U.O. drafted the study. U.O. and A.G. performed the analyses. U.O., A.G., J.L., C.T., R.T., N.F., R.H. and S.G. interpreted the results. U.O. and A.G. drafted the first version of the manuscript. All authors revised the manuscript.

Corresponding author

Correspondence to Uri Obolski.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Obolski, U., Gori, A., Lourenço, J. et al. Identifying genes associated with invasive disease in S. pneumoniae by applying a machine learning approach to whole genome sequence typing data. Sci Rep 9, 4049 (2019).

Download citation

Further reading


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing