Integrated analysis of population genomics, transcriptomics and virulence provides novel insights into Streptococcus pyogenes pathogenesis


Streptococcus pyogenes causes 700 million human infections annually worldwide, yet, despite a century of intensive effort, there is no licensed vaccine against this bacterium. Although a number of large-scale genomic studies of bacterial pathogens have been published, the relationships among the genome, transcriptome, and virulence in large bacterial populations remain poorly understood. We sequenced the genomes of 2,101 emm28 S. pyogenes invasive strains, from which we selected 492 phylogenetically diverse strains for transcriptome analysis and 50 strains for virulence assessment. Data integration provided a novel understanding of the virulence mechanisms of this model organism. Genome-wide association study, expression quantitative trait loci analysis, machine learning, and isogenic mutant strains identified and confirmed a one-nucleotide indel in an intergenic region that significantly alters global transcript profiles and ultimately virulence. The integrative strategy that we used is generally applicable to any microbe and may lead to new therapeutics for many human pathogens.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Population genetic structure for 2,095 S. pyogenes emm28 invasive infection isolates.
Fig. 2: Transcriptome analysis of the subset of 50 strains.
Fig. 3: Singleton strains (n = 442) partition into two major transcriptome clusters according to their genome-wide expression profiles.
Fig. 4: Variation in the numbers of differentially expressed genes between cluster A and B strains.
Fig. 5: Clustering of covR and covS mutant strains, and associated virulence.
Fig. 6: An intergenic single-nucleotide insertion increases Spy1336/R28 expression and strain virulence.
Fig. 7: Mouse virulence data, NADase production, and nga transcript levels.

Data availability

Whole-genome sequencing data for the 2,101 isolates studied have been deposited in the NCBI Sequence Read Archive under BioProject accession number PRJNA434389. The slightly updated complete genome sequence of the emm28 reference strain MGAS6180 (GenBank accession number CP000056) has been deposited in the NCBI GenBank database under the same accession number. Transcriptome data have been deposited in the Gene Expression Omnibus under accession GSE113058. The data that support the findings of this study are available from the corresponding author upon request.


  1. 1.

    Beres, S. B. et al. Transcriptome remodeling contributes to epidemic disease caused by the human pathogen Streptococcus pyogenes. mBio 7, e00403-16 (2016).

    PubMed  PubMed Central  Google Scholar 

  2. 2.

    Chewapreecha, C. et al. Comprehensive identification of single nucleotide polymorphisms associated with beta-lactam resistance within pneumococcal mosaic genes. PLoS Genet. 10, e1004547 (2014).

    PubMed  PubMed Central  Google Scholar 

  3. 3.

    Fernandez-Romero, N. et al. Uncoupling between core genome and virulome in extraintestinal pathogenic Escherichia coli. Can. J. Microbiol. 61, 647–652 (2015).

    CAS  PubMed  Google Scholar 

  4. 4.

    Long, S. W. et al. Population genomic analysis of 1,777 extended-spectrum beta-lactamase-producing Klebsiella pneumoniae isolates, Houston, Texas: unexpected abundance of clonal group 307. mBio 8, e00489-17 (2017).

    PubMed  PubMed Central  Google Scholar 

  5. 5.

    Mukherjee, S. et al. 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Nat. Biotechnol. 35, 676–683 (2017).

    CAS  PubMed  Google Scholar 

  6. 6.

    Nasser, W. et al. Evolutionary pathway to increased virulence and epidemic group A Streptococcus disease derived from 3,615 genome sequences. Proc. Natl Acad. Sci. USA 111, E1768–E1776 (2014).

    CAS  Google Scholar 

  7. 7.

    Bruchmann, S. et al. Deep transcriptome profiling of clinical Klebsiella pneumoniae isolates reveals strain and sequence type-specific adaptation. Environ. Microbiol. 17, 4690–4710 (2015).

    CAS  PubMed  Google Scholar 

  8. 8.

    Dotsch, A. et al. The Pseudomonas aeruginosa transcriptional landscape is shaped by environmental heterogeneity and genetic variation. mBio 6, e00749 (2015).

    PubMed  PubMed Central  Google Scholar 

  9. 9.

    Sharma-Kuinkel, B. K. et al. Potential influence of Staphylococcus aureus clonal complex 30 genotype and transcriptome on hematogenous infections. Open Forum Infect. Dis. 2, ofv093 (2015).

    PubMed  PubMed Central  Google Scholar 

  10. 10.

    Felek, S., Tsang, T. M. & Krukonis, E. S. Three Yersinia pestis adhesins facilitate Yop delivery to eukaryotic cells and contribute to plague virulence. Infect. Immun. 78, 4134–4150 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Swearingen, M. C., Porwollik, S., Desai, P. T., McClelland, M. & Ahmer, B. M. Virulence of 32 Salmonella strains in mice. PLoS One 7, e36043 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Schreiber, H. L. T. et al. Bacterial virulence phenotypes of Escherichia coli and host susceptibility determine risk for urinary tract infections.Sci. Transl. Med. 9, eaaf1283 (2017).

    PubMed  PubMed Central  Google Scholar 

  13. 13.

    Carapetis, J. R., Steer, A. C., Mulholland, E. K. & Weber, M. The global burden of group A streptococcal diseases. Lancet Infect. Dis. 5, 685–694 (2005).

    Google Scholar 

  14. 14.

    Carapetis, J. R. et al. Acute rheumatic fever and rheumatic heart disease. Nat. Rev. Dis. Primers 2, 15084 (2016).

    PubMed  PubMed Central  Google Scholar 

  15. 15.

    Zhu, L. et al. A molecular trigger for intercontinental epidemics of group A Streptococcus. J. Clin. Invest. 125, 3545–3559 (2015).

    PubMed  PubMed Central  Google Scholar 

  16. 16.

    Zhu, L., Olsen, R. J., Nasser, W., de la Riva Morales, I. & Musser, J. M. Trading capsule for increased cytotoxin production: contribution to virulence of a newly emerged clade of emm89 Streptococcus pyogenes. mBio 6, e01378-15 (2015).

    PubMed  PubMed Central  Google Scholar 

  17. 17.

    Colman, G., Tanna, A., Efstratiou, A. & Gaworzewska, E. T. The serotypes of Streptococcus pyogenes present in Britain during 1980–1990 and their association with disease. J. Med. Microbiol. 39, 165–178 (1993).

    CAS  PubMed  Google Scholar 

  18. 18.

    Gherardi, G., Vitali, L. A. & Creti, R. Prevalent emm types among invasive GAS in Europe and North America since year 2000. Front. Public Health 6, 59 (2018).

    PubMed  PubMed Central  Google Scholar 

  19. 19.

    Smit, P. W. et al. Epidemiology and emm types of invasive group A streptococcal infections in Finland, 2008–2013. Eur. J. Clin. Microbiol. Infect. Dis. 34, 2131–2136 (2015).

    CAS  PubMed  Google Scholar 

  20. 20.

    Ikebe, T. et al. Increased prevalence of group A Streptococcus isolates in streptococcal toxic shock syndrome cases in Japan from 2010 to 2012. Epidemiol. Infect. 143, 864–872 (2015).

    CAS  PubMed  Google Scholar 

  21. 21.

    Naseer, U., Steinbakk, M., Blystad, H. & Caugant, D. A. Epidemiology of invasive group A streptococcal infections in Norway 2010–2014: a retrospective cohort study.Eur. J. Clin. Microbiol. Infect. Dis. 35, 1639–1648 (2016).

    CAS  PubMed  Google Scholar 

  22. 22.

    Nelson, G. E. et al. Epidemiology of invasive group A streptococcal infections in the United States, 2005–2012. Clin. Infect. Dis. 63, 478–486 (2016).

    PubMed  PubMed Central  Google Scholar 

  23. 23.

    Plainvert, C. et al. Invasive group A streptococcal infections in adults, France (2006–2010). Clin. Microbiol. Infect. 18, 702–710 (2012).

    CAS  PubMed  Google Scholar 

  24. 24.

    Al-Shahib, A. et al. Emergence of a novel lineage containing a prophage in emm/M3 group A Streptococcus associated with upsurge in invasive disease in the UK. Microb. Genom. 2, e000059 (2016).

    PubMed  PubMed Central  Google Scholar 

  25. 25.

    Davies, M. R. et al. Emergence of scarlet fever Streptococcus pyogenes emm12 clones in Hong Kong is associated with toxin acquisition and multidrug resistance. Nat. Genet. 47, 84–87 (2015).

    CAS  Google Scholar 

  26. 26.

    Fittipaldi, N. et al. Full-genome dissection of an epidemic of severe invasive disease caused by a hypervirulent, recently emerged clone of group A Streptococcus. Am. J. Pathol. 180, 1522–1534 (2012).

    CAS  PubMed  Google Scholar 

  27. 27.

    Hamilton, S. M., Stevens, D. L. & Bryant, A. E. Pregnancy-related group a streptococcal infections: temporal relationships between bacterial acquisition, infection onset, clinical findings, and outcome. Clin. Infect. Dis. 57, 870–876 (2013).

    PubMed  PubMed Central  Google Scholar 

  28. 28.

    Johnson, D. R., Stevens, D. L. & Kaplan, E. L. Epidemiologic analysis of group A streptococcal serotypes associated with severe systemic infections, rheumatic fever, or uncomplicated pharyngitis. J. Infect. Dis. 166, 374–382 (1992).

    CAS  PubMed  Google Scholar 

  29. 29.

    Shea, P. R. et al. Group A Streptococcus emm gene types in pharyngeal isolates, Ontario, Canada, 2002–2010. Emerg. Infect. Dis. 17, 2010–2017 (2011).

    PubMed  PubMed Central  Google Scholar 

  30. 30.

    Smoot, J. C. et al. Genome sequence and comparative microarray analysis of serotype M18 group A Streptococcus strains associated with acute rheumatic fever outbreaks. Proc. Natl Acad. Sci. USA 99, 4668–4673 (2002).

    CAS  PubMed  Google Scholar 

  31. 31.

    Ben Zakour, N. L., Venturini, C., Beatson, S. A. & Walker, M. J. Analysis of a Streptococcus pyogenes puerperal sepsis cluster by use of whole-genome sequencing. J. Clin. Microbiol. 50, 2224–2228 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Chuang, I., Van Beneden, C., Beall, B. & Schuchat, A. Population-based surveillance for postpartum invasive group A Streptococcus infections, 1995–2000. Clin. Infect. Dis. 35, 665–670 (2002).

    PubMed  Google Scholar 

  33. 33.

    Gaworzewska, E. & Colman, G. Changes in the pattern of infection caused by Streptococcus pyogenes. Epidemiol. Infect. 100, 257–269 (1988).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Raymond, J., Schlegel, L., Garnier, F. & Bouvet, A. Molecular characterization of Streptococcus pyogenes isolates to investigate an outbreak of puerperal sepsis. Infect. Control Hosp. Epidemiol. 26, 455–461 (2005).

    PubMed  Google Scholar 

  35. 35.

    Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).

    CAS  Google Scholar 

  36. 36.

    Bricker, A. L., Carey, V. J. & Wessels, M. R. Role of NADase in virulence in experimental invasive group A streptococcal infection. Infect. Immun. 73, 6562–6566 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. 37.

    Bricker, A. L., Cywes, C., Ashbaugh, C. D. & Wessels, M. R. NAD+-glycohydrolase acts as an intracellular toxin to enhance the extracellular survival of group A streptococci. Mol. Microbiol. 44, 257–269 (2002).

    CAS  PubMed  Google Scholar 

  38. 38.

    Sumby, P. et al. Evolutionary origin and emergence of a highly successful clone of serotype M1 group A Streptococcus involved multiple horizontal gene transfer events. J. Infect. Dis. 192, 771–782 (2005).

    CAS  Google Scholar 

  39. 39.

    Zhu, L. et al. Contribution of secreted NADase and streptolysin O to the pathogenesis of epidemic serotype M1 Streptococcus pyogenes infections. Am. J. Pathol. 187, 605–613 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Meehl, M. A., Pinkner, J. S., Anderson, P. J., Hultgren, S. J. & Caparon, M. G. A novel endogenous inhibitor of the secreted streptococcal NAD-glycohydrolase. PLoS Pathog. 1, e35 (2005).

    PubMed  PubMed Central  Google Scholar 

  41. 41.

    Tatsuno, I. et al. Characterization of the NAD-glycohydrolase in streptococcal strains. Microbiology 153, 4253–4260 (2007).

    CAS  PubMed  Google Scholar 

  42. 42.

    Shimomura, Y. et al. Complete genome sequencing and analysis of a Lancefield group G Streptococcus dysgalactiae subsp. equisimilis strain causing streptococcal toxic shock syndrome (STSS). BMC Genomics 12, 17 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Carroll, R. K. et al. Naturally occurring single amino acid replacements in a regulatory protein alter streptococcal gene expression and virulence in mice. J. Clin. Invest. 121, 1956–1968 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. 44.

    Graham, M. R. et al. Virulence control in group A Streptococcus by a two-component gene regulatory system: global expression profiling and in vivo infection modeling. Proc. Natl Acad. Sci. USA 99, 13855–13860 (2002).

    CAS  PubMed  Google Scholar 

  45. 45.

    Ribardo, D. A. & McIver, K. S. Defining the Mga regulon: comparative transcriptome analysis reveals both direct and indirect regulation by Mga in the group A Streptococcus. Mol. Microbiol. 62, 491–508 (2006).

    CAS  PubMed  Google Scholar 

  46. 46.

    Ramalinga, A., Danger, J. L., Makthal, N., Kumaraswami, M. & Sumby, P. Multimerization of the virulence-enhancing group A Streptococcus transcription factor RivR is required for regulatory activity.J. Bacteriol. 199, e00452-16 (2017).

    PubMed  Google Scholar 

  47. 47.

    Trevino, J., Liu, Z., Cao, T. N., Ramirez-Pena, E. & Sumby, P. RivR is a negative regulator of virulence factor expression in group A Streptococcus. Infect. Immun. 81, 364–372 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. 48.

    Nyberg, P., Rasmussen, M. & Bjorck, L. α2-Macroglobulin-proteinase complexes protect Streptococcus pyogenes from killing by the antimicrobial peptide LL-37. J. Biol. Chem. 279, 52820–52823 (2004).

    CAS  PubMed  Google Scholar 

  49. 49.

    Rasmussen, M., Muller, H. P. & Bjorck, L. Protein GRAB of Streptococcus pyogenes regulates proteolysis at the bacterial surface by binding α2-macroglobulin. J. Biol. Chem. 274, 15336–15344 (1999).

    CAS  PubMed  Google Scholar 

  50. 50.

    Toppel, A. W., Rasmussen, M., Rohde, M., Medina, E. & Chhatwal, G. S. Contribution of protein G-related α2-macroglobulin-binding protein to bacterial virulence in a mouse skin model of group A streptococcal infection. J. Infect. Dis. 187, 1694–1703 (2003).

    CAS  PubMed  Google Scholar 

  51. 51.

    Haas, B. J., Chin, M., Nusbaum, C., Birren, B. W. & Livny, J. How deep is deep enough for RNA-Seq profiling of bacterial transcriptomes? BMC Genomics 13, 734 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Shishkin, A. A. et al. Simultaneous generation of many RNA-Seq libraries in a single reaction. Nat. Methods 12, 323–325 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. 53.

    Engleberg, N. C., Heath, A., Miller, A., Rivera, C. & DiRita, V. J. Spontaneous mutations in the CsrRS two-component regulatory system of Streptococcus pyogenes result in enhanced virulence in a murine model of skin and soft tissue infection. J. Infect. Dis. 183, 1043–1054 (2001).

    CAS  PubMed  Google Scholar 

  54. 54.

    Li, J. et al. Neutrophils select hypervirulent CovRS mutants of M1T1 group A Streptococcus during subcutaneous infection of mice. Infect. Immun. 82, 1579–1590 (2014).

    PubMed  PubMed Central  Google Scholar 

  55. 55.

    Mayfield, J. A. et al. Mutations in the control of virulence sensor gene from Streptococcus pyogenes after infection in mice lead to clonal bacterial variants with altered gene regulatory activity and virulence. PLoS One 9, e100698 (2014).

    PubMed  PubMed Central  Google Scholar 

  56. 56.

    Sumby, P., Whitney, A. R., Graviss, E. A., DeLeo, F. R. & Musser, J. M. Genome-wide analysis of group A streptococci reveals a mutation that modulates global phenotype and disease specificity. PLoS Pathog. 2, e5 (2006).

    PubMed  PubMed Central  Google Scholar 

  57. 57.

    Tatsuno, I., Okada, R., Zhang, Y., Isaka, M. & Hasegawa, T. Partial loss of CovS function in Streptococcus pyogenes causes severe invasive disease. BMC Res. Notes 6, 126 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. 58.

    Trevino, J. et al. CovS simultaneously activates and inhibits the CovR-mediated repression of distinct subsets of group A Streptococcus virulence factor-encoding genes. Infect. Immun. 77, 3141–3149 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. 59.

    Breiman, L. Random Forests. Mach. Learn. 45, 5–32 (2001).

    Google Scholar 

  60. 60.

    Stalhammar-Carlemalm, M., Areschoug, T., Larsson, C. & Lindahl, G. The R28 protein of Streptococcus pyogenes is related to several group B streptococcal surface proteins, confers protective immunity and promotes binding to human epithelial cells. Mol. Microbiol. 33, 208–219 (1999).

    CAS  PubMed  Google Scholar 

  61. 61.

    Stalhammar-Carlemalm, M., Stenberg, L. & Lindahl, G. Protein rib: a novel group B streptococcal cell surface protein that confers protective immunity and is expressed by most strains causing invasive infections. J. Exp. Med. 177, 1593–1603 (1993).

    CAS  PubMed  Google Scholar 

  62. 62.

    Beres, S. B. & Musser, J. M. Contribution of exogenous genetic elements to the group A Streptococcus metagenome. PLoS One 2, e800 (2007).

    PubMed  PubMed Central  Google Scholar 

  63. 63.

    Green, N. M. et al. Genome sequence of a serotype M28 strain of group A Streptococcus: potential new insights into puerperal sepsis and bacterial disease specificity. J. Infect. Dis. 192, 760–770 (2005).

    CAS  PubMed  Google Scholar 

  64. 64.

    Coll, F. et al. Genome-wide analysis of multi- and extensively drug-resistant Mycobacterium tuberculosis. Nat. Genet. 50, 307–316 (2018).

    PubMed  Google Scholar 

  65. 65.

    Earle, S. G. et al. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat. Microbiol. 1, 16041 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  66. 66.

    Gibson, G., Powell, J. E. & Marigorta, U. M. Expression quantitative trait locus analysis for translational medicine. Genome Med. 7, 60 (2015).

    PubMed  PubMed Central  Google Scholar 

  67. 67.

    Nicolae, D. L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 6, e1000888 (2010).

    PubMed  PubMed Central  Google Scholar 

  68. 68.

    Olsen, R. J. & Musser, J. M. Molecular pathogenesis of necrotizing fasciitis. Annu. Rev. Pathol. 5, 1–31 (2010).

    CAS  PubMed  Google Scholar 

  69. 69.

    Rodriguez-Ortega, M. J. et al. Characterization and identification of vaccine candidate proteins through analysis of the group A Streptococcus surface proteome. Nat. Biotechnol. 24, 191–197 (2006).

    CAS  Google Scholar 

  70. 70.

    Zhu, L. et al. Intergenic variable-number tandem-repeat polymorphism upstream of rocA alters toxin production and enhances virulence in Streptococcus pyogenes. Infect. Immun. 84, 2086–2093 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. 71.

    Hammarlof, D. L. et al. Role of a single noncoding nucleotide in the evolution of an epidemic African clade of Salmonella.Proc. Natl Acad. Sci. USA 115, E2614–E2623 (2018).

    PubMed  Google Scholar 

  72. 72.

    Blount, Z. D., Barrick, J. E., Davidson, C. J. & Lenski, R. E. Genomic analysis of a key innovation in an experimental Escherichia coli population. Nature 489, 513–518 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  73. 73.

    Zaunbrecher, M. A., Sikes, R. D. Jr, Metchock, B., Shinnick, T. M. & Posey, J. E. Overexpression of the chromosomally encoded aminoglycoside acetyltransferase eis confers kanamycin resistance in Mycobacterium tuberculosis. Proc. Natl Acad. Sci. USA 106, 20004–20009 (2009).

    CAS  PubMed  Google Scholar 

  74. 74.

    Puopolo, K. M. & Madoff, L. C. Upstream short sequence repeats regulate expression of the alpha C protein of group B Streptococcus. Mol. Microbiol. 50, 977–991 (2003).

    CAS  PubMed  Google Scholar 

  75. 75.

    Stalhammar-Carlemalm, M., Areschoug, T., Larsson, C. & Lindahl, G. Cross-protection between group A and group B streptococci due to cross-reacting surface proteins. J. Infect. Dis. 182, 142–149 (2000).

    CAS  PubMed  Google Scholar 

  76. 76.

    Weckel, A. et al. The N-terminal domain of the R28 protein promotes emm28 group A Streptococcus adhesion to host cells via direct binding to three integrins. J. Biol. Chem. 293, 16006–16018 (2018).

    CAS  PubMed  Google Scholar 

  77. 77.

    Valdes, K. M. et al. The fruRBA operon is necessary for group A streptococcal growth in fructose and for resistance to neutrophil killing during growth in whole human blood. Infect. Immun. 84, 1016–1031 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  78. 78.

    Jeukens, J. et al. Genomics of antibiotic-resistance prediction in Pseudomonas aeruginosa.Ann. NY Acad. Sci. 1435, 5–17 (2017).

    PubMed  Google Scholar 

  79. 79.

    Nguyen, M. et al. Developing an in silico minimum inhibitory concentration panel test for Klebsiella pneumoniae. Sci. Rep. 8, 421 (2018).

    PubMed  PubMed Central  Google Scholar 

  80. 80.

    Pesesky, M. W. et al. Evaluation of machine learning and rules-based approaches for predicting antimicrobial resistance profiles in Gram-negative bacilli from whole genome sequence data. Front. Microbiol. 7, 1887 (2016).

    PubMed  PubMed Central  Google Scholar 

  81. 81.

    Rishishwar, L., Petit, R. A. 3rd, Kraft, C. S. & Jordan, I. K. Genome sequence-based discriminator for vancomycin-intermediate Staphylococcus aureus. J. Bacteriol. 196, 940–948 (2014).

    PubMed  PubMed Central  Google Scholar 

  82. 82.

    Li, Y. et al. Validation of beta-lactam minimum inhibitory concentration predictions for pneumococcal isolates with newly encountered penicillin binding protein (PBP) sequences. BMC Genomics 18, 621 (2017).

    PubMed  PubMed Central  Google Scholar 

  83. 83.

    Li, Y. et al. Penicillin-binding protein transpeptidase signatures for tracking and predicting beta-lactam resistance levels in Streptococcus pneumoniae. mBio 7, e00756-16 (2016).

    PubMed  PubMed Central  Google Scholar 

  84. 84.

    Hao, K. et al. Lung eQTLs to help reveal the molecular underpinnings of asthma. PLoS Genet. 8, e1003029 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  85. 85.

    Naranbhai, V. et al. Genomic modulators of gene expression in human neutrophils. Nat. Commun. 6, 7545 (2015).

    PubMed  PubMed Central  Google Scholar 

  86. 86.

    Ongen, H. et al. Estimating the causal tissues for complex traits and diseases. Nat. Genet. 49, 1676–1683 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  87. 87.

    Tung, J., Zhou, X., Alberts, S. C., Stephens, M. & Gilad, Y. The genetic architecture of gene expression levels in wild baboons. eLife (2015).

  88. 88.

    Albert, F. W., Treusch, S., Shockley, A. H., Bloom, J. S. & Kruglyak, L. Genetics of single-cell protein abundance variation in large yeast populations. Nature 506, 494–497 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  89. 89.

    Parker, C. C. et al. Genome-wide association study of behavioral, physiological and gene expression traits in outbred CFW mice. Nat. Genet. 48, 919–926 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  90. 90.

    Francesconi, M. & Lehner, B. The effects of genetic variation on gene expression dynamics during development. Nature 505, 208–211 (2014).

    CAS  PubMed  Google Scholar 

  91. 91.

    Beres, S. B. et al. Molecular complexity of successive bacterial epidemics deconvoluted by comparative pathogenomics. Proc. Natl Acad. Sci. USA 107, 4371–4376 (2010).

    CAS  PubMed  Google Scholar 

  92. 92.

    Olsen, R. J. et al. The majority of 9,729 group A Streptococcus strains causing disease secrete SpeB cysteine protease: pathogenesis implications. Infect. Immun. 83, 4750–4758 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  93. 93.

    Beres, S. B. et al. Genome sequence analysis of emm89 Streptococcus pyogenes strains causing infections in Scotland, 2010–2016. J. Med. Microbiol. 66, 1765–1773 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  94. 94.

    Liu, Y., Schroder, J. & Schmidt, B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics 29, 308–315 (2013).

    CAS  PubMed  Google Scholar 

  95. 95.

    Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9, e112963 (2014).

    PubMed  PubMed Central  Google Scholar 

  96. 96.

    Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  97. 97.

    Inouye, M. et al. SRST2: rapid genomic surveillance for public health and hospital microbiology labs. Genome Med. 6, 90 (2014).

    PubMed  PubMed Central  Google Scholar 

  98. 98.

    Croucher, N. J. et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43, e15 (2015).

    Google Scholar 

  99. 99.

    Cheng, L., Connor, T. R., Siren, J., Aanensen, D. M. & Corander, J. Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Mol. Biol. Evol. 30, 1224–1228 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  100. 100.

    Huson, D. H. SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14, 68–73 (1998).

    CAS  PubMed  Google Scholar 

  101. 101.

    Wick, R. R., Judd, L. M., Gorrie, C. L. & Holt, K. E. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol. 13, e1005595 (2017).

    PubMed  PubMed Central  Google Scholar 

  102. 102.

    Long, S. W., Kachroo, P., Musser, J. M. & Olsen, R. J. Whole-genome sequencing of a human clinical isolate of emm28 Streptococcus pyogenes causing necrotizing fasciitis acquired contemporaneously with Hurricane Harvey.Genome Announc. 5, e01269-17 (2017).

    PubMed  PubMed Central  Google Scholar 

  103. 103.

    Lees, J. A. et al. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat. Commun. 7, 12797 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  104. 104.

    Bishop, C. Pattern Recognition and Machine Learning (Springer, New York, 2006).

  105. 105.

    Eraso, J. M. et al. Genomic landscape of intrahost variation in group A Streptococcus: repeated and abundant mutational inactivation of the fabT gene encoding a regulator of fatty acid synthesis. Infect. Immun. 84, 3268–3281 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  106. 106.

    Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  107. 107.

    Magoc, T., Wood, D. & Salzberg, S. L. EDGE-pro: estimated degree of gene expression in prokaryotic genomes. Evol. Bioinform. Online 9, 127–136 (2013).

    PubMed  PubMed Central  Google Scholar 

  108. 108.

    Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome. Biol. 15, 550 (2014).

    PubMed  PubMed Central  Google Scholar 

  109. 109.

    Hoffman, G. E. & Schadt, E. E. variancePartition: interpreting drivers of variation in complex gene expression studies. BMC Bioinformatics 17, 483 (2016).

    PubMed  PubMed Central  Google Scholar 

  110. 110.

    Tarazona, S. et al. Data quality aware analysis of differential expression in RNA-Seq with NOISeq R/Bioc package. Nucleic Acids Res. 43, e140 (2015).

    PubMed  PubMed Central  Google Scholar 

  111. 111.

    Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome. Biol. 17, 132 (2016).

    PubMed  PubMed Central  Google Scholar 

  112. 112.

    Shabalin, A. A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28, 1353–1358 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  113. 113.

    Committee for the Update of the Guide for the Care and Use of Laboratory Animals, Institute for Laboratory Animal Research & Division on Earth and Life Studies Guide for the Care and Use of Laboratory Animals 8th edn. (National Academies Press, Washington, DC, 2011).

Download references


This study was supported in part by the Fondren Foundation, Houston Methodist Hospital and Research Institute (to J.M.M.), the Academy of Finland (grant 255636 to J.V.), a European Research Council grant (number 742158 to J.C.), and a National Institutes of Health grant (1R01AI109096-01A1 to M.K.). This research was also supported in part by the Intramural Research Program of the National Institute of Allergy and Infectious Disease, National Institutes of Health (to F.R.D.). We thank N. Copeland, N. Jenkins, and D. Ginsburg for critical comments and suggestions to improve the manuscript; K. Stockbauer for critical comments and editorial assistance; E. Graviss, H. Erlendsdottir, W. Hong, and S. Linson for technical assistance; H.-L. Hyyryläinen, J. Jalava, and the Finnish clinical microbiology laboratories; A. A. Shishkin for helpful suggestions regarding the RNAtag-seq protocol; M. Todorovic and J. Jonsdottir Nielsen for banking strains from the Faroe Islands; A. McGeer for Ontario strains; C. Van Beneden, B. Beall, and the Active Bacterial Core Surveillance of the CDC’s Emerging Infections Programs network; A. Ramstad Alme and A. Witsø for technical assistance; and M. Steinbakk (Norwegian Laboratory for Streptococci) for support.

Author information




J.M.M. conceptualized the study. P.K., J.M.E., and J.M.M. designed the study. P.K., J.M.E., S.B.B., R.J.O., L.Z., W.N., P.E.B., C.C.C., M.O.S., M.J.A., B.S., M.P., J.P., J.C., S.L.K., H.A.T.N., S.W.L., and A.R.P. produced the data. P.K., J.M.E., S.B.B., R.J.O., L.Z., H.D., M.K., M.P., J.P., J.C., S.W.L., and F.R.D. analyzed the data. P.K. led the analyses of the transcriptome data. M.P., J.P., E.R.D., A.G.C., and J.C. provided scholarly input on the statistical analysis and presentation strategies. J.V., K.G.-Y.-H., K.G.K., M.G., D.A.C., S.G., and M.D.M. provided strains and metadata. All authors contributed to writing the manuscript. All authors reviewed and approved the final draft. P.K. and J.M.E. contributed equally to this work, as did S.B.B., R.J.O., and L.Z.

Corresponding author

Correspondence to James M. Musser.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Distribution of emm28 isolates by country and state in the United States.

All strains were isolated during a 26-year period, spanning 1991 through 2016. (a) Distribution of strains by country. Vertical black bars indicate the number of isolates per year. The total number of strains isolated in the USA was 952, of which 951 strains were collected as part of the Active Bacterial Core (ABC) surveillance study conducted by the Centers for Disease Control and Prevention32,68,111,112 (see for a complete description of the study). The one additional strain (from Texas) is strain MGAS6180, which is the genome sequence reference strain. Canadian strains are all from Ontario. The Faroe Islands are a self-governing part of Denmark. Regardless of country, all strains were recovered as part of comprehensive, population-based studies. (b) Distribution of emm28 isolates by state in the USA. All strains were isolated during a period of 18 years, spanning 1995 through 2012. Vertical black bars indicate the number of isolates per year. For the U.S. isolates, the states have been coded (A-J) at the request of the Centers for Disease Control.

Supplementary Figure 2 Flowcharts depicting bacterial genome and transcriptome data analysis.

(a) Next-generation sequencing data analysis pipeline employed for the preprocessing, read mapping, variant discovery and downstream genomic analyses of whole-genome sequencing data. aMLST: Multilocus sequence type, bSNP: Single nucleotide polymorphism, cHGT: Horizontal gene transfer. (b) Bioinformatics pipeline for demultiplexing, quality assessment, adapter trimming, read mapping, and data normalization and differential expression of transcriptome data.

Supplementary Figure 3 Distribution of emm28 isolates by genetic subclade, country and year.

Strains are represented by country and year of isolation. Only strains belonging to subclades 1A (SC1A-red), 1B (SC1B-blue), 2A (SC2A-green), and 2B (SC2B-brown) are shown. (a) Vertical bars indicate the number of isolates per year. The number (n) of strains isolated in each country is shown. Six distant outlier strains in the phylogenetic tree and 7 strains from the Faroe Islands are not shown. Thus, the number of strains does not sum to the total sample of 2,101 strains. No strains belonging to subclade SC2A or SC2B were isolated in Iceland. (b) Total number of strains belonging to each individual subclade per country. US, United States; CA, Canada (Ontario); FI, Finland; NO, Norway; IS, Iceland. Others refers to 6 distant outlier strains in the phylogenetic tree.

Supplementary Figure 4 Correlation among biological replicates for 50 strains analyzed by RNA-seq.

Comparison of biological replicates per strain at mid-exponential (a) and early-stationary phase (b). Mean correlation coefficient (Pearson) and standard deviation of normalized and log-transformed transcript counts for three biological replicates per strain are plotted.

Supplementary Figure 5 Transcriptome alterations and genetic subclades.

(a) Schematic depicts number of differentially expressed (DE) genes obtained by comparing transcriptome data for strains in the three major genetic subclades at mid-exponential (ME) and early-stationary (ES) phases. (b) Fold-increase in nga-ifs-slo transcript levels in SC2A (n = 15) strains compared to SC1A (n = 12) and SC1B (n = 23) strains at ME and ES phase. (c) grab gene transcript levels (normalized counts) were significantly increased in SC1B (n = 23) strains compared to SC1A (n = 12) strains at both growth phases (ME and ES). A significant increase in grab transcript levels in SC2A (n = 15) strains compared to SC1A (n = 12) strains was observed at ES phase. Statistical tests were performed using Mann-Whitney (two-tailed) test. Data are presented as box and whisker plots, where whiskers represent the minimum and maximum values. n represents the number of strains; each strain has three independent biological replicates.

Supplementary Figure 6 Comparison of three replicates versus single replicate and RNA-seq versus RNAtaq-seq.

(a) Scatterplots comparing WT-like strains from each of three major subclades (10 SC1A, 22 SC1B, 14 SC2A) using triplicates versus one randomly selected replicate from the 50-strain data. Presence of three biological replicates in the 50-strain data allowed us to simulate comparisons of averaged normalized counts when three versus one replicate were used. Strong correlation (r = 0.99) was observed for each triplicate- versus single-replicate comparison. Pearson correlation coefficient (r) is shown for each comparison. n represents number of samples (number of strains multiplied by number of replicates). (b) Seven strains were processed using the two protocols, that is, RNA-seq (three biological replicates per strain) and RNAtag-seq (singletons, that is, using single replicates). Principal component analysis of the seven strains processed using RNA-seq (three spheres colored cyan in the PCA plot) and RNAtag-seq (single sphere colored red in the PCA plot) displays overlapping spatial clustering. Expression profile of the 7 strains in the PCA plot is circled and numbered 1 through 7. Strains analyzed: 1-MGAS7888, 2-MGAS29284, 3-MGAS29553, 4-MGAS28746, 5-MGAS7914, 6-MGAS28647, and 7-MGAS28686. (c) Scatterplots were generated for the normalized counts (log-transformed) from the aforementioned seven strains processed using the two protocols, that is, RNA-seq and RNAtag-seq. For each strain, normalized transcript counts were averaged over the three biological replicates (RNA-seq protocol) and compared to RNAtag-seq normalized counts (singleton strain samples). Pearson correlation coefficient (r) is shown for each comparison.

Supplementary Figure 7 Strategy used to make pools and superpools and their sequencing read content (millions).

(a) Strategy used to make pools and superpools. Strains (small yellow circles) were grouped to form 58 distinct pools (gray circles) by labeling total RNA extracted from each strain with unique barcoded oligoribonucleotides. RNA from 8 strains was mixed to create one pool, with the exception of pool 58, which contained RNA from only 5 strains. In total there were 58 pools. cDNAs from each pool were individually barcoded with Illumina P7 index oligonucleotides. Four different P7 oligonucleotides were used in this study. Four pools were mixed to form one superpool (large yellow circles). In total there were 15 superpools. Pool 58 contained cDNA from only five strains, and superpool 15 contained only two pools. The original number of strains we performed RNAtag-seq analysis on was 461, and here we present data for 442. Data from 19 strains were not included because of low sequence coverage. (b) Average number of sequence reads per pool for each of the 15 superpools is presented. Each circle represents mean and error bars represent standard deviation (SD). Median was calculated using data for superpools 1–14 (each comprised of four pools). Superpool 15 contained only two pools. (c) Graph depicts the median number of reads per sample per pool in millions. Median reads per sample for the pools 57 and 58 are larger due to the higher sequencing depth of these pools.

Supplementary Figure 8 PCA plot of singleton strains and analysis of Cluster A ropB mutant strains.

(a) The two major clusters identified by DBSCAN are shown. (b) No subclade-specific clustering was evident within the two clusters. (c) Twenty strains with ropB mutations are outliers (colored yellow) and group away from the other strains with ropB mutations (colored orange). ropB-non-outlier strains cluster with WT-like strains (colored light blue) and strains with mutations in other major regulator genes (colored blue). (d) Cluster A ropB mutant strains separated into two groups validated by k-means clustering and were designated arbitrarily as Group I and Group II. (e) Group II ropB mutant strains had significantly decreased speB transcript levels compared to Group I strains (Mann-Whitney, two-tailed, P < 0.0001). (f) Mutations were mapped onto the crystal structure of the C-terminal region of the RopB protein. Variant amino acid positions associated with Group I or Group II organisms are labeled in red and pink, respectively. Amino acid residues present in inferred functional domains are demarcated with ovals. Mutations located in RopB functional domains were present at significantly increased frequency (test of proportions-one-tailed, P < 0.05) in Group II strains (pink labels within ovals) compared to Group I strains (red labels within ovals). PBD: peptide binding domain, NTD: N-terminal domain. The crystal structure of the NTD has not been solved. (g) Kaplan-Meier curve showing that the Group I (n = 3) and Group II (n = 4) strains differ significantly (log-rank test) in virulence in a mouse necrotizing myositis infection model (40 mice per strain). (h) Gross pathology images of infected mouse hindlimbs (n = 5 mice per strain) reflect the difference in virulence between the Group I (top) and Group II (bottom) strains, and representative images are displayed. Boxed areas demarcated in white illustrate major lesion areas.

Supplementary Figure 9 Lack of significant relationship between extent of transcriptome remodeling (number of DE genes) and genetic distance.

(a) Scatterplot comparing the number of differentially expressed genes (DE) and the genetic distance of the 442 singleton strains. For each of the strains, genes were called differentially expressed compared to reference strain MGAS28737. Genetic distance was measured as the number of core chromosomal SNPs compared to strain MGAS28737. Red line represents the line of regression. No significant correlation was observed between genetic distance and extent of transcriptome remodeling (number of DE genes) with R2 value of 0.0046. (b) No improvement in correlation (R2 = 0.0040) was observed when the analysis was conducted using only data for the 188 strains that have wild-type alleles for all known major regulatory genes. Red, SC1A; blue, SC1B; green, SC2A; yellow, SC2B. R2 value was calculated by linear regression analysis.

Supplementary Figure 10 Genome-wide association analysis and eQTL analysis of 442 strains

(a) Genome-wide association analysis was performed on 442 strains. Manhattan plot showing statistical significance (y-axis) of each k-mer (red circles) positively associated with high transcript expression of genes Spy1336/R28 and Spy1337, and their position along the 1.9 Mb GAS genome. Significant k-mers mapped to only one region of the chromosome, corresponding to the intergenic region between the Spy1336/R28 and Spy1337 genes. The top part is a schematic of the GAS genome, with vertical blue lines corresponding to open reading frames (ORFs) encoded by each strand of the chromosome. The bottom part shows an enlargement of the genome location corresponding to Spy1336/R28 and Spy1337, and the intergenic region. P values were computed by SEER software (Methods) (b) eQTL analysis identifies significant association between genotype (9T versus 10T) and expression level of genes Spy1336/R28 and Spy1337 in 50 strains at mid-exponential phase (left panel) and in 442 strains at early-stationary phase (right panel). Horizontal black bars represent mean transcript expression and standard deviation. PeQTL refers to q-values (False discovery rate, FDR) as reported by MatrixEQTL package. The threshold used for genome-wide significance was adjusted P value < 10e-8.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10, Supplementary Note and Supplementary Tables 1, 7, 9, 10, 13 and 17–19

Reporting Summary

Supplementary Table 2

SNPs largely present in SC2A but absent in SC1A post Gubbins

Supplementary Table 3

Inferred MGE content for the 20 most prevalent MGE genotypes in the S. pyogenes emm28 cohort

Supplementary Table 4

MGE genotype based on the presence or absence of 50 phage and ICE encoded genes, 31 integrases and 19 secreted virulence factors, derived from MGEs identified in 60 complete S. pyogenes

Supplementary Table 5

SRST2 MGE-50 absence/presence matrix and genotype

Supplementary Table 6

Supplementary Table 8

List of differentially expressed genes comparing the three major genetic subclades at midexponential and stationary phase

Supplementary Table 11

Regulatory gene mutation prediction by machine learning

Supplementary Table 12

List of differentially expressed genes comparing transcriptomic clusters within CovR/CovS mutant strains

Supplementary Table 14

List of differentially expressed genes between group II versus group I ropB mutant strains

Supplementary Table 15

List of differentially expressed genes comparing the isogenic strains with either 9Ts or 10Ts in the intergenic region between the Spy1336/R28 and Spy1337 genes

Supplementary Table 16

Results of eQTL analysis

Supplementary Table 20

Data quality metrics for the 2101 emm28 cohort

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kachroo, P., Eraso, J.M., Beres, S.B. et al. Integrated analysis of population genomics, transcriptomics and virulence provides novel insights into Streptococcus pyogenes pathogenesis. Nat Genet 51, 548–559 (2019).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing