Introduction

Streptococcus pyogenes (group A Streptococcus: GAS) is a globally significant bacterial pathogen of humans that is responsible for over a half a million deaths worldwide each year1. GAS is capable of expressing an arsenal of virulence genes as it survives and thrives in the diverse range of human tissues encountered throughout infection2. Distinct from many other bacteria that engage multiple RNA polymerase sigma factors, the growth-phase gene expression of GAS is modulated globally by transcription response regulators (RRs)3,4,5,6. GAS RRs control factors that mediate metabolism, colonization of tissues, evasion of immunity, stressor response, dissemination, and persistence, by responding to changes in the external and cytosolic microenvironments7.

GAS mga is the autoregulating RR that controls the expression of genes within the mga regulon8. The mga regulon is bounded by the conserved genes mga (mga1 or mga2) and scpA. Between mga and scpA variably lie the highly recombinogenic genes mrp, emm, and enn (encoding the surface-exposed M-related, M-, and M-like proteins)9. The canon describes a mosaic of four divergent, yet conserved, cell wall-spanning domains at the 3′ end of mrp, emm, and enn9,10. Along with five configurations of mrp, emm, and enn, these cell-wall spanning domains form the basis of the emm-pattern typing system9. The pattern types consist of A-C ‘throat specialists’ isolates, D ‘skin specialists’, and E ‘generalists’9. In 2018, DebRoy and coworkers11 observed the chimerisation of the 5′ end of emm4 and the 3′ end of its adjacent enn, resulting in emm4C. This observation defined a novel emm-pattern. They established that in 2018, 80% of circulating emm4 GAS strains in the United States of America and England contained the emm4C variant, highlighting the clinical relevance of this variant.

Phylogenetic delineation by molecular genotyping is central to the understanding of the biology, pathophysiology, epidemiology, and outbreak investigation of bacteria. The gold standard of GAS molecular typing, emm-typing, is based on the nucleotide (nt) sequence variability at the 5′ end of emm, and specifically the first 30 codons of the mature M-protein12. Moreover, emm-subtyping provides a higher resolution than emm-typing and is based on the nucleotide sequence spanning the C-terminus of the signal sequence (10 amino acids), and the N-terminus of mature M-protein (50 amino acids) (https://www.cdc.gov/streplab/groupa-strep/emm-background.html). emm-cluster typing, which is based on four bioinformatic criteria and the phylogeny of the amino acid sequence of the surface-exposed portion of the M-protein, corresponds to human serum binding of the M-protein13.

Although emm-typing is utilised in nearly all contemporary GAS epidemiological studies9, it is well understood that the surface-exposed M-protein is under strong diversifying selection pressure from host immunity14,15. Furthermore, the horizontal transfer of emm between GAS strains has long been known14,15,16, and more recently termed ‘emm-switching’17,18. Other studies have also described deletion-fusion (chimeric) events spanning the emm locus11,19,20,21. This imposes limitations on the use of emm-typing in GAS strain definition. Another popular GAS typing system is the multilocus sequence typing (MLST) system which utilises variation in the nucleotide sequence at the locus of seven genomically-dispersed housekeeping genes, and is often used to augment emm-typing14. While the emm-type/MLST-type pair has higher resolution than either individually, there are many emm-types that are represented in multiple MLST-types, and vice versa, which again imposes typing limitations. All of the typing systems described above are amenable to whole-genome sequencing (WGS) approaches.

In the era of WGS, we are increasingly creating more data than can be meaningfully interrogated, meaning that new approaches that enhance the speed and accuracy of data analysis are required. In this context, machine learning (ML) is becoming both ubiquitous and crucial in the subdisciplines of biology, where the growth and centralization of data is exploding22,23. What is ML? In general, traditional computer software programming applies rules to data to infer an answer. By contrast, ML flips this paradigm by applying algorithms and statistical analytics to the data and the answer (when available) to infer predictive models24. The answer is alternatively referred to as a label. Supervised learning is a ML methodology that can be applied if the label is available, where a model is trained to predict the label (that is, to ‘learn by example’)25. Once validated, the model can be applied to test datasets to predict the ‘label’ for unseen datasets. While the popularity of applying ML in biology is rising rapidly, a great concern on the ‘Blackbox-ness’ of the model remains26, that is, a biologically interpretable model is preferred to contribute to our understanding of biology. The random forest (RF) is a supervised ML algorithm, that is based on an ensemble of decision trees27,28. Each tree is constructed from a random set of input (or predictor) features, and the output of the ensemble is a majority vote of the trees that reduces the risk of inference error introduced by individual trees29,30. The RF has several attributes that make it particularly suitable to this study including: robustness, scalability, and its ability to handle both categorical and continuous data types. More importantly, RF generates ML models with a high interpretability.

We hypothesised that the RF algorithm could be applied to the variation in the DNA sequences of GAS RRs to infer genomic traits. We tested this by inferring the emm-type, emm-subtype, country of origin, preferred tissue of infection, propensity to cause invasive disease, and clinical outcome of GAS isolates using the Ordinary, Regularized, and Guided random forest algorithms.

Results

Application of random forest classifiers to infer group A Streptococcus emm-type from variation in the response regulator allele types

The accuracy with which the emm-type of an isolate was inferred from the 53 selected RR allele types using the three RF classifiers tested ranged from 95.2 to 96.7% (Table 1). The highest and lowest accuracies were observed using the Guided, and Regularized RF classifiers, respectively. The mean accuracy of the three classifiers was 95.9%. The multiclass classification performance metrics including F1, Precision, and Recall are included in Supplementary Table S1.

Table 1 Summary of the highest accuracy with which the emm-type was inferred when the three tested random forest algorithms were applied to the optimal set of response regulator allele types of group A Streptococcus. Predictions were made using tenfold cross validation and 10 replicates.

Figure 1 summarises the normalised non-zero importance scores of the predictor features (RR alleles) selected by each of the RF classifier types in attaining the highest accuracy when inferring the emm-type. The Guided (A), Ordinary (B), and Regularized (C) RF classifiers selected ten, three, and four RR alleles to attain 96.7%, 95.7%, and 95.2%, respectively.

Figure 1
figure 1

Normalised importance scores of group A Streptococcus response regulator (RR) alleles displaying the highest accuracy in inferring the isolate emm-type for the three RF classifiers tested. The Guided (a), Ordinary (b), and Regularized (c) RF classifiers employed ten, three, and four RR alleles to attain 96.7%, 95.7%, and 95.2%, respectively. The SPY locus numbers refer to the SF370 isolate, unless stated otherwise.

The importance score rankings of the minimum feature set of RR alleles required to attain the highest accuracy (optimal feature sets), for inferring the emm-type of the three RF classifiers tested are detailed in Table 2. We discovered that each of the RF classifiers had a different number of features in the optimal set. Notably, mga2 and lrp were rank most important in all three. To test the prediction power of mga2 and lrp, we applied the Ordinary, Regular, and Guided classifiers to an input dataset composed of only these two allele types, and were able to predict the emm-type with accuracies of 86.7%, 93.7%, and 86.7%, respectively. The mean value was 89.0%.

Table 2 Importance value rankings of response regulators alleles (predictor features) in the optimal feature sets inferring GAS emm-type for the random forest algorithms tested.

The susceptibility testing (Fig. 2) shows the relationship between the accuracy of inferring the emm-type and the number of predictor features selected for each of the RF classifiers. The curve of best fit for each of the RF classifiers displayed a clear elbow and a minimum threshold number of features below which there was a decline in accuracy of inference with decreasing number of predictor features. While above this threshold the accuracy of inference displayed a relative insusceptible to number of predictor features.

Figure 2
figure 2

Susceptibility tests. The accuracy of inferring the group A Streptococcus emm-type by applying a different number of predictor features (response regulator alleles) to each of the three tested random forest classifiers [(a) Guided, (b) Ordinary, and (c) Regularized).

Considering the Guided RF of the highest accuracy of inference, Table 3 lists all of the isolates for which the inferred emm-type differed from the observed (published) emm-type, and summarises our attempt to assign a putative biological or bioinformatic explanation for the inaccuracy. We identified explanations for ten isolates, which included the following: Prior to testing, it was known that the variation between the RR alleles tested was not able to discriminate emm79 from emm183, or emm101 from emm205 (non-discriminatory). Moreover, inference of emm-types that only had one representative isolate in the dataset (singletons) was potentially problematic given the methodology used. Similarly inferring the emm-type of an isolate that has undergone emm-switching or a chimeric emm-enn event had potential to give inaccurate inferences.

Table 3 Examples of inaccurately inferred GAS emm-type using the most accurate Guided random forest algorithm and the optimal set of response regulator (RR) allele types.

Novel chimeric cell wall-spanning domain and chimeric emm-enn events

We observed a novel cell wall-spanning domain that is described by the chimerisation of SF3 and SF19,10, that we have labelled SF5 (Fig. 3). The nucleotide sequences at the 3′ end of a gene in the mga regulon were observed to share 100% identity with SF5 in emm39.4 (n = 13 of 13) and emm137.0 (n = 2 of 2) isolates.

Figure 3
figure 3

Novel cell wall-spanning domain of group A Streptococcus (GAS) emm, SF5, described by the chimerisation of SF3 and SF19,10. SF3 and SF1 are typical encoded in the majority of enn and a subset of emm, respectively. SF5 was observed in emm39.4 GAS (31005V6S1) and emm137.0 GAS (33181V1T1_01).

We also observed two novel chimeric emm-enn events in the mga regulon whose evolutionary pathways are depicted in Fig. 4. Note that the Centre for Disease Control and Prevention (CDC) emm-subtyping sequence loci of the parental strains (31005V6S1 and K5797) were deleted and retained in the mutant strain, respectively.

Figure 4
figure 4

Evolutionary pathways of two novel chimeric emm-enn events in the mga regulon of GAS. (a) ‘Contrariwise’ and (b) ‘Likewise’ events are depicted were the mutated isolate changes its emm-subtype, and retains its emm-subtype, respectively. The chimeric emm-enn is represented by a deletion-fusion event that culminates in a new gene containing the 5′ end of emm and the 3′ end of enn.

While searching for other strains that contained the CDC emm137.0 sequence, we noted the following in the draft genome of the Kenya isolate emm39.4 ST236 K13190. The largest scaffold (that is, the genome) encoded a contiguous and intact mga regulon (mga, emm, and scpA), while a much smaller scaffold (1445 bp) encoded the 3′ end of an SF1-containing emm which was adjacent to the 5′ end of enn (which contained the 180nt CDC emm137.0 sequence). All of which warns that for accuracy, the WGS emm-subtype sequences must be read in the context of their position in the mga regulon.

Plasticity in the mga regulon of the E3 emm-cluster type isolates

Frost et al.21 observed five novel chimeric emm-enn genes in emm9, emm44, emm58, emm73, and emm82 isolates. We noted that only emm73 is not of emm-cluster type E3. Furthermore, Frost’s study revealed 20 incomplete emm open reading frames (ORFs), of which the E3 emm-cluster type isolates were of emm103 (n = 2), emm25 (n = 2), emm58 (n = 1), emm82 (n = 1), and emm9 (n = 2) type. Frost and coworkers also observed that pgs, encoding a conserved protein between emm and enn, showed relatively high levels of expression compared to the other mga regulon genes, and was only present in E3 emm-cluster type isolates. With one exception, we have noted that pgs was encoded in all isolates of the monophyletic E3 emm-cluster subclade composed of emm25, emm58, emm79, emm82, emm87, emm103, and emm209 types (Fig. 5). This exception was emm82 NGAS473 whose emm-switch event we have described previously19.

Figure 5
figure 5

Phylogeny of group A Streptococcus E3 emm-cluster types. The tree has been labelled with the corresponding emm-type. The table summarises examples of recombination and mutation observed in the mga regulon of E3-type isolates. The tree is drawn to scale, with branch lengths in the same units (number of amino acid substitutions per site) as those of the evolutionary distances used for the phylogenetic tree. Approximate likelihood-ratio test values > 80% are indicated at the nodes. Adapted from Ref.13. Legend: ANGAS473, an emm82 isolate, inferred to have been the result of an emm-switch event has been previously described18,19, and was observed in this study to be pgs-negative.

We observed that mrp to enn of the Fiji isolate emm58.0 ST176 20059V1I1 shared 100% nucleotide identity with the Fiji isolate ST176 33087V1T1 except for a thymine deletion in the CDC emm-subtyping sequence (209delT), and a 231 nt deletion starting at the 585th nucleotide of 20059V1I1. The 209delT deletion caused a frameshift and subsequent premature stop codon in emm. We therefore propose an evolutionary pathway from strain 20059V1I1 to 33087V1T1 (Fig. 6). Furthermore, it is probable that these deletions have dramatically altered or halted the known function of emm. Additionally, the thymine deletion renders the isolate non-typable by the conventional CDC emm-subtyping sequence (180nt). We contend that 33087V1T1 could be an intermediate strain that demonstrates a mechanism for the time-dependent excision of genes in the mga regulon as seen in chimeric emm-enn deletion fusion events. Taken together, this provides evidence of the extreme plasticity of the E3 mga regulon. In another noteworthy yet non-E3-related observation that highlights the plasticity of the GAS mga regulon, we saw that the genome of the Fiji isolate, ST129 emm65.4 (33124V1T1) of emm-cluster type E6, encoded two fully formed emm genes.

Figure 6
figure 6

Evolutionary pathway explaining the major disruption to emm of group A Streptococcus 33087V1T1. This also represents a mechanism for the time-dependent excision of the genes of the mga regulon seen in chimeric emm-enn events. It is likely that the nucleotide deletions observed in 20059V1I1 cause disruption that drastically diminishes the function of emm, leading to its eventual deletion.

Application of random forest classifiers to infer other group A Streptococcus genomic traits from variation in the response regulator allele types

Figure 7 summarises the accuracy when inferring the other GAS genome traits (emm-subtype, tissue, clinical, country, and invasiveness) from variation in the 53 RR alleles using the three RF classifiers tested. Across the three RF classifier types the mean accuracy of inferring the emm-subtype, country, invasiveness, clinical, and tissue was 89.9%, 88.6%, 84.7%, 56.9%, and 56.4%, respectively. All of these values were less than the equivalent mean value for the prediction of the emm-type (95.9%). Using only mga2 and lrp as the input dataset, we inferred the genomic traits of emm-subtype, invasiveness, country, clinical, and tissue of isolates. The mean accuracies of inference of these genomic traits for the three RF classifiers were 84.2%, 83.4%, 83.1%, 53.5%, and 52.4%, respectively. Again, all of these values were less than the equivalent mean value for the prediction of the emm-type (89.0%). Along with emm-type, these results suggest a potential utility for inferring emm-subtype, country, and invasiveness, but not for tissue and clinical. This last observation is consistent with the complexity of the interaction between the pathogenic GAS isolate and the immune system of the infected host.

Figure 7
figure 7

Accuracy of the random forest classifiers tested in inferring group A Streptococcus genomic traits from a selection of 53 response regulator allele types. The labels tested include emm-subtype, the tissue and country from which the isolate was sampled, clinical outcomes from the infection, and the propensity of the isolate to cause invasive disease.

Discussion

In this study we applied three RF algorithms to the variation in a selection of GAS RR allele types in order to infer the emm-type of the isolate with high accuracy. This analysis enabled us to infer the emm-subtype, country of sample, and invasiveness of the isolate. However, we were not able to accurately infer the tissue sampled or clinical outcomes of the infection. We investigated the causes of inaccuracy when inferring the emm-type using the optimal Guided RF feature set because it was the most accurate configuration for this purpose. From this we identified a novel chimera of the conserved cell wall spanning domains, SF3 and SF19,10, that we have labelled SF5 (Fig. 3). We also identified two novel chimeric emm-enn events in the mga regulon. These events were in emm77.0 and emm39.4 type isolates. We defined the events seen in the emm77.0 and emm137.0 isolates as ‘likewise’ (Fig. 4b) and ‘contrariwise’ (Fig. 4a), respectively. Finally, we proposed an evolutionary pathway describing the disrupted emm of the E3-type Fiji strain 33087V1T1, from which we contend that this isolate represents an intermediate strain that suggests a mechanism for the time-dependent excision of genes in the mga regulon.

Application of the random forest algorithm and the response regulator allele types to infer the emm-type

We have demonstrated that using each of the three types of RF tested we were able to infer the GAS emm-type from the variation in the selected feature sets of 53 RR allele types with high accuracy. This is important because it represents an alternative to the emm-based systems whose accuracy is susceptible to the plasticity of the mga regulon. Additionally, it shows that RR-based typing is backwards compatible with the vast emm-centric GAS knowledge base. Table 4 collates the relative strengths and weaknesses of the emm-based and RR-based typing system.

Table 4 Comparison of properties of the emm-based and response regulator-based typing systems of group A Streptococcus.

We were able to describe the feature sets (of RRs) that attained the highest accuracy of predicting emm-type (that is, optimal feature sets) for each of the RF algorithms (Table 2). The highest overall accuracy was attained using the Guided RF with the following ten RRs: mga2, lrp, spy1934, spy0715 (gntR-like), rivR, M28_spy1337, spy1325, spy1602 (gntR-like), spy1817, and crgR. The optimal feature set for the RF algorithms were of different composition (that is, size and constituents).

It should be noted that mga2 and lrp were ranked most important in all three, suggesting a mathematical importance in inferring emm-type. Of the 53 RR allele types tested, mga2 had the highest ranking importance score when inferring emm-type for all three RF algorithms. The following are offered as reasons why mga2 ranked highest. Biologically, mga is an important response regulator which controls the expression of more than 10% of the GAS genome, and Mga is a large enough protein (62 kDa) to contain multiple functional domains31. Furthermore, mga is encoded proximally to emm, and regulates the transcription of emm. Mathematically, we also have previously measured that mga2 had the highest number of unique allele types of 35 of the 53 RRs tested in this dataset19. Thus, biologically and mathematically, it is predicable that the importance score of mga2 would rank highly when inferring the emm-type.

We also saw a threshold number of feature variables (RRs) below which the accuracy declined, but above which the accuracy showed a relative insusceptibility to the number of predictor variables (RRs) (Fig. 2). This is of different significance for both in silico and laboratory-based analyses. While the impost of testing all 53 RRs in silico is negligible, a reduction to ten variables (RRs) may represent a significant economic saving of resources in the laboratory. Furthermore, reduction of predictor variables (RRs) in this in silico, WGS-based analysis was predicated on an increase in accuracy. Thus, in vitro testing must consider the trade-off of accuracy when decreasing the number of RRs tested on the basis of economics. Regardless, our findings represent a significant reduction in the search space and a logical starting test set for in vitro studies.

We endeavoured to understand when our process inferred the wrong emm-type for the Guided RF. That is, the inferred and published emm-type were different. We were able to propose putative biological explanations for approximately half of examples of inaccurate inferences (Table 3). These included the following scenarios. Firstly, isolates of either emm105 or emm205 type that shared identical RR allele types but have differing emm-types were at risk of incorrect typing. Similarly, the variation in the RR alleles of the optimal feature set of the Guided RF could not discriminate emm79 from emm183. Secondly, as a by-product of the supervised learning methodology employed, singleton emm-types were also at risk of inaccurate inference if that isolate had not been included in the training genome set. Thirdly, the singletons may also include examples of emm-switching and chimeric emm-enn events, although these occur relatively infrequently. Furthermore, inference of emm-type in the examples of emm-switching is problematic because the background RR allele types are likely to reflect the recipient isolate and not the donor of the recombined emm. Similarly, for chimeric emm-enn events, the background RR alleles will likely reflect that of the pre-event parental isolate which may have had a different emm-type. Together these anomalies had the unintended consequence of forming the basis of a preliminary method for identifying emm-switching and chimeric emm-enn events, which is a current unmet need in the GAS community.

Utility of the random forest in identifying chimeric cell wall-spanning domain and chimeric emm-enn events

We observed that the ST268 emm137.0 isolate (33181V1T1_01), that had been sampled from Fiji in 2006, was incorrectly inferred to be of emm-type 39 using the RRs and the Guided RF. Upon closer inspection we identified that 33181V1T1_01 possessed a novel cell wall-spanning domain, SF5, which is a chimerisation of two of the canonical cell wall-spanning domains (SF3 and SF1). SF5 was also observed in the ST268 emm39.4 isolate (31005V6S1), which had also been isolated from Fiji in 2006.

Additionally, emm137.0 of 33181V1T1_01 shared 100% nucleotide identity with the chimeric fusion of the 5′ end of emm and the 3′ end of enn of 31005V6S1. We noted that the CDC emm-subtyping sequence locus of the parental strain (31005V6S1) was deleted from the mutant strain (33181V1T1_01). We propose that this represents a novel chimeric emm-enn event (‘contrariwise’), that is visually depicted in Fig. 4a. We also identified that the ST747 emm77.0 isolate (K21246), which had been sampled from Kenya in 2007, possessed SF3 at the 3′ end of emm, noting that SF3 is canonically encoded in enn. The ST747 emm77.0 isolate (K5797) had been sampled from Kenya in 2000. We observed that emm77.0 of K21246 shares 100% identity with the fusion of the 5′ end of emm and the 3′ end of enn of K5797, noting that the CDC emm-subtyping sequence locus of the parental strain (K5797) was retained in the mutant strain (K21246). We propose that this too represents a novel chimeric emm-enn event (‘likewise’), that is visually depicted in Fig. 4b. The observed configurations of the conserved cell wall-spanning domains provided additional evidence to support our proposed evolutionary pathways.

We have chosen to define the first of the chimeric emm-enn events as ‘contrariwise’ (33181V1T1_01) and the second as ‘likewise’ (K21246) in preference to synonymous and non-synonymous to avoid confusion with established molecular biology nomenclature. Furthermore, typing of these mutants should be emm137C (33181V1T1_01) and emm77C (K21246) for consistency with the prevailing convention for chimeric emm and M-like genes11,19,21. It should be noted that historically contrariwise events may have been labelled as emm-switches. Where the evidence supports, we recommend that the term emm-switch is reserved for recombination events that have involved horizontal gene transfer of DNA containing emm. The recombinant DNA may also include DNA, that is adjacent to emm, other than that encoding emm. Recombination events have been identified in emm82C isolates, from which it can be inferred that the isolates have undergone both an emm-switch and a ‘likewise’ chimeric emm-enn event18,19. Finally, findings in this part of the project confirmed the bioinformatics-related imperative of reading emm-subtyping sequences in the context of their location within the mga regulon, when typing WGS genomes.

Plasticity in the mga regulon of the E3 emm-cluster type isolates

We observed a high degree of disruption to the E3 mga regulon caused by mutation and recombination. Additionally, we proposed an evolutionary pathway describing the disrupted emm of the E3-type Fiji strain 33087V1T1. Given this disruption we would expect to see either one of two eventualities. Firstly (and unlikely), the disrupted emm is retained in the genome because it confers advantage and is selected in the population. Or secondly (and more likely given the pre-eminence of emm in host immune evasion), that further deleterious mutations are acquired and the locus is gradually excised. Therein, we contend that this isolate represents an intermediate strain, suggesting a mechanism for the time-dependent excision of genes in the mga regulon. This represents an important finding in the evolutionary history of E3 GAS. Our findings are also epidemiologically important, because E3-type isolates have recently been identified in outbreaks as the causative agent of severe GAS disease18,32,33,34,35. Furthermore, given that emm is a major focus in GAS vaccine development, we consider that this work increases our understanding of rare and anomalous mga regulons. From a bioinformatics perspective, we contend also that it would not be prudent to assign an emm-subtype to the likes of 33087V1T1 (the non-typable intermediate strain evolved from emm58.0 20059V1I1) using the emm locus when a frameshift has disrupted and likely deleted the function of emm, making it susceptible to future excision. Overall, we assert that the plasticity of the E3 mga regulon represents a ‘snap shot’ of the real-time evolution of GAS and represents a recombination hotspot.

Application of the random forest algorithm and the response regulator allele types to infer other GAS genomic traits

A recent review by Allen et al. highlights the extensive opportunity that exists for the application of machine learning in microbiological genomics36. In this study, we were able to accurately predict the emm-subtype, the country from which the isolate was sampled, and the invasiveness of the isolate. Our work stands as template for predicting other untested GAS genomic traits using RF and RR allele types. Given the accuracy of the inference of invasiveness, we would suggest potential utility in an in vitro assay for predicting invasive GAS isolates.

This process flow was ineffective at inferring the tissue from which the isolate was sampled, and the clinical outcomes. Given the complexity of the GAS-host immunity interaction, this is not an unexpected result. Furthermore, it inspires an exciting question. Can a higher degree of accuracy be achieved if the input data set is augmented with judiciously selected human gene allele types?

Conclusions

In this study we applied the RF algorithm to the variation in a set of GAS RR alleles and were able to infer the emm-type with high accuracy. The highest accuracy, 96.7%, was achieved using the Guided RF. We identified the optimal feature sets (of RRs) for three different RF classifiers, therein describing how many, and which alleles had greatest importance when inferring the emm-type with the greatest accuracy. We observed that each RF classifier had a threshold number of features below which the accuracy of inference dropped, but above which the accuracy was relatively insusceptible to the number of features. By examining the potential sources of inaccuracy, we discovered a novel mga regulon cell wall-spanning domain, SF5. We also proposed two novel evolutionary pathways of chimeric emm-enn events in the mga regulon in which the original emm-type was retained by one, but was changed in the other. We defined these as ‘likewise’ and ‘contrariwise’ events, respectively. We also proposed an evolutionary pathway that describes frameshift mutation-induced disruption to emm, which results in a strain that represents an intermediate step in the time-dependent excision of genes in the mga regulon. We were also able to usefully predict emm-subtype, the country from which the isolate was sampled, and the invasiveness of the isolate. However, we were unable to predict the tissue from which the isolate was sampled, or the clinical outcome from the infection. Thus, ML has allowed us to interpret the biology of GAS and propose new evolutionary models. Noting that the RF has been under-utilised in the GAS community to date, we propose that our process flow serves as a template for the prediction of other untested GAS genomic traits.

Methods

Rationale

This study was designed to test whether the nucleotide sequences of GAS RR genes (predictor features) can predict genomic traits (labels). The following summarises our justification for investigating RRs. Traditional studies of GAS virulence have primarily focused on the biofunction of GAS virulence genes (including exotoxins, DNases, proteases, surface-exposed adherence-related proteins, and other bioactive enzymes) to the widespread exclusion of the co-ordinate regulation of the initiation of transcription of the aforementioned virulence genes. GAS RRs regulate the transcription of the majority of GAS virulence genes (and often themselves because many are autoregulating). Furthermore, GAS RRs are the major regulators of the GAS gene expression profile in the growth phase, in lieu of sigma factors. Therein, we hypothesised that variation in DNA sequences of the RR genes might correlate with GAS genomic traits. Regardless, the workflow developed in this study should represent a viable template for the investigation of untested GAS virulence genes in the future.

The RF ML algorithm was chosen because it is robust, scalable, capable of processing large datasets with high dimensionality and heterogeneous feature (or variable) types, and provides high interpretability. The three RFs tested were the Ordinary, Regularized, and Guided27,37. The regularized RF introduces a penalty to the inclusion of a new feature during decision tree building. Thus, a regularized RF only adds new features if those new features provide substantial new predictive information. The guided RF uses the importance scores from an ordinary random forest to guide the feature selection. Furthermore, we intended to find the maximum accuracy with which we could infer the response variables (genomic traits), and investigate the susceptibility of inference accuracy to the reduction of predictor feature set size.

Input data

The input data was composed of 53 nucleotide-based GAS RR allele types and six genomic traits, extracted from 944 genomes (Fig. 8a). Of these allele types, 35 have been described previously19, and 18 were less well characterised or putative RRs which still displayed the characteristic helix-turn-helix DNA binding domain inferred by SMART domain38. The six genomic traits were the emm-type, emm-subtype, the human tissue (tissue) and country from which the GAS isolate was sampled, clinical outcomes observed from the isolate (clinical), and the propensity for the isolate to cause invasive disease (invasiveness). The genomic traits of tissue and clinical correspond to the metadata fields that are titled ‘tissue/source’ and ‘clinical’ as previously described19,39. ‘Tissue/source’ essentially represents the tissue sampled, ‘clinical’ is an assemblage of presentation and disease outcome. The input data is included in Supplementary Table S2. The SPY locus numbers refer to the SF370 isolate, unless otherwise denoted. To minimise possible overfitting, the genomes were randomly separated into the training (n = 629) and test (n = 315) sets. emm-types that had only one representative in the dataset (singletons) were included in the training set. Furthermore, prior to testing with the RF algorithms, we established that the dataset contained a pair (n = 2 of 125) of different emm-types that shared identical RR allele types (emm105 and emm205).

Figure 8
figure 8

Summary of (a) input data nomenclature and (b) process flow of this study.

Process flow

Step 1 of this study involved training the RF classifier using the input data of the RR allele types as training predictor features and the genomic traits as training labels (Fig. 8b). Output data generated from this step were the selected predictor feature set importance scores of the RF classifier. In RF feature selection, the redundant or irrelevant features were eliminated under the guidance of the feature importance scores40. Simplistically, the feature importance score is a measure of the importance of the attribute in inferring the correct classification41. In step 2 the RF classifier was applied to the RR alleles types (test predictor features) to infer the classification of the six genomic traits (labels). The inferred genomic traits were then compared to the observed genomic traits, and an accuracy of inference was calculated in step 3.

In step 4, susceptibility testing, the histogram of the normalised importance scores of the feature selection set was investigated (Fig. 9). Step 4 was designed to select subsets of predictor features each containing the nth most important features with the intention of maximising the accuracy of inference. A threshold normalised importance score was arbitrarily set at a value close to 1, therein defining a subset of the selected predictor feature set that only contained the features with the highest importance scores. This was repeated for multiple arbitrarily decreasing threshold importance scores, creating subsets with increasingly more features. Steps 2 and 3 were then applied to each of these subsets and plots were generated to assess the susceptibility of the accuracy of inference to the number of predictor features (RR allele types). Steps 1–4 were repeated for the three different RF classifier algorithms.

Figure 9
figure 9

Schematic representation of the normalised importance score plot for the selected predictor feature set. Subsets of features were selected based on arbitrary threshold normalised importance values. Steps 2 and 3 of the process flow were then applied to each of these subsets.

Cross validation and multiclass classification performance evaluation for the optimal feature sets

We defined the optimal feature set as the set of response regulators from which the emm-type was inferred with the highest accuracy for each of the three random forest algorithms. Ten-fold cross validation with ten replicates was performed to measure the stability of these inferences using the ‘caret’42 and ‘RRF’ R-code packages37. As previously described, the random forest is an ensemble learning method, in which the classifier is derived from the voting results of multiple decision trees. Inferring emm-type from this dataset was a multiclass classification. In order to evaluate the performance of these multiclass classifications, we employed the ‘pROC’ R-code package43 to estimate the multiclass area under the receiver operating characteristic curve (AUC)44. The multiclass.roc function of ‘pROC’ was applied to predictions determined using the ‘vote’ parameter in the predict function of the ‘RRF’ package. We also collated the F1, Precision, and Recall metrics. The R-code was implemented in version 1.1.447, and is included as Supplementary File S3.

Development of evolutionary pathway models

In the development of the evolutionary pathways, the chimerisation of genes was confirmed using BLASTn45, as implemented in Geneious V 8.1.946, and visually depicted using Easyfig47.