Application of the random forest algorithm to Streptococcus pyogenes response regulator allele variation: from machine learning to evolutionary models

Group A Streptococcus (GAS) is a globally significant bacterial pathogen. The GAS genotyping gold standard characterises the nucleotide variation of emm, which encodes a surface-exposed protein that is recombinogenic and under immune-based selection pressure. Within a supervised learning methodology, we tested three random forest (RF) algorithms (Guided, Ordinary, and Regularized) and 53 GAS response regulator (RR) allele types to infer six genomic traits (emm-type, emm-subtype, tissue and country of sample, clinical outcomes, and isolate invasiveness). The Guided, Ordinary, and Regularized RF classifiers inferred the emm-type with accuracies of 96.7%, 95.7%, and 95.2%, using ten, three, and four RR alleles in the feature set, respectively. Notably, we inferred the emm-type with 93.7% accuracy using only mga2 and lrp. We demonstrated a utility for inferring emm-subtype (89.9%), country (88.6%), invasiveness (84.7%), but not clinical (56.9%), or tissue (56.4%), which is consistent with the complexity of GAS pathophysiology. We identified a novel cell wall-spanning domain (SF5), and proposed evolutionary pathways depicting the ‘contrariwise’ and ‘likewise’ chimeric deletion-fusion of emm and enn. We identified an intermediate strain, which provides evidence of the time-dependent excision of mga regulon genes. Overall, our workflow advances the understanding of the GAS mga regulon and its plasticity.


Results
Application of random forest classifiers to infer group A Streptococcus emm-type from variation in the response regulator allele types. The accuracy with which the emm-type of an isolate was inferred from the 53 selected RR allele types using the three RF classifiers tested ranged from 95.2 to 96.7% ( Table 1). The highest and lowest accuracies were observed using the Guided, and Regularized RF classifiers, respectively. The mean accuracy of the three classifiers was 95.9%. The multiclass classification performance metrics including F1, Precision, and Recall are included in Supplementary Table S1. Figure 1 summarises the normalised non-zero importance scores of the predictor features (RR alleles) selected by each of the RF classifier types in attaining the highest accuracy when inferring the emm-type. The Guided (A), Ordinary (B), and Regularized (C) RF classifiers selected ten, three, and four RR alleles to attain 96.7%, 95.7%, and 95.2%, respectively.
The importance score rankings of the minimum feature set of RR alleles required to attain the highest accuracy (optimal feature sets), for inferring the emm-type of the three RF classifiers tested are detailed in Table 2. We discovered that each of the RF classifiers had a different number of features in the optimal set. Notably, mga2 and lrp were rank most important in all three. To test the prediction power of mga2 and lrp, we applied the Table 1. Summary of the highest accuracy with which the emm-type was inferred when the three tested random forest algorithms were applied to the optimal set of response regulator allele types of group A Streptococcus. Predictions were made using tenfold cross validation and 10 replicates. a The optimal sets for the Ordinary, Regularized, and Guided random forests were [mga2, lrp, and gntR_spy0715], [mga2, lrp, copY, and crgR], and [mga2, lrp, spy1934, gntR_spy0715, rivR, M28_spy1337, spy1325, gntR_spy1602, spy1817, and crgR], respectively. b AUC = Multiclass classification area under the receiver operating characteristic curve. c Division by zero errors have been excluded from this average. www.nature.com/scientificreports/ Ordinary, Regular, and Guided classifiers to an input dataset composed of only these two allele types, and were able to predict the emm-type with accuracies of 86.7%, 93.7%, and 86.7%, respectively. The mean value was 89.0%. The susceptibility testing (Fig. 2) shows the relationship between the accuracy of inferring the emm-type and the number of predictor features selected for each of the RF classifiers. The curve of best fit for each of the RF classifiers displayed a clear elbow and a minimum threshold number of features below which there was a decline Figure 1. Normalised importance scores of group A Streptococcus response regulator (RR) alleles displaying the highest accuracy in inferring the isolate emm-type for the three RF classifiers tested. The Guided (a), Ordinary (b), and Regularized (c) RF classifiers employed ten, three, and four RR alleles to attain 96.7%, 95.7%, and 95.2%, respectively. The SPY locus numbers refer to the SF370 isolate, unless stated otherwise. Table 2. Importance value rankings of response regulators alleles (predictor features) in the optimal feature sets inferring GAS emm-type for the random forest algorithms tested. The optimal feature set is the set of features (from 53 response regulator alleles) selected in attaining the highest accuracy of inferring the emmtype for a particular random forest algorithm. a The percentage in brackets is the accuracy of inference. www.nature.com/scientificreports/ in accuracy of inference with decreasing number of predictor features. While above this threshold the accuracy of inference displayed a relative insusceptible to number of predictor features.

Random forest algorithm a Accuracy (%) AUC b (%) F1 c (%) Precision c (%) Recall c (%)
Considering the Guided RF of the highest accuracy of inference, Table 3 lists all of the isolates for which the inferred emm-type differed from the observed (published) emm-type, and summarises our attempt to assign a putative biological or bioinformatic explanation for the inaccuracy. We identified explanations for ten isolates, which included the following: Prior to testing, it was known that the variation between the RR alleles tested was not able to discriminate emm79 from emm183, or emm101 from emm205 (non-discriminatory). Moreover, inference of emm-types that only had one representative isolate in the dataset (singletons) was potentially problematic given the methodology used. Similarly inferring the emm-type of an isolate that has undergone emm-switching or a chimeric emm-enn event had potential to give inaccurate inferences.
Novel chimeric cell wall-spanning domain and chimeric emm-enn events. We observed a novel cell wall-spanning domain that is described by the chimerisation of SF3 and SF1 9,10 , that we have labelled SF5 (Fig. 3). The nucleotide sequences at the 3′ end of a gene in the mga regulon were observed to share 100% identity with SF5 in emm39.4 (n = 13 of 13) and emm137.0 (n = 2 of 2) isolates.
We also observed two novel chimeric emm-enn events in the mga regulon whose evolutionary pathways are depicted in Fig. 4. Note that the Centre for Disease Control and Prevention (CDC) emm-subtyping sequence loci of the parental strains (31005V6S1 and K5797) were deleted and retained in the mutant strain, respectively.
While searching for other strains that contained the CDC emm137.0 sequence, we noted the following in the draft genome of the Kenya isolate emm39.4 ST236 K13190. The largest scaffold (that is, the genome) encoded a contiguous and intact mga regulon (mga, emm, and scpA), while a much smaller scaffold (1445 bp) encoded the 3′ end of an SF1-containing emm which was adjacent to the 5′ end of enn (which contained the 180nt CDC emm137.0 sequence). All of which warns that for accuracy, the WGS emm-subtype sequences must be read in the context of their position in the mga regulon.
Plasticity in the mga regulon of the E3 emm-cluster type isolates. Frost et al. 21 observed five novel chimeric emm-enn genes in emm9, emm44, emm58, emm73, and emm82 isolates. We noted that only emm73 is not of emm-cluster type E3. Furthermore, Frost's study revealed 20 incomplete emm open reading frames (ORFs), of which the E3 emm-cluster type isolates were of emm103 (n = 2), emm25 (n = 2), emm58 (n = 1), emm82 (n = 1), and emm9 (n = 2) type. Frost and coworkers also observed that pgs, encoding a conserved protein between emm and enn, showed relatively high levels of expression compared to the other mga regulon genes, and was only present in E3 emm-cluster type isolates. With one exception, we have noted that pgs was encoded in all isolates of the monophyletic E3 emm-cluster subclade composed of emm25, emm58, emm79, emm82, emm87, emm103, and emm209 types (Fig. 5). This exception was emm82 NGAS473 whose emm-switch event we have described previously 19 .
We observed that mrp to enn of the Fiji isolate emm58.0 ST176 20059V1I1 shared 100% nucleotide identity with the Fiji isolate ST176 33087V1T1 except for a thymine deletion in the CDC emm-subtyping sequence www.nature.com/scientificreports/ (209delT), and a 231 nt deletion starting at the 585th nucleotide of 20059V1I1. The 209delT deletion caused a frameshift and subsequent premature stop codon in emm. We therefore propose an evolutionary pathway from strain 20059V1I1 to 33087V1T1 (Fig. 6). Furthermore, it is probable that these deletions have dramatically altered or halted the known function of emm. Additionally, the thymine deletion renders the isolate non-typable by the conventional CDC emm-subtyping sequence (180nt). We contend that 33087V1T1 could be an intermediate strain that demonstrates a mechanism for the time-dependent excision of genes in the mga regulon as seen in chimeric emm-enn deletion fusion events. Taken together, this provides evidence of the extreme plasticity of the E3 mga regulon. In another noteworthy yet non-E3-related observation that highlights the plasticity of the GAS Table 3. Examples of inaccurately inferred GAS emm-type using the most accurate Guided random forest algorithm and the optimal set of response regulator (RR) allele types. a emm-cluster type in brackets. b The observed or published emm-type. c Prior to the random forest testing, it was known that the variation between the RR alleles in the feature set was not able to discriminate emm79 from emm183, or emm101 from emm205. d Singleton denotes where the dataset contained only one representative of this emm-type. e Chimeric emm-enn events have been observed in isolates of this emm-type. f emm-switching has also been inferred in this isolate.  www.nature.com/scientificreports/  Approximate likelihood-ratio test values > 80% are indicated at the nodes. Adapted from Ref. 13 . Legend: ANGAS473, an emm82 isolate, inferred to have been the result of an emm-switch event has been previously described 18,19 , and was observed in this study to be pgs-negative. Application of random forest classifiers to infer other group A Streptococcus genomic traits from variation in the response regulator allele types. Figure 7 summarises the accuracy when inferring the other GAS genome traits (emm-subtype, tissue, clinical, country, and invasiveness) from variation in the 53 RR alleles using the three RF classifiers tested. Across the three RF classifier types the mean accuracy of inferring the emm-subtype, country, invasiveness, clinical, and tissue was 89.9%, 88.6%, 84.7%, 56.9%, and 56.4%, Figure 6. Evolutionary pathway explaining the major disruption to emm of group A Streptococcus 33087V1T1. This also represents a mechanism for the time-dependent excision of the genes of the mga regulon seen in chimeric emm-enn events. It is likely that the nucleotide deletions observed in 20059V1I1 cause disruption that drastically diminishes the function of emm, leading to its eventual deletion. Figure 7. Accuracy of the random forest classifiers tested in inferring group A Streptococcus genomic traits from a selection of 53 response regulator allele types. The labels tested include emm-subtype, the tissue and country from which the isolate was sampled, clinical outcomes from the infection, and the propensity of the isolate to cause invasive disease. www.nature.com/scientificreports/ respectively. All of these values were less than the equivalent mean value for the prediction of the emm-type (95.9%). Using only mga2 and lrp as the input dataset, we inferred the genomic traits of emm-subtype, invasiveness, country, clinical, and tissue of isolates. The mean accuracies of inference of these genomic traits for the three RF classifiers were 84.2%, 83.4%, 83.1%, 53.5%, and 52.4%, respectively. Again, all of these values were less than the equivalent mean value for the prediction of the emm-type (89.0%). Along with emm-type, these results suggest a potential utility for inferring emm-subtype, country, and invasiveness, but not for tissue and clinical. This last observation is consistent with the complexity of the interaction between the pathogenic GAS isolate and the immune system of the infected host.

Discussion
In this study we applied three RF algorithms to the variation in a selection of GAS RR allele types in order to infer the emm-type of the isolate with high accuracy. This analysis enabled us to infer the emm-subtype, country of sample, and invasiveness of the isolate. However, we were not able to accurately infer the tissue sampled or clinical outcomes of the infection. We investigated the causes of inaccuracy when inferring the emm-type using the optimal Guided RF feature set because it was the most accurate configuration for this purpose. From this we identified a novel chimera of the conserved cell wall spanning domains, SF3 and SF1 9,10 , that we have labelled SF5 (Fig. 3). We also identified two novel chimeric emm-enn events in the mga regulon. These events were in emm77.0 and emm39.4 type isolates. We defined the events seen in the emm77.0 and emm137.0 isolates as 'likewise' (Fig. 4b) and 'contrariwise' (Fig. 4a), respectively. Finally, we proposed an evolutionary pathway describing the disrupted emm of the E3-type Fiji strain 33087V1T1, from which we contend that this isolate represents an intermediate strain that suggests a mechanism for the time-dependent excision of genes in the mga regulon.
Application of the random forest algorithm and the response regulator allele types to infer the emm-type. We have demonstrated that using each of the three types of RF tested we were able to infer the GAS emm-type from the variation in the selected feature sets of 53 RR allele types with high accuracy. This is important because it represents an alternative to the emm-based systems whose accuracy is susceptible to the plasticity of the mga regulon. Additionally, it shows that RR-based typing is backwards compatible with the vast emm-centric GAS knowledge base. Table 4 collates the relative strengths and weaknesses of the emm-based and RR-based typing system. We were able to describe the feature sets (of RRs) that attained the highest accuracy of predicting emm-type (that is, optimal feature sets) for each of the RF algorithms ( Table 2). The highest overall accuracy was attained using the Guided RF with the following ten RRs: mga2, lrp, spy1934, spy0715 (gntR-like), rivR, M28_spy1337, spy1325, spy1602 (gntR-like), spy1817, and crgR. The optimal feature set for the RF algorithms were of different composition (that is, size and constituents).
It should be noted that mga2 and lrp were ranked most important in all three, suggesting a mathematical importance in inferring emm-type. Of the 53 RR allele types tested, mga2 had the highest ranking importance score when inferring emm-type for all three RF algorithms. The following are offered as reasons why mga2 ranked highest. Biologically, mga is an important response regulator which controls the expression of more than 10% of the GAS genome, and Mga is a large enough protein (62 kDa) to contain multiple functional domains 31 . Furthermore, mga is encoded proximally to emm, and regulates the transcription of emm. Mathematically, we also have previously measured that mga2 had the highest number of unique allele types of 35 of the 53 RRs tested in this dataset 19 . Thus, biologically and mathematically, it is predicable that the importance score of mga2 would rank highly when inferring the emm-type.
We also saw a threshold number of feature variables (RRs) below which the accuracy declined, but above which the accuracy showed a relative insusceptibility to the number of predictor variables (RRs) (Fig. 2). This is of different significance for both in silico and laboratory-based analyses. While the impost of testing all 53 RRs in silico is negligible, a reduction to ten variables (RRs) may represent a significant economic saving of resources Table 4. Comparison of properties of the emm-based and response regulator-based typing systems of group A Streptococcus. Preferred (bold) and non-preferred (italics) properties of a molecular bacterial typing system.

emm-based typing
Response regulator allele-based typing emm is a surface exposed protein The RRs are a family of cytosolic proteins that share broadly similar functional domains, including control of the expression of traditional GAS typing proteins

Known to be antigenic Not known to be antigenic
Under strong positive selection pressure from host immunity 16 Many proteins that are primarily under negative selection pressure 19

Single locus (or single point of failure) Multiple genomically-dispersed loci (redundancy)
Highly recombinogenic locus Many proteins with a range of recombinogenicity www.nature.com/scientificreports/ in the laboratory. Furthermore, reduction of predictor variables (RRs) in this in silico, WGS-based analysis was predicated on an increase in accuracy. Thus, in vitro testing must consider the trade-off of accuracy when decreasing the number of RRs tested on the basis of economics. Regardless, our findings represent a significant reduction in the search space and a logical starting test set for in vitro studies. We endeavoured to understand when our process inferred the wrong emm-type for the Guided RF. That is, the inferred and published emm-type were different. We were able to propose putative biological explanations for approximately half of examples of inaccurate inferences (Table 3). These included the following scenarios. Firstly, isolates of either emm105 or emm205 type that shared identical RR allele types but have differing emm-types were at risk of incorrect typing. Similarly, the variation in the RR alleles of the optimal feature set of the Guided RF could not discriminate emm79 from emm183. Secondly, as a by-product of the supervised learning methodology employed, singleton emm-types were also at risk of inaccurate inference if that isolate had not been included in the training genome set. Thirdly, the singletons may also include examples of emm-switching and chimeric emmenn events, although these occur relatively infrequently. Furthermore, inference of emm-type in the examples of emm-switching is problematic because the background RR allele types are likely to reflect the recipient isolate and not the donor of the recombined emm. Similarly, for chimeric emm-enn events, the background RR alleles will likely reflect that of the pre-event parental isolate which may have had a different emm-type. Together these anomalies had the unintended consequence of forming the basis of a preliminary method for identifying emmswitching and chimeric emm-enn events, which is a current unmet need in the GAS community.
Utility of the random forest in identifying chimeric cell wall-spanning domain and chimeric emm-enn events. We observed that the ST268 emm137.0 isolate (33181V1T1_01), that had been sampled from Fiji in 2006, was incorrectly inferred to be of emm-type 39 using the RRs and the Guided RF. Upon closer inspection we identified that 33181V1T1_01 possessed a novel cell wall-spanning domain, SF5, which is a chimerisation of two of the canonical cell wall-spanning domains (SF3 and SF1). SF5 was also observed in the ST268 emm39.4 isolate (31005V6S1), which had also been isolated from Fiji in 2006.
Additionally, emm137.0 of 33181V1T1_01 shared 100% nucleotide identity with the chimeric fusion of the 5′ end of emm and the 3′ end of enn of 31005V6S1. We noted that the CDC emm-subtyping sequence locus of the parental strain (31005V6S1) was deleted from the mutant strain (33181V1T1_01). We propose that this represents a novel chimeric emm-enn event ('contrariwise'), that is visually depicted in Fig. 4a. We also identified that the ST747 emm77.0 isolate (K21246), which had been sampled from Kenya in 2007, possessed SF3 at the 3′ end of emm, noting that SF3 is canonically encoded in enn. The ST747 emm77.0 isolate (K5797) had been sampled from Kenya in 2000. We observed that emm77.0 of K21246 shares 100% identity with the fusion of the 5′ end of emm and the 3′ end of enn of K5797, noting that the CDC emm-subtyping sequence locus of the parental strain (K5797) was retained in the mutant strain (K21246). We propose that this too represents a novel chimeric emm-enn event ('likewise'), that is visually depicted in Fig. 4b. The observed configurations of the conserved cell wall-spanning domains provided additional evidence to support our proposed evolutionary pathways.
We have chosen to define the first of the chimeric emm-enn events as 'contrariwise' (33181V1T1_01) and the second as 'likewise' (K21246) in preference to synonymous and non-synonymous to avoid confusion with established molecular biology nomenclature. Furthermore, typing of these mutants should be emm137 C (33181V1T1_01) and emm77 C (K21246) for consistency with the prevailing convention for chimeric emm and M-like genes 11,19,21 . It should be noted that historically contrariwise events may have been labelled as emmswitches. Where the evidence supports, we recommend that the term emm-switch is reserved for recombination events that have involved horizontal gene transfer of DNA containing emm. The recombinant DNA may also include DNA, that is adjacent to emm, other than that encoding emm. Recombination events have been identified in emm82 C isolates, from which it can be inferred that the isolates have undergone both an emm-switch and a 'likewise' chimeric emm-enn event 18,19 . Finally, findings in this part of the project confirmed the bioinformaticsrelated imperative of reading emm-subtyping sequences in the context of their location within the mga regulon, when typing WGS genomes.
Plasticity in the mga regulon of the E3 emm-cluster type isolates. We observed a high degree of disruption to the E3 mga regulon caused by mutation and recombination. Additionally, we proposed an evolutionary pathway describing the disrupted emm of the E3-type Fiji strain 33087V1T1. Given this disruption we would expect to see either one of two eventualities. Firstly (and unlikely), the disrupted emm is retained in the genome because it confers advantage and is selected in the population. Or secondly (and more likely given the pre-eminence of emm in host immune evasion), that further deleterious mutations are acquired and the locus is gradually excised. Therein, we contend that this isolate represents an intermediate strain, suggesting a mechanism for the time-dependent excision of genes in the mga regulon. This represents an important finding in the evolutionary history of E3 GAS. Our findings are also epidemiologically important, because E3-type isolates have recently been identified in outbreaks as the causative agent of severe GAS disease 18,[32][33][34][35] . Furthermore, given that emm is a major focus in GAS vaccine development, we consider that this work increases our understanding of rare and anomalous mga regulons. From a bioinformatics perspective, we contend also that it would not be prudent to assign an emm-subtype to the likes of 33087V1T1 (the non-typable intermediate strain evolved from emm58.0 20059V1I1) using the emm locus when a frameshift has disrupted and likely deleted the function of emm, making it susceptible to future excision. Overall, we assert that the plasticity of the E3 mga regulon represents a 'snap shot' of the real-time evolution of GAS and represents a recombination hotspot.
Application of the random forest algorithm and the response regulator allele types to infer other GAS genomic traits. A recent review by Allen  www.nature.com/scientificreports/ for the application of machine learning in microbiological genomics 36 . In this study, we were able to accurately predict the emm-subtype, the country from which the isolate was sampled, and the invasiveness of the isolate. Our work stands as template for predicting other untested GAS genomic traits using RF and RR allele types. Given the accuracy of the inference of invasiveness, we would suggest potential utility in an in vitro assay for predicting invasive GAS isolates. This process flow was ineffective at inferring the tissue from which the isolate was sampled, and the clinical outcomes. Given the complexity of the GAS-host immunity interaction, this is not an unexpected result. Furthermore, it inspires an exciting question. Can a higher degree of accuracy be achieved if the input data set is augmented with judiciously selected human gene allele types?

Conclusions
In this study we applied the RF algorithm to the variation in a set of GAS RR alleles and were able to infer the emm-type with high accuracy. The highest accuracy, 96.7%, was achieved using the Guided RF. We identified the optimal feature sets (of RRs) for three different RF classifiers, therein describing how many, and which alleles had greatest importance when inferring the emm-type with the greatest accuracy. We observed that each RF classifier had a threshold number of features below which the accuracy of inference dropped, but above which the accuracy was relatively insusceptible to the number of features. By examining the potential sources of inaccuracy, we discovered a novel mga regulon cell wall-spanning domain, SF5. We also proposed two novel evolutionary pathways of chimeric emm-enn events in the mga regulon in which the original emm-type was retained by one, but was changed in the other. We defined these as 'likewise' and 'contrariwise' events, respectively. We also proposed an evolutionary pathway that describes frameshift mutation-induced disruption to emm, which results in a strain that represents an intermediate step in the time-dependent excision of genes in the mga regulon. We were also able to usefully predict emm-subtype, the country from which the isolate was sampled, and the invasiveness of the isolate. However, we were unable to predict the tissue from which the isolate was sampled, or the clinical outcome from the infection. Thus, ML has allowed us to interpret the biology of GAS and propose new evolutionary models. Noting that the RF has been under-utilised in the GAS community to date, we propose that our process flow serves as a template for the prediction of other untested GAS genomic traits.

Methods
Rationale. This study was designed to test whether the nucleotide sequences of GAS RR genes (predictor features) can predict genomic traits (labels). The following summarises our justification for investigating RRs. Traditional studies of GAS virulence have primarily focused on the biofunction of GAS virulence genes (including exotoxins, DNases, proteases, surface-exposed adherence-related proteins, and other bioactive enzymes) to the widespread exclusion of the co-ordinate regulation of the initiation of transcription of the aforementioned virulence genes. GAS RRs regulate the transcription of the majority of GAS virulence genes (and often themselves because many are autoregulating). Furthermore, GAS RRs are the major regulators of the GAS gene expression profile in the growth phase, in lieu of sigma factors. Therein, we hypothesised that variation in DNA sequences of the RR genes might correlate with GAS genomic traits. Regardless, the workflow developed in this study should represent a viable template for the investigation of untested GAS virulence genes in the future.
The RF ML algorithm was chosen because it is robust, scalable, capable of processing large datasets with high dimensionality and heterogeneous feature (or variable) types, and provides high interpretability. The three RFs tested were the Ordinary, Regularized, and Guided 27,37 . The regularized RF introduces a penalty to the inclusion of a new feature during decision tree building. Thus, a regularized RF only adds new features if those new features provide substantial new predictive information. The guided RF uses the importance scores from an ordinary random forest to guide the feature selection. Furthermore, we intended to find the maximum accuracy with which we could infer the response variables (genomic traits), and investigate the susceptibility of inference accuracy to the reduction of predictor feature set size.
Input data. The input data was composed of 53 nucleotide-based GAS RR allele types and six genomic traits, extracted from 944 genomes (Fig. 8a). Of these allele types, 35 have been described previously 19 , and 18 were less well characterised or putative RRs which still displayed the characteristic helix-turn-helix DNA binding domain inferred by SMART domain 38 . The six genomic traits were the emm-type, emm-subtype, the human tissue (tissue) and country from which the GAS isolate was sampled, clinical outcomes observed from the isolate (clinical), and the propensity for the isolate to cause invasive disease (invasiveness). The genomic traits of tissue and clinical correspond to the metadata fields that are titled 'tissue/source' and 'clinical' as previously described 19,39 . 'Tissue/source' essentially represents the tissue sampled, 'clinical' is an assemblage of presentation and disease outcome. The input data is included in Supplementary Table S2. The SPY locus numbers refer to the SF370 isolate, unless otherwise denoted. To minimise possible overfitting, the genomes were randomly separated into the training (n = 629) and test (n = 315) sets. emm-types that had only one representative in the dataset (singletons) were included in the training set. Furthermore, prior to testing with the RF algorithms, we established that the dataset contained a pair (n = 2 of 125) of different emm-types that shared identical RR allele types (emm105 and emm205).

Process flow.
Step 1 of this study involved training the RF classifier using the input data of the RR allele types as training predictor features and the genomic traits as training labels (Fig. 8b). Output data generated from this step were the selected predictor feature set importance scores of the RF classifier. In RF feature selection, the redundant or irrelevant features were eliminated under the guidance of the feature importance scores 40 . Simplistically, the feature importance score is a measure of the importance of the attribute in inferring the cor- www.nature.com/scientificreports/ rect classification 41 . In step 2 the RF classifier was applied to the RR alleles types (test predictor features) to infer the classification of the six genomic traits (labels). The inferred genomic traits were then compared to the observed genomic traits, and an accuracy of inference was calculated in step 3. In step 4, susceptibility testing, the histogram of the normalised importance scores of the feature selection set was investigated (Fig. 9).
Step 4 was designed to select subsets of predictor features each containing the nth most important features with the intention of maximising the accuracy of inference. A threshold normalised importance score was arbitrarily set at a value close to 1, therein defining a subset of the selected predictor feature set that only contained the features with the highest importance scores. This was repeated for multiple arbitrarily decreasing threshold importance scores, creating subsets with increasingly more features. Steps 2 and 3 were then applied to each of these subsets and plots were generated to assess the susceptibility of the accuracy of inference to the number of predictor features (RR allele types). Steps 1-4 were repeated for the three different RF classifier algorithms.
Cross validation and multiclass classification performance evaluation for the optimal feature sets. We defined the optimal feature set as the set of response regulators from which the emm-type was inferred with the highest accuracy for each of the three random forest algorithms. Ten-fold cross validation with ten replicates was performed to measure the stability of these inferences using the 'caret' 42 and 'RRF' R-code packages 37 . As previously described, the random forest is an ensemble learning method, in which the classifier is derived from the voting results of multiple decision trees. Inferring emm-type from this dataset was a multiclass classification. In order to evaluate the performance of these multiclass classifications, we employed the 'pROC' R-code package 43 to estimate the multiclass area under the receiver operating characteristic curve (AUC) 44 . The multiclass.roc function of 'pROC' was applied to predictions determined using the 'vote' parameter in the predict function of the 'RRF' package. We also collated the F1, Precision, and Recall metrics. The R-code was implemented in version 1.1.447, and is included as Supplementary File S3.   Figure 9. Schematic representation of the normalised importance score plot for the selected predictor feature set. Subsets of features were selected based on arbitrary threshold normalised importance values. Steps 2 and 3 of the process flow were then applied to each of these subsets.