Prediction of rifampicin resistance beyond the RRDR using structure-based machine learning approaches

Rifampicin resistance is a major therapeutic challenge, particularly in tuberculosis, leprosy, P. aeruginosa and S. aureus infections, where it develops via missense mutations in gene rpoB. Previously we have highlighted that these mutations reduce protein affinities within the RNA polymerase complex, subsequently reducing nucleic acid affinity. Here, we have used these insights to develop a computational rifampicin resistance predictor capable of identifying resistant mutations even outside the well-defined rifampicin resistance determining region (RRDR), using clinical M. tuberculosis sequencing information. Our tool successfully identified up to 90.9% of M. tuberculosis rpoB variants correctly, with sensitivity of 92.2%, specificity of 83.6% and MCC of 0.69, outperforming the current gold-standard GeneXpert-MTB/RIF. We show our model can be translated to other clinically relevant organisms: M. leprae, P. aeruginosa and S. aureus, despite weak sequence identity. Our method was implemented as an interactive tool, SUSPECT-RIF (StrUctural Susceptibility PrEdiCTion for RIFampicin), freely available at https://biosig.unimelb.edu.au/suspect_rif/.

extra-pulmonary 12 TB. Systematic evaluations of resistance mutations within rpoB suggest, however, that performance is much lower clinically, with estimates that a third of resistant TB infections are missed 14 . In particular, different studies have identified resistance-associated mutations outside the RRDR in the Mycobacterium tuberculosis (Mtb) rpoB gene [15][16][17] , which are consequently misdiagnosed as susceptible by the gold standard test.
In leprosy, resistance to Rif was traditionally identified through the mouse footpad model, which takes months to culture and requires specialised personnel 3,18 . This might account for the lower incidence of Rif resistance reported in leprosy when compared to TB. Advancements in DNA sequencing and PCR have enabled for a prompt identification of resistance biomarkers within rpoB 3 . These molecular techniques have been used by the WHO Global Leprosy Programme, where they also focused on the RRDR in M. leprae to identify resistance mutations 3 . Results from these tests were at 100% sensitivity and 99% specificity in detecting resistance mutations 3 , but again only focused on a small region across the whole gene.
The broad distribution of mutations across the gene in Mtb might indicate alternative Rif resistance mechanisms beyond interference with drug binding. In our earlier work, we investigated the potential mechanisms of rifampicin resistant mutations across the whole Mtb gene and found that disruptions in protein-protein interactions, leading to destabilisation of the RNA polymerase complex and nucleic acid affinity 19 , are important contributing molecular factors. It is well established that mutations within the rpoC gene 20,21 compensate disruptive effects of resistance-causing rpoB mutations, which explains an overall normal transcriptional function despite a loss of intermolecular affinities in resistant bacteria. These intricate mechanisms at the protein complex level cannot be encompassed by the current molecular tools being used, which focus only on the RRDR sequence. Further to this, whole genome sequencing techniques, despite their efficiency, do not explain how resistance arises at the molecular level-limiting them to the characterisation of known mutations, without any predictive capability for novel variations.
To overcome these limitations, we have used a computational approach to further our understanding of the molecular mechanisms leading to resistance and to build a novel, web-based diagnostic tool, SUSPECT-RIF (StrUctural Susceptibility PrEdiCTion for RIFampicin), to accurately and pre-emptively identify resistance mutations. Our structure-based diagnostic tool follows previous, clinically successful approaches in predicting TB drug resistance [22][23][24][25] . Despite only being trained on information readily available for Mtb rpoB mutations, we show that our tool is also effective in identifying resistant mutations in infections caused by M. leprae, S. aureus and P. aeruginosa.
In this work, we have computationally measured effects of missense mutations in Mtb rpoB on protein stability [26][27][28] , dynamics 29,30 , and interactions with Rif 31 , other proteins 26,32 , nucleic acids 26,33 , and metal ions 34 . Machine learning was used on these measurements to train, test and validate a Rif-resistance classifier as a novel diagnostic predictor. Our final tool, SUSPECT-RIF, incorporates sequence-and structure-based features to model how missense mutations lead to resistance.

Results
The general methodology of this project is summarized in Fig. 1 and is divided into four main phases. The initial phase of this project sought to combine the current biological understanding of Rpob function and Rif resistance in the literature, with structural and mutational information. Notably, a thorough quality check of the crystallographic structure 35 and mutational information 15 was carried out to ensure that the final structure-based predictor was built on biologically accurate data (Fig. 1A). Our training set contained 203 resistant and 28 susceptible mutations obtained from the London School of Hygiene and Tropical Medicine 15 . An independent test set was curated from online databases [36][37][38] and contained 67 resistant and 21 susceptible mutations (Suppl. Figure 1). In the feature generation phase (Fig. 1B), different molecular effects of mutations were predicted using available tools. Additionally, features describing conservation [39][40][41][42][43][44] , the mutational local [45][46][47][48] and global environments 49,50 were also calculated. Here, features describing "local environments" accounted for protein properties at the mutation site prior to (e.g. residue depth) and after mutation (e.g. changes in protein stability [26][27][28] ). The "global environment" was calculated through graph-based signatures 49,50 , which capture the overall protein as a series of local "nodes" connected to the mutation site "node" across different distance patterns. Further information on the features used is detailed in Methods. Next, a qualitative analysis was carried out to better understand the underlying resistance mutation mechanisms at the protein level, by using local environment and affinity change measurements (Fig. 1C). During machine learning, all features were used to train and test different classification algorithms for comparison and evaluation. This process included optimisation strategies such as feature selection. The best performing algorithm was selected and validated through independent clinical tests ( Fig. 1D) prior to the final implementation phase. Here, we describe how this methodology was used to develop a predictive classifier for Rif resistance, demonstrating application across resistant mutations in four distinct organisms where Rif is used: M. tuberculosis [36][37][38]51,52 , M. leprae 53,54 , P. aeruginosa 55 and S. aureus 56 . Qualitative structural analysis. A detailed qualitative analysis on RpoB, limited to resistance mutations, has already been published as part of a larger study exploring three different drug targets in TB treatment 19 . Here, we expanded our analysis to include susceptible mutations, a more comprehensive set of Mtb Rif-resistant mutations, and mutations across all species tested. For M. tuberculosis mutations, in silico biophysical calculations were carried out on the experimental crystal structure of Mtb RNA polymerase complex (PDB id: 5UHC 35 ). As no experimental structure of the other three organisms was available, we modelled these structures through comparative homology modelling using the Mtb structure 35 as the template and used them for feature calculation. When considering Rif resistance mutations across all species tested, disruptions in protein-protein interactions 26 and subsequent reduction in nucleic acid affinity 26 were the most predominant mechanisms of resistance, followed by reductions in ligand affinity 31 and protein stability 27 (Fig. 2). This is in line with our previ-  Fig. 2B) follow similar trends, where the major effects were as a result of protein-protein and protein-nucleic acid affinity decrease. This is possibly because of higher sequence identity (SI = 95%) amongst the mycobacterial RpoB proteins. According to our analysis, between the two mycobacteria, effects on protein stability were more pronounced for Mtb (27% of Mtb mutations vs. 14% of M. leprae mutations, p-value = 0.13), while those on ligand affinity were significantly more distinct in M. leprae mutations (17% of Mtb mutations vs. 32% of M. leprae mutations, p-value = 0.05). P. aeruginosa (n = 18; SI = 55%; Fig. 2C) also follows the same trend, although from the non-ligand interactions, disruption of nucleic acid affinity (50% of P. aeruginosa mutations) is stronger than for protein-protein affinity (6% of P. aeruginosa mutations). For S. aureus, however, (n = 51; SI = 61%; Fig. 2D) the main mechanism of resistance is through disruption of ligand affinity, which might shed light on resistance properties associated with multi-drug resistance in MRSA populations. Other possible causes underlying differences between non-mycobacteria and mycobacteria may be explained biologically through horizontal gene transfer.

Feature analysis and engineering
Descriptive features were calculated to represent major biological functions and processes: protein interactions, local environment, graph-based signatures as measures of global environment, pharmacophore modelling, and conservation ( Fig. 1). A total of 298 features were calculated.
Prior to machine learning, we subjected our Mtb dataset to a Welch's t-test, in order to identify features that can distinguish between the two phenotype classes, resistant (n = 270) and susceptible (n = 49; Fig. 3 and Suppl. Figure 2). Among the most distinguishing features (given by a p-value of < 0.05) were changes in nucleic acid binding affinity 26 (p < 0.05), distance to nucleic acid (p < 0.0001) and Rif (p < 0.0001) and protein flexibility,

Scientific Reports
| (2020) 10:18120 | https://doi.org/10.1038/s41598-020-74648-y www.nature.com/scientificreports/ denoted as deformation energy 48 (p < 0.001). These features highlight the importance of considering structural effects when analysing, and consequently predicting, mutational phenotypes. Other highly distinguishing properties were sequence-based features PROVEAN Protein 41 (p < 0.0001) and SIFT 39,40 (p < 0.05), which measure the effect of mutations on protein function based on sequence alignment [39][40][41] and amino acid properties 39,40 , and the conservation feature "Rate of Evolution" given by ConSurf 43 (p < 0.0001). Considering the highly conserved rpoB gene being studied, these results show that resistant mutations are more likely to cluster at highly conserved regions (given by a lower evolutionary rate from ConSurf) than susceptible ones. This clustering at conserved, prominent sites within the gene are thought to be crucial for their gain-of-function survival mechanism in the presence of a drug. As part of model optimisation, features chosen through greedy feature selection were also statistically compared through a correlation matrix (Suppl. Figure 3) in order to remove any redundancies within features. This process removed the feature describing change in vibrational entropy as calculated by ENCoM 30 , as it correlated with the respectively calculated change in protein stability 30 .
Following removal of redundant features, bottom-up greedy feature selection was carried out, based on Matthews correlation coefficient (MCC). Our optimized model contained 41 features (Suppl. Table 1), which included representative features from the different classes considered including: graph-based signatures, stability effects, dynamics and flexibility measurements, pharmacophore changes and the changes in protein interactions with ligand rifampicin, nucleic acids and other proteins. Notably, while graph-based signatures provide a measure of global environment through different local environments at different distances from the mutation site, our feature selection only selected one of these features: Inter-HP:4.50, which accounts for hydrophobic and polar interactions within 4.50 Å of the mutational site. As only this feature from graph-based signatures was used, overall global effects are not represented in the final model, but rather, a physicochemical measure was added to the other local environment features.
The greedy feature selection approach used in this work is a heuristic approach, whereby the optimal combination of features is obtained in a stepwise, incremental manner. The final features obtained corresponded to our initial qualitative analysis results and are biologically relevant when considering overall interactions at the RNA polymerase cleft. To identify the contribution of each property feature to the final model, we trained a model under the same parameters on different subsets of features which likely work in synergy: conservation, intramolecular interactions, protein flexibility and dynamics, graph-based signature, ligand affinity, nucleic acid affinity, protein-protein interactions, physicochemical properties and presence in RRDR. MCC values depicting performance on the blind test for each subset model were compared (Suppl. Table 2

Model performance
Following greedy feature selection, our trained predictive classifier had comparable MCC between training (0.72) and blind test (0.71), with an AUC (Area Under Precision Recall Curve) of 0.99 and 0.89 respectively (Suppl. Figure 4A). When tested on all mutations in our training and blind Mtb datasets (n = 319), our initial model's performance in identifying resistant (sensitivity) and susceptible (specificity) Table 3). Most mutations (n = 29) in this dataset were located within the RRDR, which explains comparable performance between the two methods. The only variant misclassified by SUSPECT-RIF was I491F, which was deemed as a low confidence mutation in the Miotto et al. study, showing weak clinical resistance evidence. Structurally, this mutation lies at the interface close to the RpoC, and RNA binding (Suppl. Figure 5), where the introduction of the phenylalanine side chain may affect these inter-molecular interactions differently. Introduction of this larger side chain, however, doesn't seem to introduce major steric clashes. One possible reason behind this misclassification, which might explain the weak clinical evidence in the Miotto 51 study, is that this mutation is resistant only in specific lineages, which may not be appropriately captured in our model. We then compared the performance of SUSPECT-RIF on Rif resistant Mtb mutations considered in various genome sequencing techniques and databases (Sanger sequencing, CASTB 57 , KvarQ 58 , Mykrobe 59 , PhyResSE 60 , and TBProfiler 61 ) curated by Schleusener et al., 2017 52 . For a small dataset (n = 7), detected by Sanger sequencing, our model successfully identified all mutations as resistant. We also carried out an analysis of a larger curated dataset 52 which combined well-characterized mutations from four listed genomic tools 58

Translation to leprosy.
To test the applicability to other clinically relevant mycobacteria, SUSPECT-RIF was subjected to 42 clinical M. leprae mutations curated from the literature 53 , which included a study focusing on a natural leprosy reservoir-the Prata village in the Brazilian Amazon 54 . All resistant mutations were successfully identified (100% Sensitivity) while the one susceptible mutation within the dataset was misclassified as resistant (0% Specificity). The significantly higher performance of SUSPECT-RIF when compared to RRDRbased molecular tests (similar to GeneXpert-MTB/RIF for Mtb) on our dataset proves that the clinical utility of SUSPECT-RIF, although trained on Mtb data, is not limited to tuberculosis infections (Fig. 4B-D; Suppl. Figure 4B; Suppl. Table 3).
Translation to other infections. Next, we subjected SUSPECT-RIF to resistance mutations identified in two other, clinically diverse organisms where Rif is reserved as a last line treatment: P. aeruginosa (n = 18) 55 which is Gram-negative (SI: 55%) and S. aureus (n = 51) 56 which is Gram-positive (SI: 61%). Despite the rela-

SUSPECT-RIF: Web server
SUSPECT-RIF has been implemented as an interactive web-server, which is freely available to the clinical and research community at: https ://biosi g.unime lb.edu.au/suspe ct_rif/. Our website allows prediction and visualization of missense mutation phenotype, within M. tuberculosis, M. leprae, P. aeruginosa and S. aureus. User input requires the missense mutation, given as an XnY code where X is the one-letter-code wildtype residue, n is the residue position according to organism structure numbering and Y is the one-letter-code mutant residue. We have included the FASTA sequences of the different organisms displayed, as these sequences refer to the numberings used in our protein structures. The user input can either be carried out as a single mutation or as a mutation list, with the different outputs shown in Suppl. Figure 6. Single mutational analysis exhibits the phenotype as classified by our model, mutational information regarding local environment, as well as an interactive molecular viewer built using NGL Viewer 62 . Users can compare changes of interatomic interactions across wild-type and mutant structures in the 3D viewer and download it as a PyMOL session. The results from the mutational list input are exhibited as a summarized list of phenotype and local environment values, with an interactive viewer showing all mutations mapped across the RpoB protein. Links are also available for every mutation within the list to be evaluated and visualized separately.

Discussion
Our earlier work in analysing resistance mutations in TB has shed light towards the complex phenomena underlying resistance to Rif 19 , mainly because of the different interactions occurring at the transcription cleft. Predominantly, Rif resistance mutations in TB disrupt affinities between the target RpoB and other RNA polymerase subunits, as well as nucleic acids within the cleft. When the same methodology was applied across another mycobacterium (M. leprae) and a Gram-negative rod (P. aeruginosa), similar mechanistic effects of resistance were delineated. Within the context of the RNA polymerase complex, predominant disruptions at the interface affecting protein-protein and protein-nucleic acid interactions may lead to a lower steric effect imparted by Rif at the structural level. This effect might be overcome in vivo through compensatory mutations, which have already been reported for Mtb within gene rpoC 20,21 . At the protein level, such mutations can strengthen the interactions between RpoB, RpoC, and nucleic acids to retain the transcription functioning of the RNA polymerase complex and circumvent the steric and consequential nucleic acid affinity-reducing effect of Rif. As for our analysis on the Gram-positive coccus S. aureus, the main mechanistic driver of Rif resistance is a loss of ligand affinity. This is thought to be primarily because most mutations within our dataset occur within 10 Å of Rif binding (with 70.6% occurring within the RRDR). This overall mechanistic pattern may also explain the slightly lower performance of SUSPECT-RIF on S. aureus mutations, where features describing protein-nucleic acid affinity are more highly accounted for than those for ligand affinity. Finally, no primary mechanism could be seen upon analysis of susceptible mutations (available for M. tuberculosis and M. leprae). This gave us a solid understanding to base machine learning principles in building a diagnostic resistance tool.
Here, we introduce our diagnostic tool, SUSPECT-RIF, which does not assume resistance on mutational gene location, but accounts for the differences in underlying molecular mechanisms imparted by resistance mutations compared to susceptible ones. Our tool is built on features describing local mutational environment, interactions, flexibility and conservational effects. We have shown that this combination of features achieves high performance in detecting resistance mutations in an independent clinical M. tuberculosis dataset 51 with a comparable RRDR performance to the current gold standard, but also beyond this region. This is especially important considering that resistant mutations outside this region are currently being missed. Overall, SUSPECT-RIF can correctly classify 90.9% of all Mtb mutations within our dataset. We tested the robustness of SUSPECT-RIF in detecting well-characterized Mtb mutations from whole genome databases 52 where it outperformed all databases 58-61 tested, with an accuracy of 99.4%. When considering protein structural effects, where only coding mutations can be analysed, these metrics are also comparable to a whole genome sequencing-based predictor (accuracy: 95.1%) 63 , which is built on larger datasets. This latter tool possibly also includes variation within non-coding regions (raw data not available for comparison) 63 which cannot be assessed through our method. Apart from comparable accuracy in Mtb, SUSPECT-RIF has proven effective in identifying resistance mutations in RpoB across another mycobacterium (M. leprae), but also in Gram-positive (S. aureus) and Gram-negative (P. aeruginosa) organisms where Rif is clinically used. Notably, its robustness in detecting Rif resistance in leprosy makes it the first, genetic-based test for resistance in M. leprae. This, given the WHO Global Leprosy Strategy 64 is crucial in early detection of Rif resistance and appropriate patient therapy-as it improves rates of survival, as well as minimizes further resistance development. Further to that, its clinical applicability in non-mycobacteria has been achieved in spite of low sequence identity with Mtb, making SUSPECT-RIF the first structure-based diagnostic tool which consistently performs across bacterial species. This highlights the power and broad applicability of this approach for predicting resistance from both clinical 65,66 and drug development perspectives [67][68][69][70] .
SUSPECT-RIF relies on an interplay of different features (Suppl. Table 1) describing changes in interaction affinities of RpoB with other proteins, nucleic acids and ligand Rif, as well as changes in stability, flexibility (deformation and fluctuation) and residue level changes in interactions. When analysed on their own, some features offer negligible contribution to the model. However, these were retained as they are thought to work in synergy with other molecular properties. One feature which was not tested in our model was bacterial lineage associated with each mutation. During data curation, different sources of data were used, where lineage information was not always present. Because of this, it is thought that our misclassified mutations were lineage-specific, where the same mutation in different lineages may not always be resistant. To address this, we segregated our datasets according to source, where we have trained our model on lineage representative LSHTM 15 mutations, and tested with non-redundant, literature-identified datasets. This was our approach in obtaining a general lineage classifier, given the limitation of lineage information. Another feature which was not tested was the Minimum Inhibitory Scientific Reports | (2020) 10:18120 | https://doi.org/10.1038/s41598-020-74648-y www.nature.com/scientificreports/ Concentration (MIC), which quantifies the degree of resistance conferred by each mutation. As with lineage information, MIC values were not consistently available for all the mutations used in model development, which may also explain some misclassifications by our model. Finally, although rpoB mutations have been shown to have epistatic effects both within RNA polymerase (rpoC 20,21 ) and with another TB-drug target gyrA 71,72 , which binds TB second-line drug Ofloxacin, we did not have enough data across our whole mutation list to be able to account for epistasis within our model. When considering overall model performance, our model was consistently more robust in predicting resistance mutations (higher sensitivity) than susceptible mutations (specificity). Although the identification of susceptible mutations in the clinic enables the confirmation of safe Rif use, it is more important for our classifier to identify resistance mutations as efficiently as possible. Clinically, this high sensitivity in identifying resistance mutations implies that patients infected with a resistant strain are not given Rif unnecessarily, reducing the chance of related toxicities to the patient, quicker access to more effective treatment, and unnecessary costs to the healthcare system.
Our tool, SUSPECT-RIF is freely-available on an interactive website (https ://biosi g.unime lb.edu.au/suspe ct_rif), only requiring a list of missense mutations to be analysed by the clinician or researcher. Quick identification of rifampicin resistance, especially in TB and leprosy where it is part of the backbone regimen is crucial for effective treatment, with subsequent avoidance of unnecessary drug toxicities and further resistance development. Resistance-identification using SUSPECT-RIF only depends on the time, cost and availability of genome sequencing techniques, and an internet connection. Further to this, our tool has also been successful in the non-mycobacterial pathogens S. aureus and P. aeruginosa, where Rif is reserved for MDR cases. Our tool shows robust performance in both organisms, making it translatable to both Gram-positive and Gram-negative clinical infections where Rif is used in poly-resistance. This applicability across diverse, clinically relevant organisms demonstrates the potential significance SUSPECT-RIF has in Rif stewardship efforts. Further to this, the robustness with which SUSPECT-RIF classifies Rif resistant and susceptible mutations provides a good basis for the development of a universal rifampicin resistance predictor.

M. tuberculosis mutational dataset. A dataset containing both resistant and susceptible mutations
within gene rpoB was curated from different sources. For the training set, we obtained mutational information from the London School of Hygiene and Tropical Medicine in-house database. This included resistance mutations (n = 203) identified through a genome-wide association study carried out on 6,697 clinical isolates 15 and (n = 28) susceptible mutations. A non-redundant test set was manually curated from online databases TBDreamDB 36 , tbvar 37 and GMTV 38 . Resistant mutations (n = 67) were obtained from all three sources while susceptible mutations were only available from the GMTV database (n = 21). Notably, mutations were clustered at, but not restricted to the RRDR, but also spread throughout the gene and protein structure (Suppl. Figure 1B).
Translation mutational datasets. The ability of our Mtb-trained classifier to correctly predict resistance mutations was also tested in other organisms, including M. leprae, S. aureus and P. aeruginosa, in order to assess its translational capabilities. Resistance mutations for these three organisms were obtained from the literature, where sources ranged from clinical isolates in M. leprae (n = 42) 53,54 , to genome sequencing of resistant colonies in S. aureus (n = 51) 56 and an analysis on background variation epistasis on fitness cost in P. aeruginosa (n = 18) 55 . Notably, all three mutational datasets were comprised of resistance mutations, except for the M. leprae dataset which contained 41 resistant 53,54 and one susceptible mutation 54 . M. tuberculosis Protein structure. The experimental crystallographic structure of the M. tuberculosis RNA Polymerase complex was obtained from the RCSB database under the PDB id: 5UHC 35 . This complex is comprised of 6 chains denoting subunits α1 (rpoA), α2 (rpoA), β (rpoB), β'(rpoC), SigA (rpoD) and ω (rpoZ). Rif is bound at subunit β (rpoB), while the transcribed DNA is present at the cleft between subunits β (rpoB) and β'(rpoC). With a crystal structure resolution of 3.796 Å and R free value of 0.267, the structure was considered of adequate quality as a base for our clinical classifier. Resistance to Rif is brought about by missense mutations localizing on the gene rpoB, which codes for the β-subunit. Prior to feature generation, the structure was checked for missing atoms and residues using the Protein Preparation wizard in Prime (Schrodinger suite). A total of 23 missing atoms were inserted across the full Mtb RNA polymerase structure, 8 of which were present in the β-subunit (rpoB). During the subsequent feature generation stage, all calculations were carried out on the β-subunit to encompass local wild type environment and its respective changes upon introduction of mutations.
Homology modelling of M. leprae, S. aureus and P. aeruginosa RNA Polymerase. The RNA Polymerase complexes of M. leprae, S. aureus and P. aeruginosa had not been experimentally determined at the time of study. To accurately test translation of our structure-based classifier, homology models of the whole complex were built on PDB id: 5UHC 35 as template, using the Advanced Homology Modelling Wizard within Maestro (Schrodinger suite). Due to lower sequence identities with M. tuberculosis subunits within S. aureus and P. aeruginosa, initial sequence alignment was carried out by comparing results from MAFFT-DASH 73 , T-COFFEE 74 and Clustal-W 75 (embedded within Maestro). Alignments were manually curated to optimise sequence identity and gap penalties. Initial models were analysed through MolProbity 76,77 and the embedded protein analysis tools in Maestro. Loop refinement and minimization within Maestro were used as necessary to minimize the number of clashes within different RNA polymerase subunits.

Scientific Reports
| (2020) 10:18120 | https://doi.org/10.1038/s41598-020-74648-y www.nature.com/scientificreports/ Feature generation. All features were calculated on the wild-type RNA polymerase β-subunit of the four different organisms. For specific features where the mutant structure was required (Arpeggio), this was generated using MODELLER 78 . A total of 298 features were tested to account for the major biological changes brought about by missense mutations. These were subdivided into five feature classes: 1. Graph-based signatures 49,50 : were calculated to represent the wildtype protein structure, capturing both topology and physicochemical properties of the protein by modelling mutated sites as nodes which are connected by edges at different interatomic distance patterns 2. Local environment: Descriptors of wild-type protein secondary structure prediction (SST 46  Qualitative and statistical analysis. A qualitative analysis was performed on predictions for protomer stability (DUET), ligand affinity, nucleic acid affinity and RNA polymerase complex stability (mCSM-PPI) on all the mutations within the different datasets, in a manner previously described 19 . Briefly, these measurements were compared for every mutation within the dataset and the predominant mechanisms being affected was assigned to each mutation based on the extent of destabilizing effect. Effects were prioritized in order of size: mCSM-lig, mCSM-NA, mCSM-PPI and mCSM-Stability. This was done in order to adequately account for all types of protein-interactions, irrespective of interacting partner size. To statistically identify features which distinguish between the two phenotypes (resistant and susceptible) we also carried out a two-sided Welch sample t-test on the Mtb dataset mutations, using a cut-off p-value of < 0.05, using the ggsignif package in Rstudio. All remaining comparisons between different proportions, such as the comparison of performances between SUSPECT-RIF and GeneXpert-MTB/RIF, and comparisons of mechanistic effects of resistance between different organisms were carried out using a two-sided ztest with continuity correction, through the prop.test function in Rstudio, with a 0.95 confidence level.

Machine learning.
Machine learning was carried out using the sci-kit learn package on representative classification algorithms: Linear Classifiers (Gaussian, Multinomial and Complement Naïve Bayes, Stochastic Gradient Descent), Decision Tree, Nearest Neighbours (KNN), Support Vector Machines (SVM) and Ensemble Classifiers (Random Forest, Extra Trees, AdaBoost and GradientBoosting). To counteract the imbalance between resistant and susceptible mutations within the training dataset, different levels of oversampling (n OS = 0 to n OS = 6) were tested for each algorithm at each stage of the training and optimisation process. Each trained model was subjected to a non-redundant blind test described above. The different graph-based signatures (based on different distance patterns) and substitution matrices generated were initially tested separately to identify any significant performance in the resulting classifier. The best graph-based signature and matrix were subsequently added to the other features for further iteration. All models were analysed and prioritized according to consistency in Matthew's Correlation Coefficient (MCC) results between training set and test set. The confusion matrix of each model in predicting the phenotype of the independent test set values was also considered, where the number of falsely-predicted values was as low as possible -thereby optimising final model sensitivity, specificity and accuracy at every stage of the process. The nearest neighbour (k = 5) algorithm was consistently the best classifier at all levels of oversampling and was chosen for further parameter optimisation (k = 1, 3, 5, 7, 9, 11, 13, and 15), of which, the model with k = 3 at one level of oversampling consistently gave the best performance and was chosen as the algorithm for SUSPECT-RIF.
Feature selection. As a final optimization, all the features (n = 298) generated using the methods previously described were subjected to a bottom-up greedy feature selection process using sci-kit learn. Prior to this, a manual removal of statistically redundant features was carried out, as this would introduce noise to the prediction of novel mutations. Greedy feature selection initially trains and tests all the features individually. The best feature is retained, and combined with the remaining features individually, to identify the feature which reaches the best performance at each iteration. This process continues until all features are included. Model performance was ranked according to MCC where the best model was again chosen based on MCC consistency between train and blind test, to eliminate the risk of model overfitting leading to biased predictions towards resistance (larger dataset).