Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN

Kuang, Xingyan; Wang, Fan; Hernandez, Kyle M.; Zhang, Zhenyu; Grossman, Robert L.

doi:10.1038/s41598-022-06449-4

Download PDF

Article
Open access
Published: 14 February 2022

Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN

Xingyan Kuang¹,
Fan Wang¹,
Kyle M. Hernandez^1,2,
Zhenyu Zhang¹ &
…
Robert L. Grossman^1,2

Scientific Reports volume 12, Article number: 2427 (2022) Cite this article

6740 Accesses
21 Citations
10 Altmetric
Metrics details

Subjects

Abstract

Effective and timely antibiotic treatment depends on accurate and rapid in silico antimicrobial-resistant (AMR) predictions. Existing statistical rule-based Mycobacterium tuberculosis (MTB) drug resistance prediction methods using bacterial genomic sequencing data often achieve varying results: high accuracy on some antibiotics but relatively low accuracy on others. Traditional machine learning (ML) approaches have been applied to classify drug resistance for MTB and have shown more stable performance. However, there is no study that uses deep learning architecture like Convolutional Neural Network (CNN) on a large and diverse cohort of MTB samples for AMR prediction. We developed 24 binary classifiers of MTB drug resistance status across eight anti-MTB drugs and three different ML algorithms: logistic regression, random forest and 1D CNN using a training dataset of 10,575 MTB isolates collected from 16 countries across six continents, where an extended pan-genome reference was used for detecting genetic features. Our 1D CNN architecture was designed to integrate both sequential and non-sequential features. In terms of F1-scores, 1D CNN models are our best classifiers that are also more accurate and stable than the state-of-the-art rule-based tool Mykrobe predictor (81.1 to 93.8%, 93.7 to 96.2%, 93.1 to 94.8%, 95.9 to 97.2% and 97.1 to 98.2% for ethambutol, rifampicin, pyrazinamide, isoniazid and ofloxacin respectively). We applied filter-based feature selection to find AMR relevant features. All selected variant features are AMR-related ones in CARD database. 78.8% of them are also in the catalogue of MTB mutations that were recently identified as drug resistance-associated ones by WHO. To facilitate ML model development for AMR prediction, we packaged every step into an automated pipeline and shared the source code at https://github.com/KuangXY3/MTB-AMR-classification-CNN.

A convolutional neural network highlights mutations relevant to antimicrobial resistance in Mycobacterium tuberculosis

Article Open access 02 July 2022

Artificial Intelligence and Machine learning based prediction of resistant and susceptible mutations in Mycobacterium tuberculosis

Article Open access 26 March 2020

Prediction of rifampicin resistance beyond the RRDR using structure-based machine learning approaches

Article Open access 22 October 2020

Introduction

Antimicrobial resistance (AMR) is recognized as one of the greatest concerns for public health globally¹. Previous work estimated that the deaths attributable to antimicrobial resistance might rise from the current estimate of 700,000 lives per year to ten million annually by 2050². The prevalence of bacterial strains’ resistance to antibiotics has reduced the efficacy of antibiotics treatment dramatically³, which leads to the urgent need for antimicrobial susceptibility testing to guide the treatment of antibiotics for serious bacterial infections. The conventional culture-based methods have limitations including extended turnaround time for slow-growing bacteria such as Mycobacterium tuberculosis (MTB) and bias due to potential contamination. MTB remains the world’s most deadly infectious disease, with an estimated 1.5 million deaths in 2019⁴. The currently recommended treatment for drug-susceptible TB disease is a 6-month course of four first-line drugs: isoniazid (INH), rifampicin (RIF), ethambutol (EMB) and pyrazinamide (PZA)⁵. As resistance to first-line drugs has become more prevalent, second-line drugs were developed to treat first-line drug-resistant TB disease, which requires a course of second-line drugs for at least nine months and up to 20 months⁴. The emergence of drug-resistant TB continues to threaten global TB control efforts. The World Health Organization reported that nearly half a million people developed rifampicin-resistant TB (RR-TB), of which 78% had multidrug-resistant TB (MDR-TB) around the world in 2019⁴. There is an urgent need to rapidly identify drug sensitivity profiles of TB, given the fact that culture-based diagnostic tests are usually time-consuming.

To overcome these restrictions and identify antibiotic resistance more efficiently, researchers use conventional association rule methods to predict antimicrobial resistance⁶. These methods are based on the identification of variants associated with AMR from whole genome sequencing (WGS) data. The WGS data from clinical strains has been curated in dedicated databases including the Comprehensive Antibiotic Resistance Database (CARD)⁷ and the Pathosystems Resource Integration Center (PATRIC) ⁸.

Traditional machine learning (ML) algorithms, e.g., support vector machine (SVM), logistic regression (LR) and random forests (RF), have been compared with variant-based association rules for AMR prediction using WGS data of pathogen isolates in recent years^9,10. Yang et al. developed and compared different traditional ML methods using a cohort of 1839 UK MTB isolates for the prediction of resistance on eight anti-TB drugs. Kouchaki et al. trained their models by using a dataset of over 13,402 isolates for more stable prediction on seen and unseen samples¹⁰. Three basic ML classifiers based on the feature space after dimension reduction and three ensemble learning methods were considered on this dataset. Another study conducted by Zhang et al. investigated deep learning strategy by using 2D Convolutional Neural Network (CNN) on whole-genome sequencing data of 149 MTB isolates for resistance classification on a less studied drug PZA¹¹. Variants were called by aligning reads on a single reference genome H37Rv. Although ML, including deep learning, has been applied to the prediction of AMR, most studies used a limited number of isolates collected from a specific area, and all of them used single strain reference when detecting variants instead of using pan-genome reference^12,13, which could result in poor mapping and variant calling quality in new strains. The use of a pan-genome reference can decrease errors in the mapping and variant detection process, especially for more diverged strains.

Here, we present our study of MTB drug resistance classification using traditional ML methods (LR and RF) and a deep neural network architecture of 1D CNN on a large and diverse dataset of MTB isolates. To compare the performance of our ML classifiers with a state-of-the-art statistical modeling method Mykrobe predictor, we evaluated the accuracy of Mykrobe predictor on the same dataset¹⁴. Mykrobe predictor uses a De Bruijn graph representation of bacterial diversity to identify species and resistance profiles of clinical isolates for Staphylococcus aureus and Mycobacterium tuberculosis. We used a dataset of 10,575 MTB isolates¹⁵, which is imbalanced with more susceptible isolates than resistant ones for all four first-line drugs mentioned above and four second-line drugs: amikacin (AMK), capreomycin (CM), kanamycin (KM) and ofloxacin (OFX). To reduce computation, we performed feature selection first to reduce the dimensions of input data and applied multi-input 1D CNN. Instead of using a single strain reference, we used all references from CARD database¹⁶, even including references of other bacteria to build reference clusters as a pan-genome reference. Sequencing reads were then aligned to these reference clusters for variant detection. The results showed that our best ML classifiers outperformed the state-of-the-art rule-based method Mykrobe predictor, especially for EMB resistance, and showed more stable accuracy to all the four first-line drugs. Although our basic 1D CNN architecture didn’t significantly outperform our traditional ML methods LR and RF, there are potential ways to optimize it in the future, e.g., hyperparameter tuning.

Methods

Data collection

To prepare the training data and labels, we downloaded the whole-genome sequencing (WGS) data for 10,575 MTB isolates from the sequence read archive (SRA) database¹⁷ and obtained corresponding lineage and phenotypic drug susceptibility test (DST) data from CRyPTIC Consortium and the 100,000 Genomes project in an excel file, which is also available in the supplementary of their publication¹⁵. The phenotypic DST results for the drugs were used as labels when training and evaluating our ML models. All the data were collected and shared by the CRyPTIC Consortium and the 100,000 Genomes Project¹⁵. Like the datasets used by previous studies, this dataset is imbalanced in that most isolates are susceptible, and the minority of them are resistant for all the four first-line drugs (Fig. 1) and four second-line drugs. The numbers of isolate samples with phenotypic DST results available are 7138, 7137, 6347 and 7081 for EMB, INH, PZA and RIF, respectively. There are 6291 shared isolates among the four sample sets. In addition, 6820 out of the 10,575 isolates have phenotypic DST result available for each of the four second-line drugs.

Genetic feature extraction

To detect the potential genetic features that could contribute to MTB drug resistance classification, we used a command-line tool called ARIBA¹⁸. ARIBA is a very rapid, flexible and accurate AMR genotyping tool that generates detailed and customizable outputs from which we extracted genetic features. First, we downloaded all reference data from CARD, which included not only references from different MTB strains but also from other bacteria (e.g., Staphylococcus aureus). Secondly, we clustered reference sequences based on their similarity. Then we used this collection of reference clusters as our pan-genome reference and aligned read pairs of an isolate to them. For each cluster that had reads mapped, we ran local assemblies, found the closest reference, and identified variants. After running these steps, ARIBA generated files including a summary file for alignment quality, a report file containing information of detected variants and AMR-associated genes, and a read depth file. For each cluster, the read depth file provides counts of the four DNA bases on each locus of the closest reference where reads were mapped.

Next, we filtered out low-quality mappings that did not pass the ‘match’ criteria defined in ARIBA’s GitHub wiki¹⁸. From these high-quality mappings, we collected novel variants in coding regions, well-studied resistance-causing variants and AMR-associated gene presences that were detected from at least one out of the 10,575 isolates as 263 genetic features. In addition, we included indicator variables for each of the 19 lineages into our feature vector resulting in a total of 282 features.

Traditional ML methods

We applied two traditional ML algorithms, RF and LR, on the sample sets labeled with phenotypic DST results (see “Data collection” section) to train MTB AMR classifiers for the eight drugs (first-line and second-line), where the feature vector for each sample consists of the 282 features mentioned in “Genetic feature extraction” section.

RF is an ensemble method and made up of tens or hundreds of estimators (decision trees) to compress overfitting^19,20. A final prediction is an average or majority vote of the predictions of all trees. It is often used when there are large training datasets and a large number of input features. Moreover, RF is good at dealing with imbalanced data by using class weighting. Here we trained each RF classifier with 1000 estimators.

LR is a popular regression technique for modeling binary dependent variable²¹. By using a sigmoid function (logit), linear regression is transformed into logistic regression so that the prediction range is [0, 1] for outputting probabilities. Then, LR model is fitted using maximum likelihood estimation. During the training process, we applied L1 regularization on LR models for feature selection and to prevent overfitting²².

Feature selection and 1D CNN models

CNN is a class of deep neural networks that takes multi-dimensional data as input²³. When we say CNN, generally, we refer to a 2-dimensional CNN, which is often used for image classification. However, there are two other types of CNN used in practice: 1-dimensional and 3-dimensional CNNs. Conv1D is generally used for time-series data where the kernel moves on one dimension and the input and output data are 2-dimensional. Conv2d and 3D kernels move on two dimensions and three dimensions, respectively.

Because deep learning algorithms require substantial computational power, we performed feature selection to only keep relevant features as input for deep learning algorithms. First, we randomly selected 80 percent of samples to calculate the importance of each feature by using the scikit-learn RF feature importance function that averages the impurity decrease from each feature across the trees to determine the final importance of each variable²⁴. Then, we tuned the feature importance cutoff to find the one that maximizes the F1-score of an RF model trained on the remaining 20 percent of samples. For each of the eight drugs, features were selected when their feature importance scores were bigger than the optimal cutoff. The tuning processes for first-line drugs are visualized in Fig. 2.

After the relevant features were selected, we designed and built a multi-input CNN architecture with TensorFlow Keras²⁵ that took N inputs of 4 × 21 matrices representing N selected SNP features into the first layer. Each 4 × 21 matrix consists of normalized DNA base counts for each locus within a 21-base reference sequence window centered on the focal SNP (Fig. 3). We generated normalized counts based on the raw base counts extracted from the read depth file mentioned in “Genetic feature extraction” section. Our convolutional architecture starts with two 1D convolutional layers followed by a flattening layer for each SNP input. Then, it concatenates the N flattening layers with the inputs of AMR-associated gene presence and lineage features. Finally, we added three fully connected layers to complete the deep neural network architecture (Fig. 4). It smoothly integrates sequential and non-sequential features.

Results

Isolate identification and DST phenotype

To explore genetic information obtained by running the ARIBA steps listed in “Genetic feature extraction” section, we calculated the numbers of isolates matched on different reference clusters (Fig. 5a) and generated a circular phylogenetic tree with lineage and phenotypic DST data annotations (Fig. 5b).

Figure 5a shows that reads from most isolates were mapped on MTB reference clusters by using the ‘match’ criteria, while only a small portion of isolates were matched to reference clusters of other bacteria (clusters below ‘Mycobacterium_-’ in Fig. 5a, e.g., Staphylococcus aureus). By using a pan-genome reference, we can get more reliable alignments to detect variants more accurately²⁷. The 6291 isolates with phenotypic DST results available for the four first-line drugs were clustered into a phylogenetic tree (Fig. 5b). The inner circle is a phylogenetic tree based on the genetic information (reference cluster matches and known variants) detected by ARIBA, where leaves (isolates) were colored according to their lineage information. Here, isolates of the same lineage clustered together, providing confidence in the quality of isolate identification from genetic information. The outer circles show phenotypic DST of resistance or susceptibility to the four drugs for each isolate. Taken together, there are clear patterns and relationships among lineages, AMR phenotype, and genetic data.

Selected features for 1D CNN

After we performed the feature selection, the top 42 (RIF), 68 (INH), 113 (PZA) and 125 (EMB) drug-specific features were collected. Across these four sets, there were 42 shared features, indicating that the 42 features selected for RIF resistance prediction are also relevant to AMR classification of the other three drugs (Additional file 1: Fig. S1). We also ran the same feature selection procedure on second-line drugs: amikacin (AMK), capreomycin (CM), kanamycin (KM) and ofloxacin (OFX). For each of the eight (first- and second-line) drugs, all selected variant features are known AMR-associated variants from the current version of CARD database (Nov. 2021). We compared our selected variants with the AMR-associated mutations in MTB that were recently published by WHO²⁸. We list all selected variants and highlight the ones overlapping with WHO’s AMR-associated MTB mutations in Additional file 1: Table S1. Overall, we have 78.8% selected variants that are also in WHO’s list.

Training and evaluation

We performed tenfold cross-validation to train and test 24 binary classifiers of AMR status across the eight (first- and second-line) drugs and three different ML algorithms: LR, RF and customized 1D CNN. The four datasets described in “Data collection” section were used to train and test our first-line drug-specific models. In addition, we collected training data from the 6820 out of the 10,575 isolates, trained and tested ML AMR classifiers for second-line drugs by applying the same steps as for first-line drugs. The second-line drugs are listed in last section “Selected features for 1D CNN”. To compare our models with a rule-based method, we also tested the state-of-the-art AMR prediction tool Mykrobe predictor on the same sample sets used for the eight TB drugs, respectively. The precision, sensitivity, specificity, accuracy, F1-score and G-mean were calculated to evaluate the different methods (Table 1).

$$\mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} ,\mathrm{ Sensitivity}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},$$

$$\mathrm{Specificity}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} ,\mathrm{ Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{TN}+\mathrm{FN}},$$

$$\mathrm{F}1-\mathrm{score}=\frac{2\times \mathrm{Precision}\times \mathrm{Sensitivity}}{\mathrm{Precision}+\mathrm{Sensitivity}},\mathrm{ G}-\mathrm{mean}=\sqrt{\mathrm{Sensitivity}\times \mathrm{Specificity}},$$

where TP, TN, FP and FN are true positive, true negative, false positive and false negative, respectively. We used the default probability threshold of 0.5 to decide whether it is susceptible or resistant for all our ML models; however, the performance of our models could be improved in the future by tuning this hyperparameter.

Table 1 Evaluation of AMR classifiers for first-line and second-line anti-TB drugs (our ML methods VS the rule-based one Mykrobe predictor).

Full size table

We calculated different metrics to measure the performances of the four approaches (Table 1). The F1-score is the harmonic mean of precision and sensitivity and balances precision and sensitivity equally. Since the F1-score does not consider True Negatives (TN) we also included the geometric mean of sensitivity and specificity (G-mean) as an additional metric. However, in cases of imbalanced classes, the interpretability of these various metrics starts to break down. Although it is imperfect, we focus on F1-scores because the equal balance between precision and recall is relevant for our interpretation and is important for reducing bias in imbalanced datasets. In terms of F1-score, our three ML methods outperformed the rule-based method Mykrobe predictor for all the four first-line drugs and one of the second-line drugs, while the 1D CNN classifier achieved the highest scores overall.

In their manuscript, the Mykrobe predictor authors stated that the sensitivity of their MTB drug resistance prediction was low, potentially because their graph-based association rule had limited understanding of the underlying genetic mechanisms. We confirmed their observation when testing Mykrobe predictor on our datasets. As shown in Table 1, for EBM, our best model greatly improved the sensitivity from 72.4 to 94.5%, suggesting that our 1D CNN models can detect more complex or subtle genetic mechanisms.

Discussion

According to our tenfold cross-validation, our best ML classifiers showed a substantial increase in the F1-score for all the four first-line drugs and one second-line drug when compared to the prediction from the state-of-the-art rule-based method Mykrobe predictor. Our 1D CNN architecture only slightly outperformed the traditional ML methods LR and RF, although it requires more intensive computing resources during the training process. To reduce the computing resource requirements, we performed feature selection to remove irrelevant features before training 1D CNN models. For each drug, all selected variant features are known variants based on the current version of CARD. In this study, a special 1D CNN architecture was built to fit our data structure of mixed-type of data (sequential and non-sequential). As our first-stage study for MTB AMR classification, we didn’t perform hyperparameter optimization, but it is a potential way to improve our models in the future. In addition, we can include novel variants on non-coding regions and larger variants (e.g. indel) as additional features and try the computationally expensive wrapper-type feature selection algorithms (e.g., recursive feature elimination²⁹) to compare with the filter-based one used in this study³⁰. Because ARIBA was not focusing on detection of low-frequency variants in NGS data and low-frequency variants are also associated to AMR classification, we could add low-frequency variants as additional features for training our ML model by using specific SNP detection tool like binoSNP³¹ in our future work.

The large and diverse dataset of mycobacterium isolates used in our study ensures more generally trained models to predict future samples more accurately, presumably because it can better manage overfitting than regularization on a less diverse dataset. It is important to note that TB drug resistance in Mycobacterium tuberculosis is not known to involve plasmids. To extend our model into bacteria where plasmids have a role in resistance, there would need to make sure the reference database for generating reference clusters using ARIBA contains complete plasmid sequences like CARD that we used in this study. In this way, the additional plasmid features could be easily integrated into the models as presented here.

Although we focused on the F1-score as our metric of performance because it balances precision with recall, it does receive criticism because it ignores True Negatives (TN). In many clinical settings both specificity and sensitivity have critical impacts on patients and the care they receive. We also presented the G-score which is simply the geometric mean of sensitivity and specificity; however, interpretation may be biased in cases where there is an imbalance of classes (e.g., number of resistant versus non-resistant isolates). When focusing on this metric, there is more variability in performance outcomes between the rule-based and the ML methods presented here. Regardless, across all these methods substantial gains in specificity are possible and should be a focus of future work in this area.

Finally, we automated the whole process, from data collection to model training and evaluation, into a flexible pipeline that can be easily updated with new strains or train AMR prediction models of different antibiotics for other bacteria (Additional file 1: Fig. S2 for an overview of the pipeline). Given the availability of WGS data and lineage information for MTB, our ML models can classify MTB resistance against the eight anti-TB drugs with relatively high accuracy requiring only the computational resources of a standard laptop.

Conclusions

AMR infection is one of the major threats to human health. In silico methods are effective to predict drug resistance and a reliable alternative to in vitro assay that is much slower and more expensive. Statistical association rule and ML are two main types of in silico approaches. We developed ML models for first-line TB drug resistance classification on a large and diverse MTB isolate cohort to compare to a statistical rule-based method. The result shows our ML models are more accurate and stable for TB drug resistance prediction across the four first-line drugs than the rule-based method Mykrobe predictor. We designed and developed a customized 1D CNN architecture to adapt and combine sequential and non-sequential features. Even though our deep CNN models haven’t taken advantage of any optimization strategies (e.g., hyperparameter tuning), our CNN architecture slightly outperformed the other two traditional ML algorithms. As a result of variant analysis, 78.8% of variant features selected for our CNN model training are also identified as TB drug resistance-associated ones by WHO.

Data availability

The WGS data of the MTB cohort analyzed in this study are available in SRA database. Code of the ML model development pipeline written for this study is available at https://github.com/KuangXY3/MTB-AMR-classification-CNN.

Abbreviations

MTB:: Mycobacterium tuberculosis
AMR:: Antimicrobial-resistant
ML:: Machine learning
LR:: Logistic regression
RF:: Radom forest
CNN:: Convolutional Neural Network
INH:: Isoniazid
RIF:: Rifampicin
EMB:: Ethambutol
PZA:: Pyrazinamide
MDR-TB:: Multidrug-resistant
CARD:: Comprehensive Antibiotic Resistance Database
PATRIC:: Pathosystems Resource Integration Center
SVM:: Support vector machine
WGS:: Whole-genome sequencing
SRA:: Sequence read archive
DST:: Drug susceptibility test
TP:: True positive
TN:: True negative
FP:: False positive
FN:: False negative

References

Centers for Disease Control and Prevention (U.S.). Antibiotic Resistance Threats in the United States, 2019. (Centers for Disease Control and Prevention (U.S.), 2019). https://doi.org/10.15620/cdc:82532.
Brogan, D. M. & Mossialos, E. A critical analysis of the review on antimicrobial resistance report and the infectious disease financing facility. Glob. Health. https://doi.org/10.1186/s12992-016-0147-y (2016).
Article Google Scholar
Holmes, A. H. et al. Understanding the mechanisms and drivers of antimicrobial resistance. Lancet 387, 176–187 (2016).
Article CAS Google Scholar
World Health Organization. Global tuberculosis report 2020. 2020. https://www.who.int/westernpacific/health-topics/tuberculosis. Accessed 10 May 2021.
Treatment for TB Disease|Treatment|TB|CDC. 2019. https://www.cdc.gov/tb/topic/treatment/tbdisease.htm. Accessed 10 May 2021.
Boolchandani, M., D’Souza, A. W. & Dantas, G. Sequencing-based methods and resources to study antimicrobial resistance. Nat. Rev. Genet. 20, 356–370 (2019).
CAS PubMed PubMed Central Google Scholar
McArthur, A. G. et al. The comprehensive antibiotic resistance database. Antimicrob. Agents Chemother. 57, 3348–3357 (2013).
Article CAS Google Scholar
Wattam, A. R. et al. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res. 42(Database issue), D581–D591 (2014).
Article CAS Google Scholar
Yang, Y. et al. Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data. Bioinformatics 34, 1666–1671 (2018).
Article CAS Google Scholar
Kouchaki, S. et al. Application of machine learning techniques to tuberculosis drug resistance analysis. Bioinformatics 35, 2276–2282 (2019).
Article CAS Google Scholar
Zhang, A., Teng, L. & Alterovitz, G. An explainable machine learning platform for pyrazinamide resistance prediction and genetic feature identification of Mycobacterium tuberculosis. J. Am. Med. Inform. Assoc. 28, 533–540 (2021).
Article Google Scholar
Iranzadeh, A. & Mulder, N. J. Bacterial pan-genomics. In Microbial Genomics in Sustainable Agroecosystems Vol. 1 (eds Tripathi, V. et al.) 21–38 (Springer, 2019). https://doi.org/10.1007/978-981-13-8739-5_2.
Chapter Google Scholar
Jayakodi, M. et al. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature 588, 284–289 (2020).
Article ADS CAS Google Scholar
Bradley, P. et al. Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat. Commun. 6, 10063 (2015).
Article ADS CAS Google Scholar
CRyPTIC Consortium and the 100 000 Genomes Project. Prediction of susceptibility to first-line tuberculosis drugs by DNA sequencing. N. Engl. J. Med. 379, 1403–1415 (2018).
Article Google Scholar
Alcock, B. P. et al. CARD 2020: Antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res. 48, D517–D525 (2020).
CAS Google Scholar
Leinonen, R., Sugawara, H., Shumway, M., International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 39, D19-21 (2011).
Article CAS Google Scholar
Hunt, M. et al. ARIBA: Rapid antimicrobial resistance genotyping directly from sequencing reads. Microb. Genom. 3, e000131 (2017).
PubMed PubMed Central Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar
Qi, Y. Random forest for bioinformatics. In Ensemble Machine Learning: Methods and Applications (eds Zhang, C. & Ma, Y.) 307–323 (Springer US, 2012). https://doi.org/10.1007/978-1-4419-9326-7_11.
Chapter Google Scholar
Kleinbaum, D. G. & Klein, M. Logistic Regression: A Self-Learning Text 3rd edn. (Springer, 2010). https://doi.org/10.1007/978-1-4419-1742-3.
Book MATH Google Scholar
Lee, S.-I., Lee, H., Abbeel, P., Ng, A. Y. Efficient L1 Regularized Logistic Regression, vol. 8 (2006).
Kiranyaz, S. et al. 1D convolutional neural networks and applications: A survey. Mech. Syst. Signal Process. 151, 107398 (2021).
Article Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. TensorFlow: A system for large-scale machine learning. 21 (2016).
HOLT LAB. plotTree Plotting trees with data using R and Python. (2016) https://github.com/katholt/plotTree. Accessed 15 March 2021.
Jandrasits, C., Kröger, S., Haas, W. & Renard, B. Y. Computational pan-genome mapping and pairwise SNP-distance improve detection of Mycobacterium tuberculosis transmission clusters. PLOS Comput. Biol. 15, e1007527 (2019).
Article ADS Google Scholar
World Health Organization. Catalogue of mutations in Mycobacterium tuberculosis complex and their association with drug resistance. (2021).
Chen, X. & Jeong, J. C. Enhanced recursive feature elimination. In Sixth International Conference on Machine Learning and Applications 429–435 (ICMLA 2007).
Chandrashekar, G. & Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 40, 16–28 (2014).
Article Google Scholar
Dreyer, V. et al. Detection of low-frequency resistance-mediating SNPs in next-generation sequencing data of Mycobacterium tuberculosis complex strains with binoSNP. Sci. Rep. 10(1), 1 (2020).
Article Google Scholar

Download references

Acknowledgements

X. K. thanks support of the Center for Translational Data Science at the University of Chicago for their support through the analysis. We thank B. Winslow, J. Qureshi, and E. Malinowski for configuring and deploying the cloud resources used for this work.

Funding

The work was not supported by any funding.

Author information

Authors and Affiliations

Center for Translational Data Science, The University of Chicago, Chicago, IL, 60615, USA
Xingyan Kuang, Fan Wang, Kyle M. Hernandez, Zhenyu Zhang & Robert L. Grossman
Department of Medicine, The University of Chicago, Chicago, IL, 60637, USA
Kyle M. Hernandez & Robert L. Grossman

Authors

Xingyan Kuang
View author publications
You can also search for this author in PubMed Google Scholar
Fan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kyle M. Hernandez
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Robert L. Grossman
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

X.K. designed the study, analyzed data, and implemented the pipeline for model development; F.W. analyzed selected variants; X.K., F.W. and K.M.H. contributed towards writing the manuscript with comments from Z.Z. and R.L.G; all authors contributed feedback on the manuscript.

Corresponding authors

Correspondence to Xingyan Kuang or Robert L. Grossman.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kuang, X., Wang, F., Hernandez, K.M. et al. Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN. Sci Rep 12, 2427 (2022). https://doi.org/10.1038/s41598-022-06449-4

Download citation

Received: 23 September 2021
Accepted: 31 January 2022
Published: 14 February 2022
DOI: https://doi.org/10.1038/s41598-022-06449-4

This article is cited by

Optimised stacked machine learning algorithms for genomics and genetics disorder detection in the healthcare industry
- Amjad Rehman
- Muhammad Mujahid
- Gwanggil Jeon
Functional & Integrative Genomics (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.