Introduction

Klebsiella pneumoniae infections are a major cause of morbidity and mortality worldwide. Over the past several years, the emergence of antimicrobial resistant (AMR) K. pneumoniae strains has been increasing at an alarming rate, with reports of pan-resistant strains appearing in the literature and lay press1,2,3. Reports of hospital-based outbreaks are particularly concerning, and recent evidence suggests that AMR K. pneumoniae clones are circulating in the community3,4,5. As antimicrobial resistance increases, fewer effective antibiotics are available for physicians to treat these life-threatening infections. In response, the World Health Organization recently listed carbapenem and third generation cephalosporin resistant Enterobacteriaceae (including K. pneumoniae) among the most critical organisms for antimicrobial drug development6.

When a patient is diagnosed with an infection, it is critically important to prescribe appropriate antimicrobial therapy as quickly as possible. Rapid pathogen identification and appropriate antimicrobial therapy administration significantly decreases mortality, improves patient outcomes, reduces health care costs, and decreases the use of ineffective or inappropriate antibiotics7,8,9. For bloodstream infections, mortality increases every hour that appropriate therapy is delayed9. The conventional clinical microbiology laboratory evaluation of a suspected infection requires inoculation of the specimen on primary culture media and incubation until there is sufficient growth to perform taxonomic identification and minimal inhibitory concentration (MIC) determination. In many cases, subcultures are needed to purify mixed cultures containing more than one organism or generate sufficient colonial material for testing. Depending on the growth rate of the organism and the MIC testing procedures used, the multiple culture steps can add one or more days to the laboratory workup10,11.

Compared to conventional culture-based methods, rapid molecular assays may significantly reduce turnaround times by eliminating one or more subculture steps. The most common sequence-based methods for predicting the AMR phenotypes of an organism identify the presence of genes implicated in resistance using PCR, microarrays or whole genome sequencing12,13,14. Well-designed gene-based detection methods are capable of providing an accurate prediction of susceptibility or resistance for the genes tested, but and there are several limitations to this approach. First, it relies on well-curated databases of AMR genes, which can be difficult to maintain15,16,17,18. For example, the commonly used databases of AMR genes are currently excellent at cataloging well-studied AMR genes like β-lactamases19, but often lack data for diverse efflux mechanisms18. Second, similarity-based matching strategies to determine AMR gene content may incorrectly assign AMR functions to paralogous non-AMR genes. Gene based approaches may also fail to identify critical mutations in intergenic regions, including regulatory and promoter sequences, leading to a false-negative susceptibility prediction. Also, PCR based methods use primers for amplification, which may not anneal if mutations are present in the complementary region, again rendering an incorrect result. Finally, since these methods are based on preexisting knowledge of AMR conferring genomic regions, they are not able to predict resistance if the molecular mechanism is unknown or multifactorial.

The public sharing of whole genome sequence data with clinical AMR metadata has enabled the use of machine learning (ML) methods that predict AMR phenotypes without relying on a database of preexisting AMR genes or mutations. Two recent studies have used this approach to obtain accurate predictions of susceptibility or resistance in organisms with no a priori information about the gene content of the organisms20,21. To do this, they used short nucleotide k-mers as features and the laboratory derived AMR phenotypes as labels. Other studies have successfully used AMR genes, SNPs, and whole genome sequence data (or a combination thereof) to build ML classifiers with good accuracies22,23,24,25,26,27. Recent examples of gene-based and whole genome-based classification approaches for Klebsiella were reported by Stoesser et al.27, Long et al.3, and Pesesky et al.24.

To date, most AMR prediction methods have focused on classifying “susceptible” and “resistant” phenotypes. While simple and oftentimes sufficient, this approach can be error prone because it relies on the clinical interpretations of break point values. Also, intermediate phenotypes do not fit within this classification scheme. A small number of studies have attempted to predict MICs based on gene content28,29,30. One notable recent publication used an ML algorithm trained on the SNPs from several key AMR genes to successfully predict MICs for Neisseria gonorrhoeae 29.

In this study, we present an ML approach for predicting MICs for K. pneumoniae. Our strategy requires no a priori knowledge of the underlying gene content. The current model offers MIC prediction for 20 antibiotics. To our knowledge, this is the largest MIC prediction study for a human pathogen to date. We discuss the strengths and limitations of our approach and the necessary steps required to implement in silico MIC prediction using whole genome sequence data for K. pneumoniae in the clinical laboratory.

Results

Approach

For several years, the microbiology laboratory at Houston Methodist Hospital System has been banking clinical isolates of K. pneumoniae. We recently sequenced the genome of AMR K. pneumoniae strains recovered from patients between 2011 and 20153. Our goal is to use whole genome sequencing to detect the emergence of highly virulent clones, monitor the spread of AMR, and guide patient care decisions31,32. We routinely perform whole genome sequencing in our clinical laboratory. Importantly, as the cost and speed of whole genome sequencing continues to decrease, it increasingly becomes a viable option for routine microbial diagnostics.

Using the whole genome sequence data for our K. pneumoniae clinical isolates, we sought to build an ML model that accurately predicts the MIC for 20 antibiotics. We chose a strategy that uses the entire genome as input, rather than individual genes, since this approach requires no a priori knowledge of the underlying gene content, and could potentially use data from uncharacterized AMR genes, intergenic or polymorphic coding regions, or non-AMR genes that may indirectly effect the MIC. To accomplish this, we computed the counts of all overlapping 10-mer oligonucleotide k-mers and combined them with the clinical laboratory generated MIC data for each antibiotic to form one large matrix containing both the k-mers and antibiotics as features. After exploring the problem as both a multiclass classification problem and a regression problem and evaluating common ML algorithms, we chose an extreme gradient boosting regression model through the XGBoost library33 (Fig. 1). We then iteratively evaluated the available parameters of the algorithm to maximize the accuracy of the model (Figure S1, Materials and Methods).

Figure 1
figure 1

The pipeline used to optimize and train the XGBoost model using known data (blue), and to predict the MIC values for a new genome (yellow).

Model Accuracy

A 10-fold cross validation was used to access the overall stability and accuracy of the model. The raw accuracy of the model, defined as the ability to predict the exact laboratory derived MIC for a given genome and antibiotic (Supplemental Table S1), and accuracy within ±1 two-fold dilution factor (or 1-tier) of the actual MIC was measured (Table 1). Bounding the accuracy to within one two-fold dilution factor of the laboratory determined MIC is consistent with current FDA standards for diagnostic tools and conventional clinical microbiology practices34,35. We also evaluated the model based on the very major error (VME) rate, defined as a resistant isolate having a MIC that is predicted to be susceptible, and the major error rate (ME), defined as a susceptible isolate having a MIC that is predicted to be resistant. MIC thresholds for susceptibility and resistance for clinical data and model predictions are based on current CLSI breakpoints36.

Table 1 Accuracies for the entire XGBoost model and for the individual antibiotics.

The average raw accuracy of the entire model, testing on all available MICs for 20 antibiotics was 69% with a 95% confidence interval of [68%; 69%]. The within 1-tier accuracy was much higher, at 92% with a 95% confidence interval of [92%; 92%] (Table 1, Supplemental Table S1). The large difference between the raw accuracy and the within 1-tier accuracy is probably the result of a variety of factors including the inherent error in the laboratory MIC testing procedure37, variations in growth and testing conditions, MICs with > and < values (which may actually represent a range of values), and a possible lack of discriminating genetic features between adjacent MIC dilutions. The raw accuracies and within 1-tier accuracies for the individual antibiotics track similarly, with the raw accuracies being lower and the within 1-tier accuracies being markedly higher. Overall, 15 of the 20 antibiotics have within 1-tier accuracies ≥90%. Three antibiotics have within 1-tier accuracies = 89%, while piperacillin/tazobactam and cefepime had lower within 1-tier accuracies of 78% and 61%, respectively (Table 1).

The accuracy of the model varies across the MICs for each antibiotic, in part, due to nonuniform representation of genomes for every possible value; however, the included genomes are representative of K. pneumoniae strains causing human infections. Figure 2 depicts the number of genomes and the within 1-tier accuracy for each MIC and antibiotic. Overall, MICs represented by many genomes tend to have high accuracies and MICs represented by fewer genomes tend to have lower accuracies. For example, the model has higher accuracies for β-lactam resistant MICs than susceptible MICs because there are fewer genomes for susceptible strains in our data set. In some cases, the model appears to be able to successfully interpolate over bins with a small number of samples. For example tobramycin MIC = 4 μg/ml, which is in between bins with deep sampling and high accuracy had 23 samples and an accuracy equal to 90%. Accuracies and confidence intervals for all bins are shown in Supplemental Table S2. The actual and predicted MICs for each genome are displayed in Supplemental Table S3.

Figure 2
figure 2

The accuracy of the XGBoost model for individual MICs. The X-axis of the heatmap shows the actual MIC (μ g/ml) for a bin and the Y-axis lists the antibiotics. The within ±1-tier accuracy of a particular antibiotic-MIC bin is denoted by color, with red and orange being least accurate and bright yellow and green being most accurate. The number within each cell represents the number of samples (genomes with the MIC) within the bin.

The average VME rate for the entire model was 3.1% with a 95% confidence interval of [2.8%;3.4%](Table 2). The average ME rate was 3.7% with a confidence interval of [3.3%;4.1%] (Table 2). The VME ranged from 0 for ampicillin and ceftriaxone to 30% for amikacin. Poorer prediction of amikacin MICs is likely due to the lower representation of amikacin resistant genomes in the dataset. Eleven of the antibiotics have VME rates with confidence intervals between 1.5 and 7.5% for the lower and upper bounds respectively. Likewise, 12 of the antibiotics have ME rates ≤3%. These results align with FDA diagnostic device standards34, suggesting that our model may be suitable for clinical use. However, it must be noted that susceptible and resistant MICs are not balanced in the data set, so the lowest VME rates tend to track with antibiotics that have the largest number of resistant genomes. We plan to test additional K. pneumoniae strains with these AMR phenotypes to fill this gap.

Table 2 Error rates for the entire XGBoost model for the individual antibiotics.

Correlation with AMR Genes

The XGBoost algorithm generates a regression model that is based on an ensemble of decision trees. In our model, these decision trees split the data based on k-mers that distinguish the different antibiotics and MICs. Interpreting the features that are being selected by the model and subsequently understanding their relationship to a given antibiotic and resistance mechanism can be challenging for several reasons. Primarily, understanding the feature importance is an open graph theory problem. Furthermore, if an important k-mer maps to an uncharacterized gene or genomic region, it becomes difficult to determine if it is a hallmark of resistance or susceptibility. Finally, unambiguously associating a k-mer to a given antibiotic may inappropriately confer resistance to several related antibiotics, and we know that this dataset contains multiple resistance gene types such as the Bla-KPC and TEM β-lactamases3.

In general, the AMR that is detected in K. pneumoniae strains isolated from patients is usually conferred by the acquisition of antibiotic resistance genes rather than an accumulation of SNPs3,38. We reasoned that in many cases, the MICs for an antibiotic should correlate with the occurrence of genes known to confer resistance to that antibiotic. By comparing the correlation of MICs and functions with the correlation that is based on the predicted MICs from the model, we gain an understanding of the relationship between MICs and AMR genes. We can also observe how closely the model is replicating these relationships. To do this, we computed the Pearson correlation coefficient (PCC) between the MICs for each genome and the presence or absence of each unique function in each genome (Table 3, Supplemental Table S4).

Table 3 The function that is most highly correlated with the MICs for each antibiotic.

In the case of 12 antibiotics, we observe high PCCs between the MICs and the functions for well-known AMR genes. For example, the β-lactam antibiotics correlate with the CTX, KPC and TEM type β-lactamase genes and the aminoglycoside antibiotics correlate with aminoglycoside acetyltransferase genes. For 8 of the 20 antibiotics, the association between the functions and AMR is not obvious. In some of these cases, the functions appear to be related to horizontal gene transfer, and this may have resulted in their high PCCs. In other cases, the genes may not be sufficiently characterized by the raw 10-mer counts. When the analysis is repeated using MICs that are predicted by the model, the PCCs track very closely with those generated from the actual MICs, indicating that the model is not only learning that these genes exist (the model is only fed 10-mer counts, not whole genes), but it is also placing importance on these genes. Similarly, there is considerable overlap between the top ten most highly correlated functions based on the actual and predicted MICs (Table 3, last column).

In previous work, we built 16 AdaBoost-based classifiers for predicting susceptibility and resistance and installed them in the RAST annotation system3. Thirteen of the antibiotics that were used to build these classifiers are represented in our MIC prediction model. The top genomic regions predicted by the K. pneumoniae AdaBoost classifiers correspond to the most most highly correlated functions from the MIC prediction model in for four of the thirteen antibiotics. These include gentamicin, imipenem, meropenem and tetracycline. The top AdaBoost feature can be found found among the top five most highly correlated functions in three more cases including aztreonam, cefoxitin, and trimethoprim/sulfamethoxazole. Since the AdaBoost classifiers match SNPs in gyrase and topoisomerase genes conferring resistance to ciprofloxacin and levofloxacin, assessing the correlation to the presence of functions will not work for these two antibiotics. It is not immediately clear why the top features for the other four antibiotics (amikacin, cefepime, piperacillin/tazobactam and tobramicin) are not highly correlated, but no characterized AMR functions appear in the top 10 most highly correlated features.

Building a Model Based on AMR Genes

Although we chose to build a model that was based on data from the entire genome, previous studies have built MIC prediction models using the known AMR genes28,29,30. We wanted to know whether a model based on whole genome data, or a model based on only AMR genes would be more accurate. On one hand, the extra k-mers used by the whole genome model could be a source of noise, but on the other hand, they may represent useful data that could making the model more accurate. To make this comparison, we built a model that used only the AMR genes as the source of k-mers, keeping all other parameters identical. Both the PATRIC annotation service18,39 and the CARD database15 were used as sources of potential AMR genes. The overall accuracies for the whole genome and AMR gene-based models are both 92%. Likewise, the individual accuracies for each antibiotic are also nearly identical, differing by no more than 2% for any antibiotic (Supplemental Table S5, Supplemental Figure S2). This indicates that the extra k-mers used by the whole genome are not a source of noise for the XGBoost model. Since the AMR genes model is much smaller than the whole-genome model (20GB vs. 148GB), it is more efficient to compute, and it is therefore tempting to conclude that a model built from AMR genes is sufficient for performing MIC prediction for K. pneumoniae. However, due to the smaller number of susceptible and intermediate samples in this data set, more genome sequencing is necessary to determine if a model built from AMR genes is sufficient for accurately predicting these lower MIC values.

When we repeat this analysis building an identical model for the leftover non-AMR genes, we also observe an overall accuracy of 92%, and accuracies for individual antibiotics that track closely with the models built from full contigs and AMR genes (Supplemental Table S5, Supplemental Figure S2). This indicates that there is enough residual data in the non-AMR genes to accurately predict MIC values as well. This could be due to the presence of uncharacterized AMR genes in the data set, the existence of relevant information such as non-AMR genes co-occurring with AMR genes, or variants in non-AMR genes that have an impact on MICs. In the case of all three models, it is unlikely that the accuracy is due to the model mapping to a phylogenetic pattern between strains, since nearly clonal strains of the same MLST type can have a variety of different MIC values per antibiotic, and closely related strains can be quite diverse in their AMR gene content (Supplemental Table S6, Supplemental Figure S3).

Discussion

Using the clinical laboratory determined MICs and whole genome sequence data for 1668 K. pneumoniae strain recovered from infected patients, we built an XGBoost regression model that can predict the MICs for 20 antibiotics with an average within ±1-two fold dilution factor accuracy of 92%. Individually, 15 of the 20 antibiotics have within 1-tier accuracies >90%. These results demonstrate that accurate genome-based MIC prediction is possible for K. pneumoniae isolates. Herein, we provide the necessary groundwork for building a complete in silico panel.

To date, our whole genome sequencing efforts have focused on the most highly antibiotic resistant pathogens, including extended spectrum β-lactamase-(ESBL) producing K. pneumoniae 3, so the model currently lacks sufficient inclusion of organisms with MICs in the lower range. We also currently lack sufficient data to predict MICs for some less-commonly tested antibiotics, including amoxicillin/clavulanate, ertapenem, fosfomycin, moxifloxacin ticarcillin/clavulanate and tigecycline. Ideally, selecting a balanced number of organisms at each possible MIC could improve the overall accuracy of the model. Our future efforts will seek to close these gaps. Inclusion of additional strains will also improve the ME and VME rates. The K. pneumoniae strains used in our model were collected as part of a large, comprehensive, population-based study in Houston. Although our sampling capacity is extensive and we treat a diverse population of patients from Houston and around the world, the model may be further improved by including samples from other geographic locations.

In previous work, we built binary classifiers that can predict susceptibility or resistance for a given species and antibiotic18,20. Although somewhat simplistic in approach, the method provides a straightforward way to extract the genomic features relating to resistance. In this study, in order to achieve high accuracies for predicting each MIC, we combined the antibiotics and k-mers into a large matrix and used XGBoost as the ML method. While the approach is clearly advantageous because it provides a more refined phenotype prediction, feature extraction from this kind of model remains quite challenging. Although we have shown strong correlations between the actual MICs and predicted MICs with known AMR genes, we will continue to explore ways to extract the key gene data that contributes to each MIC. Importantly, these studies may provide new insight to the molecular mechanisms conferring intermediate phenotypes.

We chose to present a model that was built using k-mers from the entire genome because it requires no underlying information about the gene content of the organisms. Furthermore, since it is likely that genomic features that are not annotated as being directly involved in conferring AMR may be causing the distinction between lower MICs, a whole-genome approach may eventually lead to higher accuracies as we acquire more data. Using a whole-genome-based approach may also lead to an improved understanding the genomic features that are causing the various MIC phenotypes. However at present, the current data set and MIC data for K. pneumoniae published by other groups40,41,42,43 lacks sufficient sampling of genomes with susceptible and intermediate phenotypes to determine if either approach will ultimately differ in accuracy. If the XGBoost model remains robust on the smaller datasets like the AMR and non-AMR gene sets, this implies that it may be possible to build models that can predict AMR phenotypes from incomplete genomic data, which may help to enable the characterization of metagenomic samples.

In this study, we found that deeper trees, with a depths of 3–4, were optimal for the XGBoost model. A logical next step will be to train a deeper model, like a neural network, on this data set to determine if the accuracy can be further improved. A deep learning approach may also provide more efficient memory usage and reduced computational times. However, this strategy would not improve the extraction and interpretation of AMR-related genomic features, since feature extraction from deep learning methods is more challenging compared to ensemble methods such as XGBoost.

The genomes used in this study were sequenced using Illumina sequencing technology. In order to generate genome sequence data cost effectively, we collected samples and performed highly multiplexed runs in batches. However, this is not a clinically time efficient strategy. Newer devices such as the PacBio Sequel and Oxford MinION could potentially be used for point of care sequencing, and thus, may become a model for whole genome sequence-based diagnostic strategies44. However, at present, the cost of sequencing individual pathogens using these technologies is higher, and their error rates may be too high for effective MIC prediction with our ML method45. In order to couple the MIC prediction model outlined in this study with these sequencing platforms, we may need to either incorporate an error correction model for processing MinION or PacBio reads, or regenerate the model using genomes sequenced with these technologies. Further analyses to sequence based off of blood enrichment cultures, or from the actual wound source, rather than the pure culture, would also provide more rapid results, but require algorithms for identifying pathogens and eliminating host DNA and other contaminants. Regardless, sequence-based MIC prediction appears to be feasible as a diagnostic strategy.

Methods

Strain collection

Klebsiella pneumoniae isolates were cultured from patient specimens in the Houston Methodist Hospital System between September of 2011 and March of 2017. Strains were tested for minimum inhibitory concentrations to 20 antibiotics using the BD-Phoenix system (BD Diagnostics, Sparks, MD, USA). All of the strains collected before 2015 were originally part of two studies by Long and colleagues designed to track extended spectrum β-lactamase- (ESBL) producing strains3,46. A total of 1497 strains from the Long et al. study with BD-Phoenix MIC data were used in this analysis (Supplemental Table S3). An additional 171 isolates, 93 ESBL-producing and 78 non-ESBL-producing (Phoenix ESBL test; BD), were also panel tested, whole genome sequenced, and used in this study. In total, we analyzed 1668 Klebsiella pneumoniae genomes. Strains were collected with informed consent under human subjects protocol IRB1010-0199 as approved by the Institutional Review Board, Houston Methodist Research Institute.

DNA extraction and whole-genome sequencing

Genomic DNA was extracted using the manufacturer’s Gram-negative protocol for the DNeasy Blood and Tissue kit (Qiagen) and quantified using a Qubit 3.0 instrument (Invitrogen). Whole-genome sequencing libraries were prepared using NexteraXT reagents (Illumina) and sequenced on a MiSeq or NextSeq. 500 instrument (Illumina).

Data preparation

When the MICs produced from the BD-Phoenix tests exceeded the testing thresholds, they were cleaned to remove the >, <, ≥, and ≤ symbols. If the MIC was >x, the MIC label was changed to 2x, if it was <x, the MIC was changed to \(\frac{x}{2}\), and if the MIC was ≥x or ≤x, the symbol was removed and the MIC remained unchanged. The Log 2 value of these cleaned MICs were used for all ML tasks. For dual antibiotics with two MICs, such as trimethoprim/sulfamethoxazole, the first value was used in all cases, since the second value is either constant or dependent on the first.

Experimental design

Genomes were assembled with SPAdes47 using the PATRIC assembly service39. Contigs with ≤500 bp and ≤5-fold coverage were removed. The contigs were divided into overlapping 10-mer oligonucleotide k-mers, sorted in lexicographical order and counted using the software package KMC248. K-mer counts and antibiotic names were used as input features into the model. Antibiotic names were one-hot encoded. In this study, we chose to use 10-mers as features, rather than a longer k-mer length, in order to reduce the size of the matrix. This enabled us to load the matrix into memory and perform cross validation on a machine with 1.5TB RAM. Shorter k-mers of size 4 and 8 were also evaluated, but were not chosen because they are more likely to be redundant within a genome and because they had lower initial accuracies. The dataset was then split into equal training, validation and testing sets. Each subset was created so that it contained the same number of MICs for a given antibiotic. The final distribution antibiotics used in the model is uniform, but the number of samples for a given MIC per antibiotic was not uniformly distributed.

The prediction of MICs can be cast as a regression problem or a multi-class classification problem with each MIC representing an individual class. We explored several popular ML algorithms including AdaBoost (Adaptive Boosting)49, bagging50, random forest51, extremely randomized trees52, and support vector machines53 from the scikit-learn python package54, as well as XGBoost (Extreme Gradient Boosting)33. Using default parameter choices, we attempted to cast the problem as both a classification and regression problem depending on the capabilities of the algorithm. Algorithms were compared based on accuracy and computational resources required. An XGBoost regression model was ultimately chosen for this study because it produced the highest default accuracies and had reasonable computation costs for the current data set.

To assess the sensitivity of the model with regard to the training data, we performed a ten-fold cross validation by taking all samples and randomly splitting them into 10 mutually exclusive sets. Each set was split such that all sets had an equal (or nearly equal) number of antibiotic-MIC combinations. Ten models were then generated using 8 of the sets for training, one for validation, and one for testing. In total, 10 models were created. The accuracy within ±1 two-fold dilution factor was computed along with the 95% confidence intervals for each model. This aligns with clinical practice and FDA device guidelines34.

Hyperparameter tuning

Important hyperparameters from XGBoost were then selected and tuned using a 2k factorial design55 and a grid search, respectively. The model was tested for stability using a 10-fold cross validation. The 2k factorial design was used to tune 3 XGBoost parameters: maximum tree depth, column subsampling and row subsampling. The maximum tree depth parameter limits the maximum height of a decision tree while creating the ensemble. A value that is too high tends to overfit data whereas a value that is too low tends to underfit data33. A low value of 1 and a high value of 6 (6 is the default) were used to evaluate the impact of tree depth on the accuracy of the model. The column subsampling parameter limits the number of features that are chosen for training each tree in the ensemble. For example, if 0.5 is chosen, 50% of the features (k-mers) will be randomly chosen to train the first tree, then a different 50% for the second, and so on. A low value of 0.25 and a high value of 1 (1 is the default and maximum value allowed for XGBoost) were chosen. The row subsampling parameter limits the number of samples that are chosen for training for each tree in the ensemble. For example, if 0.5 is chosen, 50% of the samples (MIC tests) will be randomly chosen to train the first tree, then a different 50% for the second, etc. A low value of 0.25 and a high value of 1 (1 is the default and maximum value allowed for XGBoost) were chosen for evaluation.

The accuracy of the XGBoost model was evaluated in two ways. First, the accuracy of each individual MIC prediction over the test set was assessed within ±1 two-fold dilution factor. Secondly the coefficient of determination, or R2, was also used as a metric during the hyperparameter tuning.

A 2k factorial design assumes that as a parameter is increased or decreased, the metric that is being tested will either increase or decrease. Additionally, it only gives an idea of the parameters that are important and not the optimal value for a given parameter. Since the accuracy is not known to always increase or decrease as the maximum tree depth value goes up33, this parameter needed to be tested systematically. A grid search56 was designed based on the results of the 2k factorial design to deal with both issues. Since learning rate is known to have a relationship with maximum tree depth, we systematically varied the maximum tree depth and the learning rate, in tandem, in a grid search experiment with respect to accuracy33.

The results of the 2k factorial design showed that when tuning the model, a higher maximum tree depth was preferable. Different values of row and column subsampling did not cause variance in the accuracy of the model, though a larger value was deemed to produce a more accurate model. Both factors had already been maximized at 1 during the 2k factorial. Supplemental Figure S1a shows this in better detail.

The applied grid search showed that a lower value for the learning rate always returned a better solution. However, as the learning rate decreased, the training time increased. The gain in accuracy was deemed too small for the price (time) with a learning rate of 2−4. During testing we also found that a maximum depth between 3 and 4 was optimal for the Klebsiella data used to train the algorithm. Supplemental Figure S1b shows this in better detail.

Genome Annotation and AMR gene analyses

All genomes were annotated using the PATRIC annotation service in August of 201739. MLST (Multi Locus Sequencing Typing) was performed by the PATRIC annotation service and was based on Diancourt et al.57. AMR genes are defined as those having AMR-related functions (annotations) defined by PATRIC18 as well as all genes encoding proteins matching the CARD database15 with BLASTP identity scores >= 80%58. Non-AMR genes are defined as those that do not match the AMR set. A MIC prediction model built only from AMR genes was computed the same as the final whole genome-based model described above.

Correlation analyses were performed by first gathering the set of uniquely occurring functions from every K. pneumoniae genome. For each genome, the presence (+1) or absence (−1) of a function was compared to the MIC for an antibiotic. The Pearson correlation coefficient was computed for every function and antibiotic combination.

The phylogenetic tree in Supplemental Figure S3 was built by generating alignments for the seven MLST housekeeping genes, rpoB, gapA, mdh, pgi, phoE, infB, and tonB57 using MAFFT59, and then concatenating the them. The tree was generated using FastTree60. Genomes with identical sets of AMR genes are depicted by a single genome. The tree was rendered using iTOL61.

Data Availability

The model and software for predicting MICs in for K. pneumoniae genomes can be found at the PATRIC github page: https://github.com/PATRIC3/mic_prediction. All genomes were submitted to the SRA under bioprojects (PRJNA376414, PRJNA386693 and PRJNA396774). SRA run accession numbers for individual genomes can be found in Supplemental Table S3.

Accession codes

All genomes were submitted to the SRA under bioprojects (PRJNA376414, PRJNA386693 and PRJNA396774).