Rapid heuristic inference of antibiotic resistance and susceptibility by genomic neighbor typing

22 23 Surveillance of drug-resistant bacteria is essential for healthcare providers to deliver effective 24 empiric antibiotic therapy. However, traditional molecular epidemiology does not typically occur 25 on a timescale that could impact patient treatment and outcomes. Here we present a method 26 called ‘genomic neighbor typing’ for inferring the phenotype of a bacterial sample by identifying 27 its closest relatives in a database of genomes with metadata. We show that this technique can 28 infer antibiotic susceptibility and resistance for both S. pneumoniae and N. gonorrhoeae. We 29 implemented this with rapid k-mer matching, which, when used on Oxford Nanopore MinION 30 data, can run in real time. This resulted in determination of resistance within ten minutes 31 (sens/spec 91%/100% for S. pneumoniae and 81%/100% N. gonorrhoeae from isolates with a 32 representative database) of sequencing starting, and for clinical metagenomic sputum samples 33 (75%/100% for S. pneumoniae), within four hours of sample collection. This flexible approach has 34 wide application to pathogen surveillance and may be used to greatly accelerate appropriate 35 empirical antibiotic treatment. 36 2 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018;


Introduction 37 38
Infections pose multiple challenges to healthcare systems, contributing to higher mortality, 39 morbidity, and escalating cost. Clinicians must regularly make rapid decisions on empiric 40 antibiotic treatment of infectious syndromes without knowing the causative pathogen(s) or 41 whether they are drug-susceptible or drug-resistant. In some cases, this is directly linked to poor 42 outcomes; in the case of septic shock, the risk of death increases by an estimated 10% with every 43 60 minutes delay in initiating effective treatment 1 . 44

45
The molecular epidemiology of infectious disease allows us to identify high-risk pathogens and 46 determine their patterns of spread, on the basis of their genetics or (increasingly) genomics. 47  In the first, loading step, the precomputed RASE database is loaded into memory. As reads are generated, they are matched against the database using ProPhyle to calculate similarity to individual strains. The weights for the most similar strains (D and E in the figure) are increased proportionate to the number of matching k-mers. Finally, resistance is predicted from the obtained weights and the resistance profiles of the database strains as follows: First, the best lineage is identified as the lineage of the best match (having the highest weight, E in the figure) and its score is calculated (lineage score, LS). Second, for every antibiotic, a score quantifying the chance of susceptibility (susceptibility score, SS) is calculated, based on the most similar susceptible and resistant strains inside the identified lineage (B and E in the figure, respectively). The susceptibility or resistance to each of the antibiotics is predicted from their susceptibility scores by a comparison with a threshold (0.5 in the default setting), and reported together with the lineage, the best matching strain and that strain's known properties (e.g., the original antibiograms, MLST sequence type, or serotype). quality control. As the run progresses, the scores fluctuate and eventually stabilize (an example shown in Figure 2). 138 RASE databases for hundreds of S. pneumoniae and N. gonorrhoeae strains 140 141 We constructed RASE databases for S. pneumoniae and N. gonorrhoeae from the same data as 142 described above (Methods). We assigned each pneumococcal and gonococcal strains to an 143 antibiotic-specific resistance categories using the European Committee on Antimicrobial 144 Susceptibility Testing (EUCAST) breakpoints 33 and the CDC Gonococcal Isolate Surveillance 145 Project (GISP) breakpoints 34 , respectively (Methods). Where MIC data were unavailable or 146 insufficiently specific, we estimated the likely resistance phenotype using ancestral state 147 reconstruction (Methods, Supplementary Note 1). To verify the results, we tested eight 148 pneumococcal isolates for which resistance phenotypes were not originally available (Methods), 149 and the measured MICs by microdilution matched the expected phenotypes (shown in bold in 150 Table 1). We constructed the ProPhyle k-mer indexes with a k-mer length optimized to minimize 151 prediction delays (k=18, Methods). The obtained pneumococcal and gonococcal RASE databases   The bars correspond to 70 best matching strains in the database and display the normalized weights, which serve as a proxy to inverted genetic distance. They are arranged by rank and colored according to the presence in the predicted, alternative or another lineage. The bottom panels display the resistance profiles of the strains.

9
. CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  Table 1: Predicted phenotypes of S. pneumoniae for a) database isolates, b) non-database isolates, and c) metagenomes. The table displays actual and predicted resistance phenotypes (S = susceptible, R = nonsusceptible) for individual experiments, as well as information on match of the predicted MLST sequence type and clonal complex. Resistance categories in bold were inferred using ancestral reconstruction and were also confirmed using phenotypic testing (see Methods and Supplementary Table 3). Metagenomic samples are sorted by the estimated fraction of S. pneumoniae reads.

10
. CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

167
We then performed a similar evaluation with five gonococcal isolates from the database 168 (Table 2a, sens/spec 57%/100%, n=20); however, here we selected more complicated cases. 169 First, we tested a susceptible isolate (GC01), for which RASE identified the correct strain and 170 antibiogram within 3 minutes of sequencing. We then sequenced an isolate with a novel and 171 uncommon mechanism of cephalosporin resistance that has emerged recently (GC02) 35 . Under 172 such circumstances, the resistant strain and its susceptible neighbors tend to be genetically very 173 similar, which could confound our analysis. However, RASE was still able to identify the correct 174 resistance phenotypes in 9 minutes, with the delay being due to difficulty distinguishing between 175 the close relatives, reflected also by a susceptibility score in the low-confidence range (Methods). 176 This was repeated in further experiments with the same isolate (GC03) which consistently 177 reported low confidence in resistance phenotype (Methods), which is a feature of our approach 178 intended to draw operators' attention and indicate that further testing is necessary. In this 179 experiment, RASE also resolved sample mislabeling (Supplementary Note 3). For a multidrug-180 resistant isolate (GC04) RASE predictions stabilized within 2 minutes but incorrectly predicted 181 susceptibility to ceftriaxone. A subsequent analysis revealed that the ceftriaxone MIC of the 182 sample was equal to the CDC GISP breakpoint (0.125 μg/mL), whereas the best match in the 183 database had an MIC of 0.062 μg/mL, within a single doubling dilution. We further found that 184 RASE performed well even with extremely poor data and low-quality reads (GC05, 185 which the serotype and limited antibiogram and lineage data were known. We compared three 191 characteristics of the sample to assess our performance: the serotype, the MLST sequence type, 192 11 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  12 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018; erythromycin, and tetracycline resistance according to the EUCAST breakpoints 33 ). 194 195 In all cases, the closest relative was identified within 5 minutes, even if the correct MLST 196 sequence type was absent from the RASE database (an example shown in Figure 2 (susceptibility to azithromycin in GC15) was marked as being low-confidence call on the basis of a 209 poor susceptibility score. It should be noted that the ranges for what is considered low-210 confidence could vary among settings and pathogens but can be empirically determined and 211 modified by users. In this case our results suggest that informative results can be obtained even 212 using a database from one region (the US) to predict phenotype in another (Europe). However, 213 this may not be the case for all pathogens. 214 215

Phenotyping is still informative but lower quality on highly divergent lineages 216
As noted above, an important precondition of genomic neighbor typing is a comprehensive and 217 relevant reference database. To evaluate RASE performance in a setting with an incomplete 218 database, we used the gonococcal WHO 2016 reference strain collection 39 . This includes a global 219 collection of 14 diverse isolates from Europe, Asia, North America, and Australia, collected over 220 and their serotypes were identified to be 15A and 3, respectively. The susceptibility scores of the 249 best matches were fully consistent with the resistance profiles found in the samples, with the 250 exception of tetracycline resistance in SP12 due to an incomplete database (Supplementary 251 Note 7). The last remaining samples, SP07-SP09, contained less than 5% unambiguously 252 pneumococcal reads. Despite the low proportions, all predicted phenotypes were concordant 253 with phenotypic tests, with the exception of SP07, which matched the same strain as SP12 254 (discussed above). 255 Discussion 256   257 This paper presents a method that we term genomic neighbor typing to pinpoint the closest 258 relatives of a query genome within a suitable database and then to infer the phenotypic 259 properties of the query strain on the basis of the reported properties of its relatives. At present, 260 the precise lineage of a bacterial pathogen is often determined after most important clinical 261 decisions have been made. However, incorporating genomic neighbor typing at an earlier stage 262 offers a way of leveraging bacterial population structure to gain information on resistance and 263 susceptibility, and inform antimicrobial therapy. The results from the metagenomic samples 264 suggest that it is possible to apply this approach directly to clinical samples, and the success with 265 both S. pneumoniae and N. gonorrhoeae indicates that it may have wide application. 266

267
The two pathogens studied here present contrasting features; the gonococcus is Gram-negative, 268 harbors plasmids, and has a strikingly uniform core genome, while the pneumococcus is Gram-269 positive, does not contain plasmids and is diverse in both its core and accessory genome. Both 270 exhibit high rates of homologous recombination, which is expected to both spread 271 chromosomally encoded resistance elements and to scramble the phylogenetic signal that we 272 use to identify the lineages. Despite these differences and the large degree of recombination, our 273 approach performs well with both pathogens, with some differences that indicate opportunities 274 and limitations for the application. 275

276
The initial identification of the closest relative is consistently more robust in the pneumococcus 277 than the gonococcus, as a result of the former having more k-mers that are specific to an 278 individual lineage, reflecting greater sequence diversity. As a consequence of the much lower 279 diversity in gonococcus, when multiple closely related genomes are present in the database, 280 RASE fluctuates between them, even though it correctly identifies the region of the phylogeny. If 281 these genomes vary in their resistance profile, this is properly reflected in an uncertain 282 susceptibility score indicating that caution and further investigation are merited (e.g., GC03). 283 As in all inference, the principle limitation of genomic neighbor typing is the representativeness 285 of the database. While we have made use of relatively small samples from limited geographic 286 areas to demonstrate proof of principle, in practice there are multiple examples of large genome 287 databases generated by public health agencies, which could be combined with metadata on 288 resistance for genomic neighbor typing. Such databases could, if necessary, be supplemented 289 with local sampling. The relevant question for our approach therefore becomes whether the 290 database contains a sufficiently high proportion of strains that will be encountered in the clinic 291 and whether the resistance data are correct. Further work is required to determine the optimal 292 structure and contents of databases for each application, but we emphasize the range of 293 pathogens which appear to show promise for this approach. These include E. coli, in which data 294 on MLST type supplemented with epidemiologic information can consistently produce AUCs in 295 excess of 0.90 for multiple antibiotics 42 , suggesting great potential for neighbor typing to offer 296 excellent resolution superior to MLST. However, genomic neighbor typing may be less suitable in 297 the case where there is little genomic variation (e.g., Mycobacterium tuberculosis) or when 298 resistance emerges rapidly on independent and diverse genomic backgrounds (e.g., 299 Pseudomonas aeruginosa or resistance elements on highly promiscuous plasmids). 300

301
In the case where the infectious agent is unknown this problem is significantly more challenging. 302 K-mers from one pathogen can match others and produce false predictions, and so choice of the 303 correct database for prediction is key. Doing this will likely require a two-step solution in which 304 the reads are first passed through a metagenomic classifier such as Centrifuge 43 or MetaMaps 44 , 305 which would be used to select the correct RASE database on which to make a resistance call. 306 307 Another limitation is the time required for sample preparation, which currently includes human 308 DNA depletion, DNA isolation, and library preparation, taking a total of 4 hours. This is a rapidly 309 evolving area of technology and automated rapid library preparation kits are already in 310 17 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018; samples, will be required to bring the method closer to the bedside. 312 313 We have demonstrated that effectively predicting resistance and susceptibility from sequencing 314 data does not require knowledge of causal resistance determinants. In fact, neighbor typing only 315 requires that the phenotype be sufficiently strongly associated with the population structure to 316 make reliable predictions. 317

318
A key advantage of this approach is that it requires very little genomic data, thus it is not limited 319 by high error rates or low coverage. In particular, it is not attempting to define the exact genome 320 sequence of the sample being tested, but merely which lineage it comes from. As a result, even 321 when a small fraction of k-mers in the read are informative in matching to the RASE database, 322 this is sufficient to call the lineage. This has the benefit of being faster than gene detection by 323 virtue of the informative k-mers being distributed throughout the genome, and so more likely to 324 appear in the first few reads sequenced by the nanopore. Therefore, the approach we present 325 here can be seen as an application of compressed sensing: by measuring a sparse signal 326 distributed broadly across our data we can identify it with comparatively few error-tolerant 327 measurements. 328 329 Genomic neighbor typing can also be used to detect other phenotypes that are sufficiently tightly 330 linked to a phylogeny, such as virulence. Further applications may include rapid outbreak 331 investigations, as the closely related isolates involved in the outbreak would all be predicted to 332 match to the same strain in the RASE database. The approach also lends itself to enhanced 333 surveillance, including in the field; the 2014-2016 Ebola outbreak in West Africa, for example, 334 saw MinION devices used in remote locations without advanced healthcare facilities 2 . Finally, at 335 present empiric treatment decisions are made within successive 'windows' 46 , in which increasing 336 information becomes available, from initial Gram stain to full phenotypic characterization. The 337 information from genomic neighbor typing is a natural complement to this process with the 338 potential to improve therapy long before it would become clinically apparent that the patient is 339 not responding or before phenotypic susceptibility data were available. The combination of high-340 quality RASE databases with genomic neighbor typing offers an alternative forward-looking 341 model for diagnostics and surveillance, with wide applications for the improved clinical 342 management of infectious disease. 343 19 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018; Overview 346 347 RASE uses rapid approximate k-mer-based matching of long sequencing reads against a database 348 of strains to predict resistance via neighbor typing. The database contains a highly compressed 349 exact k-mer index, a representation of the tree population structure, and metadata such as compared to the antibiotic-specific breakpoints (see below; Supplementary Figures 3 and 4). If a 361 given breakpoint is above or below the interval, susceptibility or non-susceptibility is reported, 362 respectively. However, no category can be assigned at this step if the breakpoint lies within the 363 extracted interval, an antibiogram is entirely missing, it is insufficiently specific, or its parsing 364 failed. Finally, missing categories are inferred using ancestral state reconstruction on the 365 associated phylogenetic tree while maximizing parsimony (i.e., minimizing the number of nodes 366 switching its resistance category; Supplementary Figures 5 and 6). When the solution for a node 367 is not unique, non-susceptibility is assigned. 368 369 20 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018; All reference strains in the database are associated with similarity weights that are set to zero at 372 the start of the run. Each time a new read is read from the stream, k-mer-based matching is 373 applied to identify the strains with the maximum number of matching k-mers (see below). Such which is the proportion of matched k-mers in all reads. KS helps to assess whether a sample is 382 truly matching the database and predicting resistance for the database species makes sense. 383

384
The obtained weights serve as a proxy to inverted genetic distance and are used as a basis for the 385 subsequent predictions of the lineage, and antibiotic resistance and susceptibility. 386

Predicting lineage 388 389
A lineage is predicted as the lineage of the best matching reference strain, i.e., the one with the 390 largest weight. The quality of lineage prediction is further quantified using a lineage score (LS), 391 calculated as LS=2f/(f+t)-1, where f and t denote the weights of the best matches in the first 392 ('predicted') and in the second best ('alternative') lineage, respectively. The values of LS can 393 range from 0.0 to 1.0 with the following special cases: LS=1.0 means that all reads were perfectly 394 matching the predicted lineage, whereas LS=0.0 means that the predicted and alternative 395 lineages were matched equally well. 396 397 21 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018; specified threshold (0.6 in default settings), the call is considered successful. If the score is lower 399 than this, the sample cannot be securely assigned to a lineage, and this should draw operators' 400 attention. Note that custom RASE databases may require a re-calibration of the threshold. from the score definition, if SS is greater than 0.5, then the best matching strain is susceptible, 415 otherwise it is non-susceptible. 416 417 SS is used for predicting resistance or susceptibility as well as for evaluating the prediction's 418 confidence. If SS is greater than 0.5, susceptibility to the antibiotic is reported, non-susceptibility 419 otherwise. Hence resistance is predicted as the resistance of the best match. However, when SS 420 is within the [0.4, 0.6] range, it is considered a low-confidence call, and as such it should draw 421 operators' attention; this usually indicates that resistance or susceptibility emerged recently in 422 the evolutionary history and genomic neighbor typing may not be able to confidently distinguish 423 between these similar, but phenotypically distinct, strains. Note that the thresholds above might 424 22 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  The draft assemblies were downloaded from the SRA FTP server using the accession codes 438 provided in Table 1  CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018; function 53 of pneumococcus was calculated using JellyFish 54 (version 2.2.10) (Supplementary 482 Figure 7). Then, based on the characteristics of the function and the k-mer range supported by 483 ProPhyle, the possible range of k was determined as in [17,32]. For these k-mer lengths, RASE 484 indexes were constructed and their performance evaluated using the RASE prediction pipeline 485 and selected experiments. While RASE showed robustness to k-mer length in terms of final 486 predictions, prediction delays differed (Supplementary Figure 8) To determine how RASE works with nanopore data generated in real time, the timestamps of 505 individual reads extracted were using regular expressions from the read names. These were then 506 used for sorting the base-called nanopore reads by time. When the RASE pipeline was applied, 507 the timestamps were used for expressing the predictions as a function of time. The times of 508 25 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018; prediction pipeline was not slower than sequencing. 510 511 When timestamps of sequencing reads were not available (i.e., the gonococcal WHO and clinical 512 samples), RASE estimated the progress in time from the number of processed base pairs. This 513 was done by dividing the cumulative base-pair count by the typical nanopore flow, which we had 514 previously estimated from SP01 as 1.43Mbps per second. However, such an estimated progress 515 is indicative only, as it does not follow the true order of reads in the course of sequencing. As the 516 nanopore signal quality tends to decrease over time (see the decrease of KS in Figure 2 after 517 t=15mins), the randomized read order provides results of lower quality than true real-time 518 sequencing. 519 520

Lower time estimates on resistance gene detection 521 522
A complete genome of the multidrug-resistant SP02 isolate was assembled from the nanopore 523 reads using the CANU 55 (version 1.5, with default parameters). Prior to the assembly step, reads 524 were filtered using SAMsift 56 based on the matching quality with the pneumococcal RASE 525 database: only reads at least 1000bp long with at least 10% 18-mers shared with some of the 526 reference draft assemblies were used. The obtained assembly was further corrected by Pilon 57 527 (version 1.2, default parameters) using Illumina reads from the same isolate (taxid '1QJAP' in the 528 SPARC dataset 24 ) mapped to the nanopore assembly using BWA-MEM 58 (version 0.7.17, with the 529 default parameters) and sorted using SAMtools 52 . 530

531
The obtained assembly was searched for resistance-causing genes using the online CARD tool 8 (as 532 of 2018/08/01). All of the original nanopore reads were then mapped using Minimap2 59 (version 533 2.11, with '-x map-ont') to the corrected assembly and resistance genes in the reads identified 534 using BEDtools-intersect 60 (version 2.27.1, with '-F 95'). Timestamps of the resistance-535 26 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018; were used in the analysis. The closest relatives identified by RASE were verified using the obtained tree. For every WHO 553 isolate, the obtained RASE prediction was compared to the closest GISP isolate on the tree. 554 555 Library preparation 556 557 For isolates SP01-SP06, cultures were grown in Todd-Hewitt medium with 0.5% yeast extract 558 (THY; Becton Dickinson and Company, Sparks, MD) at 37°C in 5% CO2 for 24 hrs. High-molecular-559 weight (>1 μg) genomic DNA was extracted and purified from cultures using DNeasy Blood and 560 Tissue kit (QIAGEN, Valencia CA). DNA concentration was measured using Qubit fluorometer 561 (Invitrogen, Grand Island NY). Library preparation was performed using the Oxford Nanopore 562 Technologies 1D ligation sequencing kit SQK LSK108. 563 27 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018; Additional retesting of SPARC isolates was done using microdilution. Organism suspensions were 593 prepared from overnight growth on blood agar plates to the density of a 0.5 McFarland standard. 594 This organism suspension was then diluted to provide a final inoculum of 105 to 106 CFU/mL. 595 Microdilution trays were prepared according to the NCCLS methodology with cation-adjusted 596 Mueller-Hinton broth (Sigma-Aldrich) supplemented with 5% lysed horse blood (Hemostat 597 CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  Joshua Metlay for providing the test isolates for experiments SP03-SP06, which were collected as 644 part of a population-wide surveillance study done in the Philadelphia region, supported by NIH 645 30 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018; simulator. Bioinformatics 28, 593-594 (2012). CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. Supplementary Note 2. We evaluated how long it took for resistance genes to be reliably 849 detected in nanopore reads. For SP02 we observed that at least 25 minutes were needed to 850 detect resistance (i.e., to observe all resistance genes at least once), assuming that the genes in 851 question can be unambiguously identified in nanopore data despite the high per-base error rate, 852 and that the presence of the loci is directly linked to the resistance phenotype (Supplementary 853 Figure 2). If this is not the case (for example if resistance is conferred by a single SNP, requiring 854 coverage with multiple reads), further delays would be expected. Thus, genomic neighbor typing 855 can offer a time advantage compared to methods based on identifying the presence of resistance 856 genes even in a sample of DNA from a purified isolate as opposed to a metagenome, potentially 857 allowing for more rapid changes to antimicrobial therapy. 858 859 Supplementary Note 3. We originally attempted to evaluate a multidrug-resistant isolate 860 (GCGS0938 in the GISP collection); however, RASE placed it onto a distant part of the phylogeny 861 and identified it as GCGS0324 or GCGS1095. A subsequent analysis revealed that the sample was 862 mislabeled and that it was indeed GCGS1095, i.e., the same strain as in GC02, although from a 863 different stock. 864 39 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/403204 doi: bioRxiv preprint first posted online Aug. 29, 2018; Supplementary Note 4. We evaluated how RASE performs in extremely unfavorable sequencing 866 conditions; we sequenced an isolate (GC05) from the GISP collection with the use of an expired 867 flow cell (purchased in October 2017, expired in December 2017, and the sequencing done in 868 April 2018). In consequence, we obtained only 3.5 Mbps of low-quality reads (only 7% of 869 matching k-mers compared to 20% obtained in the other isolates) (GC05 in Table 2a). An 870 experiment with such a low yield would normally be discarded; despite that RASE provided 871 correct and stabilized predictions (once the first long read was obtained from the sequencer at 872 t=21mins). 873 874 Supplementary Note 5. We evaluated how genomic neighbor typing would perform if RASE used 875 Kraken 31 instead of ProPhyle 28 for the read-to-strain comparison (the matching step in Figure 1). 876 Both tools use k-mer-based matching to assign sequencing reads to a phylogenetic tree, but with 877 several key differences. Whereas Kraken stores for each k-mer the lowest common ancestor 878 (LCA) only, assigns reads to the LCA of the best hits and ignores low-complexity k-mers, ProPhyle 879 indexes all k-mers using an exact index and can thus resolve ambiguities both on the level of 880 individual k-mers and read assignments. 881

882
To compare both tools, we implemented a RASE wrapper for Kraken (Methods) and applied that 883 to the same read and database data. We then compared the final inference results obtained with 884 Kraken (with k=18 and k=31) with the results obtained from the standard RASE pipeline 885 (Supplementary File 2). 886 887 For S. pneumoniae and N. gonorrhoeae, the number of inference errors increased more than 1.5x 888 and 1.7x, respectively (in case of both k-mer sizes). In the case of N. gonorrhoeae, RASE-Kraken 889 showed large systematic biases in neighbor typing, assigning 16 (k=18) and 18 (k=31) out of the 890 gonococcal 33 samples to a single strain (GCGS1028), whereas RASE-ProPhyle identified this 891 strain only once. While in the WHO dataset the numbers of RASE-ProPhyle and RASE-Kraken 892 40 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. We next sought to evaluate the resistance predictions. In 7 cases (F, K, N, O, P, U, W), the 909 antibiograms were identified fully correctly; in 4 (G, V, X, Z) and 3 cases (L, M, Y) one and two 910 mistakes were made, respectively. To explain these discrepancies, we inferred a recombination-911 corrected phylogenetic tree comprising the GISP database isolates as well as the WHO samples 912 (Supplementary File 1). With the exception of G and Y, the WHO isolates and their respective 913 RASE-predicted best matches were the closest GISP isolates, indicative of accurate matching by 914 RASE. While branch lengths of L, M and V on the tree reveal that the corresponding parts of the 915 phylogeny are not well sampled in the database, the X, Y, and Z samples emerged from lineages 916 that are well-represented but have acquired an atypically high level of cephalosporin resistance. 917 Whereas X and Z acquired a novel resistance-conferring mosaic penA allele 73 , Y acquired a novel 918 active site mutation in the context of a pre-existing mosaic penA allele 74 . While both of these 919 adaptations resulted in high-level resistance, these mutations also appear to incur fitness costs in 920 41 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. either to the sequence type ST180 or ST3798. This is consistent with identification as serotype 3, 929 because this clonal complex contains the great majority of strains with this capsule type, which 930 historically has not been associated with resistance 77 . However, improved sampling and study of 931 this lineage has recently found highly divergent subclades that are associated with resistance. 932 These lineages were previously rare, and thus were less likely to be included in our database, but 933 now are increasing in frequency 78 . In this case, ST3798 is found to be in clade 1B, which is 934 notable for exhibiting sporadic tetracycline resistance. Again, the failure to match to this is a 935 result of the original database not containing a suitable example for comparison. 936 937 938 42 . CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

43
. CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

44
. CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

47
. CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

49
. CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

50
. CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

51
. CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Supplementary Table 4: Metadata for all strains included in the a) S. pneumoniae and b) N. gonorrhoeae
RASE database. Each record contains the strain's taxid, lineage, serotype (for S. pneumoniae only), MLST sequence type, order in the phylogenetic tree, and three fields related to resistance for every antibiotics: the '_mic', '_int', '_cat' fields contain the original published MIC information (possibly corrected after retesting), the extracted MIC interval, and the resulting category after ancestral state reconstruction (S = susceptible, R = non-susceptible, s = unknown but reconstructed susceptible, r = unknown but reconstructed non-susceptible), respectively.

52
. CC-BY-NC 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.