Main

Infections pose multiple challenges to healthcare systems and contribute to higher mortality, morbidity and escalating cost. Clinicians must regularly make rapid decisions on empirical antibiotic treatment of infectious syndromes without knowing the causative pathogen (or pathogens) or whether they are drug-susceptible or drug-resistant. In some cases, this is directly linked to poor outcomes; in the case of septic shock, the risk of death increases by an estimated 10% with every 60 min of delay in initiating effective treatment1.

The molecular epidemiology of infectious disease allows us to identify high-risk pathogens and to determine their patterns of spread on the basis of their genetics or (increasingly) genomics. Conventionally, such studies, including outbreak investigations and characterization of previously untested resistant strains, have been conducted in retrospect, but this has been changing with the availability of increasingly inexpensive sequencing technologies2,3. The wealth of data generated by genomics are promising, but introduces a challenge because while many features of a sequence are correlated with the phenotype of interest, few are causative.

Prescription, however, has long been informed by correlative features when causative ones are difficult to measure; for example, whether the same syndrome or pathogen occurring in other patients from the same clinical environment have responded to a particular antibiotic. This has also been observed at the genetic level as a result of genetic linkage between resistance elements and the rest of the genome. An example is given by the pneumococcus S. pneumoniae. The Centers for Disease Control and Prevention (CDC) has rated the threat level of drug-resistant pneumococcus as “serious”4. While resistance arises in pneumococci through a variety of mechanisms, approximately 90% of the variance in the minimal inhibitory concentration (MIC) for antibiotics of different classes can be explained by the loci determining the strain type5, even though none of these loci themselves causes resistance. Thus, in the overwhelming majority of cases, resistance and susceptibility can be inferred from coarse strain typing based on the population structure. This population structure could be leveraged to offer an alternative approach to detecting resistance whereby rather than detecting high-risk genes, we identify high-risk strains. While many approaches have been developed to identify whether a pathogen carries mutations or genes known to confer resistance6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21 (see ref. 22 for a comprehensive review), this is not equivalent to the clinical question of whether the pathogen is susceptible.

We present a method called genomic neighbour typing that can bring molecular epidemiology closer to the bedside and provide information relevant to treatment at a much earlier stage. Our method takes sequences generated from a sample in real time and matches them to a database of genomes to identify the closest relatives. Because closely related isolates usually have similar properties, this yields an informed heuristic regarding the phenotype of the pathogen. We demonstrate this by identifying drug-resistant and drug-susceptible clones for both S. pneumoniae (the pneumococcus) and N. gonorrhoeae (the gonococcus) within minutes after the start of sequencing using Oxford Nanopore Technology (ONT). The method has many potential applications depending on the specific pathogen and the quality of the databases available for matching, which we discuss together with its limitations.

Results

Resistance is associated with clones in S. pneumoniae and N. gonorrhoeae

To quantify the association of clones with antibiotic resistance of the pathogens S. pneumoniae and N. gonorrhoeae, we constructed optimal predictors of resistance from bacterial lineages and measured the associated area under the receiver operating characteristic curve (AUC) (Supplementary Fig. 1). First, we applied the method to 616 pneumococcal genomes from a carriage study of children in Massachusetts, USA23,24. Second, we used 1,102 clinical gonococcal isolates collected from 2000 to 2013 by the CDC’s Gonococcal Isolate Surveillance Project25. In both cases, the datasets comprised draft genome assemblies from Illumina HiSeq reads, resistance data and lineages inferred from sequence clusters computed using Bayesian analysis of population structure26. Lineages of S. pneumoniae were predictive for benzylpenicillin, ceftriaxone, trimethoprim–sulfamethoxazole, erythromycin and tetracycline resistance, with AUC values ranging from 0.90 to 0.97 (Supplementary Fig. 2), which is consistent with previous works5. For N. gonorrhoeae, ciprofloxacin, ceftriaxone and cefixime attained comparably large AUC values (from 0.93 to 0.98), whereas azithromycin demonstrated a lower association (AUC value of 0.80) (Supplementary Fig. 3), which is as previously observed25.

Rapid identification of nearest known relatives from sequencing reads

Based on the observed associations, we developed an approach that we term genomic neighbour typing to predict the phenotype from sequencing data. Genomic neighbour typing is a two-step algorithm that first compares a provided sample to a database of reference genomes with a known phylogeny and phenotype, and then predicts the probable phenotype of the sample based on the best hits (nearest neighbours) and their matching quality. We apply this here to the detection of drug resistance.

To implement genomic neighbour typing, we developed software called resistance-associated sequence elements (RASE) (Fig. 1). RASE takes a stream of nanopore reads and compares their k-mer content to references using a modified version of ProPhyle27,28, which is a metagenomic classifier that implements a fast and memory-efficient exact coloured de Bruijn graph data structure29 using a Burrows–Wheeler transform (BWT) index30 (Methods). Using ProPhyle RASE identifies which references are the most similar to the read and increases their similarity weights (this approach was inspired by but differs from other similar approaches such as Kraken31 and Kallisto32). These weights are cumulative scores that capture sample-to-reference similarity; they are set to zero at the beginning and are increased on the fly as sequencing proceeds according to the ‘information content’ of each read (Methods). Generally speaking, longer reads, such as those covering multiple accessory genes, tend to be specific and have high scores, whereas short reads or reads from the core genome are found in many lineages, tend to be nonspecific and have low scores. Weights serve as a proxy to the inverted genetic distance between the sample and the references.

Fig. 1: Overview of the RASE approach.
figure 1

In the first loading step, the precomputed RASE database is loaded into memory. As reads are generated, they are matched against the database using ProPhyle to calculate similarity to individual strains. The weights for the most similar strains (D and E in the figure) are increased proportionately to the number of matching k-mers. Finally, resistance is predicted from the obtained weights and from the resistance profiles of the database strains in the following manner. First, the best lineage is identified as the lineage of the best match (having the highest weight, E in the figure) and its score is calculated (lineage score (LS)). Second, for every antibiotic, a score quantifying the chance of susceptibility (susceptibility score (SS)) is calculated based on the most similar susceptible and resistant strains inside the identified lineage (B and E in the figure, respectively). The susceptibility or resistance to each of the antibiotics is predicted from their susceptibility scores by a comparison with a threshold (0.5 in the default setting) and reported together with the lineage, the best matching strain and the known properties of that strain (for example, the original antibiograms and the MLST-identified sequence type or serotype).

Resistance or susceptibility is predicted in two steps based on the computed weights, the population structure and the reference phenotypes. First, RASE identifies the lineage of the best-matching reference genome and estimates the confidence of lineage assignment by comparing the two best-matching lineages to compute a ‘lineage score’ (Methods). Subsequently, RASE identifies the best match within that lineage and predicts resistance from the nearest resistant and susceptible neighbours. A comparison of their weights provides a ‘susceptibility score’, which quantifies the risk of resistance (Methods). When the weights are too similar, the confidence of the call is considered low; this happens when resistant and susceptible strains are insufficiently genetically distinct, which is often the case for resistance that has recently emerged in evolutionary history (Methods). The ability to pinpoint the closest relatives in the database offers further resolution, even in the case where the resistance phenotype varies within a lineage.

Results of RASE are reported in real time as the best match in the database, together with susceptibility scores to the antibiotics being tested and a proportion of matching k-mers for quality control. As the run progresses, the scores fluctuate and eventually stabilize (an example is shown in Fig. 2).

Fig. 2: RASE obtains stable predictions of antibiotic resistance or susceptibility and lineage within minutes for an isolate of a pneumococcal 23F clone (SP06).
figure 2

a, The number of reads, LS, KS and SS for individual antibiotics as a function of time from the start of sequencing. In the top left plot, the times of stabilization are shown for the predicted lineage and susceptibility or resistance to all antibiotics (ceftriaxone (CRO), erythromycin (ERY), benzylpenicillin (PEN), trimethoprim–sulfamethoxazole (SXT) and tetracycline (TET)). Blue and red colours correspond to susceptibility (Susc.) and non-susceptibility (Non-susc.) calls, respectively. The dashed lines mark selected time points (1 min, 5 min and the end of sequencing (2,700 min)). bd, Similarity rank plots for selected time points: 1 min (b), 5 min (c) and the end of sequencing (2,700 min; d). The bars correspond to the 70 best-matching strains in the database and display the normalized weights, which serve as a proxy to the inverted genetic distance. They are arranged by rank and coloured according to the presence in the predicted, alternative or another lineage. The panels underneath each chart display the resistance profiles of the strains.

RASE databases for hundreds of S. pneumoniae and N. gonorrhoeae strains

We constructed RASE databases for S. pneumoniae and N. gonorrhoeae from the same data as described above (Methods). We assigned each pneumococcal and gonococcal strains to antibiotic-specific resistance categories using the European Committee on Antimicrobial Susceptibility Testing (EUCAST) breakpoints33 and the CDC Gonococcal Isolate Surveillance Project (GISP) breakpoints34, respectively (Methods). Where MIC data were unavailable or insufficiently specific, we estimated the probable resistance phenotype using ancestral state reconstruction (Methods; Supplementary Note 1; Extended Data Figs. 3 and 4). To verify the results, we tested eight pneumococcal isolates for which resistance phenotypes were not originally available (Methods), and the measured MICs by microdilution matched the expected phenotypes (shown in bold in Table 1). We constructed the ProPhyle k-mer indexes using a k-mer length that was optimized to minimize prediction delays (k = 18; Methods). The obtained pneumococcal and gonococcal RASE databases occupy 321 MB and 242 MB of RAM (×4.3 and ×12, respectively, compression rate) and can be further compressed for transmission to 47 MB and 32 MB (×29 and ×90, respectively, compression rate), respectively (Extended Data Fig. 1). This would allow RASE to be used on portable devices and its databases to be easily transmitted to the point of care over links with a limited bandwidth.

Table 1 Predicted phenotypes of S. pneumoniae for database isolates (a), non-database isolates (b) and metagenomes (c)

RASE identifies strains in the database within minutes

We first examined two pneumococcal isolates that were used to build the RASE database (100% sensitivity and 100% specificity, n = 10; Table 1a) to test whether RASE can function in ideal circumstances. In the case of a fully susceptible isolate (SP01), the correct lineage and sequenced strain were identified within 1 min and 7 min, respectively. A multidrug-resistant isolate (SP02) was predicted even faster, with both lineage and the sequenced strain correctly detected and stabilized within 1 min. To compare our method to gene-based approaches for detecting resistance22, we evaluated how long it took for resistance genes to be sequenced on the device. We observed that at least 25 min would be needed for single copies to be detected (Supplementary Note 2; Extended Data Fig. 2).

We then performed a similar evaluation with five gonococcal isolates from the database (57% sensitivity and 100% specificity, n = 20; Table 2a); however, here, we selected more complicated cases. First, we tested a susceptible isolate (GC01), for which RASE identified the correct strain and antibiogram within 3 min of sequencing. We then sequenced an isolate (GC02) with an uncommon mechanism of cephalosporin resistance that has recently emerged35. Under such circumstances, the resistant strain and its susceptible neighbours tend to be genetically very similar, which could confound our analysis. However, RASE was still able to identify the correct resistance phenotypes in 9 min, with the delay due to the difficulty in distinguishing between the close relatives, which was reflected by a susceptibility score in the low-confidence range (Methods). This was repeated in further experiments with the same isolate (GC03), which consistently reported low confidence in the resistance phenotype (Methods). This feature of our approach intends to alert the operators to indicate that further testing is necessary. In this experiment, RASE also resolved sample mislabelling (Supplementary Note 3). For a multidrug-resistant isolate (GC04), RASE predictions stabilized within 2 min, but incorrectly predicted susceptibility to ceftriaxone. A subsequent analysis revealed that the ceftriaxone MIC of the sample was equal to the CDC GISP breakpoint (0.125 μg ml–1), whereas the best match in the database had an MIC of 0.062 μg ml–1, within a single doubling dilution. We further found that RASE performed well even with extremely poor data and low-quality reads (for example, GC05; Supplementary Note 4). We also evaluated how genomic neighbour typing would perform if RASE used Kraken31 instead of ProPhyle27,28, and results are presented in Supplementary Note 5.

Table 2 Predicted phenotypes of N. gonorrhoeae for database isolates (a) and non-database isolates (b)

RASE identifies the closest relative of previously untested isolates

We next examined four pneumococcal isolates (89% sensitivity and 100% specificity, n = 20; Table 1b) for which the serotype and limited antibiogram and lineage data were known. We compared the following three characteristics of the sample to assess our performance: the serotype, the sequence type as determined by multilocus sequence typing (MLST) and the antibiograms (benzylpenicillin, ceftriaxone, trimethoprim–sulfamethoxazole, erythromycin and tetracycline resistance according to the EUCAST breakpoints33).

In all cases, the closest relative was identified within 5 min, even if the correct MLST sequence type was absent from the RASE database (an example is shown in Fig. 2). The two samples from the 23F clone (SP03 and SP06) were correctly called as being closely related to the Tennessee 23F-4 clone identified by the Pneumococcal Molecular Epidemiology Network, a clone strongly associated with macrolide resistance36. Consistent with this, the two samples were indeed resistant to erythromycin. However, the Tennessee 23F-4 clone was absent from the Massachusetts sample, with the best match being a comparatively distantly related strain that was resistant to penicillin, but susceptible to erythromycin. This illustrates the importance of a relevant database.

We evaluated RASE with 14 clinical gonococcal isolates from the RaDAR-Go project37 (Switzerland, 2015–2016) (93% sensitivity and 100% specificity, n = 56; Table 2b). These isolates were previously sequenced using nanopore and have full antibiograms available38. The 55 out of 56 correct calls indicate the strength of the genomic neighbour typing in a clinical setting. The only incorrect call (susceptibility to azithromycin for GC15) was marked as being low confidence on the basis of a poor susceptibility score. It should be noted that the ranges for what is considered low confidence could vary among settings and pathogens, but can be empirically determined and modified by users. In this case, our results suggest that informative results can be obtained even when using a database from one region (the United States) to predict phenotypes in another region (Europe). However, this may not be the case for all pathogens.

Inference is still informative but lower quality on highly divergent lineages

As noted above, an important precondition of genomic neighbour typing is a comprehensive and relevant reference database. To evaluate the performance of RASE in a setting with an incomplete database, we used the gonococcal WHO (World Health Organization) 2016 reference strain collection39. This includes a global collection of 14 diverse isolates from Europe, Asia, North America and Australia, collected over two decades and exhibiting phenotypes ranging from pan-susceptibility to multidrug resistance, and as such, the GISP database is expected to be non-representative in this study. The WHO strains are available from the National Collection of Type Cultures, and were previously sequenced using nanopore38 and genetically and phenotypically characterized39. Surprisingly, RASE correctly identified all MLST-identified sequence types represented in the database, and in seven cases it provided fully correct resistance phenotypes (67% sensitivity and 91% specificity, n = 56; Supplementary Table 1). In six out of seven cases where the complete resistance profile was not recovered, the closest relatives were correctly identified, but were genetically divergent from the query isolates (Supplementary Note 6). In one case, the errors were due to a misidentification of the closest relatives by ProPhyle. Therefore, most prediction errors could be addressed with a more comprehensive database.

RASE can identify resistance in pneumococcus from sputum metagenomic samples

Because bacterial culture and phenotyping via agar-dilution, Etest or disk diffusion introduces significant delays in resistance profiling, direct metagenomic sequencing of clinical samples would be preferable for point-of-care use. We therefore analysed metagenomic nanopore data from sputum samples obtained from patients suffering from lower respiratory tract infections40 (UK, 2017), selecting six samples from the study that were already known to contain S. pneumoniae (75% sensitivity and 100% specificity, n = 16; Table 1c).

One sample (SP10) contained DNA from multiple bacterial species. However, within 5 min, the sequence was identified to belong to the Swedish 15A-25 clone (ST63), which is also known to be associated with resistance phenotypes including macrolides and tetracyclines41. This sample was confirmed to be resistant to erythromycin and to clindamycin, tetracycline and oxacillin according to the EUCAST breakpoints33. The original report of the Swedish 15A-25 clone did not report resistance to penicillin antibiotics41, which has subsequently emerged in this lineage. However, our database correctly identified the risk of penicillin resistance in this sample. The metagenomes SP11 and SP12 contained an estimated >20% reads that matched S. pneumoniae, and their serotypes were identified to be 15A and 153, respectively. The susceptibility scores of the best matches were fully consistent with the resistance profiles found in the samples, with the exception of tetracycline resistance in SP12 due to an incomplete database (Supplementary Note 7). The last remaining samples, SP07–SP09, contained <5% unambiguously pneumococcal reads. Despite the low proportions, all predicted phenotypes were concordant with phenotypic tests, with the exception of SP07, which matched the same strain as SP12 (discussed above).

Discussion

This paper presents a method, which we termed genomic neighbour typing, to pinpoint the closest relatives of a query genome within a suitable database and then to infer the phenotypic properties of the query strain on the basis of the reported properties of its relatives. At present, the precise lineage of a bacterial pathogen is often determined after most important clinical decisions have been made. However, incorporating genomic neighbour typing at an earlier stage offers a way of leveraging bacterial population structure to gain information on resistance and susceptibility, and to guide antimicrobial therapy. The results from the metagenomic samples suggest that it is possible to apply this approach directly to clinical samples, and the success with both S. pneumoniae and N. gonorrhoeae indicates that it may have wide application.

The two pathogens studied here present contrasting features; the gonococcus is Gram-negative, harbours plasmids and has a strikingly uniform core genome, while the pneumococcus is Gram-positive, does not contain plasmids and is diverse in both its core and accessory genome. Both exhibit high rates of homologous recombination, which is expected to both spread chromosomally encoded resistance elements and to scramble the phylogenetic signal that we use to identify the lineages. Despite these differences and the large degree of recombination, our approach performs well with both pathogens, with some differences that indicate opportunities and limitations of the application.

The initial identification of the closest relative was consistently more robust in the pneumococcus than the gonococcus, which is a result of the former having more k-mers that are specific to an individual lineage, thus reflecting greater sequence diversity. As a consequence of the much lower diversity in gonococcus, when multiple closely related genomes are present in the database, RASE fluctuated between them, even though it correctly identified the region of the phylogeny. If these genomes vary in their resistance profile, this is properly reflected in an uncertain susceptibility score that indicates caution and further investigation are merited (for example, GC03).

As in all inference, the principle limitation of genomic neighbour typing is the representativeness of the database. While we have made use of relatively small samples from limited geographical areas to demonstrate proof of principle, in practice, there are multiple examples of large genomic databases generated by public health agencies, which could be combined with metadata on resistance for genomic neighbour typing. Such databases could, if necessary, be supplemented with local sampling. The relevant question for our approach therefore becomes whether the database contains a sufficiently high proportion of strains that will be encountered in the clinic and whether the resistance data are correct. Further work is required to determine the optimal structure and contents of databases for each application, but we emphasize the range of pathogens that appear to show promise for this approach. These include Escherichia coli, in which data on MLST-identified sequence types supplemented with epidemiological information consistently produced AUC values in excess of 0.90 for multiple antibiotics42, which suggests that there is great potential for neighbour typing to offer excellent resolution that is superior to MLST. However, we anticipate that genomic neighbour typing may be less suitable in cases when there is little within-species genomic variation (meaning it is hard to identify the nearest neighbour) or when resistance rapidly emerges on independent and diverse genomic backgrounds (meaning resistance is poorly correlated with those backgrounds) (Supplementary Note 8).

In cases where the infectious agent is unknown, this problem is significantly more challenging. k-mers from one pathogen can match others and produce false predictions, and so the choice of the correct database for prediction is key. Doing this will probably require a two-step solution in which the reads are first passed through a metagenomic classifier such as Centrifuge43 or MetaMaps44, which would be used to select the correct RASE database on which to make a resistance call.

Another limitation is the time required for sample preparation, which currently includes human DNA depletion, DNA isolation and library preparation, thereby taking a total of 4 h. However, this is a rapidly evolving area of technology and automated rapid library preparation kits are already in development45. Further advances in this space, in particular for the preparation of metagenomic samples, will be required to bring the method closer to the bedside.

We have demonstrated that effectively predicting resistance and susceptibility from sequencing data does not require knowledge of causal resistance determinants. In fact, neighbour typing only requires that the phenotype be sufficiently strongly associated with the population structure to make reliable predictions.

A key advantage of this approach is that it requires very little genomic data, thus it is not limited by high error rates or low coverage. In particular, it is not attempting to define the exact genome sequence of the sample being tested, but merely which lineage it comes from. As a result, even when a small fraction of k-mers in the read are informative in matching to the RASE database, this is sufficient to call the lineage. This has the benefit of being faster than gene detection by virtue of the informative k-mers being distributed throughout the genome, and so more likely to appear in the first few reads sequenced by the nanopore. Therefore, the approach we present here can be seen as an application of compressed sensing; that is, by measuring a sparse signal distributed broadly across our data, we can identify it with comparatively few error-tolerant measurements.

Genomic neighbour typing can also be used to detect other phenotypes that are sufficiently tightly linked to a phylogeny, such as virulence. Further applications may include rapid outbreak investigations, as the closely related isolates involved in the outbreak would all be predicted to match to the same strain in the RASE database. The approach also lends itself to enhanced surveillance, including in the field; the 2014–2016 Ebola outbreak in West Africa, for example, saw MinION devices used in remote locations without advanced healthcare facilities2. Finally, at present, empirical treatment decisions are made within successive ‘windows’46, in which increasing information becomes available, from initial Gram stain to full phenotypic characterization. The information from genomic neighbour typing is a natural complement to this process and has the potential to improve therapy long before it would become clinically apparent that the patient is not responding or before phenotypic susceptibility data were available. The combination of high-quality RASE databases with genomic neighbour typing offers an alternative forward-looking model for diagnostics and surveillance, with wide applications for the improved clinical management of infectious disease.

Methods

Overview

RASE uses rapid approximate k-mer-based matching of long sequencing reads against a database of strains to predict resistance via neighbour typing. The database contains a highly compressed exact k-mer index, a representation of the tree population structure and metadata such as lineage, resistance profiles, MLST-based sequence type and serotype. The RASE prediction pipeline iterates over reads from the nanopore sequencer and provides real-time predictions of lineage and resistance or susceptibility (Fig. 1).

Resistance profiles

For all antibiotics, RASE associates individual strains with a resistance category: ‘susceptible’ (S) or ‘non-susceptible’ (R). First, intervals of possible MIC values are extracted using regular expressions from the available textual antibiograms. For instance, ‘≥4’, ‘2’ and ‘NA’ would be translated to the intervals [4, + ∞), [2, 2] and [0, + ∞), respectively. Then the acquired intervals are compared to the antibiotic-specific breakpoints. If a given breakpoint is above or below the interval, susceptibility or non-susceptibility is reported, respectively. However, no category can be assigned at this step if the breakpoint lies within the extracted interval, an antibiogram is entirely missing, it is insufficiently specific or its parsing failed. Finally, missing categories are inferred using ancestral state reconstruction on the associated phylogenetic tree while maximizing parsimony (that is, minimizing the number of nodes switching its resistance category; Extended Data Figs. 3 and 4). When the solution for a node is not unique, non-susceptibility is assigned.

Genomic neighbour typing using nanopore sequencing

All reference strains in the database are associated with similarity weights that are set to zero at the start of the run. Each time a new read is read from the stream, k-mer-based matching is applied to identify the strains with the maximum number of matching k-mers (see below). Such strains are nearest neighbours to the read in the database according to the 1/(‘number of matched k-mers’) pseudodistance.

The weights of the nearest neighbours are then increased according to the information content of the read, calculated as the number of matched k-mers divided by the number of nearest neighbours. Reads that do not match (that is, 0 matching k-mers in the database) are not used in subsequent analysis. The computed matches are also used for updating the k-mer score (KS), which is the proportion of matched k-mers in all reads. The KS helps to assess whether a sample is truly matching the database and whether predicting resistance for the database species makes sense.

The obtained weights serve as a proxy to the inverted genetic distance and are used as a basis for the subsequent predictions of the lineage and the antibiotic resistance and susceptibility.

Predicting lineage

A lineage is predicted as the lineage of the best-matching reference strain; that is, the one with the largest weight. The quality of lineage prediction is further quantified using a lineage score (LS), calculated as LS = 2 f/(f + t) – 1, where f and t denote the weights of the best matches in the first (‘predicted’) and in the second best (‘alternative’) lineage, respectively. The values of the LS can range from 0.0 to 1.0 with the following special cases: LS = 1.0 means that all reads were perfectly matching the predicted lineage, whereas LS = 0.0 means that the predicted and alternative lineages were matched equally well.

The LS is used to measure how well a sample matches the identified lineage. If the LS is higher than a specified threshold (0.6 in default settings), the call is considered successful. If the score is lower than this, the sample cannot be securely assigned to a lineage, and this should draw the attention of the operators. Note that custom RASE databases may require a recalibration of the threshold.

Predicting resistance and susceptibility

Resistance or susceptibility is independently predicted for individual antibiotics based on the weights of the strains that belong to the predicted lineage. These are used to calculate a susceptibility score, which is further interpreted by comparing to predefined thresholds.

The susceptibility score (SS) is calculated as SS = s/(s + r), where s and r denote the weights of the best-matching susceptible and best-matching non-susceptible strains within the lineage, respectively. The values of the SS can range from 0.0 to 1.0 with the following special cases: SS = 0.0 and SS = 1.0 mean that all reads match only resistant or susceptible strains in the lineage, respectively. In practice, this happens only if the lineage is entirely associated with resistance or susceptibility. SS = 0.5 means that the best matching resistant and susceptible strains are matched equally well. As follows from the score definition, if SS is greater than 0.5, then the best-matching strain is susceptible, otherwise it is non-susceptible.

The SS is used for predicting resistance or susceptibility and for evaluating the confidence of the prediction. If the SS is greater than 0.5, susceptibility to the antibiotic is reported, otherwise non-susceptibility is reported. Hence, resistance is predicted as the resistance of the best match. However, when the SS is within the [0.4, 0.6] range, it is considered a low-confidence call, and as such it should draw the attention of the operators; this usually indicates that resistance or susceptibility has recently emerged in the evolutionary history, and genomic neighbour typing may not be able to confidently distinguish between these similar, but phenotypically distinct, strains. Note that the thresholds above might require a further recalibration based on the specific database, antibiotics and application of RASE.

S. pneumoniae RASE database

The S. pneumoniae RASE database was constructed using the EUCAST breakpoints33 for the following antibiotics (mg l–1): ceftriaxone (0.25), erythromycin (0.25), benzylpenicillin (0.06), trimethoprim-sulfamethoxazole (1.00) and tetracycline (1.00). While we used the above values in the present work, others may be readily defined and the database rapidly updated. This is especially useful in cases where breakpoints may vary depending on the site of infection (as is the case with pneumococcal meningitis and otitis media, for which lower MICs are considered to be resistant33).

The draft assemblies were downloaded from the SRA FTP server using the accession codes provided in table 1 in ref. 24. The phylogenetic tree was downloaded from DataDryad (accession no.: https://doi.org/10.5061/dryad.t55gq). The pneumococcal ProPhyle index was constructed with the k-mer size of k = 18.

The obtained S. pneumoniae RASE database, including the source code and data, is available at https://github.com/c2-d2/rase-db-spneumoniae-sparc.

N. gonorrhoeae RASE database

The N. gonorrhoeae RASE database was constructed with the CDC GISP breakpoints34 for the following antibiotics (mg l–1): azithromycin (2.0), cefixime (0.25), ciprofloxacin (1.0) and ceftriaxone (0.125). Before applying the breakpoints, azithromycin MICs for strains collected before 2005 were doubled to correct for the known inconsistencies of the phenotyping protocol, which were due to a change in formulation of the commercial media47.

The draft assemblies and the phylogenetic tree were downloaded from Zenodo (accession no.: https://doi.org/10.5281/zenodo.2618836). The following three prevalent types of plasmids48 were downloaded from GenBank, localized in the GISP database using BLAST49 and removed from the dataset: the cryptic plasmid pJD1 (GenBank accession no.: NC_001377.1), the beta-lactamase plasmid pJD4 (GenBank accession no.: NC_002098.1) and the conjugative plasmid pEP5289 (GenBank accession no.: GU479466.1). The gonococcal ProPhyle index was constructed with the k-mer size of k = 18.

The obtained N. gonorrhoeae RASE database, including the source code and data, is available at https://github.com/c2-d2/rase-db-ngonorrhoeae-gisp.

k-mer-based matching

Reads were matched against the RASE databases using the ProPhyle classifier27,28 (commit b55e026) and its ProPhex component50,51. The ProPhyle index stores k-mers of all strains in a highly compressed form, reducing the required memory footprint. In the database construction phase, the k-mers of the strains are first propagated along the phylogenetic tree and then greedily assembled to simplitigs52. The obtained simplitigs are then placed into a single text file, for which a BWT index is constructed30.

In the course of sequencing, each read is decomposed into overlapping k-mers. The k-mers are then searched in the BWT index by ProPhex using BWT search using a sliding window50. For every k-mer, the obtained matches are translated back on the tree. This provides a list of nodes whose descending leaves are the strains containing that k-mer. Finally, strains with maximum number of matched k-mers are identified for each read and reported in the SAM/BAM format53.

Optimizing the k-mer length

The k-mer length is the main parameter of the classification. First, the subword complexity function54 of pneumococcus was calculated using JellyFish55 (v.2.2.10) (Extended Data Fig. 5). Then, based on the characteristics of the function and the k-mer range supported by ProPhyle, the possible range of k was determined as previously described17,32. For these k-mer lengths, RASE indexes were constructed and their performance evaluated using the RASE prediction pipeline and selected experiments. While RASE showed robustness to k-mer length in terms of final predictions, prediction delays differed (Extended Data Fig. 6). Based on the obtained timing data, we set the k-mer length to k= 18.

Comparison to Kraken

For each RASE database, a fake NCBI taxonomy was generated from the database tree. Then, a library was built using Kraken31 (v.1.1.1, default parameters) from the same FASTA files as used for building the RASE database. Finally, Kraken databases were constructed for both k = 18 and k = 31.

The obtained Kraken databases were used to classify reads from individual experiments. The obtained Kraken assignments were subsequently converted using an ad hoc Python script to RASE-BAM (a subset of the BAM format53 used by RASE). Finally, RASE prediction was applied on the BAM files, with the use of the RASE database metadata, and the results compared with the results of the standard RASE with ProPhyle.

Measuring time

To determine how RASE works with nanopore data generated in real time, the timestamps of individual reads were extracted using regular expressions from the read names. These were then used for sorting the base-called nanopore reads by time. When the RASE pipeline was applied, the timestamps were used for expressing the predictions as a function of time. The times of ProPhyle assignments were also compared to the original timestamps to ensure that the prediction pipeline was not slower than sequencing.

When timestamps of sequencing reads were not available (that is, the gonococcal WHO data and clinical samples), RASE estimated the progress in time from the number of processed base pairs. This was done by dividing the cumulative base-pair count by the typical nanopore flow, which we had previously estimated from SP01 as 1.43 Mbps per s. However, such an estimated progress is indicative only, as it does not follow the true order of reads in the course of sequencing. As the nanopore signal quality tends to decrease over time (see the decrease of KS in Fig. 2 after t = 15 mins), the randomized read order provides results of lower quality than true real-time sequencing.

Lower time estimates on resistance gene detection

A complete genome of the multidrug-resistant SP02 isolate was assembled from the nanopore reads using CANU56 (v.1.5, default parameters). Before the assembly step, reads were filtered using SAMsift57 based on the matching quality with the pneumococcal RASE database: only reads at least 1,000-bp long with at least 10% 18-mers shared with some of the reference draft assemblies were used. The obtained assembly was further corrected using Pilon58 (v.1.2, default parameters) and Illumina reads from the same isolate (taxid 1QJAP in the SPARC dataset24) mapped to the nanopore assembly using BWA-MEM59 (v.0.7.17, default parameters) and sorted using SAMtools53.

The obtained assembly was searched for resistance-causing genes using the online CARD tool16 (as of 1 Aug 2018). All of the original nanopore reads were then mapped using Minimap2 (v.2.11, with ‘-x map-ont’)60 to the corrected assembly, and resistance genes in the reads were identified using BEDtools–intersect61 (v.2.27.1, with ‘-F 95’). Timestamps of the resistance-informative reads were extracted and associated with the genes. Only reads longer than 2 kbp were used in the analysis.

Evaluation of the N. gonorrhoeae WHO samples

To evaluate the predictions of the WHO samples, we inferred a phylogenetic tree from a dataset comprising both the GISP isolates and the WHO isolates. First, reads were downloaded for the GISP isolates (NCBI BioProject no.: PRJEB2999 and PRJEB7904) and for the WHO isolates F–P (NCBI BioProject no.: PRJEB4024). For the WHO isolates U–Z, read data were simulated from the finished de novo assemblies (NCBI BioProject no.: PRJEB14020) using Art-Illumina62 (v.2.5.1). Reads were mapped to the NCCP11945 reference genome (GenBank accession no.: CP001050.1) using BWA-MEM59 (v.0.7.17) and deduplicated using Picard63 (v.2.8.0). Pilon58 (v.1.16, with ‘–mindepth 10–minmq 20’) was used to call variants and further filtered to include only ‘pass’ sites, and sites where the alternative allele was supported with AF > 0.9. Gubbins64 (v.2.3.4) with RAxML65 (v.8.2.10) were run on the aligned pseudogenomes to generate the final recombination-corrected phylogeny (Supplementary Data 1).

The closest relatives identified by RASE were verified using the obtained tree. For every WHO isolate, the obtained RASE prediction was compared to the closest GISP isolate on the tree.

Library preparation

For isolates SP01–SP06, cultures were grown in Todd–Hewitt medium with 0.5% yeast extract (THY; Becton Dickinson) at 37 °C in 5% CO2 for 24 h. High-molecular-weight (>1 μg) genomic DNA was extracted and purified from cultures using a DNeasy Blood and Tissue kit (Qiagen). DNA concentration was measured using a Qubit fluorometer (Invitrogen). Library preparation was performed using an ONT 1D ligation sequencing kit (SQK LSK108).

For isolates SP07–SP12, library preparation was performed using an ONT Rapid Low-Input Barcoding kit (SQK-RLB001), with saponin-based host DNA depletion used for reducing the proportion of human reads. More details can be found in the original manuscript40.

For isolates GC01–GC05, cultures were grown on Chocolate-Agar media (that is, Difco GC base media containing 1% IsoVitaleX (Becton Dickinson) and 1% Remel Hemoglobin (Thermo Fisher Scientific)) at 37 °C in 5% CO2 for 20 h. For GC01–GC04, genomic DNA was extracted and purified from cultures using a PureLink Genomic DNA mini kit (Thermo Fisher Scientific). For GC05, DNA was extracted using the phenol–chloroform method66. Genomic DNA was extracted and purified from cultures using the PureLink Genomic DNA mini kit (Thermo Fisher Scientific). DNA concentration was measured using a Qubit fluorometer (Invitrogen). Library preparation was performed using the ONT 1D ligation sequencing kit (SQK-LSK109).

MinION sequencing

Sequencing was performed on a MinION MK1 device using R9.4/FLO-MIN106 flow cells according to the manufacturer’s instructions. For experiments SP01–SP06, base calling was performed using ONT Metrichor (v.1.6.11 (SP01), v.1.7.3 (SP02), v.1.7.14 (SP03–SP06)) simultaneously with sequencing, and all reads passing Metrichor quality check were used in the further analysis. For experiments SP07–SP12, the ONT MinKNOW software (v.1.4-1.13.1) was used to collect raw sequencing data, and ONT Albacore (v.1.2.2-2.1.10) was used for local base-calling of the raw data after sequencing runs were completed. For experiments GC01–GC05, ONT MinKNOW software was used to collect raw sequencing data, and ONT Albacore (v.2.3.4) was used for local base-calling.

Testing resistance phenotype

Additional retesting of SPARC isolates was done using microdilution. Organism suspensions were prepared from overnight growths on blood agar plates to the density of a 0.5 McFarland standard. This organism suspension was then diluted to provide a final inoculum of 105 to 106 colony-forming units per ml. Microdilution trays were prepared according to the NCCLS methodology with cation-adjusted Mueller–Hinton broth (Sigma-Aldrich) supplemented with 5% lysed horse blood (Hemostat Laboratories)67,68. Penicillin (TRC Canada) and chloramphenicol (USB) concentrations ranged from 0.016 to 16 μg ml–1. Erythromycin (Enzo Life Sciences), tetracycline (Sigma-Aldrich) and trimethoprim–sulfamethoxazole (MP Biomedicals) concentrations ranged from 0.0625 to 64 μg ml–1. Ceftriaxone (Sigma-Aldrich) concentrations ranged from 0.007 to 8 μg ml–1. The microdilution trays were incubated in ambient air at 35 °C for 24 h. The MICs were then visually read and breakpoints applied. A list of individual microdilution measurements and the obtained resistance categories is provided in Supplementary Table 2.

Resistance of streptococcus in the metagenomic samples (SP07–SP12) was determined by agar diffusion using the EUCAST methodology and breakpoints33. First, the inoculated agar plates were incubated at 37 °C overnight and then examined for growth, with the potential for reincubation up to 48 h. Then, the samples were screened to oxacillin: if the zone diameter r was >20 mm, the isolate was considered to be sensitive to benzylpenicillin, otherwise a full MIC measurement to benzylpenicillin was done. Finally, the isolate was screened for resistance to tetracycline (r ≥ 25 mm for sensitive, r < 22 mm for resistant) and erythromycin (r ≥ 22 mm for sensitive, r < 19 mm for resistant); when the isolate showed intermediate resistance, a full MIC measurement was done.

Results for all tested samples, isolates and metagenomes are summarized in Supplementary Table 3.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.