Conventional targeted sequencing methods eliminate many of the benefits of nanopore sequencing, such as the ability to accurately detect structural variants or epigenetic modifications. The ReadUntil method allows nanopore devices to selectively eject reads from pores in real time, which could enable purely computational targeted sequencing. However, this requires rapid identification of on-target reads while most mapping methods require computationally intensive basecalling. We present UNCALLED (https://github.com/skovaka/UNCALLED), an open source mapper that rapidly matches streaming of nanopore current signals to a reference sequence. UNCALLED probabilistically considers k-mers that could be represented by the signal and then prunes the candidates based on the reference encoded within a Ferragina–Manzini index. We used UNCALLED to deplete sequencing of known bacterial genomes within a metagenomics community, enriching the remaining species 4.46-fold. UNCALLED also enriched 148 human genes associated with hereditary cancers to 29.6× coverage using one MinION flowcell, enabling accurate detection of single-nucleotide polymorphisms, insertions and deletions, structural variants and methylation in these genes.
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
All sequencing runs are available as an NCBI BioProject under accession no. PRJNA604456.
The source code for UNCALLED is available on GitHub at https://github.com/skovaka/UNCALLED.
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).
Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407–410 (2017).
Rang, F. J., Kloosterman, W. P. & de Ridder, J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 19, 90 (2018).
Quick, J. et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 228–232 (2016).
Grädel, C. et al. Rapid and cost-efficient enterovirus genotyping from clinical samples using flongle flow cells. Genes 10, 659 (2019).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Luo, R., Sedlazeck, F. J., Lam, T.-W. & Schatz, M. C. A multi-task convolutional deep neural network for variant calling in single molecule sequencing. Nat. Commun. 10, 998 (2019).
Rand, A. C. et al. Mapping DNA methylation with high-throughput nanopore sequencing. Nat. Methods 14, 411–413 (2017).
Gilpatrick, T. et al. Targeted nanopore sequencing with Cas9-guided adapter ligation. Nat. Biotechnol. 38, 433–438 (2020).
Loose, M., Malla, S. & Stout, M. Real-time selective sequencing using nanopore technology. Nat. Methods 13, 751–754 (2016).
Payne, A., Holmes, N., Rakyan, V. & Loose, M. BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics 35, 2193–2198 (2019).
Gu, W. et al. Depletion of abundant sequences by hybridization (DASH): using Cas9 to remove unwanted high-abundance species in sequencing libraries and molecular counting applications. Genome Biol. 17, 41 (2016).
Edwards, H. S. et al. Real-time selective sequencing with RUBRIC: Read Until with Basecall and Reference-Informed Criteria. Sci. Rep. 9, 11475 (2019).
Payne, A. et al. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nat. Biotechnol. (in the press).
Ferragina, P. & Manzini, G. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science 390–398 (IEEE, 2000).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Marçais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).
Luo, R. et al. Exploring the limit of using a deep neural network on pileup data for germline variant calling. Nat. Mach. Intell. 2, 220–227 (2020).
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
Cleary, J. G. et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. J. Comput. Biol. 21, 405–419 (2014).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).
Tarailo-Graovac, M. & Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 25, 4.10.1–4.10.14 (2009).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Karolchik, D. et al. The UCSC genome browser database. Nucleic Acids Res. 31, 51–54 (2003).
Genetics Home Reference. MUTYH gene. MedlinePlus https://ghr.nlm.nih.gov/gene/MUTYH (2020).
Deininger, P. Alu elements: know the SINEs. Genome Biol. 12, 236 (2011).
Carrel, L. & Willard, H. F. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature 434, 400–404 (2005).
Forbes, S. A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).
Gardner, E. J. et al. The mobile element locator tool (MELT): population-scale mobile element discovery and biology. Genome Res. 27, 1916–1929 (2017).
Wu, J. et al. Tangram: a comprehensive toolbox for mobile element insertion detection. BMC Genomics 15, 795 (2014).
Cheadle, J. P. & Sampson, J. R. Exposing the MYtH about base excision repair and human inherited disease. Hum. Mol. Genet. 12 (Suppl. 2), R159–R165 (2003).
Win, A. K. et al. Risk of colorectal cancer for carriers of mutations in MUTYH, with and without a family history of cancer. Gastroenterology 146, 1208–1211.e5 (2014).
Nanopore Community Meeting 2019 Technology Update (Oxford Nanopore Technologies, 2019); https://nanoporetech.com/resource-centre/nanopore-community-meeting-2019-technology-update
De Roeck, A. et al. NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION. Genome Biol. 20, 239 (2019).
David, M., Dursi, L. J., Yao, D., Boutros, P. C. & Simpson, J. T. Nanocall: an open source basecaller for Oxford Nanopore sequencing data. Bioinformatics 33, 49–55 (2017).
Welford, B. P. Note on a method for calculating corrected sums of squares and products. Technometrics 4, 419–420 (1962).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Gog, S. & Petri, M. Optimized succinct data structures for massive data. Softw. Pract. Exp. 44, 1287–1314 (2014).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Yates, A. D. et al. Ensembl 2020. Nucleic Acids Res. 48, D682–D688 (2020).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Aganezov, S. et al. Comprehensive analysis of structural variants in breast cancer genomes using single-molecule sequencing. Genome Res. 30, 1258–1273 (2020).
We thank T. Mun for his contributions on an early prototype of UNCALLED and T. Gilpatrick for providing extracted GM12878 DNA used in the cancer gene enrichment experiments. This work was funded, in part, by the US National Science Foundation (grant no. DBI-1350041 to M.C.S.) and US National Institutes of Health (grant no. R01HG009190 to W.T.).
W.T. holds two patents currently licensed by Oxford Nanopore Technologies Limited. M.C.S. and W.T. have received travel funding from Oxford Nanopore Technologies Limited.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
(top) FM index alignment of a standard DNA sequence, where the size of each box represents the number of possible locations. (middle) FM alignment of a sequence where every position could be one of two bases. Base ambiguity is analogous to the k-mers we consider for every event. (bottom) Same as middle but alignments starting from all positions are found by filling in the gaps between ranges from previous alignments.
a, Relationship between natural log probability thresholds (x-axis), the mean number of k-mers that match above each threshold per event (blue), the fraction of events that match their correct k-mer above each threshold (red). The values for r9.4 chemistry are shown here. b, The FM index range lengths assigned to different probability thresholds for the E. coli reference. This function varies depending on the reference used.
Pore activity during Zymo ‘full flowcell 1’ sequencing runs. a, Percent of channels that are labeled active throughout zymo bacterial depletion UNCALLED and control runs, based on the percent of signal labeled ‘pore’ or ‘strand’ in the MinKNOW duty times. Curves are smoothed by taking the mean of 92 minute windows, which smooths over mux scans. b, Number of channels which are ‘alive’ throughout the run, meaning they have the capacity to sequence reads, based on when the last read was produced. This is distinct from the duty time plots in that a channel may not produce a read for several hours but still be considered ‘alive’.
GM12878 gene enrichment run duty times in the a, unsheared run and b, sheared run. Nuclease flushes were carried out at 24 and 48 hours in both runs. Curves plotted as in Extended Data Fig. 3. Note: we observed that a large patch of channels were marked as inactive after the second flush in the unsheared UNCALLED run, which can occur because of bubbles introduced when loading.
Extended Data Fig. 5 SVs confirmed by applying sensitive parameters in Sniffles and SURVIVOR or which required manual inspection to correct.
SVs confirmed by applying sensitive parameters in Sniffles and SURVIVOR or which required manual inspection to correct. a, Insertion detected by UNCALLED but not by ONT WGS because most reads represented it as < 50 bp. b, Insertion detected by ONT WGS but not by UNCALLED because of low-complexity sequence. The overlapping deletion on the other haplotype also likely made the insertion difficult to resolve. c, Insertions detected by UNCALLED but not by PacBio because of low-complexity sequence. d, Deletion detected by PacBio but not by UNCALLED. e, Deletion detected by UNCALLED (and all other long-read datasets) but not by Illumina reads, likely because of surrounding repetitive elements. Note that white read alignments indicate low mapping quality. f, Sniffles called two SVs in this locus in both UNCALLED and ONT WGS, while it appears to represent a single duplication. SURVIVOR merged the ONT WGS SVs but not the UNCALLED SVs, causing a falsely unmatched SV. This is a known issue with SURVIVOR and this case was manually corrected.
Durations of gaps between reads on channel 109 of the Zymo Full Flowcell 1 UNCALLED run. X-axis indicates when a read ended, Y-axis indicates how long until the next read begins (log scale). Dashed vertical lines indicate mux scans, which often correspond to when gap characteristics change due to pore transitions. The horizontal red line is at one standard deviation over the median gap length for the entire run (including other channels), which is the threshold the simulator uses to define active and inactive periods as represented by the top blue and red bars respectively.
a, Outline of the ReadUntil simulator. Inputs are sequencing summaries of an UNCALLED run and a control run, in addition to the corresponding UNCALLED PAF file and the raw reads from the control run. The overall ‘pattern’ of the simulation is generated from the UNCALLED run: for each channel, gaps between the end of a read and the start of the next are separated into ‘short’ and ‘long’, where the long gaps are used to define broadly active and inactive periods of the channel (see Extended Data Fig. 6) and the short gaps are stored in a series queues. The read chunks and durations are loaded from the control run. Each channel’s reads are stored in a queue and are output in the same order in which they were sequenced, but the exact time that they are output may vary between channels because of ejections. When all reads are output from a channel, the queue ‘repeats’ and outputs the same reads again. Short gaps are stored in similarly operating queues, each associated with a channel and scan interval. Scan intervals are periods between two mux scans which are synchronized across all channels. b, Illustration of how simulations can be shortened by scaling down the active/inactive periods and scan intervals, but leaving the read and short gap duration unchanged.
Simulated results of targeting sets of human genes: a, absolute enrichment with respect to gene count, b, absolute enrichment with respect to reference size, c, true positive rate with respect to gene count, d, true positive rate with respect to reference size. True positive rates were computed based on reads where the first 1,350 bp of each read fully aligns to the target reference according to minimap2. Note that reference size includes the 5Kbp surrounding each gene/exon, while the level of enrichment is calculated based on coverage of the target sequence only (see Supplementary Table 8).
Representation of alignments in path buffers. The ‘Virtual Alignment Forest’ is a more detailed version of the one in Fig. 1a. Pink edges mark paths that were pruned out due to lower probability in order to maintain the tree structure. Shaded backgrounds mark paths that have not been pruned out and are therefore represented in path buffers, and darker shading indicates that part of the path is represented in multiple buffers. ‘Path Buffers’ store cumulative log probabilities that can be used to compute a rolling mean log probability as mapping progresses, as well as ‘stay’ versus ‘move’ events represented by dotted versus solid lines. Seed mappings are inferred from the FM index coordinate which are also stored in the buffers.
About this article
Cite this article
Kovaka, S., Fan, Y., Ni, B. et al. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nat Biotechnol 39, 431–441 (2021). https://doi.org/10.1038/s41587-020-0731-9
Combined nanopore adaptive sequencing and enzyme-based host depletion efficiently enriched microbial sequences and identified missing respiratory pathogens
BMC Genomics (2021)
An update on the neurological short tandem repeat expansion disorders and the emergence of long-read sequencing diagnostics
Acta Neuropathologica Communications (2021)
Genome Biology (2021)
Nature Plants (2021)
Nature Biotechnology (2021)