Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED

Abstract

Conventional targeted sequencing methods eliminate many of the benefits of nanopore sequencing, such as the ability to accurately detect structural variants or epigenetic modifications. The ReadUntil method allows nanopore devices to selectively eject reads from pores in real time, which could enable purely computational targeted sequencing. However, this requires rapid identification of on-target reads while most mapping methods require computationally intensive basecalling. We present UNCALLED (https://github.com/skovaka/UNCALLED), an open source mapper that rapidly matches streaming of nanopore current signals to a reference sequence. UNCALLED probabilistically considers k-mers that could be represented by the signal and then prunes the candidates based on the reference encoded within a Ferragina–Manzini index. We used UNCALLED to deplete sequencing of known bacterial genomes within a metagenomics community, enriching the remaining species 4.46-fold. UNCALLED also enriched 148 human genes associated with hereditary cancers to 29.6× coverage using one MinION flowcell, enabling accurate detection of single-nucleotide polymorphisms, insertions and deletions, structural variants and methylation in these genes.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: UNCALLED algorithm and performance on E. coli data.
Fig. 2: UNCALLED results for the Zymo mock microbial community.
Fig. 3: Human cancer gene enrichment using UNCALLED.
Fig. 4: Integrated genome browser (IGV) visualization of a heterozygous Alu insertion in an exon of the MUTYH gene detected by UNCALLED, ONT WGS and PacBio HiFi reads.
Fig. 5: GM12878 promoter methylation.

Data availability

All sequencing runs are available as an NCBI BioProject under accession no. PRJNA604456.

Code availability

The source code for UNCALLED is available on GitHub at https://github.com/skovaka/UNCALLED.

References

  1. 1.

    Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).

    CAS  Google Scholar 

  3. 3.

    Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407–410 (2017).

    CAS  PubMed  Google Scholar 

  4. 4.

    Rang, F. J., Kloosterman, W. P. & de Ridder, J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 19, 90 (2018).

    PubMed  PubMed Central  Google Scholar 

  5. 5.

    Quick, J. et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 228–232 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Grädel, C. et al. Rapid and cost-efficient enterovirus genotyping from clinical samples using flongle flow cells. Genes 10, 659 (2019).

  7. 7.

    Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Luo, R., Sedlazeck, F. J., Lam, T.-W. & Schatz, M. C. A multi-task convolutional deep neural network for variant calling in single molecule sequencing. Nat. Commun. 10, 998 (2019).

    PubMed  PubMed Central  Google Scholar 

  9. 9.

    Rand, A. C. et al. Mapping DNA methylation with high-throughput nanopore sequencing. Nat. Methods 14, 411–413 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Gilpatrick, T. et al. Targeted nanopore sequencing with Cas9-guided adapter ligation. Nat. Biotechnol. 38, 433–438 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Loose, M., Malla, S. & Stout, M. Real-time selective sequencing using nanopore technology. Nat. Methods 13, 751–754 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Payne, A., Holmes, N., Rakyan, V. & Loose, M. BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics 35, 2193–2198 (2019).

    CAS  PubMed  Google Scholar 

  13. 13.

    Gu, W. et al. Depletion of abundant sequences by hybridization (DASH): using Cas9 to remove unwanted high-abundance species in sequencing libraries and molecular counting applications. Genome Biol. 17, 41 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Edwards, H. S. et al. Real-time selective sequencing with RUBRIC: Read Until with Basecall and Reference-Informed Criteria. Sci. Rep. 9, 11475 (2019).

    PubMed  PubMed Central  Google Scholar 

  15. 15.

    Payne, A. et al. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nat. Biotechnol. (in the press).

  16. 16.

    Ferragina, P. & Manzini, G. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science 390–398 (IEEE, 2000).

  17. 17.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

    PubMed  PubMed Central  Google Scholar 

  19. 19.

    Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Marçais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).

    PubMed  PubMed Central  Google Scholar 

  22. 22.

    Luo, R. et al. Exploring the limit of using a deep neural network on pileup data for germline variant calling. Nat. Mach. Intell. 2, 220–227 (2020).

    Google Scholar 

  23. 23.

    Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Cleary, J. G. et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. J. Comput. Biol. 21, 405–419 (2014).

    CAS  PubMed  Google Scholar 

  25. 25.

    Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).

    CAS  PubMed  Google Scholar 

  27. 27.

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

    PubMed  PubMed Central  Google Scholar 

  28. 28.

    Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Tarailo-Graovac, M. & Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 25, 4.10.1–4.10.14 (2009).

    Google Scholar 

  30. 30.

    Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 31.

    Karolchik, D. et al. The UCSC genome browser database. Nucleic Acids Res. 31, 51–54 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Genetics Home Reference. MUTYH gene. MedlinePlus https://ghr.nlm.nih.gov/gene/MUTYH (2020).

  33. 33.

    Deininger, P. Alu elements: know the SINEs. Genome Biol. 12, 236 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Carrel, L. & Willard, H. F. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature 434, 400–404 (2005).

    CAS  Google Scholar 

  35. 35.

    Forbes, S. A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).

    CAS  PubMed  Google Scholar 

  36. 36.

    Gardner, E. J. et al. The mobile element locator tool (MELT): population-scale mobile element discovery and biology. Genome Res. 27, 1916–1929 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. 37.

    Wu, J. et al. Tangram: a comprehensive toolbox for mobile element insertion detection. BMC Genomics 15, 795 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Cheadle, J. P. & Sampson, J. R. Exposing the MYtH about base excision repair and human inherited disease. Hum. Mol. Genet. 12 (Suppl. 2), R159–R165 (2003).

  39. 39.

    Win, A. K. et al. Risk of colorectal cancer for carriers of mutations in MUTYH, with and without a family history of cancer. Gastroenterology 146, 1208–1211.e5 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Nanopore Community Meeting 2019 Technology Update (Oxford Nanopore Technologies, 2019); https://nanoporetech.com/resource-centre/nanopore-community-meeting-2019-technology-update

  41. 41.

    De Roeck, A. et al. NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION. Genome Biol. 20, 239 (2019).

    PubMed  PubMed Central  Google Scholar 

  42. 42.

    David, M., Dursi, L. J., Yao, D., Boutros, P. C. & Simpson, J. T. Nanocall: an open source basecaller for Oxford Nanopore sequencing data. Bioinformatics 33, 49–55 (2017).

    CAS  PubMed  Google Scholar 

  43. 43.

    Welford, B. P. Note on a method for calculating corrected sums of squares and products. Technometrics 4, 419–420 (1962).

    Google Scholar 

  44. 44.

    Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

    PubMed  PubMed Central  Google Scholar 

  45. 45.

    Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Gog, S. & Petri, M. Optimized succinct data structures for massive data. Softw. Pract. Exp. 44, 1287–1314 (2014).

    Google Scholar 

  47. 47.

    Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. 48.

    Yates, A. D. et al. Ensembl 2020. Nucleic Acids Res. 48, D682–D688 (2020).

    CAS  Google Scholar 

  49. 49.

    Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. 50.

    Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).

    PubMed  PubMed Central  Google Scholar 

  51. 51.

    Aganezov, S. et al. Comprehensive analysis of structural variants in breast cancer genomes using single-molecule sequencing. Genome Res. 30, 1258–1273 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank T. Mun for his contributions on an early prototype of UNCALLED and T. Gilpatrick for providing extracted GM12878 DNA used in the cancer gene enrichment experiments. This work was funded, in part, by the US National Science Foundation (grant no. DBI-1350041 to M.C.S.) and US National Institutes of Health (grant no. R01HG009190 to W.T.).

Author information

Affiliations

Authors

Contributions

S.K. and M.C.S. designed UNCALLED. S.K. implemented UNCALLED. B.N. and S.K. benchmarked UNCALLED. Y.F. performed all sequencing library preparation. S.K. computed enrichment levels for all experiments and performed small variant and structural variant detection and analysis. Y.F. performed methylation detection and analysis. W.T. supervised sequencing runs and advised on the experimental design. M.C.S. supervised the entire project. All authors contributed to writing the manuscript. All authors read and approve the final manuscript.

Corresponding author

Correspondence to Sam Kovaka.

Ethics declarations

Competing interests

W.T. holds two patents currently licensed by Oxford Nanopore Technologies Limited. M.C.S. and W.T. have received travel funding from Oxford Nanopore Technologies Limited.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 FM Index Mapping.

(top) FM index alignment of a standard DNA sequence, where the size of each box represents the number of possible locations. (middle) FM alignment of a sequence where every position could be one of two bases. Base ambiguity is analogous to the k-mers we consider for every event. (bottom) Same as middle but alignments starting from all positions are found by filling in the gaps between ranges from previous alignments.

Extended Data Fig. 2 Event/K-mer Match Probability Thresholds.

a, Relationship between natural log probability thresholds (x-axis), the mean number of k-mers that match above each threshold per event (blue), the fraction of events that match their correct k-mer above each threshold (red). The values for r9.4 chemistry are shown here. b, The FM index range lengths assigned to different probability thresholds for the E. coli reference. This function varies depending on the reference used.

Extended Data Fig. 3 Zymo Full Flowcell 1 Pore Activity.

Pore activity during Zymo ‘full flowcell 1’ sequencing runs. a, Percent of channels that are labeled active throughout zymo bacterial depletion UNCALLED and control runs, based on the percent of signal labeled ‘pore’ or ‘strand’ in the MinKNOW duty times. Curves are smoothed by taking the mean of 92 minute windows, which smooths over mux scans. b, Number of channels which are ‘alive’ throughout the run, meaning they have the capacity to sequence reads, based on when the last read was produced. This is distinct from the duty time plots in that a channel may not produce a read for several hours but still be considered ‘alive’.

Extended Data Fig. 4 GM12878 Duty Times.

GM12878 gene enrichment run duty times in the a, unsheared run and b, sheared run. Nuclease flushes were carried out at 24 and 48 hours in both runs. Curves plotted as in Extended Data Fig. 3. Note: we observed that a large patch of channels were marked as inactive after the second flush in the unsheared UNCALLED run, which can occur because of bubbles introduced when loading.

Extended Data Fig. 5 SVs confirmed by applying sensitive parameters in Sniffles and SURVIVOR or which required manual inspection to correct.

SVs confirmed by applying sensitive parameters in Sniffles and SURVIVOR or which required manual inspection to correct. a, Insertion detected by UNCALLED but not by ONT WGS because most reads represented it as < 50 bp. b, Insertion detected by ONT WGS but not by UNCALLED because of low-complexity sequence. The overlapping deletion on the other haplotype also likely made the insertion difficult to resolve. c, Insertions detected by UNCALLED but not by PacBio because of low-complexity sequence. d, Deletion detected by PacBio but not by UNCALLED. e, Deletion detected by UNCALLED (and all other long-read datasets) but not by Illumina reads, likely because of surrounding repetitive elements. Note that white read alignments indicate low mapping quality. f, Sniffles called two SVs in this locus in both UNCALLED and ONT WGS, while it appears to represent a single duplication. SURVIVOR merged the ONT WGS SVs but not the UNCALLED SVs, causing a falsely unmatched SV. This is a known issue with SURVIVOR and this case was manually corrected.

Extended Data Fig. 6 Zymo Full Flowcell 1 Gap Durations.

Durations of gaps between reads on channel 109 of the Zymo Full Flowcell 1 UNCALLED run. X-axis indicates when a read ended, Y-axis indicates how long until the next read begins (log scale). Dashed vertical lines indicate mux scans, which often correspond to when gap characteristics change due to pore transitions. The horizontal red line is at one standard deviation over the median gap length for the entire run (including other channels), which is the threshold the simulator uses to define active and inactive periods as represented by the top blue and red bars respectively.

Extended Data Fig. 7 Outline of ReadUntil Simulator.

a, Outline of the ReadUntil simulator. Inputs are sequencing summaries of an UNCALLED run and a control run, in addition to the corresponding UNCALLED PAF file and the raw reads from the control run. The overall ‘pattern’ of the simulation is generated from the UNCALLED run: for each channel, gaps between the end of a read and the start of the next are separated into ‘short’ and ‘long’, where the long gaps are used to define broadly active and inactive periods of the channel (see Extended Data Fig. 6) and the short gaps are stored in a series queues. The read chunks and durations are loaded from the control run. Each channel’s reads are stored in a queue and are output in the same order in which they were sequenced, but the exact time that they are output may vary between channels because of ejections. When all reads are output from a channel, the queue ‘repeats’ and outputs the same reads again. Short gaps are stored in similarly operating queues, each associated with a channel and scan interval. Scan intervals are periods between two mux scans which are synchronized across all channels. b, Illustration of how simulations can be shortened by scaling down the active/inactive periods and scan intervals, but leaving the read and short gap duration unchanged.

Extended Data Fig. 8 Simulation Results.

Simulated results of targeting sets of human genes: a, absolute enrichment with respect to gene count, b, absolute enrichment with respect to reference size, c, true positive rate with respect to gene count, d, true positive rate with respect to reference size. True positive rates were computed based on reads where the first 1,350 bp of each read fully aligns to the target reference according to minimap2. Note that reference size includes the 5Kbp surrounding each gene/exon, while the level of enrichment is calculated based on coverage of the target sequence only (see Supplementary Table 8).

Extended Data Fig. 9 Path Buffer Illustration.

Representation of alignments in path buffers. The ‘Virtual Alignment Forest’ is a more detailed version of the one in Fig. 1a. Pink edges mark paths that were pruned out due to lower probability in order to maintain the tree structure. Shaded backgrounds mark paths that have not been pruned out and are therefore represented in path buffers, and darker shading indicates that part of the path is represented in multiple buffers. ‘Path Buffers’ store cumulative log probabilities that can be used to compute a rolling mean log probability as mapping progresses, as well as ‘stay’ versus ‘move’ events represented by dotted versus solid lines. Seed mappings are inferred from the FM index coordinate which are also stored in the buffers.

Supplementary information

Reporting Summary

Supplementary Tables

Supplementary Tables 1–8.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kovaka, S., Fan, Y., Ni, B. et al. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nat Biotechnol 39, 431–441 (2021). https://doi.org/10.1038/s41587-020-0731-9

Download citation

Further reading

Search

Quick links