Main

Nanopore sequencing technology and its applications in basic and applied research have undergone substantial growth since Oxford Nanopore Technologies (ONT) provided the first nanopore sequencer, MinION, in 2014 (refs. 1,2). The technology relies on a nanoscale protein pore, or ‘nanopore’, that serves as a biosensor and is embedded in an electrically resistant polymer membrane1,3 (Fig. 1). In an electrolytic solution, a constant voltage is applied to produce an ionic current through the nanopore such that negatively charged single-stranded DNA or RNA molecules are driven through the nanopore from the negatively charged ‘cis’ side to the positively charged ‘trans’ side. Translocation speed is controlled by a motor protein that ratchets the nucleic acid molecule through the nanopore in a step-wise manner. Changes in the ionic current during translocation correspond to the nucleotide sequence present in the sensing region and are decoded using computational algorithms, allowing real-time sequencing of single molecules. In addition to controlling translocation speed, the motor protein has helicase activity, enabling double-stranded DNA or RNA–DNA duplexes to be unwound into single-stranded molecules that pass through the nanopore.

Fig. 1: Principle of nanopore sequencing.
figure 1

A MinION flow cell contains 512 channels with 4 nanopores in each channel, for a total of 2,048 nanopores used to sequence DNA or RNA. The wells are inserted into an electrically resistant polymer membrane supported by an array of microscaffolds connected to a sensor chip. Each channel associates with a separate electrode in the sensor chip and is controlled and measured individually by the application-specific integration circuit (ASIC). Ionic current passes through the nanopore because a constant voltage is applied across the membrane, where the trans side is positively charged. Under the control of a motor protein, a double-stranded DNA (dsDNA) molecule (or an RNA–DNA hybrid duplex) is first unwound, then single-stranded DNA or RNA with negative charge is ratcheted through the nanopore, driven by the voltage. As nucleotides pass through the nanopore, a characteristic current change is measured and is used to determine the corresponding nucleotide type at ~450 bases per s (R9.4 nanopore).

In this review, we first present an introduction to the technology development of nanopore sequencing and discuss improvements in the accuracy, read length and throughput of ONT data. Next, we describe the main bioinformatics methods applied to ONT data. We then review the major applications of nanopore sequencing in basic research, clinical studies and field research. We conclude by considering the limitations of the existing technologies and algorithms and directions for overcoming these limitations.

Technology development

Nanopore design

The concept of nanopore sequencing emerged in the 1980s and was realized through a series of technical advances in both the nanopore and the associated motor protein1,4,5,6,7,8. α-Hemolysin, a membrane channel protein from Staphylococcus aureus with an internal diameter of ~1.4 nm to ~2.4 nm (refs. 1,9), was the first nanopore shown to detect recognizable ionic current blockades by both RNA and DNA homopolymers10,11,12. In a crucial step toward single-nucleotide-resolution nanopore sequencing, engineering of the wild-type α-hemolysin protein allowed the four DNA bases on oligonucleotide molecules to be distinguished, although complex sequences were not examined in these reports13,14,15. Similar results were achieved using another engineered nanopore, Mycobacterium smegmatis porin A (MspA)16,17, that has a similar channel diameter (~1.2 nm)18,19.

A key advance in improving the signal-to-noise ratio was the incorporation of processive enzymes to slow DNA translocation through the nanopore20,21,22. In particular, phi29 DNA polymerase was found to have superior performance in ratcheting DNA through the nanopore23,24. Indeed, this motor protein provided the last piece of the puzzle; in February 2012, two groups demonstrated processive recordings of ionic currents for single-stranded DNA molecules that could be resolved into signals from individual nucleotides by combining phi29 DNA polymerase and a nanopore (α-hemolysin24 and MspA25). In contrast to the previous DNA translocation tests that were poorly controlled13,14,15,16,17, the addition of the motor protein reduced the fluctuations in translocation kinetics, thus improving data quality. In the same month, ONT announced the first nanopore sequencing device, MinION26. ONT released the MinION to early users in 2014 and commercialized it in 2015 (ref. 2) (Fig. 2a). There have been several other nanopore-based sequencing ventures, such as Genia Technologies’s nanotag-based real-time sequencing by synthesis (Nano-SBS) technology, NobleGen Biosciences’s optipore system and Quantum Biosystems’s sequencing by electronic tunneling (SBET) technology27,28. However, this review focuses on ONT technology as it has been used in most peer-reviewed studies of nanopore sequencing, data, analyses and applications.

Fig. 2: ONT sequencing data improvement over time.
figure 2

a, Timeline of the major chemistry and platform releases by ONT. b, Accuracy of 1D, 2D and 1D2 reads. c, Average and maximum read lengths. Special efforts have been made in some studies to achieve ultralong read length. For example, by late 2019, the highest average sequencing length achieved has been 23.8 kilobases (kb) using a specific DNA extraction protocol51. The longest individual read is 2,273 kb, rescued by correcting an error in the software MinKNOW49. The DNA extraction and purification methods used in these independent studies are summarized in Supplementary Table 1. Read lengths are reported for 1D reads. d, Yield per flow cell (in log10 scale for y axis). Yields are reported for 1D reads. Data points shown in b (accuracy), c (read length) and d (yield) are from independent studies. Details for these data points are summarized in Supplementary Table 1.

ONT has continually refined the nanopore and the motor protein, releasing eight versions of the system to date, including R6 (June 2014), R7 (July 2014), R7.3 (October 2014), R9 (May 2016), R9.4 (October 2016), R9.5 (May 2017), R10 (March 2019) and R10.3 (January 2020) (Fig. 2a). The original or engineered proteins used in the R6, R7, R7.3, R10 and R10.3 nanopores have not been disclosed by the company to date. R9 achieved a notable increase in sequencing yield per unit of time and in sequencing accuracy (~87% (ref. 29) versus ~64% for R7 (ref. 30)) by using the nanopore Curlin sigma S-dependent growth subunit G (CsgG) from Escherichia coli (Fig. 2b and Supplementary Table 1). This nanopore has a translocation rate of ~250 bases per s compared to ~70 bases per s for R7 (ref. 31). Subsequently, a mutant CsgG and a new motor enzyme (whose origin was not disclosed) were integrated into R9.4 to achieve higher sequencing accuracy (~85–94% as reported in refs. 32,33,34,35,36) and faster sequencing speeds (up to 450 bases per s). R9.5 was introduced to be compatible with the 1D2 sequencing strategy, which measures a single DNA molecule twice (see below). However, the R9.4 and R9.5 have difficulty sequencing very long homopolymer runs because the current signal of CsgG is determined by approximately five consecutive nucleotides. The R10 and R10.3 nanopores have two sensing regions (also called reader heads) to aim for higher accuracy with homopolymers37,38, although independent studies are needed to assess this claim.

Additional strategies to improve accuracy

Beyond optimizing the nanopore and motor protein, several strategies have been developed to improve accuracy. Data quality can be improved by sequencing each dsDNA multiple times to generate a consensus sequence, similar to the ‘circular consensus sequencing’ strategy used in the other single-molecule long-read sequencing method from Pacific Biosciences (PacBio)39. Early versions of ONT sequencing used a 2D library preparation method to sequence each dsDNA molecule twice; the two strands of a dsDNA molecule are ligated together by a hairpin adapter, and a motor protein guides one strand (the ‘template’) through the nanopore, followed by the hairpin adapter and the second strand (the ‘complement’)40,41,42 (Fig. 3d, left). After removing the hairpin sequence, the template and complement reads, called the 1D reads, are used to generate a consensus sequence, called the 2D read, of higher accuracy. Using the R9.4 nanopore as an example, the average accuracy of 2D reads is 94% versus 86% for 1D reads33 (Fig. 2b). In May 2017, ONT released the 1D2 method together with the R9.5 nanopore; in this method, instead of being physically connected by a hairpin adapter, each strand is ligated separately to a special adapter (Fig. 3d, right). This special adapter provides a high probability (>60%) that the complement strand will immediately be captured by the same nanopore after the template strand, offering similar consensus sequence generation for dsDNA as the 2D library. The average accuracy of 1D2 reads is up to 95% (R9.5 nanopore)43 (Fig. 2b). Unlike the 2D library, the complement strand in the 1D2 library is not guaranteed to follow the template, resulting in imperfect consensus sequence generation. However, ONT no longer offers or supports the 2D and 1D2 libraries. Currently, for DNA sequencing, ONT only supports the 1D method in which each strand of a dsDNA is ligated with an adapter and sequenced independently (Fig. 3d, middle).

Fig. 3: Library preparation workflow for ONT sequencing.
figure 3

a, Special experimental techniques for ultralong genomic DNA sequencing, including HMW DNA extraction, fragmentation and size selection. b, Full-length cDNA synthesis for direct cDNA sequencing (without a PCR amplification step) and PCR-cDNA sequencing (with a PCR amplification step). c, Direct RNA-sequencing library preparation with or without a reverse transcription step, where only the RNA strand is ligated with an adapter and thus only the RNA strand is sequenced. d, Different library preparation strategies for DNA/cDNA sequencing, including 2D (where the template strand is sequenced, followed by a hairpin adapter and the complement strand), 1D (where each strand is ligated with an adapter and sequenced independently) and 1D2 (where each strand is ligated with a special adapter such that there is a high probability that one strand will immediately be captured by the same nanopore following sequencing of the other strand of dsDNA); SRE, short read eliminator kit (Circulomics).

In parallel, accuracy has been improved through new base-calling algorithms, including many developed through independent research32,44 (see below). Taking the R7.3 nanopore as an example, the 1D read accuracy was improved from 65% by hidden Markov model (HMM)45 to 70% by Nanocall46 and to 78% by DeepNano47.

Extending read length

Although the accuracy of ONT sequencing is relatively low, the read length provided by electrical detection has a very high upper bound because the method relies on the physical process of nucleic acid translocation48. Reads of up to 2.273 megabases (Mb) were demonstrated in 2018 (ref. 49). Thus, ONT read lengths depend crucially on the sizes of molecules in the sequencing library. Various approaches for extracting and purifying high-molecular-weight (HMW) DNA have been reported or applied to ONT sequencing, including spin columns (for example, Monarch Genomic DNA Purification kit, New England Biolabs), gravity-flow columns (for example, NucleoBond HMW DNA kit, Takara Bio), magnetic beads (for example, MagAttract HMW DNA kit, QIAGEN), phenol–chloroform, dialysis and plug extraction50 (Fig. 3a). HMW DNA can also be sheared to the desired size by sonication, needle extrusion or transposase cleavage (Fig. 3a). However, overrepresented small fragments outside the desired size distribution may decrease sequencing yield because of higher efficiencies of both adapter ligation and translocation through nanopores than long fragments. To remove overrepresented small DNA fragments, various size selection methods (for example, the gel-based BluePippin system of Sage Science, magnetic beads and the Short Read Eliminator kit of Circulomics) have been used to obtain the desired data distribution and/or improve sequencing yield (Fig. 3a).

With improvements in nanopore technology and library preparation protocols (Figs. 2a and 3a), the maximum read length has increased from <800 kb in early 2017 to 2.273 Mb in 2018 (ref. 49) (Fig. 2c). The average read length has increased from a few thousand bases at the initial release of MinION in 2014 to ~23 kb (ref. 51) in 2018 (Fig. 2c), primarily due to improvements in HMW DNA extraction methods and size selection strategies. However, there is a trade-off between read length and yield; for example, the sequencing yield of the HMW genomic DNA library is relatively low.

Sequencing RNA

ONT devices have been adapted to directly sequence native RNA molecules52. The method requires special library preparation in which the primer is ligated to the 3′ end of native RNA, followed by direct ligation of the adapter without conventional reverse transcription (Fig. 3c). Alternatively, a cDNA strand can be synthesized to obtain an RNA–cDNA hybrid duplex, followed by ligation of the adapter. The former strategy requires less sample manipulation and is quicker and thus is good for on-site applications, whereas the latter produces a more stable library for longer sequencing courses and therefore produces higher yields. In both cases, only the RNA strand passes through the nanopore, and therefore direct sequencing of RNA molecules does not generate a consensus sequence (for example, 2D or 1D2). Compared to DNA sequencing, direct RNA sequencing is typically of lower average accuracy, around 83–86%, as reported by independent research53,54.

Like conventional RNA sequencing, ONT can be used to perform cDNA sequencing by utilizing existing full-length cDNA synthesis methods (for example, the SMARTer PCR cDNA Synthesis kit of Takara Bio and the TeloPrime Full-Length cDNA Amplification kit of Lexogen) followed by PCR amplification42,55 (Fig. 3b). ONT also offers a direct cDNA sequencing protocol without PCR amplification, in contrast to many existing cDNA sequencing methods. This approach avoids PCR amplification bias, but it requires a relatively large amount of input material and longer library preparation time, making it unsuitable for many clinical applications. A recent benchmarking study demonstrated that ONT sequencing of RNA, cDNA or PCR-cDNA for the identification and quantification of gene isoforms provides similar results56.

Increasing throughput

In addition to sequencing length and accuracy, throughput is another important consideration for ONT sequencing applications. To meet the needs of different project scales, ONT released several platforms (Box 1). The expected data output of a flow cell mainly depends on (1) the number of active nanopores, (2) DNA/RNA translocation speed through the nanopore and (3) running time.

Early MinION users reported typical yields of hundreds of megabases per flow cell, while current throughput has increased to ~10–15 gigabases (Gb) (Fig. 2d, solid line) for DNA sequencing through faster chemistry (increasing from ~30 bases per s by R6 nanopore to ~450 bases per s by R9.4 nanopore) and longer run times with the introduction of the Rev D ASIC chip. Subsequent devices, such as PromethION, run more flow cells with more nanopores per flow cell. An independent study reported a yield of 153 Gb from a single PromethION flow cell with an average sequencing speed of ~430 bases per s (ref. 57) (Fig. 2d, dashed line). By contrast, direct RNA sequencing currently produces about 1,000,000 reads (1–3 Gb) per MinION flow cell due in part to its relatively low sequencing speed (~70 bases per s).

Data analysis

Bioinformatics analysis of ONT data has undergone continued improvement (Fig. 4). In addition to in-house data collection and specific data formats, many ONT-specific analyses focus on better utilizing the ionic current signal for purposes such as base calling, base modification detection and postassembly polishing. Other tools use long read length while accounting for high error rate. Many of these, such as tools for error correction, assembly and alignment, were developed for PacBio data but are also applicable to ONT data (Table 1).

Fig. 4: Analyses of ONT sequencing data.
figure 4

Typical bioinformatics analyses of ONT sequencing data, including the raw current data-specific approaches (for example, quality control, base calling and DNA/RNA modification detection), and error-prone long read-specific approaches (in dashed boxes; for example, error correction, de novo genome assembly, haplotyping/phasing, structural variation (SV) detection, repetitive region analyses and transcriptome analyses).

Table 1 Computational tools and experimental assays for ONT data analysis and applications

Because ONT devices do not require high-end computing resources or advanced skills for basic data processing, many laboratories can run data collection themselves. MinKNOW is the operating software used to control ONT devices by setting sequencing parameters and tracking samples (Fig. 4, top left). MinKNOW also manages data acquisition and real-time analysis and performs local base calling and outputs the binary files in fast5 format to store both metadata and read information (for example, current measurement and read sequence if base calling is performed). The fast5 format organizes the multidimensional data in a nested manner, allowing the piece-wise access/extraction of information of interest without navigating through the whole dataset. Previous versions of MinKNOW output one fast5 file for each single read (named single-fast5), but later versions output one fast5 file for multiple reads (named multi-fast5) to meet the increasing throughput. Both fast5 and fastq files are output if the base-calling mode is applied during the sequencing experiment. In addition to official ONT tools (for example, ont_fast5_api software for format conversion between single-fast5 and multi-fast5 and data compression/decompression), several third-party software packages40,58,59,60,61,62 have been developed for quality control, format conversion (for example, NanoR63 for generating fastq files from fast5 files containing sequence information), data exploration and visualization of the raw ONT data (for example, Poretools64, NanoPack65 and PyPore66) and for after base-calling data analyses (for example, AlignQC42 and BulkVis49) (Fig. 4, top right).

Base calling

Base calling, which decodes the current signal to the nucleotide sequence, is critical for data accuracy and detection of base modifications (Fig. 4, top center). Overall, method development for base calling went through four stages32,44,58,67,68: (1) base calling from the segmented current data by HMM at the early stage and by recurrent neural network in late 2016, (2) base calling from raw current data in 2017, (3) using a flip–flop model for identifying individual nucleotides in 2018 and (4) training customized base-calling models in 2019. ONT developed new base callers as ‘technology demonstrator’ software (for example, Nanonet, Scrappie and Flappie), which were subsequently implemented into the officially available software packages (for example, Albacore and Guppy). Albacore development is now discontinued in favor of Guppy, which can also run on graphics processing units in addition to central processing units to accelerate base calling.

ONT devices take thousands of current measurements per second. Processive translocation of a DNA or RNA molecule leads to a characteristic current shift that is determined by multiple consecutive nucleotides (that is, k-mer) defined by the length of the nanopore sensing region1. The raw current measurement can be segmented based on current shift to capture individual signals from each k-mer. Each current segment contains multiple measurements, and the corresponding mean, variance and duration of the current measurements together make up the ‘event’ data. The dependence of event data on neighboring nucleotides is Markov chain-like, making HMM-based methods a natural match to decode current shifts to nucleotide sequence, such as early base callers (for example, cloud-based Metrichor by ONT and Nanocall46). The subsequent Nanonet by ONT (implemented into Albacore) and DeepNano47 implemented a recurrent neural network algorithm to improve base-calling accuracy by training a deep neural network to infer k-mers from the event data. In particular, Nanonet used a bidirectional method to include information from both upstream and downstream states on base calling.

However, information may be lost when converting raw current measurement into event data, potentially diminishing base-calling accuracy. Raw current data were first used for classifying ONT reads into specific species69. Later, ONT’s open-source base caller Scrappie (implemented into both Albacore and Guppy) and the third-party software Chiron70 adopted neural networks to directly translate the raw current data into DNA sequence. Subsequently, ONT released the base caller Flappie, which uses a flip–flop model with a connectionist temporal classification decoding architecture and identifies individual bases instead of k-mers from raw current data. Furthermore, the software Causalcall uses a modified temporal convolutional network combined with a connectionist temporal classification decoder to model long-range sequence features35. In contrast to generalized base-calling models, ONT introduced Taiyaki (implemented into Guppy) to train customized (for example, application/species-specific) base-calling models by using language processing techniques to handle the high complexity and long-range dependencies of raw current data. Additionally, Taiyaki can train models for identifying modified bases (for example, 5-methylcytosine (5mC) or N6-methyladenine (6mA)) by adding a fifth output dimension. The R10 and R10.3 nanopores with two sensing regions may result in different signal features compared to previous raw current data, which will likely drive another wave of method development to improve data accuracy and base modification detection. To date, Guppy is the most widely used base caller because of its superiority in accuracy and speed32 (Table 1).

Detecting DNA and RNA modifications

ONT enables the direct detection of some DNA and RNA modifications by distinguishing their current shifts from those of unmodified bases52,71,72,73,74 (Fig. 4, middle center), although the resolution varies from the bulk level to the single-molecule level. A handful of DNA and RNA modification detection tools have been developed over the years (Table 1). Nanoraw (integrated into the Tombo software package) was the first tool to identify the DNA modifications 5mC, 6mA and N4-methylcytosine (4mC) from ONT data74. Several other DNA modification detection tools followed, including Nanopolish (5mC)75, signalAlign (5mC, 5-hydroxymethylcytosine (5hmC) and 6mA)71, mCaller (5mC and 6mA)76, DeepMod (5mC and 6mA)76, DeepSignal (5mC and 6mA)77 and NanoMod (5mC and 6mA)78. Nanpolish, Megalodon and DeepSignal were recently benchmarked and confirmed to have high accuracy for 5mC detection with single-nucleotide resolution at the single-molecule level79,80. Compared to PacBio, ONT performs better in detecting 5mC but has lower accuracy in detecting 6mA68,75,81.

The possibility of directly detecting N6-methyladenosine (m6A) modifications in RNA molecules was demonstrated using PacBio in 2012 (ref. 82), although few follow-up applications were published. Recently, ONT direct RNA sequencing has yielded robust data of reasonable quality, and several pilot studies have detected bulk-level RNA modifications by examining either error distribution profiles (for example, EpiNano (m6A)73 and ELIGOS (m6A and 5-methoxyuridine (5moU))83) or current signals (for example, Tombo extension (m6A and m5C)74 and MINES (m6A)84). However, detection of RNA modifications with single-nucleotide resolution at the single-molecule level has yet to be demonstrated.

Error correction

Although the average accuracy of ONT sequencing is improving, certain subsets of reads or read fragments have very low accuracy, and the error rates of both 1D reads and 2D/1D2 reads are still much higher than those of short reads generated by next-generation sequencing technologies. Thus, error correction is widely applied before many downstream analyses (for example, genome assembly and gene isoform identification), which can rescue reads for higher sensitivity (for example, mappability85) and improve the quality of the results (for example, break point determination at single-nucleotide resolution86). Two types of error correction algorithms are used85,87 (Fig. 4, middle right, and Table 1): ‘self-correction’ uses graph-based approaches to produce consensus sequences among different molecules from the same origins (for example, Canu88 and LoRMA89) in contrast to 2D and 1D2 reads generated from the same molecules, and ‘hybrid correction’ uses high-accuracy short reads to correct long reads by alignment-based (for example, LSC90 and Nanocorr45), graph-based (for example, LorDEC91) and dual alignment/graph-based algorithms (for example, HALC92). Recently, two benchmark studies demonstrated that the existing hybrid error correction tools (for example, FMLRC93, LSC and LorDEC) together with sufficient short-read coverage can reduce the long-read error rate to a level (~1–4%) similar to that of short reads85,87, whereas self-correction reduces the error rate to ~3–6% (ref. 87), which may be due to non-random systematic errors in ONT data.

Aligners for error-prone long reads

Alignment tools have been developed to tackle the specific characteristics of error-prone long reads (Table 1). Very early aligners (for example, BLAST94) were developed for small numbers of long reads (for example, Sanger sequencing data). More recently, there has been considerable growth in alignment methods for high-throughput accurate short reads (for example, Illumina sequencing data) in response to the growth in next-generation sequencing. Development of several error-prone long-read aligners was initially motivated by PacBio data, and they were also tested on ONT data. In 2016, the first aligner specifically for ONT reads, GraphMap, was developed95. GraphMap progressively refines candidate alignments to handle high error rates and uses fast graph transversal to align long reads with high speed and precision. Using a seed–chain–align procedure, minimap2 was developed to match increases in ONT read length beyond 100 kb (ref. 96). A recent benchmark paper revealed that minimap2 ran much faster than other long-read aligners (that is, LAST97, NGMLR98 and GraphMap) without sacrificing the accuracy99. In addition, minimap2 can perform splice-aware alignment for ONT cDNA or direct RNA-sequencing reads.

In addition to minimap2, GMAP, published in 2005 (ref. 100), and a new mode of STAR, which was originally developed for short reads101, have been widely used in splice-aware alignment of error-prone transcriptome long reads to genomes. Other aligners have also been developed, such as Graphmap2 (ref. 102) and deSALT103, for ONT transcriptome data. Especially for ONT direct RNA-sequencing reads with dense base modifications, Graphmap2 has a higher alignment rate than minimap2 (ref. 104).

Hybrid sequencing

Many applications combine long reads and short reads in the bioinformatics analyses, termed hybrid sequencing. In contrast to hybrid correction of long reads for general purposes, many hybrid sequencing-based methods integrate long reads and short reads into the algorithms and pipeline designs to harness the strengths of both types of reads to address specific biological problems. The long-read length is well suited to identifying large-range genomic complexity with unambiguous alignments, whereas the high accuracy and high throughput of short reads is useful for characterizing local details (for example, splice site detection with single-nucleotide resolution) and improving quantitative analyses. For example, genome105, transcriptome42 and metagenome106 assemblies have shown superior performance with hybrid sequencing data compared to either error-prone long reads alone or high-accuracy short reads alone.

De novo genome assembly

Error-prone long reads have been used for de novo genome assembly. Assemblers (Table 1) such as Canu88 and Miniasm107 are based on the overlap–layout–consensus algorithm, which builds a graph by overlapping similar sequences and is robust to sequencing error58,67,108 (Fig. 4, middle center). To further remove errors, error correction of long reads and polishing of assembled draft genomes (that is, improving accuracy of consensus sequences using raw current data) are often performed before and after assembly, respectively. In addition to the genome-polishing software Nanopolish109, ONT released Medaka, a neural network-based method, aiming for improved accuracy and speed compared to Nanopolish (Table 1).

These approaches take into account not only general assembly performance but also certain specific aspects, such as complex genomic regions and computational intensity. For example, Flye improves genome assembly at long and highly repetitive regions by constructing an assembly graph from concatenated disjoint genomic segments110; Miniasm uses all-versus-all read self-mapping for ultrafast assembly107, although postassembly polishing is necessary for higher accuracy. The recently developed assembler wtdbg2 runs much faster than other tools without sacrificing contiguity and accuracy111.

SVs and repetitive regions

When a reference genome is available, ONT data can be used to study sample-specific genomic details, including SVs and haplotypes, with much higher precision than other techniques. A few SV detection tools have been developed (for example, NanoSV112, Sniffles98, Picky33 and NanoVar113) (Fig. 4, bottom center, and Table 1). Picky, in addition to detecting regular SVs, also reveals enriched short-span SVs (~300 bp) in repetitive regions, as long reads cover the entire region including the variations. Given that single long reads can encompass multiple variants, including both SNVs and SVs, it is possible to perform phasing of multiploid genomes as well as other haplotype-resolved analyses112,114,115 with appropriate bioinformatics software, such as LongShot116 for SNV detection and WhatsHap117 for haplotyping/phasing.

Several tools have also been developed to investigate highly repetitive genomic regions by ONT sequencing, such as TLDR for identifying non-reference transposable elements118 and TRiCoLOR for characterizing tandem repeats119 (Table 1).

Transcriptome complexity

When used in transcriptome analyses, ONT reads can be clustered and assembled to reconstruct full-length gene isoforms or aligned to a reference genome to characterize complex transcriptional events42,120,121,122,123 (Fig. 4, bottom right). In particular, several transcript assemblers have been developed specifically for error-prone long reads, such as Traphlor124, FLAIR123, StringTie2 (ref. 125) and TALON126 as well as several based on hybrid sequencing data (for example, IDP127). In particular, IDP-denovo128 and RATTLE129 can perform de novo transcript assembly by long reads without a reference genome. More recently, ONT direct RNA sequencing has made transcriptome-wide investigation of native RNA molecules feasible52,130,131. However, development of corresponding bioinformatics tools, especially for quantitative analyses, remains inadequate.

Applications of nanopore sequencing

The long read length, portability and direct RNA sequencing capability of ONT devices have supported a diverse range of applications (Fig. 5). We review 11 applications that are the subject of the most publications since 2015.

Fig. 5: Applications of ONT sequencing.
figure 5

ONT sequencing applications are classified into three major groups (basic research, clinical usage and on-site applications) and are shown as a pie chart. The classifications are further categorized by specific topics, and the slice area is proportional to the number of publications (in log2 scale). Some applications span two categories, such as SV detection and rapid pathogen detection. The applications are also organized by the corresponding strengths of ONT sequencing as three layers of the pie chart: (1) long read length, (2) native single molecule and (3) portable, affordable and real time. The width of each layer is proportional to the number of publications (in log2 scale). Some applications that use all three strengths span all three layers (for example, antimicrobial resistance profiling). ‘Fungus’ includes Candida auris, ‘bacterium’ includes Salmonella, Neisseria meningitidis and Klebsiella pneumoniae and ‘virus’ includes severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), Ebola, Zika, Venezuelan equine encephalitis, yellow fever, Lassa fever and dengue; HLA, human leukocyte antigens.

Closing gaps in reference genomes

Genome assembly is one of the main uses of ONT sequencing (~30% of published ONT applications; Fig. 5). For species with available reference genomes, ONT long reads are useful for closing genome gaps, especially in the human genome. For example, ONT reads have been used to close 12 gaps (>50 kb for each gap) in the human reference genome and to measure the length of telomeric repeats132 and also to assemble the centromeric region of the human Y chromosome133. Moreover, ONT enabled the first gapless telomere-to-telomere assembly of the human X chromosome, including reconstruction of a ~2.8 Mb centromeric satellite DNA array and closing of all remaining 29 gaps (totaling 1.1 Mb)134. The Telomere-to-Telomere Consortium reported the first complete human genome (T2T-CHM13) of the size 3.055 Gb (ref. 135).

The Caenorhabditis elegans reference genome has also been expanded by >2 Mb through accurate identification of repetitive regions using ONT long reads136. Similar progress has been achieved in other model organisms and closely related species (for example, Escherichia coli109, Saccharomyces cerevisiae137, Arabidopsis thaliana138 and 15 Drosophila species139) as well as in non-model organisms, including characterizing large tandem repeats in the bread wheat genome140 and improving the continuity and completeness of the genome of Trypanosoma cruzi (the parasite causing Chagas disease)141.

Building new reference genomes

ONT long reads have been used extensively to assemble the initial reference genomes of many non-model organisms. For instance, ONT data alone were used to assemble the first genome of Rhizoctonia solani (a pathogenic fungal species that causes damping-off diseases in a wide range of crops)142, and hybrid sequencing data (ONT plus Illumina) were used to assemble the first draft genomes of Maccullochella peelii (Australia’s largest freshwater fish)143 and Amphiprion ocellaris (the common clown fish)144. In more complicated cases, ONT long reads have been integrated with one or more other techniques (for example, Illumina short reads, PacBio long reads, 10x Genomics linked reads, optical mapping by Bionano Genomics and spatial distance by Hi-C) to assemble the initial reference genomes of many species, such as Maniola jurtina (the meadow brown butterfly, a model for ecological genetics)145, Varanus komodoensis (the largest extant monitor lizard)146, Pavo cristatus (the national bird of India)147, Panthera leo (the lion)148 and Eumeta variegate (a bagworm moth that produces silk with potential in biomaterial design)149. In addition, ONT direct RNA sequencing has been used to construct RNA viral genomes while eliminating the need for the conventional reverse transcription step, including Mayaro virus150, Venezuelan equine encephalitis virus150, chikungunya virus150, Zika virus150, vesicular stomatitis Indiana virus150, Oropouche virus150, influenza A53 and human coronavirus86. For small DNA/RNA viral genomes (for example, the 27-kb human coronavirus genome86), the assembly process is not required given the long read length.

In the SARS-CoV-2 pandemic151, ONT sequencing was used to reconstruct full-length SARS-CoV-2 genome sequences via cDNA and direct RNA sequencing152,153,154,155, providing valuable information regarding the biology, evolution and pathogenicity of the virus.

The increasing yield, read length and accuracy of ONT data enable much more time- and cost-efficient genome assembly of all sizes of genomes, from bacteria of several megabases109, fruit fly139,156, fish143,144,157, blood clam158, banana159, cabbage159 and walnut160,161, all of whose genomes are in the hundreds of megabases, as well as the Komodo dragon146, Steller sea lion162, lettuce (https://nanoporetech.com/resource-centre/tip-iceberg-sequencing-lettuce-genome) and giant sequoia163, with genomes of a few gigabases, to coast redwood (https://www.savetheredwoods.org/project/redwood-genome-project/) and tulip (https://nanoporetech.com/resource-centre/beauty-and-beast), with genomes of 27–34 Gb. Only three PromethION flow cells were required to sequence the human genome, requiring <6 h for the computational assembly164.

Identifying large SVs

A powerful application of ONT long reads is to identify large SVs (especially from humans) in biomedical contexts, such as the breast cancer cell line HCC1187 (ref. 33), individuals with acute myeloid leukemia113, the construction of the first haplotype-resolved SV spectra for two individuals with congenital abnormalities112 and the identification of 29,436 SVs from a Yoruban individual NA19240 (ref. 165).

Characterizing full-length transcriptomes and complex transcriptional events

A comprehensive examination of the feasibility of ONT cDNA sequencing (with R7 and R9 nanopores) in transcriptome analyses demonstrated its similar performance in gene isoform identification to PacBio long reads, both of which are superior to Illumina short reads42. With ONT data alone, there remain drawbacks in estimating gene/isoform abundance, detecting splice sites and mapping alternative polyadenylation sites, although recent improvements in accuracy and throughput have advanced these analyses. Nevertheless, ONT cDNA sequencing was also tested in individual B cells from mice120 and humans122,166. Furthermore, ONT direct RNA sequencing has been used to measure the poly(A) tail length of native RNA molecules in humans131, C. elegans167, A. thaliana168 and Locusta migratoria169, corroborating a negative correlation between poly(A) tail length and gene expression167,168. In addition, the full-length isoforms of human circular RNAs have been characterized by ONT sequencing following rolling circle amplification170,171.

Characterizing epigenetic marks

As early as 2013, independent reports demonstrated that methylated cytosines (5mC and 5hmC) in DNA could be distinguished from native cytosine by the characteristic current signals measured using the MspA nanopore172,173. Later, bioinformatics tools were developed to identify three kinds of DNA modifications (6mA, 5mC and 5hmC) from ONT data71,75. Recently, ONT was applied to characterize the methylomes from different biological samples, such as 6mA in a microbial reference community174 as well as 5mC and 6mA in E. coli, Chlamydomonas reinhardtii and human genomes76.

Mapping DNA modifications using ONT sequencing in combination with exogenous methyltransferase treatment (inducing 5mC at GpC sites) led to the development of an experimental and bioinformatics approach, MeSMLR-seq, that maps nucleosome occupancy and chromatin accessibility at the single-molecule level and at long-range scale in S. cerevisiae72 (Table 1). Later, another method, SMAC-seq adopted the same strategy with the additional exogenous modification 6mA to improve the resolution of mapping nucleosome occupancy and chromatin accessibility175. Similarly, multiple epigenetic features, including the endogenous 5mC methylome (at CpG sites), nucleosome occupancy and chromatin accessibility, can be simultaneously characterized on single long human DNA molecules by MeSMLR-seq (K.F.A., unpublished data, and ref. 176). Such epigenome analyses can be performed in a haplotype-resolved manner and thus will be informative for discovering allele-specific methylation linked to imprinted genes as well as for phasing genomic variants and chromatin states, even in heterogeneous cancer samples.

Similarly, several other methods have combined various biochemical techniques with ONT sequencing (Table 1). For example, the movement of DNA replication forks on single DNA molecules has been measured by detection of nucleotide analogs (for example, 5-bromodeoxyuridine (5-BrdU)) using ONT sequencing177,178,179, and the 3D chromatin organization in human cells has been analyzed by integrating a chromatin conformation capture technique and ONT sequencing to capture multiple loci in close spatial proximity by single reads180. Two other experimental assays, DiMeLo-seq181 and BIND&MODIFY182, use ONT sequencing to map histone modifications (H3K9me3 and H3K27me3), a histone variant (CENP-A) and other specific protein–DNA interactions (for example, CTCF binding profile). They both construct a fusion protein of the adenosine methyltransferase and protein A to convert specific protein–DNA interactions to an artificial 6mA profile, which is subsequently detected by ONT sequencing.

Detecting RNA modifications

Compared to existing antibody-based approaches (which are usually followed by short-read sequencing), ONT direct RNA sequencing opens opportunities to directly identify RNA modifications (for example, m6A) and RNA editing (for example, inosine), which have critical biological functions. In 2018, distinct ionic current signals for unmodified and modified bases (for example, m6A and m5C) in ONT direct RNA-sequencing data were reported52. Since then, epitranscriptome analyses using ONT sequencing have progressed rapidly, including detection of 7-methylguanosine (m7G) and pseudouridine in 16S rRNAs of E. coli183, m6A in mRNAs of S. cerevisiae73 and A. thaliana168 and m6A130 and pseudouridine104 in human RNAs. Recent independent research (K.F.A., unpublished data, and refs. 184,185) has revealed that it is possible to probe RNA secondary structure using a combination of ONT direct RNA sequencing and artificial chemical modifications (Table 1). The dynamics of RNA metabolism were also analyzed by labeling nascent RNAs with base analogs (for example, 5-ethynyluridine186 and 4-thiouridine187) followed by ONT direct RNA sequencing (Table 1).

Cancer

ONT sequencing has been applied to many cancer types, including leukemia188,189,190,191,192, breast33,176,193, brain193, colorectal194, pancreatic195 and lung196 cancers, to identify genomic variants of interests, especially large and complex ones. For example, ONT amplicon sequencing was used to identify TP53 mutations in 12 individuals with chronic lymphoblastic leukemia188. Likewise, MinION sequencing data revealed BCR-ABL1 kinase domain mutations in 19 individuals with chronic myeloid leukemia and 5 individuals with acute lymphoblastic leukemia with superior sensitivity and time efficiency compared to Sanger sequencing189. Additionally, ONT whole-genome sequencing was used to rapidly detect chromosomal translocations and precisely determine the breakpoints in an individual with acute myeloid leukemia192.

A combination of Cas9-assisted target enrichment and ONT sequencing has characterized a 200-kb region spanning the breast cancer susceptibility gene BRCA1 and its flanking regions despite a high repetitive sequence fraction (>50%) and large gene size (~80 kb)197. This study provided a template for the analysis of full variant profiles of disease-related genes.

The ability to directly detect DNA modifications using ONT data has enabled the simultaneous capture of genomic (that is, copy number variation) and epigenomic (that is, 5mC) alterations using only ONT data from brain tumor samples193. The whole workflow (from sample collection to bioinformatics results) was completed in a single day, delivering a multimodal and rapid molecular diagnostic for cancers. In addition, same-day detection of fusion genes in clinical specimens has also been demonstrated by MinION cDNA sequencing198.

Infectious disease

Because of its fast real-time sequencing capabilities and small size, MinION has been used for rapid pathogen detection, including diagnosis of bacterial meningitis199, bacterial lower respiratory tract infection200, infective endocarditis201, pneumonia202 and infection in prosthetic joints203. In the example of bacterial meningitis, 16S amplicon sequencing took only 10 min using MinION to identify pathogenic bacteria in all six retrospective cases, making MinION particularly useful for the early administration of antibiotics through timely detection of bacterial infections199. Likewise, clinical diagnosis of bacterial lower respiratory tract infection using MinION was faster (6 h versus >2 d) and had higher sensitivity than existing culture-based ‘gold standard’ methods200.

In addition to pathogen detection, ONT sequencing can accelerate profiling antibiotic/antimicrobial resistance in bacteria and other microbes. For example, MinION was used to identify 51 acquired resistance genes directly from clinical urine samples (without culture) of 55 that were detected from cultivated bacteria using Illumina sequencing204, and a recent survey of resistance to colistin in 12,053 Salmonella strains used a combination of ONT, PacBio and Illumina data205. Indeed, ONT sequencing is useful for detecting specific species and strains (for example, virulent ones) from microbiome samples given the unambiguous mappability of longer reads, which provides accurate estimates of microbiome composition compared to the conventional studies relying on 16S rRNA and DNA amplicons57,206.

Genetic disease

ONT long reads have been applied to characterize complex genomic rearrangements in individuals with genetic disorders. For example, ONT sequencing of human genomes revealed that an expansion of tandem repeats in the ABCA7 gene was associated with an increased risk of Alzheimer’s disease207. ONT sequencing was also used to discover a new 3.8-Mb duplication in the intronic region of the F8 gene in an individual with hemophilia A208. Other examples cover a large range of diseases and conditions, including autism spectrum disorder209, Temple syndrome210, congenital abnormalities112, glycogen storage disease type Ia (ref. 211), intellectual disability and seizures212, epilepsy213,214, Parkinson’s disease215, Gaucher disease215, ataxia-pancytopenia syndrome and severe immune dysregulation114.

In another clinical application, human leukocyte antigen genotyping benefited from the improved accuracy of the R9.5 nanopore216,217,218. MinION enabled the detection of aneuploidy in prenatal and miscarriage samples in 4 h compared to 1–3 weeks with conventional techniques219.

Outbreak surveillance

The portable MinION device allows in-field and real-time genomic surveillance of emerging infectious diseases, aiding in phylogenetic and epidemiological investigations such as characterization of evolution rate, diagnostic targets, response to treatment and transmission rate. In April 2015, MinION devices were shipped to Guinea for real-time genomic surveillance of the ongoing Ebola outbreak. Only 15–60 min of sequencing per sample was required220. Likewise, a hospital outbreak of Salmonella was monitored with MinION, with positive cases identified within 2 h (ref. 221). MinION was also used to conduct genomic surveillance for Zika virus222, yellow fever virus223 and dengue virus224 outbreaks in Brazil.

With the increasing throughput of ONT sequencing, real-time surveillance has been applied to pathogens with larger genomes over the years, ranging from viruses of a few kilobases (for example, Ebola virus220, 18–19 kb; Zika virus222, 11 kb; Venezuelan equine encephalitis virus225, 11.4 kb; Lassa fever virus226, 10.4 kb and SARS-CoV-2 coronavirus151, 29.8 kb) to bacteria of several megabases (for example, Salmonella221, 5 Mb; N. meningitidis227, 2 Mb and K. pneumoniae228, 5.4 Mb) and to human fungal pathogens with genomes of >10 Mb (for example, Candida auris229, 12 Mb).

Other on-site applications

Portable ONT devices have also been used for on-site metagenomics research. MinION characterized pathogenic microbes, virulence genes and antimicrobial resistance markers in the polluted Little Bighorn River, Montana, United States230. MinION and MinIT devices were brought to farms in sub-Saharan Africa for early and rapid diagnosis (<3 h) of plant viruses and pests in cassava231. In forensic research, a portable strategy known as ‘MinION sketching’ was developed to identify human DNA with only 3 min of sequencing232, offering a rapid solution to cell authentication or contamination identification during cell or tissue culture.

The portability of the MinION system, which consists of the palm-sized MinION, mobile DNA extraction devices (for example, VolTRAX and Bento Lab) and real-time onboard base calling with Guppy and other offline bioinformatics tools, enables field research in scenarios where samples are hard to culture or store or where rapid genomic information is needed233. Examples include the International Space Station, future exploration of Mars and the Moon involving microgravity and high levels of ionizing radiation69,234,235, ships236, Greenland glaciers at subzero temperatures237, conservation work in the Madagascar forest238 and educational outreach238.

Outlook

Nanopore sequencing has enabled many biomedical studies by providing ultralong reads from single DNA/RNA molecules in real time. Nonetheless, current ONT sequencing techniques have several limitations, including relatively high error rates and the requirement for relatively high amounts of nucleic acid material. Overcoming these challenges will require further breakthroughs in nanopore technology, molecular experiments and bioinformatics software.

The principal concern in many applications is the error rate, which, at 6–15% for the R9.4 nanopore, is still much higher than that of Illumina short-read sequencing (0.1–1%). Despite substantial improvements in data accuracy over the past 7 years, there may be an intrinsic limit to 1D read accuracy. The sequencing of single molecules has a low signal-to-noise ratio, in contrast to bulk sequencing of molecules as in Illumina sequencing. Indeed, the same issue arises in the other single-molecule measurement techniques, such as Helicos, PacBio and BioNano Genomics. There is currently no theoretical estimation of this limit, but for reference, Helicos managed to reduce error rates to 4% (ref. 239). Future improvements in accuracy can be expected through optimization of molecule translocation ratcheting and, in particular, through engineering existing nanopores or discovering new ones. Indeed, many studies have been exploring new biological or non-biological nanopores with shorter sensing regions to achieve context-independent and high-quality raw signals. For example, graphene-based nanopores are capable of DNA sensing and have high durability and insulating capability in high ionic strength solutions240,241,242, where their thickness (~0.35 nm) is ideal for capturing single nucleotides243. Because such context-independent signals minimize the complex signal interference between adjacent modified bases, they could also make it possible to detect base modifications at single-molecule and single-nucleotide resolutions. Another approach for improving 1D read accuracy is to develop base-calling methods based on advanced computational techniques, such as deep learning.

Repetitive sequencing of the same molecule, for example, using 2D and 1D2 reads, was helpful in improving accuracy. However, both of these approaches were limited in that each molecule could only be measured twice. By contrast, the R2C2 protocol involves the generation and sequencing of multiple copies of target molecules122. It may also be possible to increase data accuracy by recapturing DNA molecules into the same nanopore244 or by using multilayer nanopores for multiple sequencing of each molecule.

Improved data accuracy would advance single-molecule omics studies. Haplotype-resolved genome assembly has been demonstrated for PacBio data245, which could likely be achieved using ONT sequencing. Methods are being developed to characterize epigenomic and epitranscriptomic events beyond base modifications at the single-molecule level, such as nucleosome occupancy and chromatin accessibility72,175,176 and RNA secondary structure184,185. These approaches would allow investigation of the heterogeneity and dynamics of the epigenome and epitranscriptome as well as analysis of allele-specific and/or strand-specific epigenomic and epitranscriptomic phenomena. They would require specific experimental protocols (for example, identifying chromatin accessibility by detecting artificial 5mC footprints72,175,176) rather than the simple generation of long reads.

Although the ultralong read length of ONT data remains its principal strength, further increases in read length would be beneficial, further facilitating genome assembly and the sequencing of difficult to analyze genomic regions (for example, eukaryotic centromeres and telomeres). Once read lengths reach a certain range, or even cover entire chromosomes, genome assembly would become trivial, requiring little computation and having superior completeness and accuracy. Personalized genome assembly would become widely available, and it would be possible to assemble the genomes of millions of species across the many Earth ecosystems. Obtaining megabase-scale or longer reads will require the development of HMW DNA extraction and size selection methods as well as protocols to maintain ultralong DNA fragments intact.

The other key experimental barrier to be addressed is the large amount of input DNA and RNA required for ONT sequencing, which is up to a few micrograms of DNA and hundreds of nanograms of RNA. PCR amplification of DNA is impractical for very long reads or impermissible for native DNA/RNA sequencing. Reducing the sample size requirement would make ONT sequencing useful for the many biomedical studies in which genetic material is limited. In parallel, ONT sequencing will benefit from the development of an end-to-end system. For example, the integration and automation of DNA/RNA extraction systems, sequencing library preparation and loading systems would allow users without specific training to generate ONT sequencing data. More robust and user-friendly bioinformatics software, such as cloud storage and computing and real-time analysis, will provide a further boost to ONT sequencing applications, ultimately moving the technology beyond the lab and into daily life.