A broad range of chemical modifications adorns both DNA and RNA molecules, expanding their lexicon, modulating gene expression patterns and governing key biological processes. In DNA, all four bases are subject to modifications1 (Fig. 1) that are vital in regulating development through actions such as genomic imprinting or X chromosome inactivation2. Similarly, RNA molecules are decorated with over 170 different chemical modifications3 (Fig. 1), with central roles in cellular fate, neuronal function and sex determination4. In the past few years, work has shown that RNA modifications can be environmentally driven, affecting the health status not only of the exposed organisms but also across generations5. Notably, dysregulation of the DNA and RNA modification machinery has been associated with several human diseases, highlighting their importance in correct cellular function6,7.

Fig. 1: List of DNA and RNA modifications, and sequencing-based methods used to detect them.
figure 1

Known DNA (left) and RNA (right) modifications classified by their reference nucleotide; sequencing methods used to detect them are indicated. Adapted from ref. 6.

Short-read next-generation-sequencing technologies have revolutionized our understanding of DNA and RNA modifications. Two main strategies have typically been used to map modifications in a genome- or transcriptome-wide fashion: (1) chemical-based detection, in which chemical compounds selectively react with modified bases and induce a quantifiable signature, and (2) antibody-based detection, in which modified nucleic acids are selectively captured via immunoprecipitation. A well-known example of chemical-based modification detection is bisulfite sequencing to detect 5-methylcytosine both in DNA (5mC)8 and RNA (m5C)9. Antibody-based detection methods are commonly used to map RNA modifications such as N6-methyladenosine (m6A)10,11 or N1-methyladenosine (m1A)12,13.

Despite compelling discoveries made using these approaches, short-read sequencing technologies often fail to capture the diversity and plasticity of DNA and RNA modifications. Indeed, the vast majority of modifications cannot be mapped transcriptome-wide using short-read technologies (Fig. 1), mainly owing to the modest repertoire of commercial antibodies and/or chemicals that selectively recognize them — hindering our understanding of their biological function and dynamics14. Moreover, even when these reagents are available, short-read methods suffer from severe caveats: they are often not quantitative, have varied false-positive rates, are inconsistent when using distinct antibodies or batches of the same antibody, do not provide isoform-specific information, and require multiple ligation steps and extensive PCR amplification, introducing biases to the data15. Finally, method based on next-generation sequencing can detect only one modification at a time, leading to predetermined choices of which modifications might be important for a given biological process. As a consequence, information regarding the interplay, codependencies and synergies between various modifications along a transcript is lost.

In recent years, long-read third-generation-sequencing technologies (LRS) have emerged as promising alternatives to approaches based on short-read next-generation sequencing. These platforms can potentially capture information about assorted modification types simultaneously, in full-length reads, at single-molecule resolution, and at a genome- and transcriptome-wide scale16. Consequently, these platforms can reveal how distinct modification types synergize in cellular contexts; what the full set of targets of modification enzymes that are currently unknown is; why their dysregulation leads to distinct human diseases; and whether altered RNA and/or DNA modification ‘profiles’ in disease could be used as diagnostic and prognostic markers.

Exploring the modification landscape using LRS

Two major technologies currently dominate the long-read-sequencing scene: (1) single-molecule real-time (SMRT) sequencing technologies, offered by Pacific Biosciences, and (2) nanopore-based sequencing technologies, typically commercialized by Oxford Nanopore Technologies.

SMRT sequencing, which was commercially released in 2011, enables the real-time detection of individual fluorescently labeled nucleotides that are enzymatically incorporated during the elongation of the replicate strand from a non-amplified template17 (Fig. 2). Incorporated nucleotides are detected on the basis of the associated fluorophore that is released and dissipated upon cleavage of the phosphate chain. Notably, base modifications affect the speed at which the polymerase progresses, allowing their presence to be inferred from the delay between these fluorescence events (known as interpulse duration). Detection requires comparison to an unmodified reference, such as amplified DNA. SMRT sequencing has been successfully applied for the detection of several DNA modifications, including 6-methyladenosine (6mA), 4-methylcytosine (4mC), 5mC, and 5-hydroxymethylcytosine (5hmC) (Fig. 1). However, the detection of RNA modifications with SMRT sequencing has lagged behind. A proof-of-principle method for detecting RNA modifications was published in 2013, in which reverse transcriptases from HIV-1 and Alfalfa mosaic virus (AMV) were loaded onto a zero-mode waveguide chip, and reverse transcriptase dynamics were identified the presence of m6A RNA modifications18. However, this avenue of research has not been further pursued, and the chips containing the HIV-1 and AMV reverse transcriptases are not commercially available, which limits the applicability of SMRT sequencing for the detection of RNA modification.

Fig. 2: Overview of methods for detecting DNA and RNA modifications using long-read sequencing.
figure 2

Left, in SMRT sequencing, modified bases are detected by an increased time between fluorescent pulses (known as interpulse duration), resulting from the effect that modifications have on polymerase processing speed. Right, in nanopore sequencing, the signal is read from the ‘sensing region’ (corresponding to approximately five nucleotides). Modified DNA or RNA bases cause distinct current changes compared to unmodified bases, and therefore can be identified by comparing changes in raw current intensity, base-calling errors or using pretrained base-calling algorithms. Images adapted with permission from ref. 32, Springer Nature Limited (left), ref. 33, Springer Nature Limited (right).

Nanopore sequencing technologies use nanoscale protein pores (‘nanopores’) as biosensors embedded in an electrically resistant polymer membrane. A constant voltage is applied to produce an ionic current, so that negatively charged nucleic acids translocate through the nanopores. In the case of Oxford Nanopore Technologies (which was first commercialized in 2014), translocation speed is controlled by a helicase protein that is loaded onto the nucleic acids during the library preparation and that ratchets the template through the nanopore in a stepwise manner (typically at an average speed of 70 bases per second in RNA and 450 bases per second in DNA, in the current versions of the library preparation kits). Different nucleotides confer different resistances within the pore: the sequence of bases being translocated through the nanopore can therefore be inferred from current intensity patterns via machine-learning algorithms, in a process that is commonly referred to as ‘base-calling’. Notably, modified DNA or RNA bases cause characteristic current changes compared to unmodified bases, thus making it possible to identify DNA and RNA modifications in individual molecules without any chemical pretreatments19 (Fig. 2). Detection is not limited to naturally occurring modifications but also encompasses artificially induced modifications, which can be used to characterize RNA structure20 and to detect nascent RNAs21 and genome replication22.

Recent works have shown that DNA and RNA modifications can be identified in nanopore sequencing datasets: (1) by comparison to an in silico reference or control unmodified sample (for example, amplified DNA, in vitro-transcribed RNA or a knockout model)23 or (2) ‘de novo’, by using a modification-aware base-caller that can call the modified base24 in addition to the unmodified ones25. The latter option requires a pretrained base-calling model, which must be trained with both modified and unmodified data for which the ground truth is known.

Challenges in capturing modification diversity

Third-generation sequencing technologies can improve the quality and complexity of DNA and RNA modification maps that are obtained. And yet long-read sequencing has still not been adopted as a mainstream sequencing technology. There are several reasons why this is the case. One is skepticism from the community, which is partly caused by the poor accuracy that these technologies displayed in their earlier days. Even nowadays, frequently asked questions are whether these technologies are sufficiently accurate and if their error rate is too high for them to be used. Although it is true that short-read sequencing technologies win the sequencing accuracy race, the scientific community needs to move on from this perception: the latest base-calling algorithms for DNA sequencing are as high as 99.9% for SMRT sequencing and 99.6% for nanopore sequencing.

That being said, both SMRT and nanopore sequencing face important challenges and limitations that need to be addressed before they can be adopted as mainstream sequencing technologies. In the case of SMRT sequencing, a major limitation is that many modifications do not sufficiently affect the polymerase dynamics to be detected at a useful sensitivity. Although it is feasible to detect 6mA, 4mC, 5mC and 5hmC DNA modifications, other modifications have a more subtle effect on polymerase kinetics and thus require greater coverage to obtain a similar confidence level. For example, with current chemistries 6mA and 4mC modifications require a minimum of 25× coverage per strand, whereas 250× coverage is required for 5mC and 5hmC DNA modifications, which is difficult to obtain for large genomes. Moreover, modification detection is not based on direct detection at single-molecule resolution but rather takes the form of aggregation of the subtle effect of base modifications on polymerase dynamics during DNA synthesis. Software improvements are challenging; therefore, improvements in sequencing chemistries are probably required to address this limitation.

In the case of nanopore sequencing, applicability for studying DNA and RNA modifications is limited by the lack of modification-aware base-calling algorithms that will work for any given sequence context. To date, efforts to detect modifications de novo via base-calling models have primarily been focused on DNA modifications: separate species- and/or context-specific modification-aware base-calling models are available for 4mC, 5mC, 5hmC and 6mA. Some ‘all-context’ models have been released but, in our experience, these models have very high false-positive rates. Modification-aware RNA base-calling models are currently unavailable, with the exception of a few models released by the community that were trained on a small subset of synthetic sequences24,26 (limiting their applicability transcriptome-wide). Recent efforts have transitioned toward separating the process of base calling from modification detection: in the first step, the nucleic acid sequence is predicted and in the second step, each nucleotide is classified as modified or unmodified. However, this two-step model also requires modified and unmodified training datasets similar to those needed to train a modification-aware base-calling model. Obtaining high-quality and diverse ‘training sets’ for multiple modifications, and across diverse sequence contexts, is possibly one of the major challenges that needs to be solved in the years to come27,28.

Moreover, multiple modification types coexist in the same template and are sometimes separated by very small distances, as in the case of tRNAs (which typically harbor ten different types of RNA modification per molecule). In nanopore sequencing, the signal from neighboring modifications can spill over and affect adjacent bases, confounding detection attempts. For the concurrent detection of multiple modifications in single RNA molecules, we foresee that the combination of improved base-calling algorithms, the availability of ground-truth datasets on which to train, and more-sensitive nanopores with reduced signal-to-noise ratios will probably improve our ability to detect nearby modifications within the same molecule in the near future.

Long-read methodologies also suffer from requiring high input amounts to start the libraries, as compared to methods based on next-generation sequencing. In direct RNA sequencing, the high input amounts required (50 to 500 ng of polyA-enriched RNA material, depending on the protocol) and the low sequencing yields that are obtained (typically 1–2 million reads per MinION flowcell) are problematic. Sample multiplexing has been proposed as a solution to overcome the high input requirements of direct RNA sequencing29; however, this approach comes at the expense of decreased coverage per sample. We envision that more extensive use of PromethION flowcells — which produce about 10× more coverage than a MinION flowcell, at the same input amounts — will partially alleviate these limitations. In addition, the release of new direct RNA sequencing kits with increased helicase speed (120 bases per second, versus 70 bases per second currently), as announced by Oxford Nanopore Technologies, will probably lead to increased sequencing yields.

Opportunities for LRS

Long-read sequencing technologies are becoming more mainstream in academic and commercial settings. Their ability to map DNA and RNA modifications without highly specialized protocols, with continuing progress in detection methodologies, accuracy, throughput and cost reduction, with transcript isoform identification, improved mappability, and detection of structural variants makes long-read sequencing an up-and-coming alternative to well-established technologies based on short-read next-generation sequencing.

In the past few years, interest in using RNA-based therapies to treat and prevent human diseases has grown exponentially, largely owing to the success of mRNA vaccines (such as those from Pfizer and Moderna) against COVID-19. The success of these vaccines relies on several key milestones, one of them being the use of modified RNA nucleosides (specifically, N1-methylpseudouridine (m1Ψ)) that are incorporated into the mRNA molecules during the in vitro transcription reaction. Notably, m1Ψ modifications facilitate evasion of the immune response30. Modified nucleosides will probably routinely be present in future mRNA vaccines, but their use might also enhance other RNA-based therapies such as antisense oligonucleotides. Therefore, we will need a strategy to uniquely identify where modified nucleosides are incorporated in individual molecules before using them for future treatments. Quality control must ensure mRNA purity and yield, but also the incorporation of modified nucleotides. However, there are currently no public analytical tests to measure the quality of vaccine mRNA manufacturing. In this regard, long-read sequencing technologies offer a unique platform for rigorous quality control that is key to the performance of mRNA vaccines31. We expect that long-read technologies will have a key role in the development and approval of future RNA-based therapies for diverse human diseases, such as novel mRNA vaccines against a broad range of viral diseases for which we currently lack vaccination schemes as well as against COVID-19 variants that may arise.

The rich nature of reads that are captured using long-read sequencing technologies further makes them promising platforms for disease diagnosis and prognosis. Indeed, the information content that can be obtained from reads generated by these technologies may in fact compensate for the limitations posed by their low sequencing throughput: long-read technologies can capture read lengths, polyA tail lengths and tail composition heterogeneity, as well as DNA and RNA modification information — all from a single read. Consequently, algorithms may be able to classify samples into their respective populations more accurately than short-read technologies, despite sequencing lower numbers of molecules. Moreover, in the case of nanopore sequencing, portability and the small size of the sequencing devices simplify bringing this technology into clinical and field settings.

Both SMRT and nanopore platforms are promising long-read sequencing resolutions to detect DNA modifications, and nanopore sequencing is so far unparalleled in the realm of RNA-modification quantification. We expect these technologies will continue on their rapid upward trajectories, empowering the fields of epigenomics and epitranscriptomics in the years to come.