Abstract
The lack of benchmark data sets with inbuilt ground-truth makes it challenging to compare the performance of existing long-read isoform detection and differential expression analysis workflows. Here, we present a benchmark experiment using two human lung adenocarcinoma cell lines that were each profiled in triplicate together with synthetic, spliced, spike-in RNAs (sequins). Samples were deeply sequenced on both Illumina short-read and Oxford Nanopore Technologies long-read platforms. Alongside the ground-truth available via the sequins, we created in silico mixture samples to allow performance assessment in the absence of true positives or true negatives. Our results show that StringTie2 and bambu outperformed other tools from the six isoform detection tools tested, DESeq2, edgeR and limma-voom were best among the five differential transcript expression tools tested and there was no clear front-runner for performing differential transcript usage analysis between the five tools compared, which suggests further methods development is needed for this application.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
RNA-seq data are available from Gene Expression Omnibus (GEO) under accession numbers GSE172421 (main benchmarking data set) and GSE227000 (laboratory-based mixture of replicate 1). RNA-seq data from the pure and laboratory-based mixture samples (short-read) of Holik et al.17 are available from GEO under accession number GSE64098. The GENCODE Human Release 33 genome, transcriptome and gene annotation files are available from https://www.gencodegenes.org/human/release_33.html. The RNA sequin decoy chromosome, gene annotation file and transcript abundance information are available from https://www.sequinstandards.com/resources/ (registration is required to access).
Code availability
Code used to perform these analyses and generate the figures are available from https://github.com/XueyiDong/LongReadBenchmark. All analyses for DTE, DTU and comparisons of methods performance were run in R v.4.1.0 (ref. 58). Results were visualized using ggplot2 v.3.3.5 (ref. 59) and UpSetR v.1.4.0 (ref. 60).
References
Byrne, A. et al. Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nat. Commun. 8, 16027 (2017).
Depledge, D. P. et al. Direct RNA sequencing on nanopore arrays redefines the transcriptional complexity of a viral pathogen. Nat. Commun. 10, 754 (2019).
Cole, C., Byrne, A., Adams, M., Volden, R. & Vollmers, C. Complete characterization of the human immune cell transcriptome using accurate full-length cDNA sequencing. Genome Res. 30, 589–601 (2020).
Vollmers, A. C., Mekonen, H. E., Campos, S., Carpenter, S. & Vollmers, C. Generation of an isoform-level transcriptome atlas of macrophage activation. J. Biol. Chem. 296, 100784 (2021).
Robinson, E. K. et al. Inflammation drives alternative first exon usage to regulate immune genes including a novel iron-regulated isoform of Aim2. eLife 10, e69431 (2021).
Chang, J. J.-Y. et al. Long-read RNA sequencing identifies polyadenylation elongation and differential transcript usage of host transcripts during SARS-CoV-2 in vitro infection. Front. Immunol. 13, 1501 (2022).
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).
Weber, L. M. et al. Essential guidelines for computational method benchmarking. Genome Biol. 20, 125 (2019).
Soneson, C. et al. A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes. Nat. Commun. 10, 3359 (2019).
Wongsurawat, T., Jenjaroenpun, P., Wanchai, V. & Nookaew, I. Native RNA or cDNA sequencing for transcriptomic analysis: a case study on Saccharomyces cerevisiae. Front. Bioengin. Biotechnol. 10, 401 (2022).
Sessegolo, C. et al. Transcriptome profiling of mouse samples using nanopore sequencing of cDNA and RNA molecules. Sci. Rep. 9, 14908 (2019).
Chen, Y. et al. A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines. Preprint at bioRxiv https://doi.org/10.1101/2021.04.21.440736 (2021).
Hardwick, S. A. et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat. Methods 13, 792–798 (2016).
Dong, X. et al. The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read differential expression analysis tools. NAR Genom. Bioinform. 3, lqab028 (2021).
Pardo-Palacios, F. et al. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-777702/v1 (2021).
Paul, L. et al. SIRVs: spike-in RNA variants as external isoform controls in RNA-sequencing. Preprint at bioRxiv https://doi.org/10.1101/080747 (2016).
Holik, A. Z. et al. RNA-seq mixology: designing realistic control experiments to compare protocols and analysis methods. Nucleic Acids Res. 45, e30 (2017).
Piovesan, A. et al. Human protein-coding genes and gene feature statistics in 2019. BMC Res. Notes 12, 315 (2019).
Huang, K. K. et al. Long-read transcriptome sequencing reveals abundant promoter diversity in distinct molecular subtypes of gastric cancer. Genome Biol. 22, 1–24 (2021).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Chen, Y. et al. Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat. Methods 20, 1187–1195 (2023).
Tang, A. D. et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat. Commun. 11, 1438 (2020).
Tian, L. et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biol. 22, 310 (2021).
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
Wyman, D. et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. Preprint at bioRxiv https://doi.org/10.1101/672931 (2020).
Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput. Biol. 18, e1009730 (2022).
Jenjaroenpun, P. et al. Complete genomic and transcriptional landscape analysis using third-generation sequencing: a case study of Saccharomyces cerevisiae CEN.PK113-7D. Nucleic Acids Res. 46, e38 (2018).
Gleeson, J. et al. Accurate expression quantification from nanopore direct RNA sequencing with NanoCount. Nucleic Acids Res. 50, e19 (2022).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Leng, N. et al. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29, 1035–1043 (2013).
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Tarazona, S. et al. Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Res. 43, e140 (2015).
Love, M. I., Soneson, C. & Patro, R. Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification. F1000Res. 7, 952 (2018).
Nowicka, M. & Robinson, M. D. DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Res. 5, 1356 (2016).
Gilis, J., Vitting-Seerup, K., den Berge, K. V. & Clement, L. satuRn: scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications. F1000Res. 10, 374 (2021).
Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 2008–2017 (2012).
Wyman, D. & Mortazavi, A. TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts. Bioinformatics 35, 340–342 (2019).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
Statham, A. L. et al. Repitools: an R package for the analysis of enrichment-based epigenomic data. Bioinformatics 26, 1662–1663 (2010).
Robinson, M. D. et al. Copy-number-aware differential analysis of quantitative DNA sequencing data. Genome Res. 22, 2489–96 (2012).
Riebler, A. et al. BayMeth: improved DNA methylation quantification for affinity capture sequencing data using a flexible Bayesian approach. Genome Biol. 15, R35 (2014).
Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).
Tardaguila, M. et al. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. 28, 396–411 (2018).
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Wang, L., Wang, S. & Li, W. RSeQC: quality control of RNA-seq experiments. Bioinformatics 28, 2184–2185 (2012).
Baldoni, P. L. et al. Dividing out quantification uncertainty allows efficient assessment of differential transcript expression. Preprint at bioRxiv https://doi.org/10.1101/2023.04.02.535231 (2023).
Soneson, C., Love, M. I. & Robinson, M. D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 4, 1521 (2016).
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
Law, C. W. et al. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Res. 5, 1408 (2018).
Chen, Y., Lun, A. T. L. & Smyth, G. K. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Res. 5, 1438 (2016).
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2021); https://www.R-project.org/
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).
Conway, J. R., Lex, A. & Gehlenborg, N. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33, 2938–2940 (2017).
Acknowledgements
We thank C. Weeden and M.-L. Asselin-Labat (Personalized Oncology Division, The Walter and Eliza Hall Institute of Medical Research) for providing the cell lines used in this study. P.L.B., G.K.S. and C.W.L. were supported by the Chan Zuckerberg Initiative Essential Open Source Software for Science Program (grant nos. 2019-207283 and 2021-237445) and M.E.R. was supported by Australian National Health and Medical Research Council Investigator grant (no. 2017257). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
X.D. designed the study, conducted data analysis, generated the figures and wrote the manuscript with input from all authors. M.R.M.D. conducted data analysis, generated figures and wrote the manuscript. Q.G. and M.E.R. designed the study. Q.G., L.T., J.S.J. and R.B. generated benchmarking data. P.L.B. devised analysis methods. Y.C., G.K.S., S.L.A., C.W.L. and M.E.R. supervised the research. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors: Lei Tang and Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–17.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dong, X., Du, M.R.M., Gouil, Q. et al. Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures. Nat Methods 20, 1810–1821 (2023). https://doi.org/10.1038/s41592-023-02026-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-023-02026-3