Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures

Dong, Xueyi; Du, Mei R. M.; Gouil, Quentin; Tian, Luyi; Jabbari, Jafar S.; Bowden, Rory; Baldoni, Pedro L.; Chen, Yunshun; Smyth, Gordon K.; Amarasinghe, Shanika L.; Law, Charity W.; Ritchie, Matthew E.

doi:10.1038/s41592-023-02026-3

Analysis
Published: 02 October 2023

Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures

Nature Methods volume 20, pages 1810–1821 (2023)Cite this article

6880 Accesses
6 Citations
107 Altmetric
Metrics details

Subjects

Abstract

The lack of benchmark data sets with inbuilt ground-truth makes it challenging to compare the performance of existing long-read isoform detection and differential expression analysis workflows. Here, we present a benchmark experiment using two human lung adenocarcinoma cell lines that were each profiled in triplicate together with synthetic, spliced, spike-in RNAs (sequins). Samples were deeply sequenced on both Illumina short-read and Oxford Nanopore Technologies long-read platforms. Alongside the ground-truth available via the sequins, we created in silico mixture samples to allow performance assessment in the absence of true positives or true negatives. Our results show that StringTie2 and bambu outperformed other tools from the six isoform detection tools tested, DESeq2, edgeR and limma-voom were best among the five differential transcript expression tools tested and there was no clear front-runner for performing differential transcript usage analysis between the five tools compared, which suggests further methods development is needed for this application.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of the experimental design and benchmark analysis.**

**Fig. 2: Overview of data quality for the benchmarking data set.**

**Fig. 3: Comparison of isoform identification and quantification methods using pure RNA samples.**

**Fig. 4: Comparisons of DTE methods using in silico mixtures.**

**Fig. 5: Transcript-level comparisons of DTU methods using in silico mixtures.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Data availability

RNA-seq data are available from Gene Expression Omnibus (GEO) under accession numbers GSE172421 (main benchmarking data set) and GSE227000 (laboratory-based mixture of replicate 1). RNA-seq data from the pure and laboratory-based mixture samples (short-read) of Holik et al.¹⁷ are available from GEO under accession number GSE64098. The GENCODE Human Release 33 genome, transcriptome and gene annotation files are available from https://www.gencodegenes.org/human/release_33.html. The RNA sequin decoy chromosome, gene annotation file and transcript abundance information are available from https://www.sequinstandards.com/resources/ (registration is required to access).

Code availability

Code used to perform these analyses and generate the figures are available from https://github.com/XueyiDong/LongReadBenchmark. All analyses for DTE, DTU and comparisons of methods performance were run in R v.4.1.0 (ref. ⁵⁸). Results were visualized using ggplot2 v.3.3.5 (ref. ⁵⁹) and UpSetR v.1.4.0 (ref. ⁶⁰).

References

Byrne, A. et al. Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nat. Commun. 8, 16027 (2017).
CAS PubMed PubMed Central Google Scholar
Depledge, D. P. et al. Direct RNA sequencing on nanopore arrays redefines the transcriptional complexity of a viral pathogen. Nat. Commun. 10, 754 (2019).
CAS PubMed PubMed Central Google Scholar
Cole, C., Byrne, A., Adams, M., Volden, R. & Vollmers, C. Complete characterization of the human immune cell transcriptome using accurate full-length cDNA sequencing. Genome Res. 30, 589–601 (2020).
CAS PubMed PubMed Central Google Scholar
Vollmers, A. C., Mekonen, H. E., Campos, S., Carpenter, S. & Vollmers, C. Generation of an isoform-level transcriptome atlas of macrophage activation. J. Biol. Chem. 296, 100784 (2021).
CAS PubMed PubMed Central Google Scholar
Robinson, E. K. et al. Inflammation drives alternative first exon usage to regulate immune genes including a novel iron-regulated isoform of Aim2. eLife 10, e69431 (2021).
CAS PubMed PubMed Central Google Scholar
Chang, J. J.-Y. et al. Long-read RNA sequencing identifies polyadenylation elongation and differential transcript usage of host transcripts during SARS-CoV-2 in vitro infection. Front. Immunol. 13, 1501 (2022).
Google Scholar
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).
PubMed PubMed Central Google Scholar
Weber, L. M. et al. Essential guidelines for computational method benchmarking. Genome Biol. 20, 125 (2019).
PubMed PubMed Central Google Scholar
Soneson, C. et al. A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes. Nat. Commun. 10, 3359 (2019).
PubMed PubMed Central Google Scholar
Wongsurawat, T., Jenjaroenpun, P., Wanchai, V. & Nookaew, I. Native RNA or cDNA sequencing for transcriptomic analysis: a case study on Saccharomyces cerevisiae. Front. Bioengin. Biotechnol. 10, 401 (2022).
Google Scholar
Sessegolo, C. et al. Transcriptome profiling of mouse samples using nanopore sequencing of cDNA and RNA molecules. Sci. Rep. 9, 14908 (2019).
PubMed PubMed Central Google Scholar
Chen, Y. et al. A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines. Preprint at bioRxiv https://doi.org/10.1101/2021.04.21.440736 (2021).
Hardwick, S. A. et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat. Methods 13, 792–798 (2016).
CAS PubMed Google Scholar
Dong, X. et al. The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read differential expression analysis tools. NAR Genom. Bioinform. 3, lqab028 (2021).
PubMed PubMed Central Google Scholar
Pardo-Palacios, F. et al. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-777702/v1 (2021).
Paul, L. et al. SIRVs: spike-in RNA variants as external isoform controls in RNA-sequencing. Preprint at bioRxiv https://doi.org/10.1101/080747 (2016).
Holik, A. Z. et al. RNA-seq mixology: designing realistic control experiments to compare protocols and analysis methods. Nucleic Acids Res. 45, e30 (2017).
PubMed Google Scholar
Piovesan, A. et al. Human protein-coding genes and gene feature statistics in 2019. BMC Res. Notes 12, 315 (2019).
PubMed PubMed Central Google Scholar
Huang, K. K. et al. Long-read transcriptome sequencing reveals abundant promoter diversity in distinct molecular subtypes of gastric cancer. Genome Biol. 22, 1–24 (2021).
Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
CAS PubMed Google Scholar
Chen, Y. et al. Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat. Methods 20, 1187–1195 (2023).
Tang, A. D. et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat. Commun. 11, 1438 (2020).
CAS PubMed PubMed Central Google Scholar
Tian, L. et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biol. 22, 310 (2021).
CAS PubMed PubMed Central Google Scholar
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
CAS PubMed PubMed Central Google Scholar
Wyman, D. et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. Preprint at bioRxiv https://doi.org/10.1101/672931 (2020).
Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput. Biol. 18, e1009730 (2022).
CAS PubMed PubMed Central Google Scholar
Jenjaroenpun, P. et al. Complete genomic and transcriptional landscape analysis using third-generation sequencing: a case study of Saccharomyces cerevisiae CEN.PK113-7D. Nucleic Acids Res. 46, e38 (2018).
CAS PubMed PubMed Central Google Scholar
Gleeson, J. et al. Accurate expression quantification from nanopore direct RNA sequencing with NanoCount. Nucleic Acids Res. 50, e19 (2022).
CAS PubMed Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
PubMed PubMed Central Google Scholar
Leng, N. et al. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29, 1035–1043 (2013).
CAS PubMed PubMed Central Google Scholar
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
PubMed PubMed Central Google Scholar
Tarazona, S. et al. Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Res. 43, e140 (2015).
PubMed PubMed Central Google Scholar
Love, M. I., Soneson, C. & Patro, R. Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification. F1000Res. 7, 952 (2018).
CAS PubMed PubMed Central Google Scholar
Nowicka, M. & Robinson, M. D. DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Res. 5, 1356 (2016).
PubMed PubMed Central Google Scholar
Gilis, J., Vitting-Seerup, K., den Berge, K. V. & Clement, L. satuRn: scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications. F1000Res. 10, 374 (2021).
CAS PubMed Google Scholar
Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 2008–2017 (2012).
CAS PubMed PubMed Central Google Scholar
Wyman, D. & Mortazavi, A. TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts. Bioinformatics 35, 340–342 (2019).
CAS PubMed Google Scholar
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
PubMed PubMed Central Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
PubMed PubMed Central Google Scholar
Statham, A. L. et al. Repitools: an R package for the analysis of enrichment-based epigenomic data. Bioinformatics 26, 1662–1663 (2010).
CAS PubMed PubMed Central Google Scholar
Robinson, M. D. et al. Copy-number-aware differential analysis of quantitative DNA sequencing data. Genome Res. 22, 2489–96 (2012).
CAS PubMed PubMed Central Google Scholar
Riebler, A. et al. BayMeth: improved DNA methylation quantification for affinity capture sequencing data using a flexible Bayesian approach. Genome Biol. 15, R35 (2014).
PubMed PubMed Central Google Scholar
Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).
CAS PubMed PubMed Central Google Scholar
Tardaguila, M. et al. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. 28, 396–411 (2018).
CAS PubMed PubMed Central Google Scholar
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).
Google Scholar
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
CAS PubMed PubMed Central Google Scholar
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
CAS PubMed PubMed Central Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
CAS PubMed Google Scholar
Wang, L., Wang, S. & Li, W. RSeQC: quality control of RNA-seq experiments. Bioinformatics 28, 2184–2185 (2012).
CAS PubMed Google Scholar
Baldoni, P. L. et al. Dividing out quantification uncertainty allows efficient assessment of differential transcript expression. Preprint at bioRxiv https://doi.org/10.1101/2023.04.02.535231 (2023).
Soneson, C., Love, M. I. & Robinson, M. D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 4, 1521 (2016).
PubMed Central Google Scholar
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
PubMed PubMed Central Google Scholar
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
PubMed PubMed Central Google Scholar
Law, C. W. et al. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Res. 5, 1408 (2018).
Google Scholar
Chen, Y., Lun, A. T. L. & Smyth, G. K. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Res. 5, 1438 (2016).
PubMed PubMed Central Google Scholar
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
CAS PubMed PubMed Central Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2021); https://www.R-project.org/
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).
Conway, J. R., Lex, A. & Gehlenborg, N. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33, 2938–2940 (2017).
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank C. Weeden and M.-L. Asselin-Labat (Personalized Oncology Division, The Walter and Eliza Hall Institute of Medical Research) for providing the cell lines used in this study. P.L.B., G.K.S. and C.W.L. were supported by the Chan Zuckerberg Initiative Essential Open Source Software for Science Program (grant nos. 2019-207283 and 2021-237445) and M.E.R. was supported by Australian National Health and Medical Research Council Investigator grant (no. 2017257). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Luyi Tian
Present address: Guangzhou National Laboratory, Guangzhou, China
Shanika L. Amarasinghe
Present address: The Australian Regenerative Medicine Institute, Monash University, Clayton, Victoria, Australia
These authors contributed equally: Mei R. M. Du, Quentin Gouil.

Authors and Affiliations

The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia
Xueyi Dong, Mei R. M. Du, Quentin Gouil, Luyi Tian, Jafar S. Jabbari, Rory Bowden, Pedro L. Baldoni, Yunshun Chen, Gordon K. Smyth, Shanika L. Amarasinghe, Charity W. Law & Matthew E. Ritchie
Department of Medical Biology, The University of Melbourne, Parkville, Victoria, Australia
Xueyi Dong, Quentin Gouil, Luyi Tian, Jafar S. Jabbari, Rory Bowden, Pedro L. Baldoni, Yunshun Chen, Shanika L. Amarasinghe, Charity W. Law & Matthew E. Ritchie
School of Mathematics and Statistics, The University of Melbourne, Parkville, Victoria, Australia
Gordon K. Smyth

Authors

Xueyi Dong
View author publications
You can also search for this author in PubMed Google Scholar
Mei R. M. Du
View author publications
You can also search for this author in PubMed Google Scholar
Quentin Gouil
View author publications
You can also search for this author in PubMed Google Scholar
Luyi Tian
View author publications
You can also search for this author in PubMed Google Scholar
Jafar S. Jabbari
View author publications
You can also search for this author in PubMed Google Scholar
Rory Bowden
View author publications
You can also search for this author in PubMed Google Scholar
Pedro L. Baldoni
View author publications
You can also search for this author in PubMed Google Scholar
Yunshun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Gordon K. Smyth
View author publications
You can also search for this author in PubMed Google Scholar
Shanika L. Amarasinghe
View author publications
You can also search for this author in PubMed Google Scholar
Charity W. Law
View author publications
You can also search for this author in PubMed Google Scholar
Matthew E. Ritchie
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

X.D. designed the study, conducted data analysis, generated the figures and wrote the manuscript with input from all authors. M.R.M.D. conducted data analysis, generated figures and wrote the manuscript. Q.G. and M.E.R. designed the study. Q.G., L.T., J.S.J. and R.B. generated benchmarking data. P.L.B. devised analysis methods. Y.C., G.K.S., S.L.A., C.W.L. and M.E.R. supervised the research. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Xueyi Dong or Matthew E. Ritchie.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors: Lei Tang and Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–17.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dong, X., Du, M.R.M., Gouil, Q. et al. Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures. Nat Methods 20, 1810–1821 (2023). https://doi.org/10.1038/s41592-023-02026-3

Download citation

Received: 22 July 2022
Accepted: 25 August 2023
Published: 02 October 2023
Issue Date: November 2023
DOI: https://doi.org/10.1038/s41592-023-02026-3

This article is cited by

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms
- Francisco J. Pardo-Palacios
- Angeles Arzalluz-Luque
- Ana Conesa
Nature Methods (2024)
SQANTI-SIM: a simulator of controlled transcript novelty for lrRNA-seq benchmark
- Jorge Mestre-Tomás
- Tianyuan Liu
- Ana Conesa
Genome Biology (2023)

Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures

Subjects

Abstract

Access options

Similar content being viewed by others

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

This article is cited by

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms

SQANTI-SIM: a simulator of controlled transcript novelty for lrRNA-seq benchmark

Evaluating long-read RNA-sequencing analysis tools with in silico mixtures

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links