Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures

Abstract

The lack of benchmark data sets with inbuilt ground-truth makes it challenging to compare the performance of existing long-read isoform detection and differential expression analysis workflows. Here, we present a benchmark experiment using two human lung adenocarcinoma cell lines that were each profiled in triplicate together with synthetic, spliced, spike-in RNAs (sequins). Samples were deeply sequenced on both Illumina short-read and Oxford Nanopore Technologies long-read platforms. Alongside the ground-truth available via the sequins, we created in silico mixture samples to allow performance assessment in the absence of true positives or true negatives. Our results show that StringTie2 and bambu outperformed other tools from the six isoform detection tools tested, DESeq2, edgeR and limma-voom were best among the five differential transcript expression tools tested and there was no clear front-runner for performing differential transcript usage analysis between the five tools compared, which suggests further methods development is needed for this application.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the experimental design and benchmark analysis.
Fig. 2: Overview of data quality for the benchmarking data set.
Fig. 3: Comparison of isoform identification and quantification methods using pure RNA samples.
Fig. 4: Comparisons of DTE methods using in silico mixtures.
Fig. 5: Transcript-level comparisons of DTU methods using in silico mixtures.

Similar content being viewed by others

Data availability

RNA-seq data are available from Gene Expression Omnibus (GEO) under accession numbers GSE172421 (main benchmarking data set) and GSE227000 (laboratory-based mixture of replicate 1). RNA-seq data from the pure and laboratory-based mixture samples (short-read) of Holik et al.17 are available from GEO under accession number GSE64098. The GENCODE Human Release 33 genome, transcriptome and gene annotation files are available from https://www.gencodegenes.org/human/release_33.html. The RNA sequin decoy chromosome, gene annotation file and transcript abundance information are available from https://www.sequinstandards.com/resources/ (registration is required to access).

Code availability

Code used to perform these analyses and generate the figures are available from https://github.com/XueyiDong/LongReadBenchmark. All analyses for DTE, DTU and comparisons of methods performance were run in R v.4.1.0 (ref. 58). Results were visualized using ggplot2 v.3.3.5 (ref. 59) and UpSetR v.1.4.0 (ref. 60).

References

  1. Byrne, A. et al. Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nat. Commun. 8, 16027 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. Depledge, D. P. et al. Direct RNA sequencing on nanopore arrays redefines the transcriptional complexity of a viral pathogen. Nat. Commun. 10, 754 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. Cole, C., Byrne, A., Adams, M., Volden, R. & Vollmers, C. Complete characterization of the human immune cell transcriptome using accurate full-length cDNA sequencing. Genome Res. 30, 589–601 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. Vollmers, A. C., Mekonen, H. E., Campos, S., Carpenter, S. & Vollmers, C. Generation of an isoform-level transcriptome atlas of macrophage activation. J. Biol. Chem. 296, 100784 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Robinson, E. K. et al. Inflammation drives alternative first exon usage to regulate immune genes including a novel iron-regulated isoform of Aim2. eLife 10, e69431 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. Chang, J. J.-Y. et al. Long-read RNA sequencing identifies polyadenylation elongation and differential transcript usage of host transcripts during SARS-CoV-2 in vitro infection. Front. Immunol. 13, 1501 (2022).

    Google Scholar 

  7. Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).

    PubMed  PubMed Central  Google Scholar 

  8. Weber, L. M. et al. Essential guidelines for computational method benchmarking. Genome Biol. 20, 125 (2019).

    PubMed  PubMed Central  Google Scholar 

  9. Soneson, C. et al. A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes. Nat. Commun. 10, 3359 (2019).

    PubMed  PubMed Central  Google Scholar 

  10. Wongsurawat, T., Jenjaroenpun, P., Wanchai, V. & Nookaew, I. Native RNA or cDNA sequencing for transcriptomic analysis: a case study on Saccharomyces cerevisiae. Front. Bioengin. Biotechnol. 10, 401 (2022).

    Google Scholar 

  11. Sessegolo, C. et al. Transcriptome profiling of mouse samples using nanopore sequencing of cDNA and RNA molecules. Sci. Rep. 9, 14908 (2019).

    PubMed  PubMed Central  Google Scholar 

  12. Chen, Y. et al. A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines. Preprint at bioRxiv https://doi.org/10.1101/2021.04.21.440736 (2021).

  13. Hardwick, S. A. et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat. Methods 13, 792–798 (2016).

    CAS  PubMed  Google Scholar 

  14. Dong, X. et al. The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read differential expression analysis tools. NAR Genom. Bioinform. 3, lqab028 (2021).

    PubMed  PubMed Central  Google Scholar 

  15. Pardo-Palacios, F. et al. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-777702/v1 (2021).

  16. Paul, L. et al. SIRVs: spike-in RNA variants as external isoform controls in RNA-sequencing. Preprint at bioRxiv https://doi.org/10.1101/080747 (2016).

  17. Holik, A. Z. et al. RNA-seq mixology: designing realistic control experiments to compare protocols and analysis methods. Nucleic Acids Res. 45, e30 (2017).

    PubMed  Google Scholar 

  18. Piovesan, A. et al. Human protein-coding genes and gene feature statistics in 2019. BMC Res. Notes 12, 315 (2019).

    PubMed  PubMed Central  Google Scholar 

  19. Huang, K. K. et al. Long-read transcriptome sequencing reveals abundant promoter diversity in distinct molecular subtypes of gastric cancer. Genome Biol. 22, 1–24 (2021).

    Google Scholar 

  20. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

    CAS  PubMed  Google Scholar 

  21. Chen, Y. et al. Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat. Methods 20, 1187–1195 (2023).

  22. Tang, A. D. et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat. Commun. 11, 1438 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Tian, L. et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biol. 22, 310 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. Wyman, D. et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. Preprint at bioRxiv https://doi.org/10.1101/672931 (2020).

  26. Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput. Biol. 18, e1009730 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Jenjaroenpun, P. et al. Complete genomic and transcriptional landscape analysis using third-generation sequencing: a case study of Saccharomyces cerevisiae CEN.PK113-7D. Nucleic Acids Res. 46, e38 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. Gleeson, J. et al. Accurate expression quantification from nanopore direct RNA sequencing with NanoCount. Nucleic Acids Res. 50, e19 (2022).

    CAS  PubMed  Google Scholar 

  29. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

    PubMed  PubMed Central  Google Scholar 

  30. Leng, N. et al. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29, 1035–1043 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    PubMed  PubMed Central  Google Scholar 

  32. Tarazona, S. et al. Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Res. 43, e140 (2015).

    PubMed  PubMed Central  Google Scholar 

  33. Love, M. I., Soneson, C. & Patro, R. Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification. F1000Res. 7, 952 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. Nowicka, M. & Robinson, M. D. DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Res. 5, 1356 (2016).

    PubMed  PubMed Central  Google Scholar 

  35. Gilis, J., Vitting-Seerup, K., den Berge, K. V. & Clement, L. satuRn: scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications. F1000Res. 10, 374 (2021).

    CAS  PubMed  Google Scholar 

  36. Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 2008–2017 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Wyman, D. & Mortazavi, A. TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts. Bioinformatics 35, 340–342 (2019).

    CAS  PubMed  Google Scholar 

  38. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Google Scholar 

  39. Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).

    PubMed  PubMed Central  Google Scholar 

  40. Statham, A. L. et al. Repitools: an R package for the analysis of enrichment-based epigenomic data. Bioinformatics 26, 1662–1663 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. Robinson, M. D. et al. Copy-number-aware differential analysis of quantitative DNA sequencing data. Genome Res. 22, 2489–96 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Riebler, A. et al. BayMeth: improved DNA methylation quantification for affinity capture sequencing data using a flexible Bayesian approach. Genome Biol. 15, R35 (2014).

    PubMed  PubMed Central  Google Scholar 

  43. Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Tardaguila, M. et al. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. 28, 396–411 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).

    Google Scholar 

  46. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  PubMed  Google Scholar 

  50. Wang, L., Wang, S. & Li, W. RSeQC: quality control of RNA-seq experiments. Bioinformatics 28, 2184–2185 (2012).

    CAS  PubMed  Google Scholar 

  51. Baldoni, P. L. et al. Dividing out quantification uncertainty allows efficient assessment of differential transcript expression. Preprint at bioRxiv https://doi.org/10.1101/2023.04.02.535231 (2023).

  52. Soneson, C., Love, M. I. & Robinson, M. D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 4, 1521 (2016).

    PubMed Central  Google Scholar 

  53. Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).

    PubMed  PubMed Central  Google Scholar 

  54. Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    PubMed  PubMed Central  Google Scholar 

  55. Law, C. W. et al. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Res. 5, 1408 (2018).

    Google Scholar 

  56. Chen, Y., Lun, A. T. L. & Smyth, G. K. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Res. 5, 1438 (2016).

    PubMed  PubMed Central  Google Scholar 

  57. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2021); https://www.R-project.org/

  59. Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).

  60. Conway, J. R., Lex, A. & Gehlenborg, N. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33, 2938–2940 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank C. Weeden and M.-L. Asselin-Labat (Personalized Oncology Division, The Walter and Eliza Hall Institute of Medical Research) for providing the cell lines used in this study. P.L.B., G.K.S. and C.W.L. were supported by the Chan Zuckerberg Initiative Essential Open Source Software for Science Program (grant nos. 2019-207283 and 2021-237445) and M.E.R. was supported by Australian National Health and Medical Research Council Investigator grant (no. 2017257). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

X.D. designed the study, conducted data analysis, generated the figures and wrote the manuscript with input from all authors. M.R.M.D. conducted data analysis, generated figures and wrote the manuscript. Q.G. and M.E.R. designed the study. Q.G., L.T., J.S.J. and R.B. generated benchmarking data. P.L.B. devised analysis methods. Y.C., G.K.S., S.L.A., C.W.L. and M.E.R. supervised the research. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Xueyi Dong or Matthew E. Ritchie.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors: Lei Tang and Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dong, X., Du, M.R.M., Gouil, Q. et al. Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures. Nat Methods 20, 1810–1821 (2023). https://doi.org/10.1038/s41592-023-02026-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-02026-3

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing