Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Context-aware transcript quantification from long-read RNA-seq data with Bambu

Abstract

Most approaches to transcript quantification rely on fixed reference annotations; however, the transcriptome is dynamic and depending on the context, such static annotations contain inactive isoforms for some genes, whereas they are incomplete for others. Here we present Bambu, a method that performs machine-learning-based transcript discovery to enable quantification specific to the context of interest using long-read RNA-sequencing. To identify novel transcripts, Bambu estimates the novel discovery rate, which replaces arbitrary per-sample thresholds with a single, interpretable, precision-calibrated parameter. Bambu retains the full-length and unique read counts, enabling accurate quantification in presence of inactive isoforms. Compared to existing methods for transcript discovery, Bambu achieves greater precision without sacrificing sensitivity. We show that context-aware annotations improve quantification for both novel and known transcripts. We apply Bambu to quantify isoforms from repetitive HERVH-LTR7 retrotransposons in human embryonic stem cells, demonstrating the ability for context-specific transcript expression analysis.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Bambu enables simultaneous transcript discovery and quantification from Nanopore RNA-seq data.
Fig. 2: A calibrated machine-learning full-length transcript classifier improves transcript discovery accuracy.
Fig. 3: Transcript quantification on spike-in data shows improvement with varying novel discovery rates.
Fig. 4: Full-length and unique read support provide evidence on expressed transcripts.
Fig. 5: Bambu enables the discovery and quantification of highly repetitive genes.

Similar content being viewed by others

Data availability

The SG-NEx samples are available through GitHub (https://github.com/GoekeLab/sg-nex-data), ENA (PRJEB44348) and AWS open data (https://registry.opendata.aws/sgnex/). Processed data associated with figures and tables are available on code ocean55. The PacBio data are available through SRA (SRP036136). The Arabidopsis data are available through ENA (PRJEB32782).

Code availability

Bambu is a R package for transcript discovery and quantification across multiple samples that is maintained on Bioconductor at https://www.bioconductor.org/packages/bambu/. The source code and a detailed documentation are available on GitHub at https://github.com/GoekeLab/bambu/. We used the BambuManuscriptRevision branch version of Bambu for analysis performed in this manuscript (https://github.com/GoekeLab/bambu/tree/BambuManuscriptRevision). All analysis code is available on Code Ocean55.

References

  1. Matlin, A. J., Clark, F. & Smith, C. W. J. Understanding alternative splicing: towards a cellular code. Nat. Rev. Mol. Cell Biol. 6, 386–398 (2005).

    Article  CAS  PubMed  Google Scholar 

  2. Blencowe, B. J. Alternative splicing: new insights from global analyses. Cell 126, 37–47 (2006).

    Article  CAS  PubMed  Google Scholar 

  3. Pan, Q., Shai, O., Lee, L. J., Frey, B. J. & Blencowe, B. J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40, 1413–1415 (2008).

    Article  CAS  PubMed  Google Scholar 

  4. Ben-Dov, C., Hartmann, B., Lundgren, J. & Valcárcel, J. Genome-wide analysis of alternative pre-mRNA splicing. J. Biol. Chem. 283, 1229–1233 (2008).

    Article  CAS  PubMed  Google Scholar 

  5. Graveley, B. R. Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 17, 100–107 (2001).

    Article  CAS  PubMed  Google Scholar 

  6. Nilsen, T. W. & Graveley, B. R. Expansion of the eukaryotic proteome by alternative splicing. Nature 463, 457–463 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinform. 12, 323 (2011).

    Article  CAS  Google Scholar 

  8. Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53 (2013).

    Article  CAS  PubMed  Google Scholar 

  9. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).

    Article  CAS  PubMed  Google Scholar 

  10. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Wang, D. et al. A deep proteome and transcriptome abundance atlas of 29 healthy human tissues. Mol. Syst. Biol. 15, e8503 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Gonzàlez-Porta, M., Frankish, A., Rung, J., Harrow, J. & Brazma, A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 14, R70 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Deschamps-Francoeur, G., Simoneau, J. & Scott, M. S. Handling multi-mapped reads in RNA-seq. Comput. Struct. Biotechnol. J. 18, 1569–1576 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Sarkar, H., Srivastava, A., Bravo, H. C., Love, M. I. & Patro, R. Terminus enables the discovery of data-driven, robust transcript groups from RNA-seq data. Bioinformatics 36, i102–i110 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Pardo-Palacios, F. et al. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-777702/v1 (2021).

  17. Tang, A. D. et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat. Commun. 11, 1438 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Wyman, D. et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. Preprint at bioRxiv https://doi.org/10.1101/672931 (2020).

  19. Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Prjibelski, A. D. et al. Accurate isoform discovery with IsoQuant using long reads. Nat. Biotechnol. https://doi.org/10.1038/s41587-022-01565-y (2023).

  21. Soneson, C., Matthes, K. L., Nowicka, M., Law, C. W. & Robinson, M. D. Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage. Genome Biol. 17, 12 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Kuo, R. I. et al. Illuminating the dark side of the human transcriptome with long read transcript sequencing. BMC Genomics 21, 751 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 20, 129 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via theEMAlgorithm. J. R. Stat. Soc. 39, 1–22 (1977).

    Google Scholar 

  27. Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Eddelbuettel, D. et al. Rcpp: Seamless R and C++ integration. J. Stat. Softw. 40, 1–18 (2011).

    Article  Google Scholar 

  29. Eddelbuettel, D. Seamless R and C++ Integration with Rcpp. (Springer, 2013).

  30. R Core Team. R: a language and environment for statistical computing. (R Foundation for Statistical Computing, 2021).

  31. Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).

  32. Hardwick, S. A. et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat. Methods 13, 792–798 (2016).

    Article  CAS  PubMed  Google Scholar 

  33. Chen, Y. et al. A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines. Preprint at bioRxiv https://doi.org/10.1101/2021.04.21.440736 (2021).

  34. Pertea, G. & Pertea, M. GFF Utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).

    Article  Google Scholar 

  35. Aken, B. L. et al. The Ensembl gene annotation system. Database 2016, baw093 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Parker, M. T. et al. Nanopore direct RNA sequencing maps the complexity of Arabidopsis mRNA processing and m6A modification. eLlife 9, e49658 (2020).

    Article  CAS  Google Scholar 

  37. Berardini, T. Z. et al. The Arabidopsis information resource: Making and mining the ‘gold standard’ annotated reference plant genome. Genesis 53, 474–485 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Tilgner, H., Grubert, F., Sharon, D. & Snyder, M. P. Defining a personal, allele-specific, and single-molecule long-read transcriptome. Proc. Natl Acad. Sci. USA 111, 9869–9874 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Gleeson, J. et al. Accurate expression quantification from nanopore direct RNA sequencing with NanoCount. Nucleic Acids Res. https://doi.org/10.1093/nar/gkab1129 (2021).

  40. Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

    Article  CAS  PubMed  Google Scholar 

  41. Hu, Y. et al. LIQA: long-read isoform quantification and analysis. Genome Biol. 22, 182 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Tian, L. et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biol. 22, 310 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Zhang, Y. et al. Transcriptionally active HERV-H retrotransposons demarcate topologically associating domains in human pluripotent stem cells. Nat. Genet. 51, 1380–1388 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Lu, X. et al. The retrovirus HERVH is a long noncoding RNA required for human embryonic stem cell identity. Nat. Struct. Mol. Biol. 21, 423–425 (2014).

    Article  CAS  PubMed  Google Scholar 

  45. Kelley, D. & Rinn, J. Transposable elements reveal a stem cell-specific class of long noncoding RNAs. Genome Biol. 13, R107 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  46. Göke, J. & Ng, H. H. CTRL+INSERT: retrotransposons and their contribution to regulation and innovation of the transcriptome. EMBO Rep. 17, 1131–1144 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Berrens, R. V. et al. Locus-specific expression of transposable elements in single cells with CELLO-seq. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01093-1 (2021).

  48. Semenick, D. Tests and measurements: the t-test. J. Strength Cond. 12, 36 (1990).

  49. Massey, F. J. The Kolmogorov–Smirnov test for goodness of fit. J. Am. Stat. Assoc. 46, 68–78 (1951).

    Article  Google Scholar 

  50. Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker. RepeatMasker http://repeatmasker.org (1996).

  51. Soneson, C. et al. A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes. Nat. Commun. 10, 3359 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Troskie, R.-L. et al. Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome. Genome Biol. 22, 146 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods 15, 201–206 (2018).

    Article  CAS  PubMed  Google Scholar 

  54. Mulroney, L. et al. Identification of high confidence human poly(A) RNA isoform scaffolds using nanopore sequencing. RNA https://doi.org/10.1261/rna.078703.121 (2021).

  55. Chen, Y., Sim, A., Lee, J., Goeke, J. Bambu (Source Code) https://codeocean.com/capsule/3893005/tree/v2 (2023).

Download references

Acknowledgements

This work was supported by funding from the Agency for Science, Technology and Research (A*STAR) and the National Medical Research Council. M.L. was supported by R01 HG009937.

Author information

Authors and Affiliations

Authors

Contributions

Y.C. and A.S. designed and implemented the computational method. J.G. conceived the project. Y.C., A.S. and J.G. designed the study and experiments and analyzed data. Y.K.W., K.Y., J.J.X.L. and M.H.L. contributed to the implementation of the computational method. M.L. contributed to the design of the computation method. Y.C., A.S. and J.G. organized and wrote the paper with contributions from all authors.

Corresponding author

Correspondence to Jonathan Göke.

Ethics declarations

Competing interests

J.G. received travel and accommodation expenses to speak at the Oxford Nanopore Community Meeting 2018. All other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Chung Chau HON and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling editors: Lei Tang and Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–8, Supplementary Text, Supplementary Text Figs. 1–5, Supplementary Text Tables 1–7 and Supplementary Notes.

Reporting Summary

Peer Review File

Supplementary Table

Description of samples used for data analysis and annotation information on novel retrotransposon-derived transcripts based on Bambu.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Sim, A., Wan, Y.K. et al. Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat Methods 20, 1187–1195 (2023). https://doi.org/10.1038/s41592-023-01908-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-01908-w

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics