Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Investigating open reading frames in known and novel transcripts using ORFanage

A preprint version of the article is available at bioRxiv.

Abstract

ORFanage is a system designed to assign open reading frames (ORFs) to known and novel gene transcripts while maximizing similarity to annotated proteins. The primary intended use of ORFanage is the identification of ORFs in the assembled results of RNA-sequencing experiments, a capability that most transcriptome assembly methods do not have. Our experiments demonstrate how ORFanage can be used to find novel protein variants in RNA-seq datasets, and to improve the annotations of ORFs in tens of thousands of transcript models in the human annotation databases. Through its implementation of a highly accurate and efficient pseudo-alignment algorithm, ORFanage is substantially faster than other ORF annotation methods, enabling its application to very large datasets. When used to analyze transcriptome assemblies, ORFanage can aid in the separation of signal from transcriptional noise and the identification of likely functional transcript variants, ultimately advancing our understanding of biology and medicine.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of irregularities in reference database ORF annotation.
Fig. 2: Diagram illustrating the algorithm implemented in ORFanage.
Fig. 3: Novel ORFs in the GTEx dataset inferred using ORFanage.

Data availability

No new sequencing data were created for this study. The sequencing data used in this study are available through the GTEx project (phs000424.v7.p2). GTEx data were first analyzed as part of the CHESS project and the details can be found in the corresponding resources and publications (http://ccb.jhu.edu/chess/). The datasets analyzed in this study are (1) GENCODE annotation build version 41 (https://www.gencodegenes.org/human/release_41.html); (2) RefSeq annotation build 110 (https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/110/); (3) MANE joint annotation build version 1.0 (https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/); (4) A. thaliana annotation (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Arabidopsis_thaliana/all_assembly_versions/GCF_000001735.3_TAIR10/); and (5) C. elegans genome annotation (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Caenorhabditis_elegans/all_assembly_versions/GCF_000002985.6_WBcel235/). Source data are provided with this paper.

Code availability

All code required to reproduce the data generated within the study from public sources is provided at https://github.com/alevar/ORFanage_tests. The core method is implemented in C++ and based on the GFFutils45 and KSW264,65 libraries. The code and test data are available for download at https://github.com/alevar/ORFanage/releases/tag/1.0 (https://doi.org/10.5281/zenodo.8102912)66. Jupyter notebooks used to generate all results described in the manuscript are provided separately at https://github.com/alevar/ORFanage_tests (https://doi.org/10.5281/zenodo.8102918)67. All additional software methods used in this study and their versions and appropriate references are listed in Methods.

References

  1. O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).

    Article  Google Scholar 

  2. Frankish, A. et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 51, D942–D949 (2023).

    Article  Google Scholar 

  3. Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 19, 208 (2018).

    Article  Google Scholar 

  4. Varabyou, A. et al. CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis and protein structure. Preprint at bioRxiv https://doi.org/10.1101/2022.12.21.521274 (2022).

  5. Salzberg, S. L. Open questions: how many genes do we have? BMC Biol. 16, 94 (2018).

    Article  Google Scholar 

  6. Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022).

    Article  Google Scholar 

  7. Rodriguez, J. M. et al. APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res. 41, D110–D117 (2013).

    Article  Google Scholar 

  8. Tress, M. L., Abascal, F. & Valencia, A. Alternative splicing may not be the key to proteome complexity. Trends Biochem. Sci. 42, 98–110 (2017).

    Article  Google Scholar 

  9. Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).

    Article  Google Scholar 

  10. Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).

    Article  Google Scholar 

  11. Reyes, A. & Huber, W. Alternative start and termination sites of transcription drive most transcript isoform differences across human tissues. Nucleic Acids Res. 46, 582–592 (2018).

    Article  Google Scholar 

  12. Sinitcyn, P. et al. Global detection of human variants and isoforms by deep proteome sequencing. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01714-x (2023).

  13. Glinos, D. A. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608, 353–359 (2022).

    Article  Google Scholar 

  14. Park, E., Pan, Z., Zhang, Z., Lin, L. & Xing, Y. The expanding landscape of alternative splicing variation in human populations. Am. J. Hum. Genet. 102, 11–26 (2018).

    Article  Google Scholar 

  15. Zhang, S. et al. New insights into Arabidopsis transcriptome complexity revealed by direct sequencing of native RNAs. Nucleic Acids Res. 48, 7700–7711 (2020).

    Article  Google Scholar 

  16. Roach, N. P. et al. The full-length transcriptome of C. elegans using direct RNA sequencing. Genome Res. 30, 299–312 (2020).

    Article  MathSciNet  Google Scholar 

  17. Zhao, S. Alternative splicing, RNA-seq and drug discovery. Drug Discov. Today 24, 1258–1267 (2019).

    Article  Google Scholar 

  18. Kiyose, H. et al. Comprehensive analysis of full-length transcripts reveals novel splicing abnormalities and oncogenic transcripts in liver cancer. PLoS Genet. 18, e1010342 (2022).

    Article  Google Scholar 

  19. Leung, S. K. et al. Full-length transcript sequencing of human and mouse cerebral cortex identifies widespread isoform diversity and alternative splicing. Cell Rep. 37, 110022 (2021).

    Article  Google Scholar 

  20. Matlin, A. J., Clark, F. & Smith, C. W. Understanding alternative splicing: towards a cellular code. Nat. Rev. Mol. Cell Biol. 6, 386–398 (2005).

    Article  Google Scholar 

  21. Tazi, J., Bakkour, N. & Stamm, S. Alternative splicing and disease. Biochim. Biophys. Acta 1792, 14–26 (2009).

    Article  Google Scholar 

  22. Garcia-Blanco, M. A., Baraniak, A. P. & Lasda, E. L. Alternative splicing in disease and therapy. Nat. Biotechnol. 22, 535–546 (2004).

    Article  Google Scholar 

  23. Cummings, B. B. et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, eaal5209 (2017).

    Article  Google Scholar 

  24. Merkin, J., Russell, C., Chen, P. & Burge, C. B. Evolutionary dynamics of gene and isoform regulation in mammalian tissues. Science 338, 1593–1599 (2012).

    Article  Google Scholar 

  25. Boulet, A. et al. The mammalian phosphate carrier SLC25A3 is a mitochondrial copper transporter required for cytochrome c oxidase biogenesis. J. Biol. Chem. 293, 1887–1896 (2018).

    Article  Google Scholar 

  26. Kim, H. K., Pham, M. H. C., Ko, K. S., Rhee, B. D. & Han, J. Alternative splicing isoforms in health and disease. Pflüg. Arch. Eur. J. Physiol. 470, 995–1016 (2018).

    Article  Google Scholar 

  27. Frampton, G. M. et al. Activation of MET via diverse exon 14 splicing alterations occurs in multiple tumor types and confers clinical sensitivity to MET inhibitorsMET Exon 14 alterations confer response to targeted therapy. Cancer Discov. 5, 850–859 (2015).

    Article  Google Scholar 

  28. Kahles, A. et al. Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell 34, 211–224 (2018).

    Article  Google Scholar 

  29. Brooks, A. N. et al. A pan-cancer analysis of transcriptome changes associated with somatic mutations in U2AF1 reveals commonly altered splicing events. PLoS ONE 9, e87361 (2014).

    Article  Google Scholar 

  30. Allen, A. S. et al. De novo mutations in epileptic encephalopathies. Nature 501, 217–221 (2013).

    Article  Google Scholar 

  31. Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 2059–2074 (2013).

    Article  Google Scholar 

  32. Varabyou, A., Salzberg, S. L. & Pertea, M. Effects of transcriptional noise on estimates of gene and transcript expression in RNA sequencing experiments. Genome Res. 31, 301–308 (2021).

    Article  Google Scholar 

  33. Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).

    Article  Google Scholar 

  34. Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512 (2013).

    Article  Google Scholar 

  35. Vitting-Seerup, K. & Sandelin, A. The landscape of isoform switches in human cancers. Mol. Cancer Res. 15, 1206–1220 (2017).

    Article  Google Scholar 

  36. Tang, S., Lomsadze, A. & Borodovsky, M. Identification of protein coding regions in RNA transcripts. Nucleic Acids Res. 43, e78 (2015).

    Article  Google Scholar 

  37. Vitting-Seerup, K., Porse, B. T., Sandelin, A. & Waage, J. spliceR: an R package for classification of alternative splicing and prediction of coding potential from RNA-seq data. BMC Bioinformatics 15, 81 (2014).

    Article  Google Scholar 

  38. Kang, Y. et al. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 45, W12–W16 (2017).

    Article  Google Scholar 

  39. Singh, U. & Wurtele, E. S. orfipy: a fast and flexible tool for extracting ORFs. Bioinformatics 37, 3019–3020 (2021).

    Article  Google Scholar 

  40. Tress, M. L., Abascal, F. & Valencia, A. Most alternative isoforms are not functionally important. Trends Biochem. Sci. 42, 408–410 (2017).

    Article  Google Scholar 

  41. Cunningham, F. et al. Ensembl 2022. Nucleic Acids Res. 50, D988–D995 (2022).

    Article  Google Scholar 

  42. Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European molecular biology open software suite. Trends Genet. 16, 276–277 (2000).

    Article  Google Scholar 

  43. Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).

    Article  Google Scholar 

  44. Lonsdale, J. et al. The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

    Article  Google Scholar 

  45. Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).

    Article  Google Scholar 

  46. Moss, S. E. & Morgan, R. O. The annexins. Genome Biol. 5, 219 (2004).

    Article  Google Scholar 

  47. Gerke, V. & Moss, S. E. Annexins: from structure to function. Physiol. Rev. 82, 331–371 (2002).

    Article  Google Scholar 

  48. McCulloch, K. M. et al. An alternative N-terminal fold of the intestine-specific annexin A13a induces dimerization and regulates membrane-binding. J. Biol. Chem. 294, 3454–3463 (2019).

    Article  Google Scholar 

  49. Lillebostad, P. A. et al. Structure of the ALS mutation target annexin A11 reveals a stabilising N-terminal segment. Biomolecules 10, 660 (2020).

    Article  Google Scholar 

  50. Fernández-Lizarbe, S. et al. Structural and lipid-binding characterization of human annexin A13a reveals strong differences with its long A13b isoform. Biol. Chem. 398, 359–371 (2017).

    Article  Google Scholar 

  51. Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).

    Article  Google Scholar 

  52. Varadi, M. et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).

    Article  Google Scholar 

  53. Finstermeier, K. et al. A mitogenomic phylogeny of living primates. PLoS ONE 8, e69504 (2013).

    Article  Google Scholar 

  54. Wall, J. D., Robinson, J. A. & Cox, L. A. High-resolution estimates of crossover and noncrossover recombination from a captive baboon colony. Genome Biol. Evol. 14, evac040 (2022).

    Article  Google Scholar 

  55. Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2020).

    Article  Google Scholar 

  56. Sommer, M. J. et al. Structure-guided isoform identification for the human transcriptome. eLife 11, e82556 (2022).

    Article  Google Scholar 

  57. Pockrandt, C., Steinegger, M. & Salzberg, S. L. PhyloCSF++: a fast and user-friendly implementation of PhyloCSF with annotation tools. Bioinformatics 38, 1440–1442 (2022).

    Article  Google Scholar 

  58. Lin, M. F., Jungreis, I. & Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27, 275–282 (2011).

    Article  Google Scholar 

  59. Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).

    Article  Google Scholar 

  60. Varabyou, A., Pertea, G., Pockrandt, C. & Pertea, M. TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets. Bioinformatics 37, 3650–3651 (2021).

    Article  Google Scholar 

  61. Swarbreck, D. et al. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 36, D1009–D1014 (2007).

    Article  Google Scholar 

  62. C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998).

    Article  Google Scholar 

  63. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).

    Article  Google Scholar 

  64. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  Google Scholar 

  65. Suzuki, H. & Kasahara, M. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics 19, 45 (2018).

    Article  Google Scholar 

  66. Varabyou, A. ORFanage: reference guided ORF annotation 1.0.2. Zenodo https://doi.org/10.5281/zenodo.8102912 (2023).

  67. Varabyou, A. ORFanage evaluation notebooks. Zenodo https://doi.org/10.5281/zenodo.8102918 (2023).

  68. DeLano, W. L. PyMOL: an open-source molecular graphics tool. CCP4 Newsl. Protein Crystallogr. 40, 82–92 (2002).

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the US National Institutes of Health under grants nos. R01 HG006677 (S.L.S.), R01 MH123567 (S.L.S.) and R35 GM130151 (S.L.S.) and by the US National Science Foundation under grant no. DBI-1759518 (M.P.). We would also like to thank C. Pockrandt for helpful discussions on phyloCSF++ implementation and usage.

Author information

Authors and Affiliations

Authors

Contributions

A.V. and B.E. conceived and developed the original idea. A.V. developed and implemented the final method and experiments. A.V., B.E., S.L.S. and M.P. conceptualized the study, methods and wrote the paper.

Corresponding authors

Correspondence to Ales Varabyou or Mihaela Pertea.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Liugo Wang and the other anonymous reviewer for their contribution to the peer review of this work. Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–4 and Tables 1 and 2.

Reporting Summary

Source data

Source Data Fig. 1

Three files with data used to generate Fig. 1c,d,e,g,h. README.md is included to map individual files to panels within figure and to provide descriptions of the columns.

Source Data Fig. 3

Contains a file with data used to generate Fig. 3a. README.md is included with a detailed description of the contents.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Varabyou, A., Erdogdu, B., Salzberg, S.L. et al. Investigating open reading frames in known and novel transcripts using ORFanage. Nat Comput Sci 3, 700–708 (2023). https://doi.org/10.1038/s43588-023-00496-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-023-00496-1

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics