Investigating open reading frames in known and novel transcripts using ORFanage

Varabyou, Ales; Erdogdu, Beril; Salzberg, Steven L.; Pertea, Mihaela

doi:10.1038/s43588-023-00496-1

Article
Published: 31 July 2023

Investigating open reading frames in known and novel transcripts using ORFanage

Nature Computational Science volume 3, pages 700–708 (2023)Cite this article

829 Accesses
2 Citations
12 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

ORFanage is a system designed to assign open reading frames (ORFs) to known and novel gene transcripts while maximizing similarity to annotated proteins. The primary intended use of ORFanage is the identification of ORFs in the assembled results of RNA-sequencing experiments, a capability that most transcriptome assembly methods do not have. Our experiments demonstrate how ORFanage can be used to find novel protein variants in RNA-seq datasets, and to improve the annotations of ORFs in tens of thousands of transcript models in the human annotation databases. Through its implementation of a highly accurate and efficient pseudo-alignment algorithm, ORFanage is substantially faster than other ORF annotation methods, enabling its application to very large datasets. When used to analyze transcriptome assemblies, ORFanage can aid in the separation of signal from transcriptional noise and the identification of likely functional transcript variants, ultimately advancing our understanding of biology and medicine.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of irregularities in reference database ORF annotation.**

**Fig. 2: Diagram illustrating the algorithm implemented in ORFanage.**

**Fig. 3: Novel ORFs in the GTEx dataset inferred using ORFanage.**

Data availability

No new sequencing data were created for this study. The sequencing data used in this study are available through the GTEx project (phs000424.v7.p2). GTEx data were first analyzed as part of the CHESS project and the details can be found in the corresponding resources and publications (http://ccb.jhu.edu/chess/). The datasets analyzed in this study are (1) GENCODE annotation build version 41 (https://www.gencodegenes.org/human/release_41.html); (2) RefSeq annotation build 110 (https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/110/); (3) MANE joint annotation build version 1.0 (https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/); (4) A. thaliana annotation (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Arabidopsis_thaliana/all_assembly_versions/GCF_000001735.3_TAIR10/); and (5) C. elegans genome annotation (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Caenorhabditis_elegans/all_assembly_versions/GCF_000002985.6_WBcel235/). Source data are provided with this paper.

Code availability

All code required to reproduce the data generated within the study from public sources is provided at https://github.com/alevar/ORFanage_tests. The core method is implemented in C++ and based on the GFFutils⁴⁵ and KSW2^64,65 libraries. The code and test data are available for download at https://github.com/alevar/ORFanage/releases/tag/1.0 (https://doi.org/10.5281/zenodo.8102912)⁶⁶. Jupyter notebooks used to generate all results described in the manuscript are provided separately at https://github.com/alevar/ORFanage_tests (https://doi.org/10.5281/zenodo.8102918)⁶⁷. All additional software methods used in this study and their versions and appropriate references are listed in Methods.

References

O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
Article Google Scholar
Frankish, A. et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 51, D942–D949 (2023).
Article Google Scholar
Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 19, 208 (2018).
Article Google Scholar
Varabyou, A. et al. CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis and protein structure. Preprint at bioRxiv https://doi.org/10.1101/2022.12.21.521274 (2022).
Salzberg, S. L. Open questions: how many genes do we have? BMC Biol. 16, 94 (2018).
Article Google Scholar
Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022).
Article Google Scholar
Rodriguez, J. M. et al. APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res. 41, D110–D117 (2013).
Article Google Scholar
Tress, M. L., Abascal, F. & Valencia, A. Alternative splicing may not be the key to proteome complexity. Trends Biochem. Sci. 42, 98–110 (2017).
Article Google Scholar
Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
Article Google Scholar
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).
Article Google Scholar
Reyes, A. & Huber, W. Alternative start and termination sites of transcription drive most transcript isoform differences across human tissues. Nucleic Acids Res. 46, 582–592 (2018).
Article Google Scholar
Sinitcyn, P. et al. Global detection of human variants and isoforms by deep proteome sequencing. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01714-x (2023).
Glinos, D. A. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608, 353–359 (2022).
Article Google Scholar
Park, E., Pan, Z., Zhang, Z., Lin, L. & Xing, Y. The expanding landscape of alternative splicing variation in human populations. Am. J. Hum. Genet. 102, 11–26 (2018).
Article Google Scholar
Zhang, S. et al. New insights into Arabidopsis transcriptome complexity revealed by direct sequencing of native RNAs. Nucleic Acids Res. 48, 7700–7711 (2020).
Article Google Scholar
Roach, N. P. et al. The full-length transcriptome of C. elegans using direct RNA sequencing. Genome Res. 30, 299–312 (2020).
Article MathSciNet Google Scholar
Zhao, S. Alternative splicing, RNA-seq and drug discovery. Drug Discov. Today 24, 1258–1267 (2019).
Article Google Scholar
Kiyose, H. et al. Comprehensive analysis of full-length transcripts reveals novel splicing abnormalities and oncogenic transcripts in liver cancer. PLoS Genet. 18, e1010342 (2022).
Article Google Scholar
Leung, S. K. et al. Full-length transcript sequencing of human and mouse cerebral cortex identifies widespread isoform diversity and alternative splicing. Cell Rep. 37, 110022 (2021).
Article Google Scholar
Matlin, A. J., Clark, F. & Smith, C. W. Understanding alternative splicing: towards a cellular code. Nat. Rev. Mol. Cell Biol. 6, 386–398 (2005).
Article Google Scholar
Tazi, J., Bakkour, N. & Stamm, S. Alternative splicing and disease. Biochim. Biophys. Acta 1792, 14–26 (2009).
Article Google Scholar
Garcia-Blanco, M. A., Baraniak, A. P. & Lasda, E. L. Alternative splicing in disease and therapy. Nat. Biotechnol. 22, 535–546 (2004).
Article Google Scholar
Cummings, B. B. et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, eaal5209 (2017).
Article Google Scholar
Merkin, J., Russell, C., Chen, P. & Burge, C. B. Evolutionary dynamics of gene and isoform regulation in mammalian tissues. Science 338, 1593–1599 (2012).
Article Google Scholar
Boulet, A. et al. The mammalian phosphate carrier SLC25A3 is a mitochondrial copper transporter required for cytochrome c oxidase biogenesis. J. Biol. Chem. 293, 1887–1896 (2018).
Article Google Scholar
Kim, H. K., Pham, M. H. C., Ko, K. S., Rhee, B. D. & Han, J. Alternative splicing isoforms in health and disease. Pflüg. Arch. Eur. J. Physiol. 470, 995–1016 (2018).
Article Google Scholar
Frampton, G. M. et al. Activation of MET via diverse exon 14 splicing alterations occurs in multiple tumor types and confers clinical sensitivity to MET inhibitorsMET Exon 14 alterations confer response to targeted therapy. Cancer Discov. 5, 850–859 (2015).
Article Google Scholar
Kahles, A. et al. Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell 34, 211–224 (2018).
Article Google Scholar
Brooks, A. N. et al. A pan-cancer analysis of transcriptome changes associated with somatic mutations in U2AF1 reveals commonly altered splicing events. PLoS ONE 9, e87361 (2014).
Article Google Scholar
Allen, A. S. et al. De novo mutations in epileptic encephalopathies. Nature 501, 217–221 (2013).
Article Google Scholar
Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 2059–2074 (2013).
Article Google Scholar
Varabyou, A., Salzberg, S. L. & Pertea, M. Effects of transcriptional noise on estimates of gene and transcript expression in RNA sequencing experiments. Genome Res. 31, 301–308 (2021).
Article Google Scholar
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
Article Google Scholar
Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512 (2013).
Article Google Scholar
Vitting-Seerup, K. & Sandelin, A. The landscape of isoform switches in human cancers. Mol. Cancer Res. 15, 1206–1220 (2017).
Article Google Scholar
Tang, S., Lomsadze, A. & Borodovsky, M. Identification of protein coding regions in RNA transcripts. Nucleic Acids Res. 43, e78 (2015).
Article Google Scholar
Vitting-Seerup, K., Porse, B. T., Sandelin, A. & Waage, J. spliceR: an R package for classification of alternative splicing and prediction of coding potential from RNA-seq data. BMC Bioinformatics 15, 81 (2014).
Article Google Scholar
Kang, Y. et al. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 45, W12–W16 (2017).
Article Google Scholar
Singh, U. & Wurtele, E. S. orfipy: a fast and flexible tool for extracting ORFs. Bioinformatics 37, 3019–3020 (2021).
Article Google Scholar
Tress, M. L., Abascal, F. & Valencia, A. Most alternative isoforms are not functionally important. Trends Biochem. Sci. 42, 408–410 (2017).
Article Google Scholar
Cunningham, F. et al. Ensembl 2022. Nucleic Acids Res. 50, D988–D995 (2022).
Article Google Scholar
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European molecular biology open software suite. Trends Genet. 16, 276–277 (2000).
Article Google Scholar
Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).
Article Google Scholar
Lonsdale, J. et al. The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Article Google Scholar
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Res. 9, 304 (2020).
Article Google Scholar
Moss, S. E. & Morgan, R. O. The annexins. Genome Biol. 5, 219 (2004).
Article Google Scholar
Gerke, V. & Moss, S. E. Annexins: from structure to function. Physiol. Rev. 82, 331–371 (2002).
Article Google Scholar
McCulloch, K. M. et al. An alternative N-terminal fold of the intestine-specific annexin A13a induces dimerization and regulates membrane-binding. J. Biol. Chem. 294, 3454–3463 (2019).
Article Google Scholar
Lillebostad, P. A. et al. Structure of the ALS mutation target annexin A11 reveals a stabilising N-terminal segment. Biomolecules 10, 660 (2020).
Article Google Scholar
Fernández-Lizarbe, S. et al. Structural and lipid-binding characterization of human annexin A13a reveals strong differences with its long A13b isoform. Biol. Chem. 398, 359–371 (2017).
Article Google Scholar
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Article Google Scholar
Varadi, M. et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).
Article Google Scholar
Finstermeier, K. et al. A mitogenomic phylogeny of living primates. PLoS ONE 8, e69504 (2013).
Article Google Scholar
Wall, J. D., Robinson, J. A. & Cox, L. A. High-resolution estimates of crossover and noncrossover recombination from a captive baboon colony. Genome Biol. Evol. 14, evac040 (2022).
Article Google Scholar
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2020).
Article Google Scholar
Sommer, M. J. et al. Structure-guided isoform identification for the human transcriptome. eLife 11, e82556 (2022).
Article Google Scholar
Pockrandt, C., Steinegger, M. & Salzberg, S. L. PhyloCSF++: a fast and user-friendly implementation of PhyloCSF with annotation tools. Bioinformatics 38, 1440–1442 (2022).
Article Google Scholar
Lin, M. F., Jungreis, I. & Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27, 275–282 (2011).
Article Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Article Google Scholar
Varabyou, A., Pertea, G., Pockrandt, C. & Pertea, M. TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets. Bioinformatics 37, 3650–3651 (2021).
Article Google Scholar
Swarbreck, D. et al. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 36, D1009–D1014 (2007).
Article Google Scholar
C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998).
Article Google Scholar
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).
Article Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article Google Scholar
Suzuki, H. & Kasahara, M. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics 19, 45 (2018).
Article Google Scholar
Varabyou, A. ORFanage: reference guided ORF annotation 1.0.2. Zenodo https://doi.org/10.5281/zenodo.8102912 (2023).
Varabyou, A. ORFanage evaluation notebooks. Zenodo https://doi.org/10.5281/zenodo.8102918 (2023).
DeLano, W. L. PyMOL: an open-source molecular graphics tool. CCP4 Newsl. Protein Crystallogr. 40, 82–92 (2002).
Google Scholar

Download references

Acknowledgements

This work was supported in part by the US National Institutes of Health under grants nos. R01 HG006677 (S.L.S.), R01 MH123567 (S.L.S.) and R35 GM130151 (S.L.S.) and by the US National Science Foundation under grant no. DBI-1759518 (M.P.). We would also like to thank C. Pockrandt for helpful discussions on phyloCSF++ implementation and usage.

Author information

Authors and Affiliations

Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
Ales Varabyou, Beril Erdogdu, Steven L. Salzberg & Mihaela Pertea
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Ales Varabyou, Steven L. Salzberg & Mihaela Pertea
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
Beril Erdogdu, Steven L. Salzberg & Mihaela Pertea
Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
Steven L. Salzberg

Authors

Ales Varabyou
View author publications
You can also search for this author in PubMed Google Scholar
Beril Erdogdu
View author publications
You can also search for this author in PubMed Google Scholar
Steven L. Salzberg
View author publications
You can also search for this author in PubMed Google Scholar
Mihaela Pertea
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.V. and B.E. conceived and developed the original idea. A.V. developed and implemented the final method and experiments. A.V., B.E., S.L.S. and M.P. conceptualized the study, methods and wrote the paper.

Corresponding authors

Correspondence to Ales Varabyou or Mihaela Pertea.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Liugo Wang and the other anonymous reviewer for their contribution to the peer review of this work. Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–4 and Tables 1 and 2.

Reporting Summary

Source data

Source Data Fig. 1

Three files with data used to generate Fig. 1c,d,e,g,h. README.md is included to map individual files to panels within figure and to provide descriptions of the columns.

Source Data Fig. 3

Contains a file with data used to generate Fig. 3a. README.md is included with a detailed description of the contents.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Varabyou, A., Erdogdu, B., Salzberg, S.L. et al. Investigating open reading frames in known and novel transcripts using ORFanage. Nat Comput Sci 3, 700–708 (2023). https://doi.org/10.1038/s43588-023-00496-1

Download citation

Received: 23 March 2023
Accepted: 05 July 2023
Published: 31 July 2023
Issue Date: August 2023
DOI: https://doi.org/10.1038/s43588-023-00496-1

This article is cited by

CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure
- Ales Varabyou
- Markus J. Sommer
- Mihaela Pertea
Genome Biology (2023)
Reference-guided search for open reading frames
- Liguo Wang
Nature Computational Science (2023)