Abstract
Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Understanding our Genetic Inheritance: The US Human Genome Project, The First Five Years 1991-1995 (US Department of Health and Human Services, US Department of Energy, 1990).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022). Describes the first complete gap-free assembly and annotation of a human genome, which added 140 protein-coding genes and several thousand additional non-coding genes to the human gene catalogue.
The Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Kawaji, H., Kasukawa, T., Forrest, A., Carninci, P. & Hayashizaki, Y. The FANTOM5 collection, a data series underpinning mammalian transcriptome atlases in diverse cell types. Sci. Data 4, 170113 (2017).
Fields, C., Adams, M. D., White, O. & Venter, J. C. How many genes in the human genome? Nat. Genet. 7, 345–346 (1994).
Clamp, M. et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl Acad. Sci. USA 104, 19428–19433 (2007).
Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005). Demonstrated that transcription is far more complex than previously thought, including large numbers of isoforms and more lncRNAs than protein-coding genes.
Katayama, S. et al. Antisense transcription in the mammalian transcriptome. Science 309, 1564–1566 (2005).
Salzberg, S. L. Next-generation genome annotation: we still struggle to get it right. Genome Biol. 20, 92 (2019).
Frankish, A. et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 51, D942–D949 (2023).
O'Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–745 (2016).
Pertea, M. et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 19, 208 (2018). Presents an enhanced and comprehensive catalogue of human genes and transcripts based on very deep RNA-seq across a broad sample of human tissues.
UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).
Pockrandt, C., Steinegger, M. & Salzberg, S. L. PhyloCSF++: a fast and user-friendly implementation of PhyloCSF with annotation tools. Bioinformatics https://doi.org/10.1093/bioinformatics/btab756 (2021).
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
Pertea, M. & Salzberg, S. L. Between a chicken and a grape: estimating the number of human genes. Genome Biol. 11, 206 (2010). Reviews the history of efforts to estimate the human gene count and highlights different computational methods that were used to help with the human gene annotation.
Pruitt, K. D. et al. The consensus coding sequence (CCDS) project: identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 19, 1316–1323 (2009). Describes a joint effort among three genome annotation centres to converge on coding regions for the annotation of the human and mouse reference genomes.
Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022). Describes a project to create uniform transcript annotations for every protein-coding gene, therefore enhancing the precision of genomic medicine through the accurate identification of genomic variations.
Alioto, T. S. U12DB: a database of orthologous U12-type spliceosomal introns. Nucleic Acids Res. 35, D110–115 (2007).
Mudge, J. M. et al. Standardized annotation of translated open reading frames. Nat. Biotechnol. 40, 994–999 (2022). Outlines a community-led effort to produce a standardized catalogue of human ORFs identified through ribosome profiling.
The GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Troskie, R. L. et al. Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome. Genome Biol. 22, 146 (2021).
Sun, M. et al. Systematic functional interrogation of human pseudogenes using CRISPRi. Genome Biol. 22, 240 (2021).
Xu, J. & Zhang, J. Are human translated pseudogenes functional? Mol. Biol. Evol. 33, 755–760 (2016).
Ramilowski, J. A. et al. Functional annotation of human long noncoding RNAs via molecular phenotyping. Genome Res. 30, 1060–1072 (2020).
Cech, T. R. & Steitz, J. A. The noncoding RNA revolution—trashing old rules to forge new ones. Cell 157, 77–94 (2014).
Mattick, J. S. et al. Long non-coding RNAs: definitions, functions, challenges and recommendations. Nat. Rev. Mol. Cell Biol. https://doi.org/10.1038/s41580-022-00566-8 (2023).
Michelini, F. et al. Damage-induced lncRNAs control the DNA damage response through interaction with DDRNAs at individual double-strand breaks. Nat. Cell Biol. 19, 1400–1411 (2017).
Lagarde, J. et al. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat. Genet. 49, 1731–1740 (2017). Describes a large-scale application of capturing rare RNA species with antisense probes and sequencing them with long-read technology, which revealed a large number of isoforms that were not otherwise detectable.
Uszczynska-Ratajczak, B., Lagarde, J., Frankish, A., Guigó, R. & Johnson, R. Towards a complete map of the human long non-coding RNA transcriptome. Nat. Rev. Genet. 19, 535–548 (2018).
The RNAcentral Consortium. RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res. 49, D212–220 (2021).
Liu, Y. et al. High-plex protein and whole transcriptome co-mapping at cellular resolution with spatial CITE-seq. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01676-0 (2023).
Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).
Stokes, T. et al. Transcriptomics for clinical and experimental biology research: hang on a seq. Adv. Genet. 4, 2200024 (2023).
Deveson, I. W. et al. Universal alternative splicing of noncoding exons. Cell Syst. 6, 245–255 (2018). Describes widespread alternative splicing in non-coding exons, suggesting that non-coding exons are functionally modular and produce a seemingly limitless variety of isoforms.
Mudge, J. M. et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res. 29, 2073–2087 (2019).
Lewandowski, J. P. et al. The Tug1 lncRNA locus is essential for male fertility. Genome Biol. 21, 237 (2020).
Broadwell, L. J. et al. Myosin 7b is a regulatory long noncoding RNA (lncMYH7b) in the human heart. J. Biol. Chem. 296, 100694 (2021).
He, Y. et al. Transcriptional-readthrough RNAs reflect the phenomenon of “a gene contains gene(s)” or “gene(s) within a gene” in the human genome, and thus are not chimeric RNAs. Genes 9, 40 (2018).
Wang, Y. et al. Identification of the cross-strand chimeric RNAs generated by fusions of bi-directional transcripts. Nat. Commun. 12, 4645 (2021).
de Hoon, M., Shin, J. W. & Carninci, P. Paradigm shifts in genomics through the FANTOM projects. Mamm. Genome 26, 391–402 (2015).
Yip, C. W. et al. Antisense-oligonucleotide-mediated perturbation of long non-coding RNA reveals functional features in stem cells and across cell types. Cell Rep. 41, 111893 (2022).
Seal, R. L. et al. A guide to naming human non-coding RNA genes. EMBO J. 39, e103777 (2020).
Amberger, J. S., Bocchini, C. A., Scott, A. F. & Hamosh, A. OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 47, D1038–D1043 (2019).
Cline, M. S. et al. BRCA challenge: BRCA exchange as a global resource for variants in BRCA1 and BRCA2. PLoS Genet. 14, e1007752 (2018).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Hunt, S. E. et al. Annotating and prioritizing genomic variants using the Ensembl Variant Effect Predictor-A tutorial. Hum. Mutat. 43, 986–997 (2022).
Schoch, K. et al. Alternative transcripts in variant interpretation: the potential for missed diagnoses and misdiagnoses. Genet. Med. 22, 1269–1275 (2020). A potent example of the considerable impact that precise gene model annotation has on genetic diagnostics, demonstrating how inaccuracies can yield false negatives or positives and potentially compromising the diagnosis of rare disease patients.
Steward, C. A. et al. Re-annotation of 191 developmental and epileptic encephalopathy-associated genes unmasks de novo variants in SCN1A. NPJ Genom. Med. 4, 31 (2019).
Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
Bartonicek, N. et al. Intergenic disease-associated regions are abundant in novel transcripts. Genome Biol. 18, 241 (2017).
Aznaourova, M., Schmerer, N., Schmeck, B. & Schulte, L. N. Disease-causing mutations and rearrangements in long non-coding RNA gene loci. Front. Genet. 11, 527484 (2020).
den Dunnen, J. T. et al. HGVS recommendations for the description of sequence variants: 2016 update. Hum. Mutat. 37, 564–569 (2016).
Shumate, A. et al. Assembly and annotation of an Ashkenazi human reference genome. Genome Biol. 21, 129 (2020).
Zimin, A. V. et al. A reference-quality, fully annotated genome from a Puerto Rican individual. Genetics 220, iyab227 (2022).
Chao, K. H., Zimin, A. V., Pertea, M. & Salzberg, S. L. The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual. G3: Genes, Genomes, Genetic0s 13,jkac321 (2023).
Liao, W. W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).
Gonzàlez-Porta, M., Frankish, A., Rung, J., Harrow, J. & Brazma, A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 14, R70 (2013).
Okazaki, Y. et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–573 (2002).
Babarinde, I. A. & Hutchins, A. P. The effects of sequencing depth on the assembly of coding and noncoding transcripts in the human genome. BMC Genom. 23, 487 (2022).
Weatheritt, R. J., Sterne-Weiler, T. & Blencowe, B. J. The ribosome-engaged landscape of alternative splicing. Nat. Struct. Mol. Biol. 23, 1117–1123 (2016).
van Heesch, S. et al. The translational landscape of the human heart. Cell 178, 242–260 (2019). Shows that combining ribosome profiling with deep proteomic analysis can detect peptide products translated from a large number of 5′-UTRs and annotated lncRNAs.
Duffy, E. E. et al. Developmental dynamics of RNA translation in the human brain. Nat. Neurosci. 25, 1353–1365 (2022).
Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305 (2019).
Mulroney, L. et al. Identification of high-confidence human poly(A) RNA isoform scaffolds using nanopore sequencing. RNA 28, 162–176 (2022).
Grapotte, M. et al. Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network. Nat. Commun. 12, 3297 (2021).
Sinitcyn, P. et al. Global detection of human variants and isoforms by deep proteome sequencing. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01714-x (2023). Establishes a valuable resource for the identification of isoforms at the proteome level, and provides direct evidence that most frame-preserving alternatively spliced isoforms are translated.
Glinos, D. A. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608, 353–359 (2022).
Mercer, T. R. et al. Targeted sequencing for gene discovery and quantification using RNA CaptureSeq. Nat. Protoc. 9, 989–1009 (2014).
Curion, F. et al. Targeted RNA sequencing enhances gene expression profiling of ultra-low input samples. RNA Biol. 17, 1741–1753 (2020).
Zhao, L. et al. NONCODEV6: an updated database dedicated to long non-coding RNA annotation in both animals and plants. Nucleic Acids Res. 49, D165–D171 (2021).
Hon, C.-C. et al. An atlas of human long non-coding RNAs with accurate 5′ ends. Nature 543, 199–204 (2017).
Volders, P.-J. et al. LNCipedia 5: towards a reference set of human long non-coding RNAs. Nucleic Acids Res. 47, D135–139 (2019).
Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).
Ma, L. et al. LncBook: a curated knowledgebase of human long non-coding RNAs. Nucleic Acids Res. 47, 2699–2699 (2019).
Acknowledgements
We thank the staff at the Banbury Center at Cold Spring Harbor Laboratory and the Cold Spring Harbor Laboratory Corporate Sponsor Program for supporting a workshop that all authors of this work attended. This work was supported in part by the US National Institutes of Health (NIH) under grants R01-HG006677 (to M.P., S.L.S. and A.V.), R01-MH123567 (to M.P. and S.L.S.), R35-GM130151 (to S.L.S.), U41-HG007234 (to A.F.) and U24-HG007234 (to R.G. and S.C.-S.); the Wellcome Trust under grant WT222155/Z/20/Z (to A.F.); the European Molecular Biology Laboratory (to A.F.); the US National Science Foundation under grant DBI-1759518 (to M.P.); the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under grant T2EDK-00391 (to A.G.H.); Science Foundation Ireland through Future Research Leaders award 18/FRL/6194 and the Irish Research Council through Consolidator Laureate award (IRCLA/2022/2500; to R.J.); the National Center for Biotechnology Information of the National Library of Medicine, NIH (to T.D.M., K.D.P. and S.P.); the National Health and Medical Research Council (NHMRC) APP1186371 (to C.A.W.); the Center for Genomic Medicine at the University of Utah Health, and the H.A. & Edna Benning Foundation (to M.Y.); the Spanish Ministry of Science and Innovation to the EMBL partnership, Centro de Excelencia Severo Ochoa and CERCA Programme/Generalitat de Catalunya (to R.G. and S.C.-S.); the RIKEN Center for Integrative Medical Sciences (to P.C. and H.T.); and Human Technopole (to P.C.).
Author information
Authors and Affiliations
Contributions
P.A., S.C.-S., F.M.D.L.V., T.F., A.F., T.G., R.G., J.L.H., A.G.H., R.J., T.D.M., M.P., K.D.P., S.P., H.T., I.U., A.V., C.A.W., M.Y., P.C. and S.L.S. participated in discussions at a Banbury Conference at Cold Spring Harbor Laboratory, providing the source material for this paper. All authors contributed to writing, editing and reviewing the paper.
Corresponding authors
Ethics declarations
Competing interests
T.F. is chief editor at Nature Genetics, a Nature Portfolio journal. T.F. was not involved in the editorial handling of this Nature paper (journals within the Nature Portfolio are editorially independent). The other authors declare no competing interests.
Peer review
Peer review information
Nature thanks Tuuli Lappalainen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Amaral, P., Carbonell-Sala, S., De La Vega, F.M. et al. The status of the human gene catalogue. Nature 622, 41–47 (2023). https://doi.org/10.1038/s41586-023-06490-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586-023-06490-x
This article is cited by
-
The pancancer overexpressed NFYC Antisense 1 controls cell cycle mitotic progression through in cis and in trans modes of action
Cell Death & Disease (2024)
-
Natural antisense transcripts as versatile regulators of gene expression
Nature Reviews Genetics (2024)
-
Isoform alterations in the ubiquitination machinery impacting gastrointestinal malignancies
Cell Death & Disease (2024)
-
CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure
Genome Biology (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.