Abstract
Pervasive transcription of the human genome results in a heterogeneous mix of coding RNAs and long noncoding RNAs (lncRNAs). Only a small fraction of lncRNAs have demonstrated regulatory functions, thus making functional lncRNAs difficult to distinguish from nonfunctional transcriptional byproducts. This difficulty has resulted in numerous competing human lncRNA classifications that are complicated by a steady increase in the number of annotated lncRNAs. To address these challenges, we quantitatively examined transcription, splicing, degradation, localization and translation for coding and noncoding human genes. We observed that annotated lncRNAs had lower synthesis and higher degradation rates than mRNAs and discovered mechanistic differences explaining slower lncRNA splicing. We grouped genes into classes with similar RNA metabolism profiles, containing both mRNAs and lncRNAs to varying extents. These classes exhibited distinct RNA metabolism, different evolutionary patterns and differential sensitivity to cellular RNA-regulatory pathways. Our classification provides an alternative to genomic context-driven annotations of lncRNAs.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Accession codes
References
Birney, E. et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).
Iyer, M.K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).
Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227 (2009).
van Heesch, S. et al. Extensive localization of long noncoding RNAs to the cytosol and mono- and polyribosomal complexes. Genome Biol. 15, R6 (2014).
Ingolia, N.T. et al. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Rep. 8, 1365–1379 (2014).
Guttman, M., Russell, P., Ingolia, N.T., Weissman, J.S. & Lander, E.S. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 154, 240–251 (2013).
Bánfai, B. et al. Long noncoding RNAs are rarely translated in two human cell lines. Genome Res. 22, 1646–1657 (2012).
Calviello, L. et al. Detecting actively translated open reading frames in ribosome profiling data. Nat. Methods 13, 165–170 (2016).
Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).
Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).
Struhl, K. Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nat. Struct. Mol. Biol. 14, 103–105 (2007).
Andersson, R. et al. Nuclear stability and transcriptional directionality separate functionally distinct RNA species. Nat. Commun. 5, 5336 (2014).
Quek, X.C. et al. lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res. 43, D168–D173 (2015).
Rinn, J.L. & Chang, H.Y. Genome regulation by long noncoding RNAs. Annu. Rev. Biochem. 81, 145–166 (2012).
Ulitsky, I. & Bartel, D.P. lincRNAs: genomics, evolution, and mechanisms. Cell 154, 26–46 (2013).
St Laurent, G., Wahlestedt, C. & Kapranov, P. The landscape of long noncoding RNA classification. Trends Genet. 31, 239–251 (2015).
Keene, J.D. RNA regulons: coordination of post-transcriptional events. Nat. Rev. Genet. 8, 533–543 (2007).
Le Hir, H., Nott, A. & Moore, M.J. How introns influence and enhance eukaryotic gene expression. Trends Biochem. Sci. 28, 215–220 (2003).
Cabili, M.N. et al. Localization and abundance analysis of human lncRNAs at single-cell and single-molecule resolution. Genome Biol. 16, 20 (2015).
Windhager, L. et al. Ultrashort and progressive 4sU-tagging reveals key characteristics of RNA processing at nucleotide resolution. Genome Res. 22, 2031–2042 (2012).
Fong, N. et al. Pre-mRNA splicing is facilitated by an optimal RNA polymerase II elongation rate. Genes Dev. 28, 2663–2676 (2014).
Sultan, M. et al. Influence of RNA extraction methods and library selection schemes on RNA-seq data. BMC Genomics 15, 675 (2014).
Sterne-Weiler, T. et al. Frac-seq reveals isoform-specific recruitment to polyribosomes. Genome Res. 23, 1615–1623 (2013).
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
de Pretis, S. et al. INSPEcT: a computational tool to infer mRNA synthesis, processing and degradation dynamics from RNA- and 4sU-seq time course experiments. Bioinformatics 31, 2829–2835 (2015).
Tilgner, H. et al. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Res. 22, 1616–1625 (2012).
Haerty, W. & Ponting, C.P. Unexpected selection to retain high GC content and splicing enhancers within exons of multiexonic lncRNA loci. RNA 21, 333–346 (2015).
Schüler, A., Ghanbarian, A.T. & Hurst, L.D. Purifying selection on splice-related motifs, not expression level nor RNA folding, explains nearly all constraint on human lincRNAs. Mol. Biol. Evol. 31, 3164–3183 (2014).
Hsin, J.-P. & Manley, J.L. The RNA polymerase II CTD coordinates transcription and RNA processing. Genes Dev. 26, 2119–2137 (2012).
Nojima, T. et al. Mammalian NET-seq reveals genome-wide nascent transcription coupled to RNA processing. Cell 161, 526–540 (2015).
Hirose, Y., Tacke, R. & Manley, J.L. Phosphorylated RNA polymerase II stimulates pre-mRNA splicing. Genes Dev. 13, 1234–1239 (1999).
Gregersen, L.H. et al. MOV10 Is a 5′ to 3′ RNA helicase contributing to UPF1 mRNA target degradation by translocation along 3′ UTRs. Mol. Cell 54, 573–585 (2014).
Rabani, M. et al. Metabolic labeling of RNA uncovers principles of RNA production and degradation dynamics in mammalian cells. Nat. Biotechnol. 29, 436–442 (2011).
Clark, M.B. et al. Genome-wide analysis of long noncoding RNA stability. Genome Res. 22, 885–898 (2012).
Tani, H. et al. Genome-wide determination of RNA stability reveals hundreds of short-lived noncoding transcripts in mammals. Genome Res. 22, 947–956 (2012).
Bahar Halpern, K. et al. Nuclear retention of mRNA in mammalian tissues. Cell Rep. 13, 2653–2662 (2015).
Battich, N., Stoeger, T. & Pelkmans, L. Control of transcript variability in single mammalian cells. Cell 163, 1596–1610 (2015).
Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
Zhang, Y.E., Vibranovski, M.D., Landback, P., Marais, G.A.B. & Long, M. Chromosomal redistribution of male-biased genes in mammalian evolution with two bursts of gene gain on the X chromosome. PLoS Biol. 8, e1000494 (2010).
Necsulea, A. et al. The evolution of lncRNA repertoires and expression patterns in tetrapods. Nature 505, 635–640 (2014).
Kutter, C. et al. Rapid turnover of long noncoding RNAs and the evolution of gene expression. PLoS Genet. 8, e1002841 (2012).
Wu, X. & Sharp, P.A. Divergent transcription: a driving force for new gene origination? Cell 155, 990–996 (2013).
Mukherjee, N. et al. Integrative regulatory mapping indicates that the RNA-binding protein HuR couples pre-mRNA processing and mRNA stability. Mol. Cell 43, 327–339 (2011).
Bresson, S.M., Hunter, O.V., Hunter, A.C. & Conrad, N.K. Canonical poly(A) polymerase activity promotes the decay of a wide variety of mammalian nuclear RNAs. PLoS Genet. 11, e1005610 (2015).
Gulko, B., Hubisz, M.J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).
Marques, A.C. et al. Chromatin signatures at transcriptional start sites separate two equally populated yet distinct classes of intergenic long noncoding RNAs. Genome Biol. 14, R131 (2013).
Michalik, K.M. et al. Long noncoding RNA MALAT1 regulates endothelial cell function and vessel growth. Circ. Res. 114, 1389–1397 (2014).
Kretz, M. et al. Suppression of progenitor differentiation requires the long noncoding RNA ANCR. Genes Dev. 26, 338–343 (2012).
Yuan, S.X. et al. Long noncoding RNA DANCR increases stemness features of hepatocellular carcinoma by derepression of CTNNB1. Hepatology 63, 499–511 (2016).
Tripathi, V. et al. The nuclear-retained noncoding RNA MALAT1 regulates alternative splicing by modulating SR splicing factor phosphorylation. Mol. Cell 39, 925–938 (2010).
Khalil, A.M. et al. Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc. Natl. Acad. Sci. USA 106, 11667–11672 (2009).
Zhang, X. et al. A myelopoiesis-associated regulatory intergenic noncoding RNA transcript within the human HOXA cluster. Blood 113, 2526–2534 (2009).
Rabani, M. et al. High-resolution sequencing and modeling identifies distinct dynamic RNA regulatory strategies. Cell 159, 1698–1710 (2014).
Yang, J.-R. & Zhang, J. Human long noncoding RNAs are substantially less folded than messenger RNAs. Mol. Biol. Evol. 32, 970–977 (2015).
Ulveling, D., Francastel, C. & Hubé, F. When one is better than two: RNA with dual functions. Biochimie 93, 633–644 (2011).
Sauvageau, M. et al. Multiple knockout mouse models reveal lincRNAs are required for life and brain development. eLife 2, e01749 (2013).
Bassett, A.R. et al. Considerations when investigating lncRNA function in vivo. eLife 3, e03058 (2014).
Adiconis, X. et al. Comparative analysis of RNA sequencing methods for degraded or low-input samples. Nat. Methods 10, 623–629 (2013).
Li, B. & Dewey, C.N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Fraley, C., Raftery, A.E., Murphy, T.B. & Scrucca, L. Mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation (Department of Statistics, University of Washington, 2012).
Fraley, C. & Raftery, A.E. Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002).
Doelken, P., Huggins, J.T., Goldblatt, M., Nietert, P. & Sahn, S.A. Effects of coexisting pneumonia and end-stage renal disease on pleural fluid analysis in patients with hydrostatic pleural effusion. Chest 143, 1709–1716 (2013).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Duttke, S.H. et al. Human promoters are intrinsically directional. Mol. Cell 57, 674–684 (2015).
Pervouchine, D.D., Knowles, D.G. & Guigó, R. Intron-centric estimation of alternative splicing from RNA-seq data. Bioinformatics 29, 273–274 (2013).
Yeo, G. & Burge, C.B. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol. 11, 377–394 (2004).
Corvelo, A., Hallegger, M., Smith, C.W.J. & Eyras, E. Genome-wide association between branch point properties and alternative splicing. PLoS Comput. Biol. 6, e1001016 (2010).
Schwartz, S., Hall, E. & Ast, G. SROOGLE: webserver for integrative, user-friendly visualization of splicing signals. Nucleic Acids Res. 37, W189–W192 (2009).
Duffy, E.E. et al. Tracking distinct RNA populations using efficient and reversible covalent chemistry. Mol. Cell 59, 858–866 (2015).
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2015).
Liaw, A. & Wiener, M. Classification and regression by randomForest. R News 2, 18–22 (2002).
Ladewig, E., Okamura, K., Flynt, A.S., Westholm, J.O. & Lai, E.C. Discovery of hundreds of mirtrons in mouse and human small RNA data. Genome Res. 22, 1634–1645 (2012).
Wiwie, C., Baumbach, J. & Röttger, R. Comparing the performance of biomedical clustering methods. Nat. Methods 12, 1033–1038 (2015).
Kishore, S. et al. Insights into snoRNA biogenesis and processing from PAR-CLIP of snoRNA core proteins and small RNA sequencing. Genome Biol. 14, R45 (2013).
Akalin, A., Franke, V., Vlahovicˇek, K., Mason, C.E. & Schübeler, D. Genomation: a toolkit to summarize, annotate and visualize genomic intervals. Bioinformatics 31, 1127–1129 (2015).
Shen, L. GeneOverlap: Test and Visualize Gene Overlaps (Mount Sinai, 2013).
Tibshirani, R., Walther, G. & Hastie, T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Series B Stat. Methodol. 63, 411–423 (2001).
van Buuren, S. & Groothuis-Oudshoorn, K. Mice: multivariate imputation by chained equations in r. J. Stat. Softw. 45, 1–67 (2011).
Gerstberger, S., Hafner, M. & Tuschl, T. A census of human RNA-binding proteins. Nat. Rev. Genet. 15, 829–845 (2014).
Gaujoux, R. & Seoighe, C. A flexible R package for nonnegative matrix factorization. BMC Bioinformatics 11, 367 (2010).
Gaujoux, R. & Seoighe, C. Using the Package nMF (CRAN, 2015).
Gaujoux, R. & Seoighe, C. The Package nMF: Manual Pages (CRAN, 2015).
Hahne, F. & Ivanek, R. in Statistical Genomics: Methods and Protocols (eds. Mathé, E. & Davis, S.) 335–351 (Springer, 2016).
Kim, S. ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients. Commun. Stat. Appl. Methods 22, 665–674 (2015).
Epskamp, S., Cramer, A.O.J., Waldorp, L.J., Schmittmann, V.D. & Borsboom, D. Qgraph: network visualizations of relationships in psychometric data. J. Stat. Softw. 48, 1–18 (2012).
Spasic, M. et al. Genome-wide assessment of AU-rich elements by the AREScore algorithm. PLoS Genet. 8, e1002433 (2012).
Acknowledgements
U.O. acknowledges support from an award from the US National Institutes of Health (R01-GM104962) and the Simons Institute for the Theory of Computing at UC Berkeley, where he was a long-term visitor in the Algorithmic Challenges in Genomics Program in the spring of 2015. N.M. acknowledges support from EU Marie Curie IIF.
Author information
Authors and Affiliations
Contributions
N.M. and U.O. conceived the project; N.M. and U.O. developed the methodology; N.M., L.C. and S.d.P. developed software and performed formal analysis; N.M. and A.H. conducted the investigation; N.M. conducted the visualization; N.M. and U.O. wrote the original draft; L.C., S.d.P. and M.P. reviewed and edited the paper; N.M. and U.O. acquired funding; N.M. and U.O. provided resources; N.M. and U.O. supervised the project.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 ERCC fit and table.
The fit between the number of expected ERCC molecules and the observed TPM measurement for total RNA depleted of rRNA with (a) ribozero or (b) RNAseH with oligos targetting rRNA, and (c) RNA that underwent one round of polyA selection. (d) Median gene expression (TPM) of 101 tissues/cell lines from strand-specific paired-end RNA seq generated by ENCODE. CPC distribution for the high expression population of (e) intronless and (f) multiexonic genes in HEK293 cells. (g) Boxplot of the distribution of primary RNA fraction for different data types for genes with mature RNA TPM > 1 in total RNA (n = 12,033 genes). Coverage depth compared to coverage breadth of 4SU and GROseq data for (h) the genome, (i) introns of coding genes, (j) intergenic enhancers, (k) exons of coding genes, (l) exons of lncRNA, and (m) introns of lncRNA.
Supplementary Figure 2 Splicing metrics, features and alternative models.
(a) Description of θ. (b) Features utilized in splicing models: physical features (orange), canonical splicing signals (blue) and splicing regulatory element density (blue). For details regarding calculation of splice site strengthp olypyrimidine tract score, branchpoint score and exonic splicing enhancer and silencer see Supplemental Experimental Procedures. (c) Violin plot of the θ calculated for introns of coding genes, lncRNA, mirtrons and snoRNA host introns. (d) The average r-squared for all regression models generated for each labeling time point and intron category. (e) The spearman correlation coefficient for each feature with θ for different feature categories and intron types. (f) The variable importance for different feature categories and intron types. The average NET-seq signal +/- 25 nucleotides from the 5' splice site for (g) total RNA polymerase II, (h) unphosphorylated RNA polymerase II, and (i) ser2p RNA polymerase II.
Supplementary Figure 3 Comparison of inferred rates.
Boxplot of the (a) synthesis, (b) processing, and (c) degradation rates. (d) The Pearson correlation between rates derived from all timepoints. (e) The distribution of polysomal vs cytosolic ratio for coding genes, lncRNAs, and pseudeogenes. (f) The distribution of synthesis rates for polyribosomal lncRNAs divided into groups based on the presence of a translated ORF.
Supplementary Figure 4 Characterizing class behavior.
(a) Optimal cluster number estimation by gap statistic. (b) Clustering GO enrichment for protein-coding genes and fold-enrichment of unclassified genes (grey). (c) Steady-state HEK293 expression distribution for each class. (d) Tissue specificity score distribution for each class and genes with inssuficient metabolic datain HEK293 cells. (e) Nuc/Cyt localization in mouse liver RNA-seq. (f) Odds-ratio for enrichment of the "core" and "missing " proteome.
Supplementary Figure 5 Regulatory and fitness differences.
a) Log 2 fold change of RBP perturbation - control for K562 ENCODE data. (b) Boxplot of cytoplasmic vs nuclear localization for genes grouped by origination class. A line is depicted connecting the means for each class (point).
Supplementary Figure 6 Characterization of lncRNAs in classes.
(a) The odds-ratio of the overlap between lncRNAs that were either found ("yes") or not found ("no") in lncRNADb for each lncRNA biotype. The numbers represent the gene count in each category. The fraction of nucleotides in a particular class with a fitCons score > S calculated for (b) coding exons and (c) 3' UTR exons of protein-coding genes and (d) all exons of lncRNA genes defined by GENCODE V19. The signal in the “sense-intronic” category, which are genes located in the intron of protein-coding genes, may be due to higher than background signal within introns of coding genes. (e-g) Coverage of 4SU (blue), total (green), cytoplasmic (red) and nuclear (cyan) RNA profiles for example lncRNAs from c6, c5, and c7, respectively. (h) Comparison of median values for each RNA metabolism feature by each class for coding genes (left) and lncRNAs (right).
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–6 and Supplementary Note (PDF 1805 kb)
Rights and permissions
About this article
Cite this article
Mukherjee, N., Calviello, L., Hirsekorn, A. et al. Integrative classification of human coding and noncoding genes through RNA metabolism profiles. Nat Struct Mol Biol 24, 86–96 (2017). https://doi.org/10.1038/nsmb.3325
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nsmb.3325
This article is cited by
-
The pathogenesis mechanism and potential clinical value of lncRNA in gliomas
Discover Oncology (2024)
-
CCAT1 lncRNA is chromatin-retained and post-transcriptionally spliced
Histochemistry and Cell Biology (2024)
-
Comprehensive analysis of the coding and non-coding RNA transcriptome expression profiles of hippocampus tissue in tx-J animal model of Wilson's disease
Scientific Reports (2023)
-
Emerging roles of noncoding RNAs in human cancers
Discover Oncology (2023)
-
Long non-coding RNAs in osteoporosis: from mechanisms of action to therapeutic potential
Human Cell (2023)