Long non-coding RNAs (lncRNAs) are largely heterogeneous and functionally uncharacterized. Here, using FANTOM5 cap analysis of gene expression (CAGE) data, we integrate multiple transcript collections to generate a comprehensive atlas of 27,919 human lncRNA genes with high-confidence 5′ ends and expression profiles across 1,829 samples from the major human primary cell types and tissues. Genomic and epigenomic classification of these lncRNAs reveals that most intergenic lncRNAs originate from enhancers rather than from promoters. Incorporating genetic and expression data, we show that lncRNAs overlapping trait-associated single nucleotide polymorphisms are specifically expressed in cell types relevant to the traits, implicating these lncRNAs in multiple diseases. We further demonstrate that lncRNAs overlapping expression quantitative trait loci (eQTL)-associated single nucleotide polymorphisms of messenger RNAs are co-expressed with the corresponding messenger RNAs, suggesting their potential roles in transcriptional regulation. Combining these findings with conservation data, we identify 19,175 potentially functional lncRNAs in the human genome.
This is a preview of subscription content
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005)
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012)
Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nature Genet. 47, 199–208 (2015)
Cabili, M. N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011)
Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012)
Quek, X. C. et al. lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res. 43, D168–D173 (2015)
Schmidt, L. H. et al. The long noncoding MALAT-1 RNA indicates a poor prognosis in non-small cell lung cancer and induces migration and tumor growth. J. Thorac. Oncol. 6, 1984–1992 (2011)
Andersson, R. et al. Nuclear stability and transcriptional directionality separate functionally distinct RNA species. Nature Commun. 5, 5336 (2014)
Preker, P. et al. RNA exosome depletion reveals transcription upstream of active human promoters. Science 322, 1851–1854 (2008)
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014)
Quinn, J. J. & Chang, H. Y. Unique features of long non-coding RNA biogenesis and function. Nature Rev. Genet. 17, 47–62 (2016)
Palazzo, A. F. & Lee, E. S. Non-coding RNA: what is functional and what is junk? Front. Genet. 6, 2 (2015)
Engreitz, J. M. et al. Local regulation of gene expression by lncRNA promoters, transcription and splicing. Nature 539, 452–455 (2016)
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010)
Li, M. J. et al. GWASdb v2: an update database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res. 44 (D1), D869–D876 (2016)
GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015)
Farh, K. K.-H. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015)
Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nature Methods 10, 1177–1184 (2013)
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012)
Shiraki, T. et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl Acad. Sci. USA 100, 15776–15781 (2003)
Forrest, A. R. R. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014)
Arner, E. et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347, 1010–1014 (2015)
Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015)
Wang, L. et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 41, e74 (2013)
Batut, P., Dobin, A., Plessy, C., Carninci, P. & Gingeras, T. R. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 23, 169–180 (2013)
Sigova, A. A. et al. Divergent transcription of long noncoding RNA/mRNA gene pairs in embryonic stem cells. Proc. Natl Acad. Sci. USA 110, 2876–2881 (2013)
Carninci, P. et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nature Genet. 38, 626–635 (2006)
Core, L. J. et al. Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers. Nature Genet. 46, 1311–1320 (2014)
Xiang, J.-F. et al. Human colorectal cancer-specific CCAT1-L lncRNA regulates long-range chromatin interactions at the MYC locus. Cell Res. 24, 513–531 (2014)
Ulitsky, I. Evolution to the rescue: using comparative genomics to understand long non-coding RNAs. Nature Rev. Genet. 17, 601–614 (2016)
Kapusta, A. et al. Transposable elements are major contributors to the origin, diversification, and regulation of vertebrate long noncoding RNAs. PLoS Genet. 9, e1003470 (2013)
Villar, D. et al. Enhancer evolution across 20 mammalian species. Cell 160, 554–566 (2015)
Ng, S.-Y., Johnson, R. & Stanton, L. W. Human long non-coding RNAs promote pluripotency and neuronal differentiation by association with chromatin modifiers and transcription factors. EMBO J. 31, 522–533 (2012)
Holm, H. et al. Several common variants modulate heart rate, PR interval and QRS duration. Nature Genet. 42, 117–122 (2010)
Pfeufer, A. et al. Genome-wide association study of PR interval. Nature Genet. 42, 153–159 (2010)
Smith, J. G. et al. Genome-wide association study of electrocardiographic conduction measures in an isolated founder population: Kosrae. Heart Rhythm 6, 634–641 (2009)
Paralkar, V. R. et al. Unlinking an lncRNA from its associated cis element. Mol. Cell 62, 104–110 (2016)
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005)
1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)
Lai, F. et al. Activating RNAs associate with Mediator to enhance chromatin architecture and transcription. Nature 494, 497–501 (2013)
Clark, M. B. et al. The reality of pervasive transcription. PLoS Biol. 9, e1000625, (2011)
Struhl, K. Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nature Struct. Mol. Biol. 14, 103–105 (2007)
Severin, J. et al. Interactive visualization and analysis of large-scale sequencing datasets using ZENBU. Nature Biotechnol. 32, 217–219 (2014)
Hasegawa, A., Daub, C., Carninci, P., Hayashizaki, Y. & Lassmann, T. MOIRAI: a compact workflow system for CAGE analysis. BMC Bioinformatics 15, 144 (2014)
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. 28, 511–515 (2010)
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnol. 29, 644–652 (2011)
Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002)
Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nature Biotechnol. 32, 462–464 (2014)
Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nature Methods 9, 215–216 (2012)
Sloan, C. A. et al. ENCODE data at the ENCODE portal. Nucleic Acids Res. 44 (D1), D726–D732 (2016)
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000)
Lin, M. F., Jungreis, I. & Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27, i275–i282 (2011)
Washietl, S. et al. RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA 17, 578–594 (2011)
Olexiouk, V. et al. sORFs.org: a repository of small ORFs identified by ribosome profiling. Nucleic Acids Res. 44 (D1), D324–D329 (2016)
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002)
Wheeler, T. J. & Eddy, S. R. nhmmer: DNA homology search with profile HMMs. Bioinformatics 29, 2487–2489 (2013)
Wheeler, T. J. et al. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res. 41, D70–D82 (2013)
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010)
Chao, A. & Shen, T.-J. Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environ. Ecol. Stat. 10, 429–443 (2003)
Meehan, T. F. et al. Logical development of the cell ontology. BMC Bioinformatics 12, 6 (2011)
Mungall, C. J., Torniai, C., Gkoutos, G. V., Lewis, S. E. & Haendel, M. A. Uberon, an integrative multi-species anatomy ontology. Genome Biol. 13, R5 (2012)
Johnson, A. D. et al. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 24, 2938–2939 (2008)
1000 Genomes Project Consortiumet al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010)
Sakharkar, M. K., Chow, V. T. K. & Kangueane, P. Distributions of exons and introns in the human genome. In Silico Biol. 4, 387–393 (2004)
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010)
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010)
Bostock, M., Ogievetsky, V. & Heer, J. D3: data-driven documents. IEEE Trans. Vis. Comput. Graph. 17, 2301–2309 (2011)
Abugessaisa, I. et al. FANTOM5 transcriptome catalog of cellular states based on Semantic MediaWiki. Database 2016, baw105 (2016)
FANTOM5 was made possible by research grants for the RIKEN Omics Science Center and the Innovative Cell Biology by Innovative Technology (Cell Innovation Program) from the MEXT to Y.H. It was also supported by research grants for the RIKEN Preventive Medicine and Diagnosis Innovation Program (RIKEN PMI) to Y.H. and the RIKEN Centre for Life Science Technologies, Division of Genomic Technologies (RIKEN CLST (DGT)) from the MEXT, Japan. A.R.R.F. is supported by a Senior Cancer Research Fellowship from the Cancer Research Trust, the MACA Ride to Conquer Cancer and the Australian Research Council’s Discovery Projects funding scheme (DP160101960). S.D. is supported by award number U54HG007004 from the National Human Genome Research Institute of the National Institutes of Health, funding from the Ministry of Economy and Competitiveness (MINECO) under grant number BIO2011-26205, and SEV-2012-0208 from the Spanish Ministry of Economy and Competitiveness. Y.A.M. is supported by the Russian Science Foundation, grant 15-14-30002. We thank RIKEN GeNAS for generation of the CAGE and RNA-seq libraries, the Netherlands Brain Bank for brain materials, the RIKEN BioResource Centre for providing cell lines and all members of the FANTOM5 consortium for discussions, in particular H. Ashoor, M. Frith, R. Guigo, A. Tanzer, E. Wood, H. Jia, K. Bailie, J. Harrow, E. Valen, R. Andersson, K. Vitting-Seerup, A. Sandelin, M. Taylor, J. Shin, R. Mori, C. Mungall and T. Meehan.
The authors declare no competing financial interests.
Reviewer Information Nature thanks M. Gerstein, J. Rinn and the other anonymous reviewer(s) for their contribution to the peer review of this work.
Extended data figures and tables
a, Integration of CAGE and transcript models. CAGE clusters were used to integrate transcript models from various sources and their 5′ completeness was assessed on the basis of TIEScore. b, Identification of lncRNAs. TIEScore identified 59,110 genes and coding potential assessment further identified 27,919 lncRNAs in FANTOM CAT at the robust TIEScore cutoff. c, Categorization of lncRNAs. LncRNAs were annotated according to their gene orientation (that is, genomic context) and DHS type23 (that is, epigenomic context) and then categorized into divergent p-lncRNAs (purple), intergenic p-lncRNAs (blue), e-lncRNAs (green) and other lncRNAs (grey). d, Overlaps between FANTOM CAT and other lncRNA catalogues. e, LncRNA gene models outside FANTOM CAT are 5′ incomplete. LncRNAs found commonly in both catalogues (grey), or only in FANTOM CAT (red), show stronger evidence of transcription initiation (DHS, H3K4me1, H3K4me3 and PolII ChIP-seq23) and conservation (phastCons38) than those found only in other lncRNA catalogues (blue, green or yellow).
a, FANTOM CAT lncRNA TSS are well-supported. The 5′ ends of FANTOM CAT lncRNAs (first column) have stronger transcriptomic, epigenomic and genomic evidence of transcription initiation than the 5′ ends of lncRNA models in the Human BodyMap 2.0 (ref. 4), miTranscriptome3 and GENCODE release 25 (ref. 19) (second column). In b and c, the box plots show the median, quartiles and Tukey whiskers of the estimates of FDR of complete 5′ ends (b) and number of 5′ complete lncRNA genes (c) on the basis of ten sets of gold standard TSS and non-TSS regions (Methods). b, FDR of complete 5′ ends. c, Estimated number of 5′ complete lncRNA genes (total number of genes × [1 − FDR]). d, Validation rate of gene models using RAMPAGE. RAMPAGE data sets25,50 (n = 207, Methods) were used to validate the lncRNA transcripts in FANTOM CAT and other catalogues (left). Transcripts containing full consensus CDS (CCDS transcripts) were used for control (right). The exon of a transcript is detected by RAMPAGE31 if it overlaps ≥3 RAMPAGE 3′ ends. Transcript detection rates of all catalogues were plotted (upper). About 95% of lncRNA transcripts in the robust FANTOM CAT can be detected, which is slightly higher than that of GENCODE release 25 (~92%). The TSS of a detected transcript is validated by RAMPAGE if it is located within the proximity of a RAMPAGE 5′ end (for example, from 0 to 500 bp, x axis, lower). At 100 bp, ~95% of lncRNA transcripts in the robust FANTOM CAT can be validated, versus ~85% for that of GENCODE release 25. We note the percentages of CCDS transcripts in FANTOM CAT and GENCODE release 25 detected or validated by RAMPAGE are similar, with the robust and stringent FANTOM CAT catalogues performing slightly better.
a, An example of improved TSS annotation of a GENCODE release 25 lncRNA gene. The 5′ ends of GENCODE release 25 annotated lncRNA transcripts of TUG1 (ENSG00000253352) are distant from the region of strong CAGE signal, while FANTOM CAT added extra transcripts accurately start from the proximal CAGE signal summit. b, An example of bridged gene models of GENCODE release 25 lncRNA genes. In GENCODE release 25, the locus was annotated with three short lncRNA genes; FANTOM CAT bridged these short lncRNA transcript models into a long transcript model (RP11-973H7.4, ENSG00000267654) starting from the proximal CAGE signal summit.
a, Epigenomic features surrounding TSS. The y axis refers to the fraction of TIR overlaps with peaks of the corresponding epigenomic signal from the Roadmap Epigenome Consortium23. b, Genomic features surrounding TSS. Sequence features conducive to generating longer transcripts are enrichment of 5′ splice site (5′ SS) and depletion of polyadenylation sites (PAS). Sequence features associated with transcription initiation include CpG islands, INR (initiator) motif and TATA box motif. c, Core promoter motifs. Grey dashed lines indicate whole-genome background.
a, Percentages of genes with conserved and unconserved TIR (as defined in Fig. 1c) and their overlap with various classes of transposons. b, Enrichment of retrotransposons at unconserved TIR. The Venn diagrams show the overlap between unconserved TIR, DNA transposons and retrotransposons. Retrotransposons are significantly enriched in unconserved TIR of all gene classes (one-tailed Fisher’s exact test, P < 0.05).
a, Expression level and specificity. Abbreviation cpm is relative log expression (rle) normalized count per millions. The maximum expression level (log2 cpm) and expression specificity (Chao–Shen’s corrected Shannon entropy59) of genes among 69 primary cell facets10 were plotted. Box plots show the median (dashed lines), quartiles and Tukey whiskers. b, Percentage of genes within categories expressed within primary cell facets. The circles represent the mean among samples within a facet and the error bars represent 99.99% confidence intervals. Dashed lines represent the means among all samples. c, Number of lncRNA genes expressed within primary cell facets. Dashed line represents the mean among all samples. The x axis is sorted on the basis of number of lncRNA genes expressed. A gene is considered as ‘expressed’ when cpm ≥ 0.01.
Extended Data Figure 7 Association of cell-type-enriched genes with trait-associated genes of different biological themes.
A detailed view of blocks from Fig. 2a. The dendrograms were coloured as in Fig. 2a. a, ‘Immune system’ cell types and ‘infection and immunity’ traits. b, ‘Hepato-intestinal system’ cell types and ‘hepatic function’ traits. c, ‘Pigmented cells’ cell types and ‘pigmentation’ traits. d, ‘Non-immune blood cells’ cell types and ‘blood homeostasis’ traits. e, ‘Cardiovascular system’ cell types and ‘cardiovascular function’ traits.
Extended Data Figure 8 LncRNA AP001057.1 is associated with classical monocytes and implicated in immune diseases.
a, Genomic view of AP001057.1 (ENSG00000232124) in the ZENBU genome browser43. The strongest TSS of AP001057.1 overlaps with an enhancer DHS. The locus overlaps with fine-mapped SNPs associated with Crohn’s disease and GWAS SNPs associated with coeliac disease and inflammatory bowel disease. b, AP001057.1 is associated with classical monocytes (CL:0000860). c, AP001057.1 is significantly upregulated in monocytes upon stimulation with various immunogenic agents (FDR < 0.05 in edgeR58, highlighted in red and indicated with asterisks). Note: we performed differential expression analysis to identify lncRNAs that are dynamically regulated upon stimulation, infection or differentiation on the basis of 25 manually curated series of FANTOM5 samples (Supplementary Table 18 and Methods), and the results are available in Supplementary Table 19. Figures were captured (with slight modifications) from the online resource at http://fantom.gsc.riken.jp/cat/v1/#/genes/ENSG00000232124.1.
Extended Data Figure 9 Selective constraints and enrichment of GWAS trait and eQTL-associated SNPs at lncRNA loci.
a, Selective constraints between species (phastCons38) and within human population (derived allele frequency39). b, Enrichment of GWAS SNPs. Only lead GWAS SNPs15 were used (Methods). c, Enrichment of PICS17 fine-mapped SNPs in global (all versus all) or focused (immune versus immune) analysis (Methods). d, Enrichment of GTEx eQTL SNPs16 associated with expression of mRNAs. Circles represent means and the error bars represent their 99.99% confidence intervals.
We searched for gene loci that overlap eQTL SNPs associated with expression variation of mRNAs (as identified by GTEx16). Gene loci overlapping these SNPs were then paired with the corresponding mRNA and their expression correlation across the FANTOM5 expression atlas was investigated. Rows compare the gene types overlapping the SNPs. a, mRNAs; b, all lncRNAs; c, divergent p-lncRNAs; d, intergenic p-lncRNAs; e, e-lncRNAs. Columns compare the relative orientation of the gene pairs and the position of the SNPs. The term ‘all’ refers to all orientations of the gene pairs and positions of the SNPs pooled. Gene pairs were binned on the basis of the number of SNPs linking the pair (bin = 5 SNPs). The data points represent the mean of absolute Spearman’s rho and the error bars represent its 99.99% confidence intervals. At each bin, the number of pairs plotted is the same for the three pair types as indicated.
This file contains Supplementary Notes 1-6, Supplementary Figures 1-14, descriptions for Supplementary Tables 1-19, online resources and Supplementary references. (PDF 9152 kb)
This zipped file contains Supplementary Tables 1-19 – see Supplementary Information document for descriptions. (ZIP 74370 kb)
This zipped file contains source data for Supplementary Figures 1-6. (ZIP 1386 kb)
About this article
Cite this article
Hon, CC., Ramilowski, J., Harshbarger, J. et al. An atlas of human long non-coding RNAs with accurate 5′ ends. Nature 543, 199–204 (2017). https://doi.org/10.1038/nature21374
Integrative analysis of mutated genes and mutational processes reveals novel mutational biomarkers in colorectal cancer
BMC Bioinformatics (2022)
A specific, non-immune system-related isoform of the human inducible nitric oxide synthase is expressed during differentiation of human stem cells into various cell types
Cell Communication and Signaling (2022)
H3K27ac-activated EGFR-AS1 promotes cell growth in cervical cancer through ACTN4-mediated WNT pathway
Biology Direct (2022)
Nature Communications (2022)
Nature Reviews Molecular Cell Biology (2022)