Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

The landscape of long noncoding RNAs in the human transcriptome

Abstract

Long noncoding RNAs (lncRNAs) are emerging as important regulators of tissue physiology and disease processes including cancer. To delineate genome-wide lncRNA expression, we curated 7,256 RNA sequencing (RNA-seq) libraries from tumors, normal tissues and cell lines comprising over 43 Tb of sequence from 25 independent studies. We applied ab initio assembly methodology to this data set, yielding a consensus human transcriptome of 91,013 expressed genes. Over 68% (58,648) of genes were classified as lncRNAs, of which 79% were previously unannotated. About 1% (597) of the lncRNAs harbored ultraconserved elements, and 7% (3,900) overlapped disease-associated SNPs. To prioritize lineage-specific, disease-associated lncRNA expression, we employed non-parametric differential expression testing and nominated 7,942 lineage- or cancer-associated lncRNA genes. The lncRNA landscape characterized here may shed light on normal biology and cancer pathogenesis and may be valuable for future biomarker development.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Ab initio transcriptome assembly shows an expansive landscape of human transcription.
Figure 2: Characterization of the MiTranscriptome assembly.
Figure 3: Analysis of conservation in lncRNAs.
Figure 4: Methodology for discovering cancer-associated lncRNAs.
Figure 5: Discovery of lineage-associated and cancer-associated lncRNAs in the MiTranscriptome compendium.

References

  1. 1

    Ferlay, J. et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 136, E359–E386 (2015).

    CAS  Google Scholar 

  2. 2

    Kandoth, C. et al. Mutational landscape and significance across 12 major cancer types. Nature 502, 333–339 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. 3

    Ciriello, G. et al. Emerging landscape of oncogenic signatures across human cancers. Nat. Genet. 45, 1127–1133 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. 4

    Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. 5

    Ulitsky, I. & Bartel, D.P. lincRNAs: genomics, evolution, and mechanisms. Cell 154, 26–46 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. 6

    Prensner, J.R. & Chinnaiyan, A.M. The emergence of lncRNAs in cancer biology. Cancer Discov. 1, 391–407 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. 7

    Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. 8

    Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. 9

    Prensner, J.R. et al. Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, an unannotated lincRNA implicated in disease progression. Nat. Biotechnol. 29, 742–749 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. 10

    Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    CAS  Article  Google Scholar 

  11. 11

    Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. 12

    Pruitt, K.D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756–D763 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. 13

    Karolchik, D. et al. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 42, D764–D770 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. 14

    Wang, L. et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 41, e74 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. 15

    Finn, R.D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2014).

    Article  CAS  Google Scholar 

  16. 16

    Kim, M.S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. 17

    Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. 18

    Guttman, M. et al. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. 19

    ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  20. 20

    Rosenbloom, K.R. et al. ENCODE data in the UCSC genome browser: year 5 update. Nucleic Acids Res. 41, D56–D63 (2013).

    Article  CAS  Google Scholar 

  21. 21

    Necsulea, A. et al. The evolution of lncRNA repertoires and expression patterns in tetrapods. Nature 505, 635–640 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. 22

    Dimitrieva, S. & Bucher, P. UCNEbase—a database of ultraconserved non-coding elements and genomic regulatory blocks. Nucleic Acids Res. 41, D101–D109 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. 23

    Bejerano, G. et al. Ultraconserved elements in the human genome. Science 304, 1321–1325 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. 24

    Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).

    Article  CAS  Google Scholar 

  25. 25

    Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).

    CAS  Article  Google Scholar 

  26. 26

    Grasso, C.S. et al. The mutational landscape of lethal castration-resistant prostate cancer. Nature 487, 239–243 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. 27

    Yu, Y.P. et al. Gene expression alterations in prostate cancer predicting tumor aggression and preceding development of malignancy. J. Clin. Oncol. 22, 2790–2799 (2004).

    Article  CAS  Google Scholar 

  28. 28

    Taylor, B.S. et al. Integrative genomic profiling of human prostate cancer. Cancer Cell 18, 11–22 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. 29

    Glück, S. et al. TP53 genomics predict higher clinical and pathologic tumor response in operable early-stage breast cancer treated with docetaxel-capecitabine ± trastuzumab. Breast Cancer Res. Treat. 132, 781–791 (2012).

    Article  CAS  Google Scholar 

  30. 30

    Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. 31

    Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).

  32. 32

    Rhodes, D.R. et al. Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia 9, 166–180 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. 33

    Gray, K.A., Yates, B., Seal, R.L., Wright, M.W. & Bruford, E.A. Genenames.org: the HGNC resources in 2015. Nucleic Acids Res. doi:10.1093/nar/gku1071 (31 October 2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. 34

    Chen, D. et al. LIFR is a breast cancer metastasis suppressor upstream of the Hippo-YAP pathway and a prognostic marker. Nat. Med. 18, 1511–1517 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. 35

    Gupta, R.A. et al. Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature 464, 1071–1076 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. 36

    Prensner, J.R. et al. The long noncoding RNA SChLAP1 promotes aggressive prostate cancer and antagonizes the SWI/SNF complex. Nat. Genet. 45, 1392–1398 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. 37

    Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. 38

    Thomas, G. et al. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat. Genet. 41, 579–584 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. 39

    Stacey, S.N. et al. Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor–positive breast cancer. Nat. Genet. 39, 865–869 (2007).

    Article  CAS  Google Scholar 

  40. 40

    Michailidou, K. et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat. Genet. 45, 353–361 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. 41

    Turnbull, C. et al. Genome-wide association study identifies five new breast cancer susceptibility loci. Nat. Genet. 42, 504–507 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. 42

    Li, J. et al. A combined analysis of genome-wide association studies in breast cancer. Breast Cancer Res. Treat. 126, 717–727 (2011).

    Article  CAS  Google Scholar 

  43. 43

    Amaral, P.P., Clark, M.B., Gascoigne, D.K., Dinger, M.E. & Mattick, J.S. lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res. 39, D146–D151 (2011).

    Article  CAS  Google Scholar 

  44. 44

    Volders, P.J. et al. LNCipedia: a database for annotated human lncRNA transcript sequences and structures. Nucleic Acids Res. 41, D246–D251 (2013).

    Article  CAS  Google Scholar 

  45. 45

    Park, C., Yu, N., Choi, I., Kim, W. & Lee, S. lncRNAtor: a comprehensive resource for functional investigation of long noncoding RNAs. Bioinformatics 30, 2480–2485 (2014).

    Article  CAS  Google Scholar 

  46. 46

    Hangauer, M.J., Vaughn, I.W. & McManus, M.T. Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs. PLoS Genet. 9, e1003569 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. 47

    Zhou, Y. et al. Activation of p53 by MEG3 non-coding RNA. J. Biol. Chem. 282, 24731–24742 (2007).

    Article  CAS  Google Scholar 

  48. 48

    Tomlins, S.A. et al. Urine TMPRSS2:ERG fusion transcript stratifies prostate cancer risk in men with elevated serum PSA. Sci. Transl. Med. 3, 94ra72 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. 49

    Prensner, J.R. et al. PCAT-1, a long noncoding RNA, regulates BRCA2 and controls homologous recombination in cancer. Cancer Res. 74, 1651–1660 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. 50

    Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).

    CAS  Article  Google Scholar 

  51. 51

    Fickett, J.W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 10, 5303–5318 (1982).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. 52

    Kim, M.S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. 53

    Chambers, M.C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30, 918–920 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. 54

    Ye, J. et al. Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics 13, 134 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. 55

    Eisenberg, E. & Levanon, E.Y. Human housekeeping genes, revisited. Trends Genet. 29, 569–574 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. 56

    Bernstein, B.E. et al. Genomic maps and comparative analysis of histone modifications in human and mouse. Cell 120, 169–181 (2005).

    Article  CAS  Google Scholar 

  57. 57

    Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. 58

    Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. 59

    Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. 60

    Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. 61

    Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank B. Palen and J. Hallum for technical assistance with the high-performance computing cluster, S. Roychowdhury for reviewing the manuscript, the University of Michigan DNA Sequencing Core for Sanger sequencing and K. Giles for critically reading the manuscript and for the submission of documents. This work was supported in part by US National Institutes of Health Prostate Specialized Program of Research Excellence grant P50 CA69568, Early Detection Research Network grant UO1 CA111275, US National Institutes of Health grants R01 CA132874 and RO1 CA154365 (D.G.B. and A.M.C.), and US Department of Defense grant PC100171 (A.M.C.). A.M.C. is supported by the Prostate Cancer Foundation and the Howard Hughes Medical Institute. A.M.C. is an American Cancer Society Research Professor and a Taubman Scholar of the University of Michigan. R.M. was supported by a Prostate Cancer Foundation Young Investigator Award and by US Department of Defense Post-Doctoral Fellowship W81XWH-13-1-0284. Y.S.N. is supported by a University of Michigan Cellular and Molecular Biology National Research Service Award Institutional Predoctoral Training Grant.

Author information

Affiliations

Authors

Contributions

M.K.I., Y.S.N. and A.M.C. conceived the study and analyses. M.K.I. processed RNA-seq data and performed ab initio assembly. M.K.I. and Y.S.N. performed data processing and data analysis with assistance from T.R.B., R.M., A.S., Y.H., J.R.E., S.Z., J.R.P. and F.Y.F. R.M., U.S., A.S. and Y.H. performed quantitative PCR validations. M.K.I. and Y.S.N. developed SSEA with the help of H.K.I. D.G.B. contributed primary samples. D.R.R., Y.-M.W. and S.M.D. generated RNA-seq libraries, and X.C. performed the sequencing. M.K.I., Y.S.N. and A.S. developed the web resource. T.R.B. provided systems administration, data storage, high-performance computing and networking support. A.P. performed the proteomics analysis. M.K.I., Y.S.N. and A.M.C. wrote the manuscript. All authors discussed results and commented on the manuscript.

Corresponding author

Correspondence to Arul M Chinnaiyan.

Ethics declarations

Competing interests

Oncomine is supported by ThermoFisher, Inc. (previously Life Technologies and Compendia Biosciences). A.M.C. was a co-founder of Compendia Biosciences and served on the scientific advisory board of Life Technologies before it was acquired. The University of Michigan has filed a patent application for the use of a subset of the lncRNAs described in this study as biomarkers of cancer.

Integrated supplementary information

Supplementary Figure 1 Curation and processing of samples in the MiTranscriptome compendia.

(a) Pie chart showing the number of studies curated from TCGA, ENCODE, MCTP and other publicly available datasets. (b) Workflow for bioinformatics processing of individual RNA-seq libraries. Data sets downloaded as BAM files were first converted to FASTQ format. Quality assessment of FASTQ files was performed using FASTQC. Reads mapping to mitochondria, ribosomal RNA, poly-A sequence, poly-C sequence or phiX virus (a spiked-in control) were filtered out. Fragment length distribution and orientation were determined by mapping a subset of the input reads to a set of large human exons (>500 bp). Reads were aligned using TopHat (v2.0.6) with Bowtie2 (v2.1.0). Gene fusion calling was performed using TopHat-Fusion (v2.0.6) with Bowtie1 (v0.12.9). Read alignment metrics were computed using Picard Tools, and genome track information was generated using BEDTools and UCSC binary utilities. Finally, ab initio transcriptome assembly was performed using Cufflinks version 2.0.2. (c) Scatter plot showing the total fragments (x axis) and the fraction of aligned fragments (y axis) for each RNA-seq library. Coarse quality control filters were used to remove libraries with fewer than 20 million total fragments or 20 million alignments (red point). (d) Dot plot showing for each library the fraction of aligned bases corresponding to RefSeq mRNAs (black points), intronic regions (green points) or intergenic regions (blue points) on the y axis. Libraries with fewer than 50% of aligned bases corresponding to RefSeq mRNA were filtered out (dotted line). (e) Pie chart showing the numbers of primary tumors (red), metastatic tumors (yellow), benign adjacent tissues or tissues from healthy individuals (blue), or cell lines (green) for 6,503 RNA-seq libraries that passed coarse quality control filters.

Supplementary Figure 2 Transfrag filtering.

(a) The dot plot shows the numbers of short transfrags (red), short clipped exons (blue) and long transfrags (black) for each library. (b) The dot plot shows the numbers of unannotated intergenic or antisense transfrags (blue), sense intronic transfrags (green) and annotated transfrags (black) for each library. (c) Example transcript models illustrating categories of ab initio transcripts and sources of background noise. Annotated transfrags (black) overlap reference transcripts on the same strand. Unannotated antisense intronic or intergenic transfrags (blue) may be confounded by genomic DNA contamination. Unannotated sense intronic transfrags (green) may be confounded by contamination from both genomic DNA and incompletely processed RNA. (d) Decision tree depicting the transfrag filtering steps for a single library. First, transfrags were labeled ‘annotated’ or ‘unannotated’ on the basis of overlap with a reference transcriptome catalog. Annotated transfrags and unannotated multiexonic transfrags were considered expressed. Unannotated monoexonic transfrags within introns in the sense orientation of an overlapping transcript were discarded as incompletely processed RNA artifacts. Unannotated antisense or intergenic monoexonic transfrags were subjected to a bivariate kernel density classification method to discriminate recurrent, reliable transcription from genomic DNA contamination artifacts. Transfrags predicted as ‘expressed’ were incorporated into meta-assemblies. (e) Scatter plot comparing the sensitivity of the monoexonic transfrag classifier for correctly detecting annotated transcripts (y axis) and the fraction of unannotated transfrags predicted to be expressed (x axis). (f) Histogram demonstrating the sensitivity for correctly detecting annotated test transcripts held out of the classifier training process.

Supplementary Figure 3 Meta-assembly.

(a) Schematic of the transcriptome meta-assembly algorithm using a simplified example with three transfrags transcribed from left to right. The input to the meta-assembly is a list of weighted transfrags (in this case, the weights correspond to FPKM expression values). First, a splice graph is constructed using the transfrag exon boundaries. The splice graph is a directed acyclic graph (DAG) with nodes (rounded rectangular boxes) representing contiguously transcribed genomic bases and edges (arrows) corresponding to possible alternative splicing and promoter usage. The splice graph is then trimmed to remove poorly expressed starting/ending nodes, and adjacent nodes with a degree of one are collapsed. (b) The pruned splice graph from a is subjected to meta-assembly. To encapsulate the splicing pattern information present in the original transfrags, the pruned splice graph is converted into a splicing pattern graph. A splicing pattern graph is a de Bruijn graph where each node represents a group of k consecutive connected nodes from the splice graph (in this example, k = 3), and edges connect adjacent node groups. In real cases, k is automatically chosen to optimize the number of nodes in the splicing pattern graph. Finally, the splicing pattern graph is repeatedly traversed using a greedy dynamic programming algorithm to determine the set of most highly abundant isoforms from the graph. In this example, isoforms ACDE and ABCE recapitulate input transfrags with nearly identical FPKM values, and the invalid isoform combinations ACE and ABCDE are discarded. (c) Genome view showing an example of the meta-assembly procedure for breast cohort transfrags in a chromosome 12q13.3 locus containing the lncRNA HOTAIR and the protein-coding gene HOXC11 on opposite strands (chr. 12: 54,349,995–54,377,376, hg19). In total, 883 transfrags were considered background noise and not used for meta-assembly. A dense cluster of 7,471 expressed transfrags from 1,076 breast RNA-seq libraries was used as input. The aggregated transfrag signal on the positive (+) and negative (–) strands is shown below. Meta-assembly produced 17 transcripts from the transfrags, including transcripts that matched GENCODE HOTAIR and HOXC11 splicing patterns as well as HOTAIR transcripts with unannotated splice sites.

Supplementary Figure 4 Characterization of unannotated transcripts.

(a) Dot plots depicting the comparison of the MiTranscriptome with reference transcripts from RefSeq, UCSC or GENCODE. Precision (blue), precision for the subset of transcripts overlapping annotated transcripts (light blue) and sensitivity (orange) are plotted for each comparison. (b) Dot plots comparing the base-wise, splice-site and splicing pattern precision and sensitivity of MiTranscriptome and GENCODE using lncRNAs from RefSeq (left) or Cabili et al. (right). (c) Bar plots comparing the numbers of unannotated transcripts versus different classes of annotated transcripts for each of the 18 cohorts. Top, stacked bar plot showing annotated ncRNAs (red), pseudogenes (cyan), read-throughs (purple) and protein-coding genes (blue). Bottom, bar plot showing unannotated transcripts (pink).

Supplementary Figure 5 MiTranscriptome characterization.

(a) Density histogram depicting the confidence scores for annotated and unannotated lncRNAs. (b) Comparison of the relationship of the maximum number of exons per gene to the number of isoforms per gene. LncRNAs tend to have fewer exons than protein-coding genes, but they have complex splicing patterns that yield multiple transcript isoforms. (c) Cumulative distribution plot for the base-wise conservation fraction of proteins (blue), read-throughs (purple), pseudogenes (cyan), TUCPs (green) and lncRNAs (red). Random intergenic (black) and intronic (gray) regions are plotted as controls. The inset plot highlights the top 5th percentile of the distribution. (d) Bar plot showing KS test statistics for classes of transcripts versus random intergenic controls. (e) ROC curve for predicting the conservation of protein-coding genes versus random intergenic controls. The cutoff (pink point) chosen for calling highly conserved transcripts is plotted. (f) Cumulative distribution plot for promoter conservation (legend shared with c). The inset plot highlights the top 5th percentile of the distribution. (g) Bar plot showing KS tests for promoter conservation versus random intergenic regions. (h) ROC curve for predicting ultraconserved noncoding elements versus random intergenic regions. The cutoff (pink point) chosen for nominating ultraconserved lncRNAs is plotted.

Supplementary Figure 6 Validation of lncRNA transcripts.

One hundred lncRNA transcripts were validated by qRT-PCR across the A549, LNCaP and MCF-7 cell lines using an approach with or without revers transcriptase. Ct values were first normalized to housekeeping genes (CHMP2A, EMC7, GPI, PSMB2, PSMB4, RAB7A, REEP5, SNRPD3) and then to the median value of all samples using the DDCt method. Here data are plotted as a logirithmic of fold change over the median with s.e.m. Validation was performed on (a) 38 monoexonic transcripts and (b) 62 multiexonic transcripts. The boxed transcripts are two representative examples of lncRNAs with lineage/cancer specificity in breast or prostate according to SSEA analysis (Supplementary Table 10) whose cell line expression profile (by qRT-PCR) reflects what is expected from tissue analysis.

Supplementary Figure 7 Further validation of lncRNA transcripts.

(a) Heat-map representation of the correlation between qPCR (fold change over the median) with RNA-seq (FPKM) of 100 selected transcripts in the A549, LNCaP and MCF-7 cell lines. (b,c) Representative example of 2 of 20 previously unannotated lncRNA transcripts that were analyzed by Sanger sequencing to ensure primer specificity with their associated chromatograms. As seen in the UCSC Genome Browser View, a (b) multiexonic lncRNA (Gene ID: G021137) and (c) monoexonic lncRNA (Gene ID: G030545).

Supplementary Figure 8 Classification of transcripts of unknown coding potential.

(a) Decision tree showing the categorization of ab initio transcripts. Unannotated transcripts and annotated noncoding RNAs were classified as either lncRNA or TUCP. Transcript categories for protein-coding genes, pseudogenes and read-throughs were imputed from overlapping reference annotations. (b) ROC curve comparing the false positive rate (x axis) with the true positive rate (y axis) for CPAT coding potential predictions of noncoding RNAs versus protein-coding genes. (c) Curve comparing the probability cutoff (x axis) with balanced accuracy (y axis). The dotted line shows the cutoff used to call TUCP transcripts. (d) Scatter plot comparing the frequency of Pfam domain occurrences in non-transcribed intergenic space versus transcribed regions. Points in red were considered valid Pfam domain hits, and points in black were considered artifacts. (e) Three-dimensional scatter plot comparing Fickett score (x axis), ORF size (y axis) and Hexamer score (z axis) for all transcripts. Transcripts represented by red points contain valid Pfam domains, while blue do not. (fh) Box plots comparing ORF size (f), Hexamer score (g) and Fickett score (h) for lncRNAs (red), TUCPs predicted by Pfam only (yellow), TUCPs predicted by CPAT (green) and TUCPs predicted by both Pfam and CPAT (blue).

Supplementary Figure 9 Enrichment of the MiTranscriptome assembly for disease-associated regions.

(a) Venn diagram comparing the coverage of disease- or trait-associated genomic regions (i.e., GWAS SNPs) for the MiTranscriptome assembly (yellow) in comparison to reference catalogs (blue). (b) Pie charts comparing the distributions of intronic and exonic GWAS SNP coverage of the MiTranscriptome assembly (left) and reference catalogs (right). (c) Dot plot displaying the enrichment of GWAS SNPs versus random SNPs for different transcript categories. Enrichment odds ratios (transcript-SNP overlaps versus shuffled transcript-SNP overlaps) are plotted on the y axis. Points indicate the mean of 100 permutations for tests of enrichment with GWAS SNPs (circle) or random SNPs (diamond), and error bars depict ±2 s.d. of the distribution of odds ratios. Both exonic and whole-transcript enrichment is reported. (d) Dot plot showing the enrichment of GWAS SNPs (circle) versus random SNPs (diamond) for novel intergenic lncRNAs and TUCPs. Enrichment odds ratios (transcript-SNP overlaps versus shuffled transcript-SNP overlaps) are plotted on the y axis. Points indicate the mean of 100 shuffles for comparisons with GWAS SNPs (circle) or random SNPs (diamond), and error bars depict ±2 s.d. of the distribution of odds ratios. Both exonic and whole-transcript enrichment is reported.

Supplementary Figure 10 Discovery of lineage-associated and cancer-associated transcripts.

(a) Heat map of lineage-specific transcripts nominated by SSEA. Each column represents a sample set from 1 of 25 cancer (dark gray) and 13 normal (light gray) lineages, and each row represents an individual transcript. Colored labels above columns reflect the organ system cohorts used in assembly. Row side colors correspond to lncRNAs (red), TUCPs (green), pseudogenes (cyan), read-throughs (purple) and protein-coding transcripts (blue). All transcripts were statistically significant (FDR < 1 × 10−7) and ranked in the top 1% of the most positively or negatively enriched transcripts within at least one sample set. The heat-map color spectrum corresponds to percentile ranks, with underexpressed transcripts colored blue and overexpressed transcripts colored red. The column dendrogram shows unsupervised hierarchical clustering of the sample sets. (b) Heat map of cancer-specific transcripts (CATs) nominated by SSEA. Columns represent 12 cancer types, and colored column labels reflect the organ system cohorts used in assembly. All transcripts were statistically significant (FDR < 1 × 10−3) and ranked in the top 1% of the most positively or negatively enriched transcripts within at least one sample set. The column dendrogram shows unsupervised clustering results. The row side color and heat-map color schemes are identical to those in a.

Supplementary Figure 11 Lineage-specific and cancer-specific transcripts.

(a) Scatter plot grid showing lineage-specific and cancer-specific transcripts nominated by SSEA. A row of scatter plots for each transcript category is plotted across 12 cancer types. Each plot shows the cancer versus normal enrichment score (x axis) and the cancer lineage enrichment score (y axis). Red points indicate cancer and lineage associated transcripts within the respective cancer types, and gray points indicate all other cancer and lineage associated transcripts. (b,c) Box plots comparing the performance of (b) positively enriched cancer and lineage associated transcripts and (c) negatively enriched transcripts for each category across 12 cancer types. The average of the lineage and cancer versus normal ES is plotted on the y axis.

Supplementary Figure 12 Examples of cancer- and/or lineage-associated transcripts.

(a) Genomic view of the chromosome 6q26-q27 locus. The protein-coding genes QKI and PDE10A flank an intergenic region with two annotated lncRNAs, AK093114 and AK090788. MiTranscriptome transcripts are shown in a dense view populating this intergenic space. The most zoomed view (bottom) depicts MEAT6, a melanoma-associated lncRNA. AK090788 overlaps a portion of MEAT6, but the full MEAT6 transcript uses an alternate start site (black arrow). (b) Expression data for MEAT6 (demarcated by an asterisk in a). This isoform variant does not use the alternate start site used by MEAT6 and closely resembles AK090788. (c,d) Expression profiles for cancer- and lineage-associated transcripts across all MiTranscriptome tissue cohorts are shown for (c) lung adenocarcinoma and (d) thyroid cancer.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1-12 and Supplementary Note. (PDF 8387 kb)

Supplementary Tables 1-9, 11, 12, 14 and 15

Supplementary Tables 1-9, 11, 12, 14 and 15. (XLSX 4529 kb)

Supplementary Table 10

Specific details for lncRNA discoveries. (XLSX 22135 kb)

Supplementary Table 13

GSEA results. (XLSX 19085 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Iyer, M., Niknafs, Y., Malik, R. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat Genet 47, 199–208 (2015). https://doi.org/10.1038/ng.3192

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing