Accurate annotation of genes and their transcripts is a foundation of genomics, but currently no annotation technique combines throughput and accuracy. As a result, reference gene collections remain incomplete—many gene models are fragmentary, and thousands more remain uncataloged, particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), which combines targeted RNA capture with third-generation long-read sequencing. Here we present an experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues that resulted in novel transcript models for 3,574 and 561 gene loci, respectively. CLS approximately doubled the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enabled us to definitively characterize the genomic features of lncRNAs, including promoter and gene structure, and protein-coding potential. Thus, CLS removes a long-standing bottleneck in transcriptome annotation and generates manual-quality full-length transcript models at high-throughput scales.
Subscribe to Journal
Get full journal access for 1 year
only $17.42 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Gene Expression Omnibus
Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005).
Jia, H. et al. Genome-wide computational identification and manual annotation of human long noncoding RNA genes. RNA 16, 1478–1487 (2010).
Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227 (2009).
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).
Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).
Hangauer, M.J., Vaughn, I.W. & McManus, M.T. Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs. PLoS Genet. 9, e1003569 (2013).
Iyer, M.K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).
Zhao, Y. et al. NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic Acids Res. 44, D203–D208 (2016).
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).
Bernstein, B.E. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Chen, L. et al. Transcriptional diversity during lineage commitment of human blood progenitors. Science 345, 1251033 (2014).
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Forrest, A.R.R. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).
Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).
Georgakilas, G. et al. microTSS: accurate microRNA transcription start site identification reveals a significant number of divergent pri-miRNAs. Nat. Commun. 5, 5700 (2014).
Ørom, U.A. et al. Long noncoding RNAs with enhancer-like function in human cells. Cell 143, 46–58 (2010).
Ferdin, J. et al. HINCUTs in cancer: hypoxia-induced noncoding ultraconserved transcripts. Cell Death Differ. 20, 1675–1687 (2013).
Calin, G.A. et al. Ultraconserved regions encoding ncRNAs are altered in human leukemias and carcinomas. Cancer Cell 12, 215–229 (2007).
Lagarde, J. et al. Extension of human lncRNA transcripts by RACE coupled with long-read high-throughput sequencing (RACE-Seq). Nat. Commun. 7, 12339 (2016).
Mercer, T.R. et al. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat. Biotechnol. 30, 99–104 (2011).
Bussotti, G. et al. Improved definition of the mouse transcriptome via targeted RNA sequencing. Genome Res. 26, 705–716 (2016).
Clark, M.B. et al. Quantitative gene profiling of long noncoding RNAs with targeted RNA sequencing. Nat. Methods 12, 339–342 (2015).
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Kozomara, A. & Griffiths-Jones, S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 42, D68–D73 (2014).
Visel, A., Minovitsky, S., Dubchak, I. & Pennacchio, L.A. VISTA Enhancer Browser—a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).
Dimitrieva, S. & Bucher, P. UCNEbase—a database of ultraconserved non-coding elements and genomic regulatory blocks. Nucleic Acids Res. 41, D101–D109 (2013).
Bussotti, G. et al. BlastR—fast and accurate database searches for non-coding RNAs. Nucleic Acids Res. 39, 6886–6895 (2011).
Kralj, J.G. & Salit, M.L. Characterization of in vitro transcription amplification linearity and variability in the low copy number regime using External RNA Control Consortium (ERCC) spike-ins. Anal. Bioanal. Chem. 405, 315–320 (2013).
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).
Sharon, D., Tilgner, H., Grubert, F. & Snyder, M. A single-molecule long-read survey of the human transcriptome. Nat. Biotechnol. 31, 1009–1014 (2013).
Quail, M.A. et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341 (2012).
Mercer, T.R. et al. Targeted sequencing for gene discovery and quantification using RNA CaptureSeq. Nat. Protoc. 9, 989–1009 (2014).
García-García, G. et al. Assessment of the latest NGS enrichment capture methods in clinical context. Sci. Rep. 6, 20948 (2016).
Leucci, E. et al. Melanoma addiction to the long non-coding RNA SAMMSON. Nature 531, 518–522 (2016).
Blanco, E., Parra, G. & Guigó, R. Using geneid to identify genes. Curr. Protoc. Bioinformatics Chapter 4, Unit 4.3 (2007).
Smith, C.M. & Steitz, J.A. Classification of gas5 as a multi-small-nucleolar-RNA (snoRNA) host gene and a member of the 5′-terminal oligopyrimidine gene family reveals common features of snoRNA host genes. Mol. Cell. Biol. 18, 6897–6909 (1998).
Ounzain, S. et al. CARMEN, a human super enhancer-associated long noncoding RNA controlling cardiac specification, differentiation and homeostasis. J. Mol. Cell. Cardiol. 89, 98–112 (2015).
Nissan, A. et al. Colon cancer associated transcript-1: a novel RNA expressed in malignant and pre-malignant human tissues. Int. J. Cancer 130, 1598–1606 (2012).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Marques, A.C. et al. Chromatin signatures at transcriptional start sites separate two equally populated yet distinct classes of intergenic long noncoding RNAs. Genome Biol. 14, R131 (2013).
Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).
Arking, D.E. et al. Genetic association study of QT interval highlights role for calcium signaling pathways in myocardial repolarization. Nat. Genet. 46, 826–836 (2014).
Alam, T. et al. Promoter analysis reveals globally differential regulation of human long non-coding RNA and protein-coding genes. PLoS One 9, e109443 (2014).
Melé, M. et al. Chromatin environment, transcriptional regulation, and splicing distinguish lincRNAs and mRNAs. Genome Res. 27, 27–37 (2017).
Mackowiak, S.D. et al. Extensive identification and analysis of conserved small ORFs in animals. Genome Biol. 16, 179 (2015).
Bazzini, A.A. et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33, 981–993 (2014).
Wang, L. et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 41, e74 (2013).
Lin, M.F., Jungreis, I. & Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27, i275–i282 (2011).
Sauvageau, M. et al. Multiple knockout mouse models reveal lincRNAs are required for life and brain development. eLife 2, e01749 (2013).
Wan, X. et al. Identification of androgen-responsive lncRNAs as diagnostic and prognostic markers for prostate cancer. Oncotarget 7, 60503–60518 (2016).
Letunic, I., Doerks, T. & Bork, P. SMART: recent updates, new developments and status in 2015. Nucleic Acids Res. 43, D257–D260 (2015).
Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012).
Marco-Sola, S., Sammeth, M., Guigó, R. & Ribeca, P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1185–1188 (2012).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Quinlan, A.R. BEDTools: the Swiss-Army tool for genome feature analysis. Curr. Protoc. Bioinformatics 47, 11.12.1–11.12.34 (2014).
We thank members of the Guigó laboratory for their valuable input and help with sample handling, data analysis and writing of the manuscript, including E. Palumbo, F. Reverter, A. Breschi, D. Pervouchine, C. Arnan and F. Camara. We thank L. Armengol (qGenomics) for advice on RNA capture, D. Garrido (CRG) for help with eQTL analysis, S. Bonnin (CRG) for help with data manipulation in R, and I. Jungreis (MIT) for advice on PhyloCSF. J. Wright and J. Choudhary (Sanger Institute) helped with the search for peptide hits to putative coding regions. S. Djebali (INRA, France) kindly made available the Compmerge utility. This work and its publication were supported by the National Human Genome Research Institute of the US National Institutes of Health (grants U41HG007234, U41HG007000 and U54HG007004) and the Wellcome Trust (grant WT098051 to R.G.). R.J. was supported by the Ramón y Cajal Subprogram of the Spanish Ministry of Economy and Competitiveness (grant RYC-2011-08851). Work in the laboratory of R.G. was supported by the National Human Genome Research Institute (awards U54HG0070, R01MH101814 and U41HG007234). This research was partly supported by NCCR RNA & Disease, funded by the Swiss National Science Foundation (to R.J.). We thank R. Garrido (CRG) for administrative support. We acknowledge support from the Spanish Ministry of Economy and Competitiveness, Centro de Excelencia Severo Ochoa 2013–2017 (SEV-2012-0208), and from the CERCA Programme, Generalitat de Catalunya.
The authors declare no competing financial interests.
Supplementary Figures 1–17, Supplementary Tables 1–13 and Supplementary Note 1 (PDF 22645 kb)
Human protein-coding potential analysis on full-length reads. (ZIP 9063 kb)
This file lists the protein coding potential analysis for every full length read, using CPAT and PhyloCSF programs.
Novel human ORFs from PhyloCSF (XLSX 11 kb)
Oligonucleotide sequences used in this study (FASTA format) (TXT 1 kb)
This file provides the SMART cDNA library construction adapters, capture blockers and TruSeq adapter sequences used in the library construction (See Supplementary Figure 2c), as well as primer pairs used in the RT-PCR validation step.
RT-PCR sequences of CLS transcript models (FASTA format) (TXT 2 kb)
About this article
Cite this article
Lagarde, J., Uszczynska-Ratajczak, B., Carbonell, S. et al. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat Genet 49, 1731–1740 (2017). https://doi.org/10.1038/ng.3988
The EMBO Journal (2020)
JHEP Reports (2020)
Expert curation of the human and mouse olfactory receptor gene repertoires identifies conserved coding regions split across two exons
BMC Genomics (2020)
A Depletion of Stop Codons in lincRNA is Owing to Transfer of Selective Constraint from Coding Sequences
Molecular Biology and Evolution (2020)
A Long Intergenic Non-coding RNA, LINC01426, Promotes Cancer Progression via AZGP1 and Predicts Poor Prognosis in Patients with LUAD
Molecular Therapy - Methods & Clinical Development (2020)