High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing

Published online:


Accurate annotation of genes and their transcripts is a foundation of genomics, but currently no annotation technique combines throughput and accuracy. As a result, reference gene collections remain incomplete—many gene models are fragmentary, and thousands more remain uncataloged, particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), which combines targeted RNA capture with third-generation long-read sequencing. Here we present an experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues that resulted in novel transcript models for 3,574 and 561 gene loci, respectively. CLS approximately doubled the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enabled us to definitively characterize the genomic features of lncRNAs, including promoter and gene structure, and protein-coding potential. Thus, CLS removes a long-standing bottleneck in transcriptome annotation and generates manual-quality full-length transcript models at high-throughput scales.

  • Subscribe to Nature Genetics for full access:



Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.


Primary accessions

Gene Expression Omnibus


  1. 1.

    et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005).

  2. 2.

    et al. Genome-wide computational identification and manual annotation of human long noncoding RNA genes. RNA 16, 1478–1487 (2010).

  3. 3.

    et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227 (2009).

  4. 4.

    et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).

  5. 5.

    et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).

  6. 6.

    , & Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs. PLoS Genet. 9, e1003569 (2013).

  7. 7.

    et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).

  8. 8.

    et al. NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic Acids Res. 44, D203–D208 (2016).

  9. 9.

    et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

  10. 10.

    et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

  11. 11.

    et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).

  12. 12.

    et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  13. 13.

    et al. Transcriptional diversity during lineage commitment of human blood progenitors. Science 345, 1251033 (2014).

  14. 14.

    et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

  15. 15.

    et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).

  16. 16.

    et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).

  17. 17.

    et al. microTSS: accurate microRNA transcription start site identification reveals a significant number of divergent pri-miRNAs. Nat. Commun. 5, 5700 (2014).

  18. 18.

    et al. Long noncoding RNAs with enhancer-like function in human cells. Cell 143, 46–58 (2010).

  19. 19.

    et al. HINCUTs in cancer: hypoxia-induced noncoding ultraconserved transcripts. Cell Death Differ. 20, 1675–1687 (2013).

  20. 20.

    et al. Ultraconserved regions encoding ncRNAs are altered in human leukemias and carcinomas. Cancer Cell 12, 215–229 (2007).

  21. 21.

    et al. Extension of human lncRNA transcripts by RACE coupled with long-read high-throughput sequencing (RACE-Seq). Nat. Commun. 7, 12339 (2016).

  22. 22.

    et al. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat. Biotechnol. 30, 99–104 (2011).

  23. 23.

    et al. Improved definition of the mouse transcriptome via targeted RNA sequencing. Genome Res. 26, 705–716 (2016).

  24. 24.

    et al. Quantitative gene profiling of long noncoding RNAs with targeted RNA sequencing. Nat. Methods 12, 339–342 (2015).

  25. 25.

    et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

  26. 26.

    & miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 42, D68–D73 (2014).

  27. 27.

    , , & VISTA Enhancer Browser—a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).

  28. 28.

    & UCNEbase—a database of ultraconserved non-coding elements and genomic regulatory blocks. Nucleic Acids Res. 41, D101–D109 (2013).

  29. 29.

    et al. BlastR—fast and accurate database searches for non-coding RNAs. Nucleic Acids Res. 39, 6886–6895 (2011).

  30. 30.

    & Characterization of in vitro transcription amplification linearity and variability in the low copy number regime using External RNA Control Consortium (ERCC) spike-ins. Anal. Bioanal. Chem. 405, 315–320 (2013).

  31. 31.

    et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).

  32. 32.

    , , & A single-molecule long-read survey of the human transcriptome. Nat. Biotechnol. 31, 1009–1014 (2013).

  33. 33.

    et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341 (2012).

  34. 34.

    et al. Targeted sequencing for gene discovery and quantification using RNA CaptureSeq. Nat. Protoc. 9, 989–1009 (2014).

  35. 35.

    et al. Assessment of the latest NGS enrichment capture methods in clinical context. Sci. Rep. 6, 20948 (2016).

  36. 36.

    et al. Melanoma addiction to the long non-coding RNA SAMMSON. Nature 531, 518–522 (2016).

  37. 37.

    , & Using geneid to identify genes. Curr. Protoc. Bioinformatics Chapter 4, Unit 4.3 (2007).

  38. 38.

    & Classification of gas5 as a multi-small-nucleolar-RNA (snoRNA) host gene and a member of the 5′-terminal oligopyrimidine gene family reveals common features of snoRNA host genes. Mol. Cell. Biol. 18, 6897–6909 (1998).

  39. 39.

    et al. CARMEN, a human super enhancer-associated long noncoding RNA controlling cardiac specification, differentiation and homeostasis. J. Mol. Cell. Cardiol. 89, 98–112 (2015).

  40. 40.

    et al. Colon cancer associated transcript-1: a novel RNA expressed in malignant and pre-malignant human tissues. Int. J. Cancer 130, 1598–1606 (2012).

  41. 41.

    et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).

  42. 42.

    et al. Chromatin signatures at transcriptional start sites separate two equally populated yet distinct classes of intergenic long noncoding RNAs. Genome Biol. 14, R131 (2013).

  43. 43.

    et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).

  44. 44.

    et al. Genetic association study of QT interval highlights role for calcium signaling pathways in myocardial repolarization. Nat. Genet. 46, 826–836 (2014).

  45. 45.

    et al. Promoter analysis reveals globally differential regulation of human long non-coding RNA and protein-coding genes. PLoS One 9, e109443 (2014).

  46. 46.

    et al. Chromatin environment, transcriptional regulation, and splicing distinguish lincRNAs and mRNAs. Genome Res. 27, 27–37 (2017).

  47. 47.

    et al. Extensive identification and analysis of conserved small ORFs in animals. Genome Biol. 16, 179 (2015).

  48. 48.

    et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33, 981–993 (2014).

  49. 49.

    et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 41, e74 (2013).

  50. 50.

    , & PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27, i275–i282 (2011).

  51. 51.

    et al. Multiple knockout mouse models reveal lincRNAs are required for life and brain development. eLife 2, e01749 (2013).

  52. 52.

    et al. Identification of androgen-responsive lncRNAs as diagnostic and prognostic markers for prostate cancer. Oncotarget 7, 60503–60518 (2016).

  53. 53.

    , & SMART: recent updates, new developments and status in 2015. Nucleic Acids Res. 43, D257–D260 (2015).

  54. 54.

    & ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012).

  55. 55.

    , , & The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1185–1188 (2012).

  56. 56.

    et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

  57. 57.

    BEDTools: the Swiss-Army tool for genome feature analysis. Curr. Protoc. Bioinformatics 47, 11.12.1–11.12.34 (2014).

Download references


We thank members of the Guigó laboratory for their valuable input and help with sample handling, data analysis and writing of the manuscript, including E. Palumbo, F. Reverter, A. Breschi, D. Pervouchine, C. Arnan and F. Camara. We thank L. Armengol (qGenomics) for advice on RNA capture, D. Garrido (CRG) for help with eQTL analysis, S. Bonnin (CRG) for help with data manipulation in R, and I. Jungreis (MIT) for advice on PhyloCSF. J. Wright and J. Choudhary (Sanger Institute) helped with the search for peptide hits to putative coding regions. S. Djebali (INRA, France) kindly made available the Compmerge utility. This work and its publication were supported by the National Human Genome Research Institute of the US National Institutes of Health (grants U41HG007234, U41HG007000 and U54HG007004) and the Wellcome Trust (grant WT098051 to R.G.). R.J. was supported by the Ramón y Cajal Subprogram of the Spanish Ministry of Economy and Competitiveness (grant RYC-2011-08851). Work in the laboratory of R.G. was supported by the National Human Genome Research Institute (awards U54HG0070, R01MH101814 and U41HG007234). This research was partly supported by NCCR RNA & Disease, funded by the Swiss National Science Foundation (to R.J.). We thank R. Garrido (CRG) for administrative support. We acknowledge support from the Spanish Ministry of Economy and Competitiveness, Centro de Excelencia Severo Ochoa 2013–2017 (SEV-2012-0208), and from the CERCA Programme, Generalitat de Catalunya.

Author information

Author notes

    • Barbara Uszczynska-Ratajczak
    • , Jennifer Harrow
    •  & Rory Johnson

    Present addresses: Centre of New Technologies, Warsaw, Poland (B.U.-R.); Illumina, Cambridge, UK (J.H.); Department of Clinical Research, University of Bern, Bern, Switzerland (R.J.).

    • Julien Lagarde
    •  & Barbara Uszczynska-Ratajczak

    These authors contributed equally to this work.


  1. Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.

    • Julien Lagarde
    • , Barbara Uszczynska-Ratajczak
    • , Sílvia Pérez-Lluch
    • , Amaya Abad
    • , Roderic Guigo
    •  & Rory Johnson
  2. Universitat Pompeu Fabra (UPF), Barcelona, Spain.

    • Julien Lagarde
    • , Barbara Uszczynska-Ratajczak
    • , Sílvia Pérez-Lluch
    • , Amaya Abad
    • , Roderic Guigo
    •  & Rory Johnson
  3. R&D Department, Quantitative Genomic Medicine Laboratories (qGenomics), Barcelona, Spain.

    • Silvia Carbonell
  4. Functional Genomics Group, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.

    • Carrie Davis
    •  & Thomas R Gingeras
  5. Wellcome Trust Sanger Institute, Hinxton, UK.

    • Adam Frankish
    •  & Jennifer Harrow


  1. Search for Julien Lagarde in:

  2. Search for Barbara Uszczynska-Ratajczak in:

  3. Search for Silvia Carbonell in:

  4. Search for Sílvia Pérez-Lluch in:

  5. Search for Amaya Abad in:

  6. Search for Carrie Davis in:

  7. Search for Thomas R Gingeras in:

  8. Search for Adam Frankish in:

  9. Search for Jennifer Harrow in:

  10. Search for Roderic Guigo in:

  11. Search for Rory Johnson in:


R.J., R.G., J.H., A.F., B.U.-R. and J.L. designed the experiment. S.C. generated cDNA libraries and performed the capture. C.D. and T.R.G. carried out PacBio sequencing of capture libraries. J.L. and B.U.-R. analyzed the data under the supervision of R.G. and R.J. R.J. wrote the manuscript, with contributions from J.L., B.U.-R. and R.G. S.P.-L. and A.A. performed the RT-PCR experiments.

Competing interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to Roderic Guigo or Rory Johnson.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–17, Supplementary Tables 1–13 and Supplementary Note 1

  2. 2.

    Life Sciences Reporting Summary

Zip files

  1. 1.

    Supplementary Data Set 1

    Human protein-coding potential analysis on full-length reads.This file lists the protein coding potential analysis for every full length read, using CPAT and PhyloCSF programs.

Excel files

  1. 1.

    Supplementary Data Set 2

    Novel human ORFs from PhyloCSF

Text files

  1. 1.

    Supplementary Data Set 3

    Oligonucleotide sequences used in this study (FASTA format)This file provides the SMART cDNA library construction adapters, capture blockers and TruSeq adapter sequences used in the library construction (See Supplementary Figure 2c), as well as primer pairs used in the RT-PCR validation step.

  2. 2.

    Supplementary Data Set 4

    RT-PCR sequences of CLS transcript models (FASTA format)