High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing

Abstract

Accurate annotation of genes and their transcripts is a foundation of genomics, but currently no annotation technique combines throughput and accuracy. As a result, reference gene collections remain incomplete—many gene models are fragmentary, and thousands more remain uncataloged, particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), which combines targeted RNA capture with third-generation long-read sequencing. Here we present an experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues that resulted in novel transcript models for 3,574 and 561 gene loci, respectively. CLS approximately doubled the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enabled us to definitively characterize the genomic features of lncRNAs, including promoter and gene structure, and protein-coding potential. Thus, CLS removes a long-standing bottleneck in transcriptome annotation and generates manual-quality full-length transcript models at high-throughput scales.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Using the CLS approach to extend GENCODE lncRNA annotation.
Figure 2: CLS yields an enriched, long-read transcriptome.
Figure 3: Extending known lncRNA gene structures.
Figure 4: Full-length transcript annotation.
Figure 5: Properties of full-length lncRNA transcripts.
Figure 6: Protein-coding potential of full-length lncRNAs.

Accession codes

Primary accessions

Gene Expression Omnibus

References

  1. 1

    Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005).

    CAS  Google Scholar 

  2. 2

    Jia, H. et al. Genome-wide computational identification and manual annotation of human long noncoding RNA genes. RNA 16, 1478–1487 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3

    Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4

    Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5

    Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6

    Hangauer, M.J., Vaughn, I.W. & McManus, M.T. Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs. PLoS Genet. 9, e1003569 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. 7

    Iyer, M.K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8

    Zhao, Y. et al. NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic Acids Res. 44, D203–D208 (2016).

    CAS  PubMed  Google Scholar 

  9. 9

    Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10

    Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11

    Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12

    Bernstein, B.E. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Google Scholar 

  13. 13

    Chen, L. et al. Transcriptional diversity during lineage commitment of human blood progenitors. Science 345, 1251033 (2014).

    PubMed  PubMed Central  Google Scholar 

  14. 14

    Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15

    Forrest, A.R.R. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).

    CAS  Google Scholar 

  16. 16

    Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).

    CAS  PubMed  Google Scholar 

  17. 17

    Georgakilas, G. et al. microTSS: accurate microRNA transcription start site identification reveals a significant number of divergent pri-miRNAs. Nat. Commun. 5, 5700 (2014).

    CAS  PubMed  Google Scholar 

  18. 18

    Ørom, U.A. et al. Long noncoding RNAs with enhancer-like function in human cells. Cell 143, 46–58 (2010).

    PubMed  PubMed Central  Google Scholar 

  19. 19

    Ferdin, J. et al. HINCUTs in cancer: hypoxia-induced noncoding ultraconserved transcripts. Cell Death Differ. 20, 1675–1687 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20

    Calin, G.A. et al. Ultraconserved regions encoding ncRNAs are altered in human leukemias and carcinomas. Cancer Cell 12, 215–229 (2007).

    CAS  PubMed  Google Scholar 

  21. 21

    Lagarde, J. et al. Extension of human lncRNA transcripts by RACE coupled with long-read high-throughput sequencing (RACE-Seq). Nat. Commun. 7, 12339 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22

    Mercer, T.R. et al. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat. Biotechnol. 30, 99–104 (2011).

    PubMed  PubMed Central  Google Scholar 

  23. 23

    Bussotti, G. et al. Improved definition of the mouse transcriptome via targeted RNA sequencing. Genome Res. 26, 705–716 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24

    Clark, M.B. et al. Quantitative gene profiling of long noncoding RNAs with targeted RNA sequencing. Nat. Methods 12, 339–342 (2015).

    CAS  PubMed  Google Scholar 

  25. 25

    Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26

    Kozomara, A. & Griffiths-Jones, S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 42, D68–D73 (2014).

    CAS  Google Scholar 

  27. 27

    Visel, A., Minovitsky, S., Dubchak, I. & Pennacchio, L.A. VISTA Enhancer Browser—a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).

    CAS  PubMed  Google Scholar 

  28. 28

    Dimitrieva, S. & Bucher, P. UCNEbase—a database of ultraconserved non-coding elements and genomic regulatory blocks. Nucleic Acids Res. 41, D101–D109 (2013).

    CAS  PubMed  Google Scholar 

  29. 29

    Bussotti, G. et al. BlastR—fast and accurate database searches for non-coding RNAs. Nucleic Acids Res. 39, 6886–6895 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30

    Kralj, J.G. & Salit, M.L. Characterization of in vitro transcription amplification linearity and variability in the low copy number regime using External RNA Control Consortium (ERCC) spike-ins. Anal. Bioanal. Chem. 405, 315–320 (2013).

    CAS  PubMed  Google Scholar 

  31. 31

    Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32

    Sharon, D., Tilgner, H., Grubert, F. & Snyder, M. A single-molecule long-read survey of the human transcriptome. Nat. Biotechnol. 31, 1009–1014 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. 33

    Quail, M.A. et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34

    Mercer, T.R. et al. Targeted sequencing for gene discovery and quantification using RNA CaptureSeq. Nat. Protoc. 9, 989–1009 (2014).

    CAS  PubMed  Google Scholar 

  35. 35

    García-García, G. et al. Assessment of the latest NGS enrichment capture methods in clinical context. Sci. Rep. 6, 20948 (2016).

    PubMed  PubMed Central  Google Scholar 

  36. 36

    Leucci, E. et al. Melanoma addiction to the long non-coding RNA SAMMSON. Nature 531, 518–522 (2016).

    CAS  PubMed  Google Scholar 

  37. 37

    Blanco, E., Parra, G. & Guigó, R. Using geneid to identify genes. Curr. Protoc. Bioinformatics Chapter 4, Unit 4.3 (2007).

    PubMed  Google Scholar 

  38. 38

    Smith, C.M. & Steitz, J.A. Classification of gas5 as a multi-small-nucleolar-RNA (snoRNA) host gene and a member of the 5′-terminal oligopyrimidine gene family reveals common features of snoRNA host genes. Mol. Cell. Biol. 18, 6897–6909 (1998).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39

    Ounzain, S. et al. CARMEN, a human super enhancer-associated long noncoding RNA controlling cardiac specification, differentiation and homeostasis. J. Mol. Cell. Cardiol. 89, 98–112 (2015).

    CAS  PubMed  Google Scholar 

  40. 40

    Nissan, A. et al. Colon cancer associated transcript-1: a novel RNA expressed in malignant and pre-malignant human tissues. Int. J. Cancer 130, 1598–1606 (2012).

    CAS  PubMed  Google Scholar 

  41. 41

    Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42

    Marques, A.C. et al. Chromatin signatures at transcriptional start sites separate two equally populated yet distinct classes of intergenic long noncoding RNAs. Genome Biol. 14, R131 (2013).

    PubMed  PubMed Central  Google Scholar 

  43. 43

    Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).

    CAS  PubMed  Google Scholar 

  44. 44

    Arking, D.E. et al. Genetic association study of QT interval highlights role for calcium signaling pathways in myocardial repolarization. Nat. Genet. 46, 826–836 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45

    Alam, T. et al. Promoter analysis reveals globally differential regulation of human long non-coding RNA and protein-coding genes. PLoS One 9, e109443 (2014).

    PubMed  PubMed Central  Google Scholar 

  46. 46

    Melé, M. et al. Chromatin environment, transcriptional regulation, and splicing distinguish lincRNAs and mRNAs. Genome Res. 27, 27–37 (2017).

    PubMed  PubMed Central  Google Scholar 

  47. 47

    Mackowiak, S.D. et al. Extensive identification and analysis of conserved small ORFs in animals. Genome Biol. 16, 179 (2015).

    PubMed  PubMed Central  Google Scholar 

  48. 48

    Bazzini, A.A. et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 33, 981–993 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. 49

    Wang, L. et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 41, e74 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. 50

    Lin, M.F., Jungreis, I. & Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27, i275–i282 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. 51

    Sauvageau, M. et al. Multiple knockout mouse models reveal lincRNAs are required for life and brain development. eLife 2, e01749 (2013).

    PubMed  PubMed Central  Google Scholar 

  52. 52

    Wan, X. et al. Identification of androgen-responsive lncRNAs as diagnostic and prognostic markers for prostate cancer. Oncotarget 7, 60503–60518 (2016).

    PubMed  PubMed Central  Google Scholar 

  53. 53

    Letunic, I., Doerks, T. & Bork, P. SMART: recent updates, new developments and status in 2015. Nucleic Acids Res. 43, D257–D260 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54

    Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. 55

    Marco-Sola, S., Sammeth, M., Guigó, R. & Ribeca, P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1185–1188 (2012).

    CAS  PubMed  Google Scholar 

  56. 56

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. 57

    Quinlan, A.R. BEDTools: the Swiss-Army tool for genome feature analysis. Curr. Protoc. Bioinformatics 47, 11.12.1–11.12.34 (2014).

    Google Scholar 

Download references

Acknowledgements

We thank members of the Guigó laboratory for their valuable input and help with sample handling, data analysis and writing of the manuscript, including E. Palumbo, F. Reverter, A. Breschi, D. Pervouchine, C. Arnan and F. Camara. We thank L. Armengol (qGenomics) for advice on RNA capture, D. Garrido (CRG) for help with eQTL analysis, S. Bonnin (CRG) for help with data manipulation in R, and I. Jungreis (MIT) for advice on PhyloCSF. J. Wright and J. Choudhary (Sanger Institute) helped with the search for peptide hits to putative coding regions. S. Djebali (INRA, France) kindly made available the Compmerge utility. This work and its publication were supported by the National Human Genome Research Institute of the US National Institutes of Health (grants U41HG007234, U41HG007000 and U54HG007004) and the Wellcome Trust (grant WT098051 to R.G.). R.J. was supported by the Ramón y Cajal Subprogram of the Spanish Ministry of Economy and Competitiveness (grant RYC-2011-08851). Work in the laboratory of R.G. was supported by the National Human Genome Research Institute (awards U54HG0070, R01MH101814 and U41HG007234). This research was partly supported by NCCR RNA & Disease, funded by the Swiss National Science Foundation (to R.J.). We thank R. Garrido (CRG) for administrative support. We acknowledge support from the Spanish Ministry of Economy and Competitiveness, Centro de Excelencia Severo Ochoa 2013–2017 (SEV-2012-0208), and from the CERCA Programme, Generalitat de Catalunya.

Author information

Affiliations

Authors

Contributions

R.J., R.G., J.H., A.F., B.U.-R. and J.L. designed the experiment. S.C. generated cDNA libraries and performed the capture. C.D. and T.R.G. carried out PacBio sequencing of capture libraries. J.L. and B.U.-R. analyzed the data under the supervision of R.G. and R.J. R.J. wrote the manuscript, with contributions from J.L., B.U.-R. and R.G. S.P.-L. and A.A. performed the RT-PCR experiments.

Corresponding authors

Correspondence to Roderic Guigo or Rory Johnson.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–17, Supplementary Tables 1–13 and Supplementary Note 1 (PDF 22645 kb)

Life Sciences Reporting Summary (PDF 129 kb)

Supplementary Data Set 1

Human protein-coding potential analysis on full-length reads. (ZIP 9063 kb)

This file lists the protein coding potential analysis for every full length read, using CPAT and PhyloCSF programs.

Supplementary Data Set 2

Novel human ORFs from PhyloCSF (XLSX 11 kb)

Supplementary Data Set 3

Oligonucleotide sequences used in this study (FASTA format) (TXT 1 kb)

This file provides the SMART cDNA library construction adapters, capture blockers and TruSeq adapter sequences used in the library construction (See Supplementary Figure 2c), as well as primer pairs used in the RT-PCR validation step.

Supplementary Data Set 4

RT-PCR sequences of CLS transcript models (FASTA format) (TXT 2 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lagarde, J., Uszczynska-Ratajczak, B., Carbonell, S. et al. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat Genet 49, 1731–1740 (2017). https://doi.org/10.1038/ng.3988

Download citation

Further reading

Search

Quick links

Sign up for the Nature Briefing newsletter for a daily update on COVID-19 science.
Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing