Quantitative gene profiling of long noncoding RNAs with targeted RNA sequencing

Abstract

We compared quantitative RT-PCR (qRT-PCR), RNA-seq and capture sequencing (CaptureSeq) in terms of their ability to assemble and quantify long noncoding RNAs and novel coding exons across 20 human tissues. CaptureSeq was superior for the detection and quantification of genes with low expression, showed little technical variation and accurately measured differential expression. This approach expands and refines previous annotations and simultaneously generates an expression atlas.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Quantitative comparison of CaptureSeq, RNA-seq and qRT-PCR.
Figure 2: Profiling of human long noncoding RNAs with CaptureSeq.

Accession codes

Primary accessions

Gene Expression Omnibus

References

  1. 1

    Clark, M.B. et al. PLoS Biol. 9, e1000625 (2011).

    CAS  Article  Google Scholar 

  2. 2

    Kapranov, P., Willingham, A.T. & Gingeras, T.R. Nat. Rev. Genet. 8, 413–423 (2007).

    CAS  Article  Google Scholar 

  3. 3

    Djebali, S. et al. Nature 489, 101–108 (2012).

    CAS  Article  Google Scholar 

  4. 4

    Jiang, L. et al. Genome Res. 21, 1543–1551 (2011).

    CAS  Article  Google Scholar 

  5. 5

    Mercer, T.R. et al. Nat. Protoc. 9, 989–1009 (2014).

    CAS  Article  Google Scholar 

  6. 6

    Mercer, T.R. et al. Nat. Biotechnol. 30, 99–104 (2012).

    CAS  Article  Google Scholar 

  7. 7

    ERCC Consortium. BMC Genomics 6, 150 (2005).

  8. 8

    Hansen, K.D., Brenner, S.E. & Dudoit, S. Nucleic Acids Res. 38, e131 (2010).

    Article  Google Scholar 

  9. 9

    Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Genome Biol. 12, R22 (2011).

    CAS  Article  Google Scholar 

  10. 10

    Cabili, M.N. et al. Genes Dev. 25, 1915–1927 (2011).

    CAS  Article  Google Scholar 

  11. 11

    Derrien, T. et al. Genome Res. 22, 1775–1789 (2012).

    CAS  Article  Google Scholar 

  12. 12

    Harrow, J. et al. Genome Res. 22, 1760–1774 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13

    Amaral, P.P., Clark, M.B., Gascoigne, D.K., Dinger, M.E. & Mattick, J.S. Nucleic Acids Res. 39, D146–D151 (2011).

    CAS  Article  Google Scholar 

  14. 14

    Wang, L. et al. Nucleic Acids Res. 41, e74 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15

    Finn, R.D. et al. Nucleic Acids Res. 42, D222–D230 (2014).

    CAS  Google Scholar 

  16. 16

    Keren, H., Lev-Maor, G. & Ast, G. Nat. Rev. Genet. 11, 345–355 (2010).

    CAS  Article  Google Scholar 

  17. 17

    Lindblad-Toh, K. et al. Nature 478, 476–482 (2011).

    CAS  Article  Google Scholar 

  18. 18

    FANTOM Consortium. Nature 507, 462–470 (2014).

  19. 19

    Andersson, R. et al. Nature 507, 455–461 (2015).

    Article  Google Scholar 

  20. 20

    Roadmap Epigenomics Consortium. Nature 518, 317–330 (2015).

  21. 21

    Mercer, T.R. et al. Genome Res. 25, 290–303 (2015).

    CAS  Article  Google Scholar 

  22. 22

    Hsu, F. et al. Bioinformatics 22, 1036–1046 (2006).

    CAS  Article  Google Scholar 

  23. 23

    Pruitt, K.D. et al. Nucleic Acids Res. 42, D756–D763 (2014).

    CAS  Article  Google Scholar 

  24. 24

    Ning, Z., Cox, A.J. & Mullikin, J.C. Genome Res. 11, 1725–1729 (2001).

    CAS  Article  Google Scholar 

  25. 25

    Martin, J.A. & Wang, Z. Nat. Rev. Genet. 12, 671–682 (2011).

    CAS  Article  Google Scholar 

  26. 26

    Kim, D. et al. Genome Biol. 14, R36 (2013).

    Article  Google Scholar 

  27. 27

    Langmead, B. & Salzberg, S.L. Nat. Methods 9, 357–359 (2012).

    CAS  Article  Google Scholar 

  28. 28

    Li, H. et al. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

  29. 29

    Dobin, A. et al. Bioinformatics 29, 15–21 (2013).

    CAS  Article  Google Scholar 

  30. 30

    Haas, B.J. et al. Nat. Protoc. 8, 1494–1512 (2013).

    CAS  Article  Google Scholar 

  31. 31

    Trapnell, C. et al. Nat. Protoc. 7, 562–578 (2012).

    CAS  Article  Google Scholar 

  32. 32

    Anders, S., Pyl, P.T. & Huber, W. Bioinformatics 31, 166–169 (2015).

    CAS  Article  Google Scholar 

  33. 33

    Quinlan, A.R. & Hall, I.M. Bioinformatics 26, 841–842 (2010).

    CAS  Article  Google Scholar 

  34. 34

    Trapnell, C. et al. Nat. Biotechnol. 28, 511–515 (2010).

    CAS  Article  Google Scholar 

  35. 35

    Crooks, G.E., Hon, G., Chandonia, J.M. & Brenner, S.E. Genome Res. 14, 1188–1190 (2004).

    CAS  Article  Google Scholar 

  36. 36

    Love, M.I., Huber, W. & Anders, S. Genome Biol. 15, 550 (2014).

    PubMed  PubMed Central  Google Scholar 

  37. 37

    Robinson, M.D., McCarthy, D.J. & Smyth, G.K. Bioinformatics 26, 139–140 (2010).

    CAS  Article  Google Scholar 

  38. 38

    Blanchette, M. et al. Genome Res. 14, 708–715 (2004).

    CAS  Article  Google Scholar 

  39. 39

    Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Genome Res. 20, 110–121 (2010).

    CAS  Article  Google Scholar 

  40. 40

    Sherry, S.T. et al. Nucleic Acids Res. 29, 308–311 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. 41

    Forbes, S.A. et al. Nucleic Acids Res. 39, D945–D950 (2011).

    CAS  Article  Google Scholar 

  42. 42

    Stenson, P.D. et al. Genome Med 1, 13 (2009).

    Article  Google Scholar 

Download references

Acknowledgements

The authors acknowledge the following funding sources: an Australian National Health and Medical Research Council (NHMRC) Australia Fellowship (631668 to J.S.M. and 631542 to M.E.D.), an NHMRC Early Career Fellowship (APP1072662 to M.B.C.), an EMBO Long Term Fellowship (ALTF 864-2013 to M.B.C.), the Queensland State Government (National and International Research Alliance Program to L.K.N.) and an EMBL Interdisciplinary Postdoc (EIPOD) under Marie Curie Actions (COFUND) (to G.B.). The contents of the published material are solely the responsibility of the administering institution, a participating institution or individual authors and do not reflect the views of NHMRC. The authors thank the ENCODE consortium for the provision of data; data were employed in strict accordance with the associated data-release policy. The authors also thank Prof. M. Brown (University of Queensland) for contributions to manuscript preparation.

Author information

Affiliations

Authors

Contributions

T.R.M. and M.B.C. conceived the project and designed experiments, with advice from M.E.D. and J.S.M. M.B.C., J.C., M.E.B. and K.R.H. performed RNA-seq and RNA CaptureSeq. T.R.M., M.B.C., T.L. and G.B. performed the bioinformatic analyses. K.-A.L.C. assisted with statistical analyses. W.Y.C. performed PCR validations. T.R.M., M.B.C., M.E.D., L.K.N. and J.S.M. prepared the manuscript. G.P.T., A.J.E., L.K.N., R.J.T., M.E.D. and J.S.M. provided funding.

Corresponding authors

Correspondence to John S Mattick or Marcel E Dinger.

Ethics declarations

Competing interests

T.R.M. is a recipient of a Roche Discovery Agreement (2014). M.B.C. has received research support from Roche/Nimblegen for an unrelated research project.

Integrated supplementary information

Supplementary Figure 1 Advantages of RNA CaptureSeq for profiling genes with alternative splicing or low expression.

(a) Schematic figure indicating limitations of qRT-PCR for quantifying alternative splicing events. (b,c) Dynamic range of K562 cell transcriptome populations demonstrated by transcript (b) or exon (c) expression. Notably, the top 1% of transcripts comprises 38.4% of the total expressed mRNA population. (d) Calculated maximal fold enrichment achieved by CaptureSeq relative to number of genes (combining all known isoforms) targeted (estimated gene expression based on average gene expression in human K562 cell line). Note that higher enrichments can be maintained by removing highly expressed isoforms and gene loci from CaptureSeq targets.

Supplementary Figure 2 Comparative analysis of ERCC spike-in quantification using RNA sequencing and CaptureSeq.

(a) Fold enrichment achieved by CaptureSeq for each ERCC standard. High and variable enrichment at low ERCC concentrations results from low and sporadic alignment of RNAseq reads to ERCC standards. Decreasing enrichment at high ERCC concentrations is due to CaptureSeq saturation. Each technical replicate capture hybridization contained three biological replicate samples. (b) Spearman correlation of measured abundance by CaptureSeq of ERCC probes for three biological replicate samples in technical replicate. (c,d) Average Spearman correlation of measured abundance of ERCC probes for three biological replicates of CaptureSeq (c) and RNA-seq (d). (e) Segmented regression analysis indicates inflection point in the measured abundance of ERCC probes by CaptureSeq at an ERCC concentration of 2.34 attomol/μl (dotted line). n = 3 biological replicates; error bars are s.d. (f) RNA sequencing exhibits a linear profile across the range of ERCC concentrations it detects. n = 3 biological replicates; error bars are s.d.

Supplementary Figure 3 Enrichment in read coverage of ERCC RNA spike-in by CaptureSeq.

(a,b) Averaged read coverage for each ERCC probe from RNA-seq and CaptureSeq. n = 3 biological replicates; error bars are s.d. Horizontal dotted line shows eightfold coverage. RepA and RepB are technical replicate capture hybridizations containing three biological replicate samples. (a) Number of ERCC transcripts required for eightfold coverage. (b) Concentration of ERCC transcripts required for eightfold coverage. Vertical dotted line marks concentrations above which CaptureSeq is saturated. Lowest three concentrations of probes (<0.00114 attomol/μl) have zero coverage in more than 50% of RNA-seq libraries. (c) Fold difference in variability between RNA-seq measurement of ERCC abundance and CaptureSeq technical replicates (n = 3 biological replicates). Horizontal dotted line is at 1 and −1 (no difference in variability). Values above 1 show RNA-seq is more variable; values below −1 show CaptureSeq is more variable. Vertical dotted line is the ERCC concentration that allows consistent eightfold coverage by RNA-seq. RNA-seq is more variable at low expression levels. (d) Mean difference between CaptureSeq and RNA-seq accuracy in measuring ERCC abundance. RNAseq provided less accurate expression measurements at low levels but was more accurate at high levels. n = 3 biological replicates; error bars are s.d.

Supplementary Figure 4 Comparative analysis of technical bias between RNA-seq and CaptureSeq.

(a,b) Relationships among ERCC length (a), GC% (b) and CaptureSeq performance on moderately expressed probes compared to RNA-seq (enrichment residuals shown). Spearman correlation shown; line is nonlinear regression fit. RepA and RepB are technical replicate capture hybridizations containing three biological replicate samples. (c) Combined sequence read coverage across ERCC all standards merged (left) or two representative ERCC controls (middle and right) by RNAseq (blue) and CaptureSeq (red). Difference between read coverage indicated by gray shaded area. (d) Relative nucleotide enrichment for ERCC sequences that exhibit differential coverage between RNA-seq and CaptureSeq. No specific nucleotide bias is observed in regions exhibiting differential coverage. (e) Sequenced read coverage profile of SMPD2 by RNA-seq (blue) and CaptureSeq (red). Only minor variation is observed between the two profiles.

Supplementary Figure 5 Analysis of differential gene expression between samples with CaptureSeq.

(a) Pearson correlations of measured abundance of ERCC probes for one representative sample versus all others containing the same ERCC mix. Top, ERCC mix 1; bottom, ERCC mix 2. Two multiplexed capture hybridizations were performed containing a mix of ERCC mix 1 and 2 samples. Slightly higher correlations equate to samples present in the same hybridization. (b) Clustering of ERCC read counts following variance stabilizing transformation. ERCC mixes 1 (n = 5) and 2 (n = 4) clearly separate followed by separation by capture hybridization. Samples present in same hybridization shown in red and black, respectively. (c,d) Relationship between ERCC concentration and detected ERCC abundance. Segmental linear regression to determine the ERCC concentration at which saturation occurs (dotted line). Error bars are s.d. Linear slopes from segmental linear regression and the Pearson correlation for non-saturating concentrations are provided. (c) ERCC mix 1 samples; n = 5 biological replicates. Saturation at 1.30 attomol/μl. (d) ERCC mix 2 samples; n = 4 biological replicates. Saturation at 0.976 attomol/μl. (e) Averaged read coverage for each ERCC probe from ERCC mix 1 (n = 5) and ERCC mix 2 (n = 4) pools. Error bars are s.d. Y-axis dotted line shows eightfold coverage. (f) edgeR MA plot of log fold change for each ERCC control between the two mixes against transcript expression in log CPM (counts per million). Differentially expressed (DE) controls colored red; non-DE colored black. Zero fold change between two samples shown by blue line. edgeR performed using TMM normalization.

Supplementary Figure 6 Comparison of CaptureSeq and RNA-seq for differential gene expression analysis.

Comparison of CaptureSeq and RNA-seq for differential gene expression analysis. (a) Quantification of fold changes in ERCC standard abundances between two distinct samples (ERCC 1, n = 5 biological replicates; and ERCC 2, n = 4 biological replicates) for CaptureSeq and RNA-seq (with a matched number of reads). CaptureSeq records values for all ERCC standards (92); expression values were not obtained for 11 standards with RNA-seq. Slopes from nonlinear regression with a straight-line fit. (b) Variability in fold-change measurements for each ERCC fold-change category between CaptureSeq and matched RNA-seq. For each category RNA-seq showed greater variation. (c,d) edgeR MA plot of log fold change for each ERCC control against transcript expression in log CPM (counts per million). Differentially expressed (DE) controls colored red; non-DE colored black. Zero fold change between two samples shown by blue line. Matched RNA-seq (c), RNA-seq all reads, no downsampling (d). (e) Relationship between ERCC expression level (log CPM) and ability of CaptureSeq and RNA-seq to detect DE, given various levels of expression differences between two groups. Left, CaptureSeq; middle, matched RNA-seq; right, RNA-seq all reads, no downsampling. FDR, false discovery rate. 1% FDR shown by dashed line. FDR values limited to minimum value of 10−37.

Supplementary Figure 7 Targeted sequencing of lncRNAs with CaptureSeq.

Frequency distribution of expression for different gene classes according to biotype (a), gene ontology biological function (b) or annotation in disease database (c) in K562 cells. (d) Frequency distribution of probes relative to fraction of length with overlapping alignments from captured genomic DNA. We found greater than onefold coverage across the entirety of 96.5% of probes, thereby validating the ability to capture gDNA. (e) Plot showing measured relative to known abundance of ERCC standards by CaptureSeq (orange) and RNA-seq (dark blue). We have plotted measured abundance before (orange, light blue) and after (red, dark blue) removing duplicate reads. Although removing duplicate reads may reduce the impact of PCR amplification artifacts, it also causes the abundance of ERCC spike-ins to be underestimated, decreasing the quantitative range of CaptureSeq, and is therefore not recommended. (f) Genome browser view showing read alignment profile and assembled transcripts from RNA-seq (upper) and CaptureSeq (lower) across the Titin-antisense lncRNA locus. CaptureSeq read alignment shows higher specificity for exons, with fewer reads derived from nascent transcription present, resulting in more accurate transcript assembly. By contrast, RNA-seq shows a large amount of nascent transcription, resulting in the misassembly of the transcript locus with ‘retained’ introns.

Supplementary Figure 8 Quantitative accuracy of each tissue within expression atlas.

Measured expression (FPKM) of ERCC standards in each human tissue library analyzed. Pearson’s correlation indicates the quantitative accuracy of libraries following capture. Despite enhanced coverage, some ERCC probes (red) remained undetected, indicating that sequencing had not proceeded to saturation.

Supplementary Figure 9 Analysis of novel captured exons and isoforms.

(a) Proportion of introns with canonical splice junctions in previous coding and lncRNA exons is similar to new introns identified using CaptureSeq. (b) Sequence motif at 3’ intron end shows similar enrichment for poly-pyrimidine tract and splice elements in previously annotated introns and new introns identified by CaptureSeq. (c) Example of multiple previous lncRNA annotations that are merged into single higher-order contiguous lncRNA loci following more complete and accurate assembly with CaptureSeq. (d,e) Frequency distribution of open-reading-frame length and hexamer score indicates distinction between coding and noncoding transcripts analyzed from CaptureSeq assembled transcripts. (f) Box-whisker plot showing that CaptureSeq assembled gene models contained more exons and were more complete than previous annotations (based on GENCODE v19, Cabili et al. (2011), and lncRNAdb) used to design the capture array.

Supplementary Figure 10 Comparative analysis of captured and annotated coding and noncoding exons.

(a) Cumulative frequency distribution indicating the conservation (according to 100-way MutliZ Alignment) of previously annotated (based on GENCODE v19, Cabili et al., and lncRNAdb) coding and lncRNA exons relative to novel exons identified using CaptureSeq. (b) Conservation at 3’ exon boundary showing 3-nt periodicity characteristics of previously known coding gene exons (red) relative to new coding gene exons (orange) identified by CaptureSeq and (c) similar conservation of splice elements in previous lncRNA annotations relative to new lncRNA exons identified by CaptureSeq. (d) Comparison of SNP, repeat and predicted RNA secondary structure density between previous gene annotations (based on GENCODE v19, Cabili et al. (2011), and lncRNAdb) and new annotations assembled from CaptureSeq experiments.

Supplementary Figure 11 Examples of transcripts assembled following CaptureSeq.

(a) Targeting previous lncRNA annotations (blue) integrates them into a single complex locus. (b) CaptureSeq ensnares additional novel exons into the initial annotation, thereby expanding the previous annotation to annotated TSS’s. (c) CaptureSeq revises previous lncRNA annotations to identify a 1021 amino acid ORF. An assembly gap in GRCh37 (hg19) means the protein N-terminal may not be present. A new contig in GRCh38 places MGC50722 6kb upstream suggesting the possibility these two loci form one gene. (d) LncRNA can be erroneously annotated when only transcript fragments are available as demonstrated in example showing a lncRNA locus contains distal coding exons for downstream NPAS4 gene. Arrows indicate direction of transcription. Fantom 5 TSS on forward strand (red) and reverse strand (blue).

Supplementary Figure 12 Tissue-specific expression of captured lncRNAs.

(a) Hierarchal clustering of lncRNA loci according to expression. (b) Example of brain-specific lncRNA Evf2 correctly assembled and quantified using CaptureSeq.

Supplementary Figure 13 Examples of captured novel coding exons.

Examples of novel coding exons within GENCODE genes assembled following CaptureSeq. (a) Identification of a novel transcription start site for GLIS1 gene well supported by chromatin marks for transcriptional initiation. New first exon adds 175 amino acids to 5' of protein. (b) Novel internal coding exons in TNXB. Novel exon(s) are conserved and maintain the TNXB reading frame. (c) Targeting novel exons solely identified by evolutionary conservation enables the identification of novel exons that help assemble multiple HMNC2 annotation fragments into a contiguous gene locus. (d) Putative novel coding locus in bi-directional orientation with ZNF593 contains a 186 amino acid ORF.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13 and Supplementary Results (PDF 2384 kb)

Supplementary Table 1

PCR primers utilized in this study. (XLSX 49 kb)

Supplementary Table 2

Capture transcripts containing putative novel coding exons plus associated putative novel coding exons and GENCODE genes. (XLSX 413 kb)

Supplementary Table 3

Novel coding exon transcript annotations. (XLSX 49 kb)

Supplementary Data 1

Human genome coordinates (hg19) of tiled regions used for lncRNA CaptureSeq experiment. (ZIP 1095 kb)

Supplementary Data 2

Human genome coordinates (hg19) for all captured and assembled noncoding RNAs. (ZIP 1629 kb)

Supplementary Data 3

Human genome coordinates (hg19) for all captured and assembled coding RNAs. (ZIP 497 kb)

Supplementary Data 4

Human genome coordinates (hg19) for all novel coding RNAs that join LncRNA and coding gene loci. (ZIP 127 kb)

Supplementary Data 5

Human genome coordinates (hg19) for all novel noncoding RNAs. (ZIP 1020 kb)

Supplementary Data 6

Transcript Annotation file (.gtf) for all assembled transcripts (comprehensive). (ZIP 4055 kb)

Supplementary Data 7

FPKM values for all assembled transcripts (comprehensive). (ZIP 9178 kb)

Supplementary Data 8

Human genome coordinates (hg19) for all putative novel coding exons that were expressed. Exon annotation as per Lindblad-Toh et al. (2011). (ZIP 14 kb)

Supplementary Data 9

Human genome coordinates (hg19) for all captured and assembled transcripts containing putative novel coding exons. (ZIP 53 kb)

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Clark, M., Mercer, T., Bussotti, G. et al. Quantitative gene profiling of long noncoding RNAs with targeted RNA sequencing. Nat Methods 12, 339–342 (2015). https://doi.org/10.1038/nmeth.3321

Download citation

Further reading

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing