Assessment of transcript reconstruction methods for RNA-seq

Journal name:
Nature Methods
Volume:
10,
Pages:
1177–1184
Year published:
DOI:
doi:10.1038/nmeth.2714
Received
Accepted
Published online

Abstract

We evaluated 25 protocol variants of 14 independent computational methods for exon identification, transcript reconstruction and expression-level quantification from RNA-seq data. Our results show that most algorithms are able to identify discrete transcript components with high success rates but that assembly of complete isoform structures poses a major challenge even when all constituent elements are identified. Expression-level estimates also varied widely across methods, even when based on similar transcript models. Consequently, the complexity of higher eukaryotic genomes imposes severe limitations on transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data.

At a glance

Figures

  1. Summary of nucleotide-level performance for the methods evaluated.
    Figure 1: Summary of nucleotide-level performance for the methods evaluated.

    The plots show performance at detecting exonic nucleotides. Sensitivity (blue) indicates the proportion of known exon sequence in each genome covered by assembled transcripts, and precision (orange) indicates the proportion of reported expressed sequence confined to known exons. Some protocol variants considered all expressed transcripts (all) or excluded those of low abundance (high). Programs run with gene annotation are grouped separately. iReckon was run with complete reference annotation (full) and with transcript boundaries only (ends). Transcript reconstruction methods are described in the Supplementary Note.

  2. Summary of exon-level performance for the methods evaluated.
    Figure 2: Summary of exon-level performance for the methods evaluated.

    The plots show performance at detecting individual exons as the percentage of reference exons with a matching feature in the submission (sensitivity, blue) and the proportion of reported exons that agree with annotation (precision, orange).

  3. Influence of read depth and intron length on detection performance.
    Figure 3: Influence of read depth and intron length on detection performance.

    (a) Sensitivity for detection of annotated exons stratified by read depth. (b) Annotated introns were binned on length, and sensitivity was calculated separately for each bin.

  4. Intron classification.
    Figure 4: Intron classification.

    Reported introns were classified by overlap with splice sites annotated in the reference gene sets.

  5. Transcript assembly performance.
    Figure 5: Transcript assembly performance.

    (a) Reference transcripts with a matching submission entry (transcript-level sensitivity, blue) and reported transcripts that match the reference (transcript-level precision, orange). (b) Transcripts for which various subsets of constituent exons have been reported.

  6. Examples of transcript calls and expression-level estimates.
    Figure 6: Examples of transcript calls and expression-level estimates.

    (a) The upper tracks show RNA-seq read coverage (from STAR alignments; see Online Methods) and annotated genes. Exon predictions from the ten methods that quantified transcripts are illustrated below the annotated gene by colored boxes. Exons predicted to belong to the same transcript isoform are connected. Original and median-scaled RPKM values are presented to the right and left, respectively, of the transcript models. For the gene RPF2, all methods reported different isoforms and expression levels. Where multiple overlapping isoforms were identified, that with the higher RPKM was selected for visualization, and spliced isoforms were prioritized over unspliced ones. The noncoding RNA U6 is not expressed. (b) Heat maps illustrate pairwise agreement between reported transcript isoforms for H. sapiens (left), D. melanogaster (center) and C. elegans (right). (c) Correlation between reported RPKM values and NanoString counts (Pearson r of log-transformed values). NanoString counts were compared to the highest RPKM value reported for transcript isoforms consistent with the probe design (correlation rc) or for any isoform from the locus (correlation ra).

Accession codes

Referenced accessions

ArrayExpress

References

  1. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511515 (2010).
  2. Mezlini, A.M. et al. iReckon: simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res. 23, 519529 (2013).
  3. Roberts, A., Pimentel, H., Trapnell, C. & Pachter, L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics 27, 23252329 (2011).
  4. Li, J.J., Jiang, C.-R., Brown, J.B., Huang, H. & Bickel, P.J. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl. Acad. Sci. USA 108, 1986719872 (2011).
  5. Schulz, M.H., Zerbino, D.R., Vingron, M. & Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28, 10861092 (2012).
  6. Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644652 (2011).
  7. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909912 (2010).
  8. Guigó, R. et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 7 (suppl. 1), S2 (2006).
  9. Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435W439 (2006).
  10. Slater, G.S.C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).
  11. Schweikert, G. et al. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res. 19, 21332143 (2009).
  12. Blanco, E., Parra, G. & Guigó, R. Using geneid to identify genes. Curr. Protoc. Bioinformatics 18, 4.3 (2007).
  13. Sperisen, P. et al. trome, trEST and trGEN: databases of predicted protein sequences. Nucleic Acids Res. 32, D509D511 (2004).
  14. Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821829 (2008).
  15. Lenhard, B., Sandelin, A. & Carninci, P. Metazoan promoters: emerging characteristics and insights into transcriptional regulation. Nat. Rev. Genet. 13, 233245 (2012).
  16. Di Giammartino, D.C., Nishida, K. & Manley, J.L. Mechanisms and consequences of alternative polyadenylation. Mol. Cell 43, 853866 (2011).
  17. Tian, B., Hu, J., Zhang, H. & Lutz, C.S. A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 33, 201212 (2005).
  18. Batut, P., Dobin, A., Plessy, C., Carninci, P. & Gingeras, T.R. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 23, 169180 (2013).
  19. Shepard, P.J. et al. Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. RNA 17, 761772 (2011).
  20. Engström, P.G. et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat. Methods doi:10.1038/nmeth.2722 (3 November 2013).
  21. Jean, G., Kahles, A., Sreedharan, V.T., De Bona, F. & Rätsch, G. RNA-Seq read alignments with PALMapper. Curr. Protoc. Bioinformatics 32, 11.6 (2010).
  22. Marco-Sola, S., Sammeth, M., Guigo, R. & Ribeca, P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 11851188 (2012).
  23. Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873881 (2010).
  24. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 11051111 (2009).
  25. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
  26. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621628 (2008).
  27. Kulkarni, M.M. Digital multiplexed gene expression analysis using the NanoString nCounter system. Curr. Protoc. Mol. Biol. 94, 25B.10 (2011).
  28. Li, B. & Dewey, C.N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
  29. Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 10091015 (2010).
  30. Bohnert, R. & Rätsch, G. rQuant.web: a tool for RNA-Seq-based transcript quantitation. Nucleic Acids Res. 38, W348W351 (2010).
  31. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 17601774 (2012).
  32. The modENCODE Consortium et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 17871797 (2010).
  33. Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101108 (2012).
  34. Graveley, B.R. et al. The developmental transcriptome of Drosophila melanogaster. Nature 471, 473479 (2011).
  35. Mortazavi, A. et al. Scaffolding a Caenorhabditis nematode genome with RNA-seq. Genome Res. 20, 17401747 (2010).
  36. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 1521 (2013).

Download references

Author information

  1. These authors contributed equally to this work.

    • Josep F Abril,
    • Pär G Engström,
    • Felix Kokocinski,
    • Josep F Abril,
    • Pär G Engström &
    • Felix Kokocinski

Affiliations

  1. European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK.

    • Tamara Steijger,
    • Pär G Engström,
    • Paul Bertone &
    • Daniel Zerbino
  2. Departament de Genètica, Facultat de Biologia, Universitat de Barcelona, Barcelona, Spain.

    • Josep F Abril
  3. Wellcome Trust Sanger Institute, Cambridge, UK.

    • Felix Kokocinski,
    • Jennifer Harrow,
    • Tim J Hubbard,
    • Steven M J Searle &
    • Simon White
  4. Centre for Genomic Regulation, Barcelona, Spain.

    • Thomas Derrien,
    • David Gonzalez,
    • Roderic Guigó,
    • Julien Lagarde &
    • Michael Sammeth
  5. Universitat Pompeu Fabra, Barcelona, Spain.

    • Sarah Djebali &
    • Roderic Guigó
  6. Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.

    • Paul Bertone
  7. Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.

    • Paul Bertone
  8. Wellcome Trust–Medical Research Council Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK.

    • Paul Bertone
  9. Present address: Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden.

    • Pär G Engström
  10. Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.

    • Martin Akerman,
    • Thomas R Gingeras,
    • Jie Wu &
    • Michael Q Zhang
  11. Centre Nacional d'Analisi Genomica, Barcelona, Spain.

    • Tyler Alioto &
    • Paolo Ribeca
  12. Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.

    • Giovanna Ambrosini,
    • Philipp Bucher,
    • Gregory Lefebvre &
    • Jacques Rougemont
  13. Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland.

    • Giovanna Ambrosini,
    • Philipp Bucher,
    • Christian Iseli,
    • Gregory Lefebvre,
    • Jacques Rougemont,
    • Naryttza Diaz Solorzano,
    • Brian J Stevenson,
    • Heinz Stockinger &
    • Armand Valsesia
  14. Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland.

    • Stylianos E Antonarakis
  15. Computational Biology Center, Sloan-Kettering Institute, New York, New York, USA.

    • Jonas Behr,
    • André Kahles &
    • Gunnar Rätsch
  16. Friedrich Miescher Laboratory of the Max Planck Society, Tübingen, Germany.

    • Jonas Behr,
    • Regina Bohnert,
    • Géraldine Jean,
    • André Kahles,
    • Peter Niermann,
    • Gunnar Rätsch &
    • Georg Zeller
  17. Queensland Centre for Medical Genomics, The University of Queensland, St. Lucia, Australia.

    • Nicole Cloonan &
    • Sean M Grimmond
  18. Department of Computer Science, Yale University, New Haven, Connecticut, USA.

    • Jiang Du &
    • Mark Gerstein
  19. Division of Biostatistics, School of Public Health, University of California, Berkeley, Berkeley, California, USA.

    • Sandrine Dudoit
  20. Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, USA.

    • Mark Gerstein,
    • Joel Rozowsky &
    • Andrea Sboner
  21. Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA.

    • Mark Gerstein,
    • Lukas Habegger &
    • Jing Leng
  22. Ludwig Institute for Cancer Research, Lausanne, Switzerland.

    • Christian Iseli,
    • Naryttza Diaz Solorzano,
    • Brian J Stevenson &
    • Armand Valsesia
  23. Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA.

    • Suzanna Lewis
  24. Department of Developmental and Cell Biology, University of California, Irvine, Irvine, California, USA.

    • Ali Mortazavi
  25. Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.

    • Alexandre Reymond
  26. Max Planck Institute for Molecular Genetics, Berlin, Germany.

    • Hugues Richard &
    • Marcel H Schulz
  27. Department of Computer Science, Royal Holloway, University of London, London, UK.

    • Victor Solovyev
  28. Institute for Microbiology and Genetics, Göttingen, Germany.

    • Mario Stanke
  29. Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany.

    • David Weese
  30. Biology Division, California Institute of Technology, Pasadena, California, USA.

    • Barbara J Wold
  31. Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, New York, USA.

    • Jie Wu
  32. Department of Bioinformatics and Computational Biology, Genentech, San Francisco, California, USA.

    • Thomas D Wu

Consortia

  1. The RGASP Consortium

    • Josep F Abril,
    • Martin Akerman,
    • Tyler Alioto,
    • Giovanna Ambrosini,
    • Stylianos E Antonarakis,
    • Jonas Behr,
    • Paul Bertone,
    • Regina Bohnert,
    • Philipp Bucher,
    • Nicole Cloonan,
    • Thomas Derrien,
    • Sarah Djebali,
    • Jiang Du,
    • Sandrine Dudoit,
    • Pär G Engström,
    • Mark Gerstein,
    • Thomas R Gingeras,
    • David Gonzalez,
    • Sean M Grimmond,
    • Roderic Guigó,
    • Lukas Habegger,
    • Jennifer Harrow,
    • Tim J Hubbard,
    • Christian Iseli,
    • Géraldine Jean,
    • André Kahles,
    • Felix Kokocinski,
    • Julien Lagarde,
    • Jing Leng,
    • Gregory Lefebvre,
    • Suzanna Lewis,
    • Ali Mortazavi,
    • Peter Niermann,
    • Gunnar Rätsch,
    • Alexandre Reymond,
    • Paolo Ribeca,
    • Hugues Richard,
    • Jacques Rougemont,
    • Joel Rozowsky,
    • Michael Sammeth,
    • Andrea Sboner,
    • Marcel H Schulz,
    • Steven M J Searle,
    • Naryttza Diaz Solorzano,
    • Victor Solovyev,
    • Mario Stanke,
    • Tamara Steijger,
    • Brian J Stevenson,
    • Heinz Stockinger,
    • Armand Valsesia,
    • David Weese,
    • Simon White,
    • Barbara J Wold,
    • Jie Wu,
    • Thomas D Wu,
    • Georg Zeller,
    • Daniel Zerbino &
    • Michael Q Zhang

Contributions

J.H., R.G. and T.J.H. conceived of and organized the study. Consortium members provided transcript models for evaluation. J.H. and P.B. coordinated the analysis, which was carried out by T.S., J.F.A., P.G.E. and F.K. T.S., P.B. and P.G.E. wrote the manuscript with input from the other authors.

Competing financial interests

The authors declare no competing financial interests.

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (10,004 KB)

    Supplementary Figures 1–31, Supplementary Tables 1–10 and Supplementary Note

Additional data