Abstract

De novo assembly of RNA-seq data enables researchers to study transcriptomes without the need for a genome sequence; this approach can be usefully applied, for instance, in research on 'non-model organisms' of ecological and evolutionary importance, cancer samples or the microbiome. In this protocol we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-seq data in non-model organisms. We also present Trinity-supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples and approaches to identify protein-coding genes. In the procedure, we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from http://trinityrnaseq.sourceforge.net. The run time of this protocol is highly dependent on the size and complexity of data to be analyzed. The example data set analyzed in the procedure detailed herein can be processed in less than 5 h.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    , & RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).

  2. 2.

    & Advancing RNA-seq analysis. Nat. Biotechnol. 28, 421–423 (2010).

  3. 3.

    & Next-generation transcriptome assembly. Nat. Rev. Genet. 12, 671–682 (2011).

  4. 4.

    et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).

  5. 5.

    et al. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).

  6. 6.

    et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909–912 (2010).

  7. 7.

    , , & Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28, 1086–1092 (2012).

  8. 8.

    et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).

  9. 9.

    , , , & Optimizing de novo common wheat transcriptome assembly using short-read RNA-seq data. BMC Genomics 13, 392 (2012).

  10. 10.

    et al. De novo assembly and characterization of the root transcriptome of Aegilops variabilis during an interaction with the cereal cyst nematode. BMC Genomics 13, 133 (2012).

  11. 11.

    et al. Optimizing de novo transcriptome assembly from short-read RNA-seq data: a comparative study. BMC Bioinformatics 12 (suppl. 14), S2 (2011).

  12. 12.

    et al. Trinity RNA-seq assembler performance optimization. XSEDE '12 Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: bridging from the eXtreme to the campus and beyond (Chicago, Illinois, USA, July 16–20, 2012) (2012).

  13. 13.

    & A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).

  14. 14.

    & RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).

  15. 15.

    , & edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

  16. 16.

    & Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).

  17. 17.

    , , & Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics 11, 94 (2010).

  18. 18.

    & Design and validation issues in RNA-seq experiments. Briefi. Bioinform. 12, 280–287 (2011).

  19. 19.

    & Statistical design and analysis of RNA sequencing data. Genetics 185, 405–416 (2010).

  20. 20.

    , , , & Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621–628 (2008).

  21. 21.

    et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

  22. 22.

    & Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013).

  23. 23.

    , , & Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

  24. 24.

    & A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).

  25. 25.

    et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. (17 September 2012).

  26. 26.

    , , , & RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008).

  27. 27.

    et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

  28. 28.

    , , , & GenomeView: a next-generation genome browser. Nucleic Acids Res. 40, e12 (2012).

  29. 29.

    et al. Comparison of next-generation sequencing systems. J. Biomed. Biotechnol. 2012, 251364 (2012).

  30. 30.

    et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

  31. 31.

    et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348–352 (2011).

  32. 32.

    , , & De novo transcriptome assembly and SNP discovery in the wing polymorphic salt marsh beetle Pogonus chalceus (Coleoptera, Carabidae). PLoS ONE 7, e42605 (2012).

  33. 33.

    & Comment on “Widespread RNA and DNA sequence differences in the human transcriptome”. Science 335, 1302 (2012).

  34. 34.

    & Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

  35. 35.

    , & Empirical Bayesian selection of hypothesis testing procedures for analysis of sequence count expression data. Stat. Appl. Genet. Mol. Biol. (2012).

  36. 36.

    , , , & Differential expression in RNA-seq: a matter of depth. Genome Res. 21, 2213–2223 (2011).

  37. 37.

    et al. GENE-counter: a computational pipeline for the analysis of RNA-seq data for gene expression differences. PLoS ONE 6, e25279 (2011).

  38. 38.

    & baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11, 422 (2010).

  39. 39.

    et al. An empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29, 1035–1043 (2012).

  40. 40.

    & Genomic sequencing in cancer. Cancer Lett. (2012).

  41. 41.

    et al. Comparative functional genomics of the fission yeasts. Science 332, 930–936 (2011).

  42. 42.

    & Comparing de novo assemblers for 454 transcriptome data. BMC Genomics 11, 571 (2010).

  43. 43.

    , , & Next generation transcriptomes for next generation genomes using est2assembly. BMC Bioinformatics 10, 447 (2009).

  44. 44.

    et al. RobiNA: a user-friendly, integrated software solution for RNA-seq–based transcriptomics. Nucleic Acids Res. 40, W622–W627 (2012).

  45. 45.

    Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17 (2011).

  46. 46.

    , , , & How deep is deep enough for RNA-seq profiling of bacterial transcriptomes? BMC Genomics 13, 734 (2012).

  47. 47.

    , , , & A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv:1203.4802 [q-bio.GN] (2012).

  48. 48.

    , & A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol. 500, 79–98 (2011).

  49. 49.

    et al. Transcriptome analysis by strand-specific sequencing of complementary DNA. Nucleic Acids Res. 37, e123 (2009).

  50. 50.

    et al. Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nat. Genet. 44, 765–769 (2012).

Download references

Acknowledgements

We are grateful to D. Jaffe and S. Young for access to additional computing resources, to Z. Chen for help in R-scripting, to L. Gaffney for help with figure illustrations, to C. Titus Brown for essential discussions and inspiration related to digital normalization strategies, to G. Marcais and C. Kingsford for supporting the use of their Jellyfish software in Trinity and to B. Walenz for supporting our earlier use of Meryl. We are grateful to our users and their feedback, in particular J. Wortman and P. Bain for comments on earlier drafts of the manuscript. This project has been funded in part (B.J.H.) with Federal funds from the National Institute of Allergy and Infectious Diseases (NIAID), US National Institutes of Health (NIH), Department of Health and Human Services (DHHS), under contract no. HHSN272200900018C. Work was supported by Howard Hughes Medical Institute (HHMI), a NIH PIONEER award, a Center for Excellence in Genome Science grant no. 5P50HG006193-02 from the National Human Genome Research Institute (NHGRI) and the Klarman Cell Observatory at the Broad Institute (A.R.). A.P. was supported by the CSIRO Office of the Chief Executive (OCE). M.Y. was supported by the Clore Foundation. P.B. was supported by the National Science Foundation (NSF) grant no. OCI-1053575 for the Extreme Science and Engineering Discovery Environment (XSEDE) project. B.L. and C.D. were partially supported by NIH grant no.1R01HG005232-01A1. In addition, B.L. was partially funded by J. Thomson's MacArthur Professorship and by the Morgridge Institute for Research support for Computation and Informatics in Biology and Medicine. M.L. was supported by the Bundesministerium für Bildung und Forschung via the project 'NGSgoesHPC'. N.P. was funded by the Fund for Scientific Research, Flanders (Fonds Wetenschappelijk Onderzoek (FWO) Vlaanderen), Belgium. R.H. and R.D.L. were funded by the NSF under grant nos. ABI-1062432 and CNS-0521433 to Indiana University, and by Indiana METACyt Initiative, which is supported in part by Lilly Endowment, Inc. J.B. was supported through a CSIRO eResearch Accelerated Computing Project. Any opinions, findings and conclusions or recommendations expressed in this article are those of the authors and do not necessarily reflect the views of any of the funding bodies and institutions including the National Science Foundation, the National Center for Genome Analysis Support and Indiana University.

Author information

Author notes

    • Brian J Haas
    •  & Alexie Papanicolaou

    These authors contributed equally to this work.

Affiliations

  1. Broad Institute of Massachusetts Institute of Technology (MIT) and Harvard, Cambridge, Massachusetts, USA.

    • Brian J Haas
    • , Moran Yassour
    • , Nathalie Pochet
    •  & Aviv Regev
  2. Commonwealth Scientific and Industrial Research Organisation (CSIRO) Ecosystem Sciences, Black Mountain Laboratories, Canberra, Australian Capital Territory, Australia.

    • Alexie Papanicolaou
    •  & Michael Ott
  3. The Selim and Rachel Benin School of Computer Science, The Hebrew University of Jerusalem, Jerusalem, Israel.

    • Moran Yassour
    •  & Nir Friedman
  4. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden.

    • Manfred Grabherr
  5. Pittsburgh Supercomputing Center, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

    • Philip D Blood
  6. CSIRO Information Management & Technology, St. Lucia, Queensland, Australia.

    • Joshua Bowden
  7. Department of Microbiology and Molecular Genetics, Oklahoma State University, Stillwater, Oklahoma, USA.

    • Matthew Brian Couger
  8. Genomics Research Centre, Griffith University, Gold Coast Campus, Gold Coast, Queensland, Australia.

    • David Eccles
  9. Department of Computer Sciences, University of Wisconsin, Madison, Wisconsin, USA.

    • Bo Li
    •  & Colin N Dewey
  10. Center for Information Services and High-performance Computing (ZIH), Technische Universität Dresden, Dresden, Germany.

    • Matthias Lieber
  11. California Institute for Quantitative Biosciences, University of California, Berkeley, Berkeley, California, USA.

    • Matthew D MacManes
  12. Institute for Genome Sciences, Baltimore, Maryland, USA.

    • Joshua Orvis
  13. Department of Plant Systems Biology, Vlaams Instituut voor Biotechnologie (VIB), Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium.

    • Nathalie Pochet
  14. Parco Tecnologico Padano, Località Cascina Codazza, Lodi, Italy.

    • Francesco Strozzi
  15. Corn Insects and Crop Genetics Research Unit, United States Department of Agriculture–Agricultural Research Service, Ames, Iowa, USA.

    • Nathan Weeks
  16. Genomics facility, Purdue University, West Lafayette, Indiana, USA.

    • Rick Westerman
  17. GWT-TUD GmbH, Saxony, Germany.

    • Thomas William
  18. Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin, USA.

    • Colin N Dewey
  19. University Information Technology Services, Research Technologies Division, Indiana University, Bloomington, Indiana, USA.

    • Robert Henschel
    •  & Richard D LeDuc
  20. Department of Biology, Howard Hughes Medical Institute, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Aviv Regev

Authors

  1. Search for Brian J Haas in:

  2. Search for Alexie Papanicolaou in:

  3. Search for Moran Yassour in:

  4. Search for Manfred Grabherr in:

  5. Search for Philip D Blood in:

  6. Search for Joshua Bowden in:

  7. Search for Matthew Brian Couger in:

  8. Search for David Eccles in:

  9. Search for Bo Li in:

  10. Search for Matthias Lieber in:

  11. Search for Matthew D MacManes in:

  12. Search for Michael Ott in:

  13. Search for Joshua Orvis in:

  14. Search for Nathalie Pochet in:

  15. Search for Francesco Strozzi in:

  16. Search for Nathan Weeks in:

  17. Search for Rick Westerman in:

  18. Search for Thomas William in:

  19. Search for Colin N Dewey in:

  20. Search for Robert Henschel in:

  21. Search for Richard D LeDuc in:

  22. Search for Nir Friedman in:

  23. Search for Aviv Regev in:

Contributions

B.J.H. is the current lead developer of Trinity and is additionally responsible for the development of the companion in silico normalization and TransDecoder utilities described herein. M.Y. contributed to Butterfly software enhancements, generating figures and to the manuscript text. B.L. and C.N.D. developed RSEM and are responsible for enhancements related to improved Trinity support. B.J.H. and A.P. wrote the initial draft of the manuscript. A.R. is the Principal Investigator. All authors contributed to Trinity development and/or writing of the final manuscript, and all authors approved the final text.

Competing interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to Brian J Haas or Aviv Regev.

Supplementary information

PDF files

  1. 1.

    Supplementary Note

    Supplementary materials for de novo transcript sequence reconstruction from RNA-seq: reference generation and analysis with Trinity.

  2. 2.

    Supplementary Figure 1

    Defining minimum edge thresholds during initial Butterfly graph pruning.

  3. 3.

    Supplementary Figure 2

    Butterfly's minimum support requirement for path extension during transcript reconstruction.

  4. 4.

    Supplementary Figure 3

    Merging of insufficiently different path sequences.

  5. 5.

    Supplementary Figure 4

    Enforcing path restrictions via triplet locking.

  6. 6.

    Supplementary Figure 5

    Restrictions on the number of paths to be extended at each node.

  7. 7.

    Supplementary Figure 6

    Evaluating assembly completeness for the S. pombe transcriptome.

  8. 8.

    Supplementary Figure 7

    Evaluating assembly completeness for the mouse dendritic cell transcriptome.

  9. 9.

    Supplementary Figure 8

    Correlation of expression values between reference transcripts and Trinity transcript components according to percent length agreement in S. pombe.

  10. 10.

    Supplementary Figure 9

    Agreement between expression profiles calculated based on reference transcripts and trinity components at different S. pombe samples.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nprot.2013.084

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.