De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis

Journal name:
Nature Protocols
Volume:
8,
Pages:
1494–1512
Year published:
DOI:
doi:10.1038/nprot.2013.084
Published online

Abstract

De novo assembly of RNA-seq data enables researchers to study transcriptomes without the need for a genome sequence; this approach can be usefully applied, for instance, in research on 'non-model organisms' of ecological and evolutionary importance, cancer samples or the microbiome. In this protocol we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-seq data in non-model organisms. We also present Trinity-supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples and approaches to identify protein-coding genes. In the procedure, we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from http://trinityrnaseq.sourceforge.net. The run time of this protocol is highly dependent on the size and complexity of data to be analyzed. The example data set analyzed in the procedure detailed herein can be processed in less than 5 h.

At a glance

Figures

  1. Overview of Trinity assembly and analysis pipeline.
    Figure 1: Overview of Trinity assembly and analysis pipeline.

    Key sequential stages in Trinity (left) and the associated computational resources (right). Trinity takes as input short reads (top left) and first uses the Inchworm module to construct contigs. This stage requires a single high-memory server (∼1 GB of RAM per 1 million paired reads, but varies according to read complexity; top right). Chrysalis (middle left) clusters related Inchworm contigs, often generating tens to hundreds of thousands of Inchworm contig clusters, each of which is processed to a de Bruijn graph component independently and in parallel on a computing grid (bottom right). Butterfly (bottom left) then extracts all probable sequences from each graph component, which can be parallelized as well.

  2. Effects of in silico fragment normalization of RNA-seq data on Trinity full-length transcript reconstruction.
    Figure 2: Effects of in silico fragment normalization of RNA-seq data on Trinity full-length transcript reconstruction.

    (a,b) The y axis shows the number of full-length transcripts reconstructed from a data set of paired-end strand-specific RNA-seq in S. pombe (10 million paired-end reads) (a) and mouse (100 million, paired-end reads) (b), using either the full data set (total; 100%) or different samplings (x axis) by either Trinity's in silico normalization procedure (at 5× up to 100× targeted maximum k-mer (k = 25) coverage; blue bars)) or random downsampling of the same number of reads (red bars).

  3. Transcriptome and genome representations of alternatively spliced transcripts.
    Figure 3: Transcriptome and genome representations of alternatively spliced transcripts.

    (ac) An example of the graphical representation generated by Trinity's Butterfly software (a) along with the corresponding reconstructed transcripts (b) and their exonic structure based on alignment to the mouse genome (c). Each node in a is associated with a sequence, and directed edges connect consecutive sequences from 5′ to 3′ in the same transcript. Bulges and bifurcations indicate sequence differences between alternative reconstructed transcripts, including alternatively spliced cassette exons; only a single bulge is shown in this transcript graph, yielding the red node. Edges are annotated by the number of RNA-seq fragments supporting the transcript from the 5′ sequence to the 3′ one. In this example, there are two supported paths: one from the blue to the green node (supported by 32 fragments) yielding 'isoform A' (b, top), and the other from the blue to the red to the green node, supported by at most five fragments, yielding 'isoform B' (b, bottom). The red node is a result of an alternatively skipped exon, as apparent in the gene structure (c, red bar, shown in 'isoform B'). Navigable transcript graphs are optionally generated by Butterfly, provided in 'dot' format, and can be visualized using graphviz (http://graphviz.org). These details are provided on the Trinity website (http://trinityrnaseq.sourceforge.net/advanced_trinity_guide.html).

  4. Strand-specific library types.
    Figure 4: Strand-specific library types.

    The left (/1) and right (/2) sequencing reads are depicted according to their orientations relative to the sense strand of a transcript sequence. The strand-specific library type (F, R, FR or RF) depends on the library construction protocol and is user-specified to Trinity via the '--SS_lib_type' parameter.

  5. Full-length transcript reconstruction by Trinity in different organisms, sequencing depths and parameters.
    Figure 5: Full-length transcript reconstruction by Trinity in different organisms, sequencing depths and parameters.

    The y axis to the left-hand side shows the number of fully reconstructed transcripts for Trinity assemblies of RNA-seq data derived from fission yeast (S. pombe8, 41), Drosophila melanogaster11 and mouse8 with different combinations of parameters: DS, double-stranded mode; SS, strand-specific mode; +J, using the '--jaccard_clip' parameter to split falsely fused transcripts. Both SS and DS results are provided for S. pombe and mouse, but only DS results are provided for Drosophila, as its RNA-seq data were not strand specific. Blue shows full-length transcripts; red shows full-length merged transcripts (i.e., transcripts erroneously fused (multicistronic) with another (typically neighboring) transcript). The black asterisks (values shown in the y axis on the right-hand side of the graph) indicate the run times in each case with a contemporary high-memory (256–512 GB of RAM) server using a maximum of four threads ('--CPU 4', see Step 6 of the PROCEDURE).

  6. Evaluating paired-read support via the Jaccard similarity coefficient.
    Figure 6: Evaluating paired-read support via the Jaccard similarity coefficient.

    Read pair support is computed by first counting the number of RNA-seq fragments (bounds of paired reads) that span each of two outer points of a specified window length (default: 100 bases), and then computing the Jaccard similarity coefficient (intersection/union) comparing the fragments that overlap either point. An example is shown for a neighboring pair of S. pombe transcripts (SPAC23C4.14 and SPAC23C4.15, bottom) with substantial overlapping read coverage (gray track), resulting in a contiguous (fused) transcript assembled by Inchworm. However, the Jaccard similarity coefficient (blue track) calculated from the paired reads (gray dumbbells) clearly identifies the position of reduced pair support. Examples of strong (upper left) and weak (upper right) pair support are depicted at the top. When using the '--jaccard_clip' parameter, the Inchworm contig is dissected into two separate full-length transcripts, which are then further processed by Chrysalis and Butterfly as part of the Trinity pipeline.

  7. De novo transcriptome assembly and analysis workflow.
    Figure 7: De novo transcriptome assembly and analysis workflow.

    Reads from multiple samples (e.g., different tissues, top) are combined into a single data set. Reads may be normalized to reduce read counts while retaining read diversity and sample complexity. The combined read set is assembled by Trinity to generate a 'reference' de novo transcriptome assembly (right). Protein-coding regions can be extracted from the reference assembly using TransDecoder and further characterized according to likely functions based on sequence homology or domain content. Separately, sample-specific expression analysis is performed by aligning the original sample reads to the reference transcriptome assembly on a per sample basis, followed by abundance estimation using RSEM. Differentially expressed transcripts are identified by applying the Bioconductor software, such as edgeR, to a matrix containing the RSEM abundance estimates (number of RNA-seq fragments mapped to each transcript from each sample). Differentially expressed transcripts can then be further grouped according to their expression patterns.

  8. Abundance estimation via expectation maximization by RSEM.
    Figure 8: Abundance estimation via expectation maximization by RSEM.

    An illustrative example of abundance estimation for two transcripts with shared (blue) and unique (red, yellow) sequences. To estimate transcript abundances, RNA-seq reads (short bars) are first aligned to the transcript sequences (long bars, bottom). Unique regions of isoforms will capture uniquely mapping RNA-seq reads (red and yellow short bars), and shared sequences between isoforms will capture multiply-mapping reads (blue short bars). An expectation maximization algorithm, implemented in the RSEM software, estimates the most likely relative abundances of the transcripts and then fractionally assigns reads to the isoforms based on these abundances. The assignments of reads to isoforms resulting from iterations of expectation maximization are illustrated as filled short bars (right), and eliminated assignments are shown as hollow bars. Note that assignments of multiply-mapped reads are in fact performed fractionally according to a maximum likelihood estimate. Thus, in this example, a higher fraction of each read is assigned to the more highly expressed top isoform than to the bottom isoform.

  9. Pairwise comparisons of transcript abundance.
    Figure 9: Pairwise comparisons of transcript abundance.

    Two visualizations of the comparison of transcript expression profiles between the logarithmic growth and plateau growth samples from S. pombe to identify differentially expressed transcripts. (a) MA plot for differential expression analysis generated by EdgeR: for each gene, the log2(fold change) (log2(plateau_phase/logarithmic_growth)) between the two samples is plotted (A, y axis) against the gene's log2(average expression) in the two samples (M, x axis). (b) Volcano plot reporting false discovery rate (−log10FDR, y axis) as a function of log2 (fold change) between the samples (logFC, x axis). Transcripts that are identified as significantly differentially expressed at most 0.1% FDR are colored in red.

  10. Comparisons of transcriptional profiles across samples.
    Figure 10: Comparisons of transcriptional profiles across samples.

    (a) Hierarchical clustering of transcripts and samples. Shown is a heat map showing the relative expression levels of each transcript (rows) in each sample (column). Rows and columns are hierarchically clustered. Expression values (FPKM) are log2-transformed and then median-centered by transcript. (b) Heat map showing the hierarchically clustered Spearman correlation matrix resulting from comparing the transcript expression values (TMM-normalized FPKM) for each pair of samples. (c) Transcript clusters extracted from the hierarchical clustering with R. X axis: samples; y axis: median-centered log2(FPKM). Gray lines, individual transcripts; blue line, average expression values per cluster. Number of transcripts in each cluster is shown in the left corner of each plot. DS, diauxic shift; HS, heat shock; Log, mid-log growth; Plat, plateau growth.

References

  1. Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 5763 (2009).
  2. Haas, B.J. & Zody, M.C. Advancing RNA-seq analysis. Nat. Biotechnol. 28, 421423 (2010).
  3. Martin, J.A. & Wang, Z. Next-generation transcriptome assembly. Nat. Rev. Genet. 12, 671682 (2011).
  4. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562578 (2012).
  5. Guttman, M. et al. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503510 (2010).
  6. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909912 (2010).
  7. Schulz, M.H., Zerbino, D.R., Vingron, M. & Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28, 10861092 (2012).
  8. Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644652 (2011).
  9. Duan, J., Xia, C., Zhao, G., Jia, J. & Kong, X. Optimizing de novo common wheat transcriptome assembly using short-read RNA-seq data. BMC Genomics 13, 392 (2012).
  10. Xu, D.L. et al. De novo assembly and characterization of the root transcriptome of Aegilops variabilis during an interaction with the cereal cyst nematode. BMC Genomics 13, 133 (2012).
  11. Zhao, Q.Y. et al. Optimizing de novo transcriptome assembly from short-read RNA-seq data: a comparative study. BMC Bioinformatics 12 (suppl. 14), S2 (2011).
  12. Henschel, R. et al. Trinity RNA-seq assembler performance optimization. XSEDE '12 Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: bridging from the eXtreme to the campus and beyond (Chicago, Illinois, USA, July 16–20, 2012) http://dx.doi.org/10.1145/2335755.2335842 (2012).
  13. Marcais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764770 (2011).
  14. Li, B. & Dewey, C.N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
  15. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139140 (2010).
  16. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
  17. Bullard, J.H., Purdom, E., Hansen, K.D. & Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics 11, 94 (2010).
  18. Fang, Z. & Cui, X. Design and validation issues in RNA-seq experiments. Briefi. Bioinform. 12, 280287 (2011).
  19. Auer, P.L. & Doerge, R.W. Statistical design and analysis of RNA sequencing data. Genetics 185, 405416 (2010).
  20. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621628 (2008).
  21. Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511515 (2010).
  22. Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 7173 (2013).
  23. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
  24. Robinson, M.D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
  25. Dillies, M.A. et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. http://dx.doi.org/10.1093/bib/bbs046 (17 September 2012).
  26. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 15091517 (2008).
  27. Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 2426 (2011).
  28. Abeel, T., Van Parys, T., Saeys, Y., Galagan, J. & Van de Peer, Y. GenomeView: a next-generation genome browser. Nucleic Acids Res. 40, e12 (2012).
  29. Liu, L. et al. Comparison of next-generation sequencing systems. J. Biomed. Biotechnol. 2012, 251364 (2012).
  30. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133138 (2009).
  31. Rothberg, J.M. et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348352 (2011).
  32. Van Belleghem, S.M., Roelofs, D., Van Houdt, J. & Hendrickx, F. De novo transcriptome assembly and SNP discovery in the wing polymorphic salt marsh beetle Pogonus chalceus (Coleoptera, Carabidae). PLoS ONE 7, e42605 (2012).
  33. Kleinman, C.L. & Majewski, J. Comment on “Widespread RNA and DNA sequence differences in the human transcriptome”. Science 335, 1302 (2012).
  34. Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357359 (2012).
  35. Pounds, S.B., Gao, C.L. & Zhang, H. Empirical Bayesian selection of hypothesis testing procedures for analysis of sequence count expression data. Stat. Appl. Genet. Mol. Biol. http://dx.doi.org/10.1515/1544-6115.1773 (2012).
  36. Tarazona, S., Garcia-Alcalde, F., Dopazo, J., Ferrer, A. & Conesa, A. Differential expression in RNA-seq: a matter of depth. Genome Res. 21, 22132223 (2011).
  37. Cumbie, J.S. et al. GENE-counter: a computational pipeline for the analysis of RNA-seq data for gene expression differences. PLoS ONE 6, e25279 (2011).
  38. Hardcastle, T.J. & Kelly, K.A. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11, 422 (2010).
  39. Leng, N. et al. An empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29, 10351043 (2012).
  40. Tuna, M. & Amos, C.I. Genomic sequencing in cancer. Cancer Lett. http://dx.doi.org/doi:10.1016/j.canlet.2012.11.004 (2012).
  41. Rhind, N. et al. Comparative functional genomics of the fission yeasts. Science 332, 930936 (2011).
  42. Kumar, S. & Blaxter, M.L. Comparing de novo assemblers for 454 transcriptome data. BMC Genomics 11, 571 (2010).
  43. Papanicolaou, A., Stierli, R., Ffrench-Constant, R.H. & Heckel, D.G. Next generation transcriptomes for next generation genomes using est2assembly. BMC Bioinformatics 10, 447 (2009).
  44. Lohse, M. et al. RobiNA: a user-friendly, integrated software solution for RNA-seq–based transcriptomics. Nucleic Acids Res. 40, W622W627 (2012).
  45. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17 http://journal.embnet.org/index.php/embnetjournal/article/view/200/479 (2011).
  46. Haas, B.J., Chin, M., Nusbaum, C., Birren, B.W. & Livny, J. How deep is deep enough for RNA-seq profiling of bacterial transcriptomes? BMC Genomics 13, 734 (2012).
  47. Brown, C.T., Howe, A., Zhang, Q., Pryrkosz, A.B. & Brom, T.H. A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv:1203.4802 [q-bio.GN] (2012).
  48. Borodina, T., Adjaye, J. & Sultan, M. A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol. 500, 7998 (2011).
  49. Parkhomchuk, D. et al. Transcriptome analysis by strand-specific sequencing of complementary DNA. Nucleic Acids Res. 37, e123 (2009).
  50. Sung, W.K. et al. Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nat. Genet. 44, 765769 (2012).

Download references

Author information

  1. These authors contributed equally to this work.

    • Brian J Haas &
    • Alexie Papanicolaou

Affiliations

  1. Broad Institute of Massachusetts Institute of Technology (MIT) and Harvard, Cambridge, Massachusetts, USA.

    • Brian J Haas,
    • Moran Yassour,
    • Nathalie Pochet &
    • Aviv Regev
  2. Commonwealth Scientific and Industrial Research Organisation (CSIRO) Ecosystem Sciences, Black Mountain Laboratories, Canberra, Australian Capital Territory, Australia.

    • Alexie Papanicolaou &
    • Michael Ott
  3. The Selim and Rachel Benin School of Computer Science, The Hebrew University of Jerusalem, Jerusalem, Israel.

    • Moran Yassour &
    • Nir Friedman
  4. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden.

    • Manfred Grabherr
  5. Pittsburgh Supercomputing Center, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

    • Philip D Blood
  6. CSIRO Information Management & Technology, St. Lucia, Queensland, Australia.

    • Joshua Bowden
  7. Department of Microbiology and Molecular Genetics, Oklahoma State University, Stillwater, Oklahoma, USA.

    • Matthew Brian Couger
  8. Genomics Research Centre, Griffith University, Gold Coast Campus, Gold Coast, Queensland, Australia.

    • David Eccles
  9. Department of Computer Sciences, University of Wisconsin, Madison, Wisconsin, USA.

    • Bo Li &
    • Colin N Dewey
  10. Center for Information Services and High-performance Computing (ZIH), Technische Universität Dresden, Dresden, Germany.

    • Matthias Lieber
  11. California Institute for Quantitative Biosciences, University of California, Berkeley, Berkeley, California, USA.

    • Matthew D MacManes
  12. Institute for Genome Sciences, Baltimore, Maryland, USA.

    • Joshua Orvis
  13. Department of Plant Systems Biology, Vlaams Instituut voor Biotechnologie (VIB), Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium.

    • Nathalie Pochet
  14. Parco Tecnologico Padano, Località Cascina Codazza, Lodi, Italy.

    • Francesco Strozzi
  15. Corn Insects and Crop Genetics Research Unit, United States Department of Agriculture–Agricultural Research Service, Ames, Iowa, USA.

    • Nathan Weeks
  16. Genomics facility, Purdue University, West Lafayette, Indiana, USA.

    • Rick Westerman
  17. GWT-TUD GmbH, Saxony, Germany.

    • Thomas William
  18. Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin, USA.

    • Colin N Dewey
  19. University Information Technology Services, Research Technologies Division, Indiana University, Bloomington, Indiana, USA.

    • Robert Henschel &
    • Richard D LeDuc
  20. Department of Biology, Howard Hughes Medical Institute, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Aviv Regev

Contributions

B.J.H. is the current lead developer of Trinity and is additionally responsible for the development of the companion in silico normalization and TransDecoder utilities described herein. M.Y. contributed to Butterfly software enhancements, generating figures and to the manuscript text. B.L. and C.N.D. developed RSEM and are responsible for enhancements related to improved Trinity support. B.J.H. and A.P. wrote the initial draft of the manuscript. A.R. is the Principal Investigator. All authors contributed to Trinity development and/or writing of the final manuscript, and all authors approved the final text.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Note (699 KB)

    Supplementary materials for de novo transcript sequence reconstruction from RNA-seq: reference generation and analysis with Trinity.

  2. Supplementary Figure 1 (554 KB)

    Defining minimum edge thresholds during initial Butterfly graph pruning.

  3. Supplementary Figure 2 (551 KB)

    Butterfly's minimum support requirement for path extension during transcript reconstruction.

  4. Supplementary Figure 3 (530 KB)

    Merging of insufficiently different path sequences.

  5. Supplementary Figure 4 (536 KB)

    Enforcing path restrictions via triplet locking.

  6. Supplementary Figure 5 (540 KB)

    Restrictions on the number of paths to be extended at each node.

  7. Supplementary Figure 6 (636 KB)

    Evaluating assembly completeness for the S. pombe transcriptome.

  8. Supplementary Figure 7 (584 KB)

    Evaluating assembly completeness for the mouse dendritic cell transcriptome.

  9. Supplementary Figure 8 (551 KB)

    Correlation of expression values between reference transcripts and Trinity transcript components according to percent length agreement in S. pombe.

  10. Supplementary Figure 9 (584 KB)

    Agreement between expression profiles calculated based on reference transcripts and trinity components at different S. pombe samples.

Comments

  1. Report this comment #60574

    Brian Haas said:

    Please acknowledge the following additional funding support:

    JO was supported by National Science Foundation grant OCE-1046371.

Subscribe to comments

Additional data