Full-length transcriptome assembly from RNA-Seq data without a reference genome

Journal name:
Nature Biotechnology
Volume:
29,
Pages:
644–652
Year published:
DOI:
doi:10.1038/nbt.1883
Received
Accepted
Published online

Abstract

Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.

At a glance

Figures

  1. Overview of Trinity.
    Figure 1: Overview of Trinity.

    (a) Inchworm assembles the read data set (short black lines, top) by greedily searching for paths in a k-mer graph (middle), resulting in a collection of linear contigs (color lines, bottom), with each k-mer present only once in the contigs. (b) Chrysalis pools contigs (colored lines) if they share at least one k − 1-mer and if reads span the junction between contigs, and then it builds individual de Bruijn graphs from each pool. (c) Butterfly takes each de Bruijn graph from Chrysalis (top), and trims spurious edges and compacts linear paths (middle). It then reconciles the graph with reads (dashed colored arrows, bottom) and pairs (not shown), and outputs one linear sequence for each splice form and/or paralogous transcript represented in the graph (bottom, colored sequences).

  2. Trinity correctly reconstructs the majority of full-length transcripts in fission yeast and mouse.
    Figure 2: Trinity correctly reconstructs the majority of full-length transcripts in fission yeast and mouse.

    (a,c) The fraction of genes that are fully reconstructed and in the Oracle Set in different expression quintiles (5% increments) in fission yeast (50 M pairs assembly) (a) and the fraction of genes that have at least one fully reconstructed transcript and are in the Oracle Set in different expression quintiles in mouse (53 M pairs assembly) (c). Each bar represents a 5% quintile of read coverage for genes expressed. Gray bars show the remaining fraction of transcripts that are in the Oracle Set but not fully reconstructed. For example, ~36% of the S. pombe transcripts at the bottom 5% of expression levels are fully reconstructed by Trinity; ~45% of the transcripts in this quintile are in the Oracle Set. (b,d) Curves show the median values for coverage (as fraction of length of reference transcripts) by the longest corresponding Trinity-assembled transcript, according to expression quintiles in yeast (b) and mouse (d), depending on the number of read pairs that went into each assembly.

  3. Trinity improves the yeast annotation.
    Figure 3: Trinity improves the yeast annotation.

    Shown are examples of Trinity assemblies (red) along with the corresponding annotated transcripts (blue) and underlying reads (gray) all aligned to the S. pombe genome (read alignment is shown for graphical clarity; no alignments were used to generate the assemblies). (a) Trinity identifies a new multi-exonic transcript (left) and extends the 5′ and 3′ UTRs of the coq9 gene (right). (b) Trinity extends the UTRs of two convergently transcribed and overlapping genes.

  4. Trinity resolves closely paralogous genes.
    Figure 4: Trinity resolves closely paralogous genes.

    (a) The compacted component graph for two paralogous mouse genes, Ddx19a and Ddx19b (93% identity). Red and blue arrows highlight the two paths chosen by Trinity out of the 64 possible paths in this portion of the graph alone. Numbers on the edges indicate the number of supporting reads; numbers in parentheses represent the sequence length at each node.(b) Alignments between the transcripts represented by the red and blue paths in a and the paralogous genes Ddx19a and Ddx19b relative to the mouse reference genome (genome alignment shown for graphical clarity only; no alignments were used to generate the assemblies).

  5. Comparison of Trinity to other mapping-first and assembly-first methods.
    Figure 5: Comparison of Trinity to other mapping-first and assembly-first methods.

    (a,b) Evaluation based on number of full-length annotated transcripts reconstructed by each method in S. pombe (50 M read pair assemblies) (a) and mouse (53 M read pair assemblies) (b). Number of genes reconstructed in full length (blue) or as fusions of two full-length genes (green, yeast only) and the number of full-length reconstructed transcript isoforms (red, mouse only) in each of four assembly-first (de novo) and two mapping-first approaches. (c,d) Evaluation based on the number of introns defined by the transcripts from each method for S. pombe (c) and mouse (d). Shown is the number of distinct introns consistent with the reference annotation (y axis) versus the number of uniquely predicted introns (x axis), based on mapping to the genome of the transcripts reconstructed by the different methods. (e,f) Evaluation based on the number of splicing patterns (complete sets of introns in multi-intronic transcripts) defined by the transcripts from each method for S. pombe (e) and mouse (f). Shown are the numbers of distinct splicing patterns (y axis) consistent with the reference annotation versus the number of unique splicing patterns (x axis), for each method.

  6. Trinity reconstructs polymorphic transcripts in whitefly.
    Figure 6: Trinity reconstructs polymorphic transcripts in whitefly.

    (a) Allelic variation evident from mapping RNA-Seq reads to a full-length whitefly transcript reconstructed by Trinity. At the top is a schematic of a single transcript orthologous to the Drosophila melanogaster Lamin gene Lam, identified by grouping reconstructed transcripts having allelic variants (colored yellow). Gray coverage plot shows cumulative read coverage along the transcripts. SNPs are marked with colored bars and scaled based on the relative proportions of each variant (blue: C, red: T, orange: G, green: A). Individual reads are shown below coverage plot (forward reads, blue; reverse, red). (b) Comparison of performance for de novo assembly of the whitefly transcriptome. The y axis is a count of the unique top-matching (BLASTX) uniref90 (ref. 20) protein sequences aligned Trinity transcripts across a minimal percent of their length. (c) Example of two alternatively spliced transcripts resolved even in the absence of a reference genome. Shown are two isoforms of an ELAV-like gene reconstructed by Trinity (gray boxes indicate alternative exons). Exon structure is determined for visualization by the D. melanogaster ortholog. The protein sequence alignment shows the similarity between the two whitefly isoforms and orthologous proteins from other insects, and it confirms the splice variants (gray boxes).

Accession codes

Primary accessions

Gene Expression Omnibus

Sequence Read Archive

References

  1. Birol, I. et al. De novo transcriptome assembly with ABySS. Bioinformatics 25, 28722877 (2009).
  2. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511515 (2010).
  3. Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503510 (2010).
  4. Haas, B.J. & Zody, M.C. Advancing RNA-Seq analysis. Nat. Biotechnol. 28, 421423 (2010).
  5. Yassour, M. et al. Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing. Proc. Natl. Acad. Sci. USA 106, 32643269 (2009).
  6. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 19661967 (2009).
  7. De Bruijn, N.G. A combinatorical problem. Koninklijke Nederlandse Akademie v. Wetenschappen 46, 758764 (1946).
  8. Good, I.J. Normal recurring decimals. J. Lond. Math. Soc. 21, 167169 (1946).
  9. Pevzner, P.A., Tang, H. & Waterman, M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 97489753 (2001).
  10. Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821829 (2008).
  11. Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810820 (2008).
  12. Hertz-Fowler, C. et al. GeneDB: a resource for prokaryotic and eukaryotic organisms. Nucleic Acids Res. 32, D339D343 (2004).
  13. Levin, J.Z. et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat. Methods 7, 709715 (2010).
  14. Parkhomchuk, D. et al. Transcriptome analysis by strand-specific sequencing of complementary DNA. Nucleic Acids Res. 37, e123 (2009).
  15. Rhind, N. et al. Comparative functional genomics of the fission yeasts. Science published online, doi:doi:10.1126/science.1203357 (21 April 2011).
  16. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470476 (2008).
  17. Wilhelm, B.T. et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453, 12391243 (2008).
  18. Xu, Z. et al. Bidirectional promoters generate pervasive transcription in yeast. Nature 457, 10331037 (2009).
  19. Wu, T.D. & Watanabe, C.K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 18591875 (2005).
  20. Wu, C.H. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34, D187D191 (2006).
  21. Wapinski, I., Pfeffer, A., Friedman, N. & Regev, A. Natural history and evolutionary principles of gene duplication in fungi. Nature 449, 5461 (2007).
  22. Molnar, M. et al. Characterization of rec7, an early meiotic recombination gene in Schizosaccharomyces pombe. Genetics 157, 519532 (2001).
  23. Nakamura, T., Kishida, M. & Shimoda, C. The Schizosaccharomyces pombe spo6+ gene encoding a nuclear protein with sequence similarity to budding yeast Dbf4 is required for meiotic second division and sporulation. Genes Cells 5, 463479 (2000).
  24. Watanabe, T. et al. Comprehensive isolation of meiosis-specific genes identifies novel proteins and unusual non-coding transcripts in Schizosaccharomyces pombe. Nucleic Acids Res. 29, 23272337 (2001).
  25. Yassour, M. et al. Strand-specific RNA sequencing reveals extensive regulated long antisense transcripts that are conserved across yeast species. Genome Biol. 11, R87 (2010).
  26. Matlin, A.J., Clark, F. & Smith, C.W.J. Understanding alternative splicing: towards a cellular code. Nat. Rev. Mol. Cell Biol. 6, 386398 (2005).
  27. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909912 (2010).
  28. Graveley, B.R. Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 17, 100107 (2001).
  29. Wang, X.-W. et al. De novo characterization of a whitefly transcriptome and analysis of its gene expression during development. BMC Genomics 11, 400 (2010).
  30. Salzberg, S.L. & Yorke, J.A. Beware of mis-assembled genomes. Bioinformatics 21, 43204321 (2005).
  31. Shannon, C.E. Prediction and entropy of printed English. Bell Syst. Tech. J. 30, 5064 (1951).
  32. Price, A.L., Jones, N.C. & Pevzner, P.A. De novo identification of repeat families in large genomes. Bioinformatics 21 Suppl 1, i351i358 (2005).
  33. Grabherr, M.G. et al. Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics 26, 11451151 (2010).
  34. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 11051111 (2009).
  35. Kent, W.J. BLAT–the BLAST-like alignment tool. Genome Res. 12, 656664 (2002).

Download references

Author information

  1. These authors contributed equally to this work.

    • Manfred G Grabherr,
    • Brian J Haas &
    • Moran Yassour

Affiliations

  1. Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts, USA.

    • Manfred G Grabherr,
    • Brian J Haas,
    • Moran Yassour,
    • Joshua Z Levin,
    • Dawn A Thompson,
    • Ido Amit,
    • Xian Adiconis,
    • Lin Fan,
    • Raktima Raychowdhury,
    • Qiandong Zeng,
    • Zehua Chen,
    • Evan Mauceli,
    • Nir Hacohen,
    • Andreas Gnirke,
    • Federica di Palma,
    • Bruce W Birren,
    • Chad Nusbaum,
    • Kerstin Lindblad-Toh &
    • Aviv Regev
  2. School of Computer Science, Hebrew University, Jerusalem, Israel.

    • Moran Yassour &
    • Nir Friedman
  3. Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Moran Yassour &
    • Aviv Regev
  4. Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, Massachusetts, USA.

    • Nicholas Rhind
  5. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden.

    • Kerstin Lindblad-Toh
  6. Alexander Silberman Institute of Life Sciences, Hebrew University, Jerusalem, Israel.

    • Nir Friedman
  7. Howard Hughes Medical Institute, Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Aviv Regev

Contributions

M.G.G., M.Y., B.J.H., K.L.-T., N.F. and A.R. conceived and designed the study. B.J.H., M.G.G. and M.Y. developed the Inchworm, Chrysalis and Butterfly components, respectively. N.R., F.D.P., B.W.B., C.N., K.L.-T. contributed to the study's conception and execution. J.Z.L., D.A.T., X.A., L.F., R.R., I.A., N.H., A.R. and A.G. designed and performed all experiments. Q.Z., Z.C. and E.M. contributed computational analyses. M.G.G., B.J.H. and M.Y. designed, implemented and evaluated all methods. A.R., N.F., M.G.G., B.J.H. and M.Y. wrote the manuscript, with input from all authors. A.R. and N.F. contributed equally to this paper.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (406K)

    Supplementary Tables 1–3, Supplementary Methods, Supplementary Note and Supplementary Figures 1–9

Additional data