StringTie enables improved reconstruction of a transcriptome from RNA-seq reads

Pertea, Mihaela; Pertea, Geo M; Antonescu, Corina M; Chang, Tsung-Cheng; Mendell, Joshua T; Salzberg, Steven L

doi:10.1038/nbt.3122

Letter
Published: 18 February 2015

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads

Mihaela Pertea^1,2,
Geo M Pertea^1,2,
Corina M Antonescu^1,2,
Tsung-Cheng Chang^3,4,
Joshua T Mendell^3,4,5 &
…
Steven L Salzberg ORCID: orcid.org/0000-0002-8859-7432^1,2,6,7

Nature Biotechnology volume 33, pages 290–295 (2015)Cite this article

54k Accesses
5462 Citations
141 Altmetric
Metrics details

Subjects

Abstract

Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Transcript assembly pipelines for StringTie, Cufflinks and Traph.**

**Figure 2: Transcriptome assemblers' accuracies in detecting expressed transcripts from two simulated RNA-seq data sets.**

**Figure 3: Accuracy of transcript assemblers at assembling known genes, measured on real data sets from four different tissues.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain

Article Open access 09 April 2024

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Accession codes

Primary accessions

Sequence Read Archive

SRP041943

References

Blencowe, B.J. Alternative splicing: new insights from global analyses. Cell 126, 37–47 (2006).
Article CAS Google Scholar
Ponting, C.P., Oliver, P.L. & Reik, W. Evolution and functions of long noncoding RNAs. Cell 136, 629–641 (2009).
Article CAS Google Scholar
Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
Article CAS Google Scholar
Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).
Article CAS Google Scholar
Salzberg, S.L. Recent advances in RNA sequence analysis. F1000 Biol. Rep. 2, 64 (2010).
PubMed PubMed Central Google Scholar
Garber, M., Grabherr, M.G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat. Methods 8, 469–477 (2011).
Article CAS Google Scholar
Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
Article CAS Google Scholar
Schulz, M.H., Zerbino, D.R., Vingron, M. & Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28, 1086–1092 (2012).
Article CAS Google Scholar
Li, B. & Dewey, C.N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Article CAS Google Scholar
Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013).
Article CAS Google Scholar
Feng, J., Li, W. & Jiang, T. Inference of isoforms from short sequence reads. J. Comput. Biol. 18, 305–321 (2011).
Article CAS Google Scholar
Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).
Article CAS Google Scholar
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Article CAS Google Scholar
Li, J.J., Jiang, C.R., Brown, J.B., Huang, H. & Bickel, P.J. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl. Acad. Sci. USA 108, 19867–19872 (2011).
Article CAS Google Scholar
Li, W., Feng, J. & Jiang, T. IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J. Comput. Biol. 18, 1693–1707 (2011).
Article Google Scholar
Mezlini, A.M. et al. iReckon: simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res. 23, 519–529 (2013).
Article CAS Google Scholar
Tomescu, A.I., Kuosmanen, A., Rizzi, R. & Makinen, V. A novel min-cost flow method for estimating transcript expression with RNA-Seq. BMC Bioinformatics 14 (suppl. 5), S15 (2013).
Article Google Scholar
Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).
Article CAS Google Scholar
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
Article Google Scholar
Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010).
Article CAS Google Scholar
Zhao, Q.Y. et al. Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinformatics 12 (suppl. 14), S2 (2011).
Article CAS Google Scholar
Behr, J. et al. MITIE: simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 29, 2529–2538 (2013).
Article CAS Google Scholar
Griebel, T. et al. Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 40, 10073–10083 (2012).
Article CAS Google Scholar
Karolchik, D. et al. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 42, D764–D770 (2014).
Article CAS Google Scholar
Hansen, K.D., Brenner, S.E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131 (2010).
Article Google Scholar
Zimin, A.V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).
Article CAS Google Scholar
Rehrauer, H., Opitz, L., Tan, G., Sieverling, L. & Schlapbach, R. Blind spots of quantitative RNA-seq: the limits for assessing abundance, differential expression, and isoform switching. BMC Bioinformatics 14, 370 (2013).
Article Google Scholar
Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Pruitt, K.D., Tatusova, T., Klimke, W. & Maglott, D.R. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37, D32–D36 (2009).
Article CAS Google Scholar
Flicek, P. et al. Ensembl 2014. Nucleic Acids Res. 42, D749–D755 (2014).
Article CAS Google Scholar
Ford, L. & Fulkerson, D. Flows in Networks (Princeton University Press, Princeton, NJ, 1962).
Goldberg, A. & Tarjan, R. A new approach to the maximum-flow problem. JACM 35, 921–940 (1988).
Article Google Scholar
Dantzig, G. Linear Programming and Extensions (Princeton University Press, Princeton, NJ, 1962).
Goldberg, A., Plotkin, S. & Tardos, E. Combinatorial algorithms for the generalized circulation problem. Math. Oper. Res. 16, 351–381 (1991).
Article Google Scholar

Download references

Acknowledgements

These studies were supported in part by US National Institutes of Health grants R01-HG006677 (S.L.S.), R01-HG006102 (S.L.S.), R01-GM105705 (G.M.P.), R01-CA120185 (J.T.M.), P01-CA134292 (J.T.M.), and the Cancer Prevention and Research Institute of Texas (J.T.M.).

Author information

Authors and Affiliations

Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA
Mihaela Pertea, Geo M Pertea, Corina M Antonescu & Steven L Salzberg
McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, USA
Mihaela Pertea, Geo M Pertea, Corina M Antonescu & Steven L Salzberg
Department of Molecular Biology, The University of Texas Southwestern Medical Center, Dallas, Texas, USA
Tsung-Cheng Chang & Joshua T Mendell
Center for Regenerative Science and Medicine, The University of Texas Southwestern Medical Center, Dallas, Texas, USA
Tsung-Cheng Chang & Joshua T Mendell
Simmons Cancer Center, The University of Texas Southwestern Medical Center, Dallas, Texas, USA
Joshua T Mendell
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, USA
Steven L Salzberg
Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
Steven L Salzberg

Authors

Mihaela Pertea
View author publications
You can also search for this author in PubMed Google Scholar
Geo M Pertea
View author publications
You can also search for this author in PubMed Google Scholar
Corina M Antonescu
View author publications
You can also search for this author in PubMed Google Scholar
Tsung-Cheng Chang
View author publications
You can also search for this author in PubMed Google Scholar
Joshua T Mendell
View author publications
You can also search for this author in PubMed Google Scholar
Steven L Salzberg
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.P. designed the StringTie method with input from S.L.S. M.P. and G.M.P. implemented the algorithms. C.M.A. ran all programs on the RNA-seq data and tuned their performance. J.T.M. and T.-C.C. produced the kidney cell line data and gave feedback on StringTie's performance. M.P. and S.L.S. wrote the paper. S.L.S. supervised the entire project. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Steven L Salzberg.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13, Supplementary Tables 1–11 and Supplementary Discussion (PDF 1024 kb)

Supplementary Software 1

StringTie code (ZIP 351 kb)

Source data

Source data to Fig. 1

Source data to Fig. 2

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pertea, M., Pertea, G., Antonescu, C. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33, 290–295 (2015). https://doi.org/10.1038/nbt.3122

Download citation

Received: 15 April 2014
Accepted: 09 December 2014
Published: 18 February 2015
Issue Date: March 2015
DOI: https://doi.org/10.1038/nbt.3122

This article is cited by

Impacts of longitudinal water curtain cooling system on transcriptome-related immunity in ducks
- Qian Hu
- Tao Zhang
- Hehe Liu
BMC Genomics (2024)
Uncovering impaired mitochondrial and lysosomal function in adipose-derived stem cells from obese individuals with altered biological activity
- Bo Wang
- Ge Zhang
- Huiming Xu
Stem Cell Research & Therapy (2024)
DhuFAP: a platform for gene functional analysis in Dendrobium huoshanense
- Qiaoqiao Xiao
- Qi Pan
- Jiaotong Yang
BMC Genomics (2024)
Thrombospondin 1 enhances systemic inflammation and disease severity in acute-on-chronic liver failure
- Hozeifa Mohamed Hassan
- Xi Liang
- Jun Li
BMC Medicine (2024)
Heterologous overexpression of heat shock protein 20 genes of different species of yellow Camellia in Arabidopsis thaliana reveals their roles in high calcium resistance
- Lisha Zhong
- Yuxing Shi
- Yu Liang
BMC Plant Biology (2024)