Simulation-based comprehensive benchmarking of RNA-seq aligners

Abstract

Alignment is the first step in most RNA-seq analysis pipelines, and the accuracy of downstream analyses depends heavily on it. Unlike most steps in the pipeline, alignment is particularly amenable to benchmarking with simulated data. We performed a comprehensive benchmarking of 14 common splice-aware aligners for base, read, and exon junction-level accuracy and compared default with optimized parameters. We found that performance varied by genome complexity, and accuracy and popularity were poorly correlated. The most widely cited tool underperforms for most metrics, particularly when using default settings.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1
Figure 2
Figure 3: The effect of tuning parameters on the human-T3-data base-level statistics.
Figure 4: Runtime performance on human and malaria data.

Change history

  • 22 December 2016

    In the version of this analysis initially published online, the first sentence of Supplementary Note 7 was incorrect; it has been corrected to read "Computational performance refers to how long it takes the alignment to run and how much memory it requires." Supplementary Note 7 has also been removed and its text included in the new Supplementary Note 8. The format for the supplementary information titles was incorrect; these have been updated to the standard format. The supplementary figures and notes have been renumbered to reflect callouts in the main text. The supplementary figures have been renumbered: Supplementary Figures 6–14 are now Supplementary Figures 2–10, Supplementary Figure 5 is now Supplementary Figure 11, and Supplementary Figures 2–4 are now Supplementary Figures 12–14. The supplementary notes have also been renumbered: Supplementary Note 5 is now Supplementary Note 1, Supplementary Note 1 is now Supplementary Note 2, Supplementary Note 10 is now Supplementary Note 3, Supplementary Notes 2–6 are now Supplementary Notes 4–7, and Supplementary Note 11 is now Supplementary Note 10. These errors have been corrected in this file as of 22 December 2016.

References

  1. 1

    Hayer, K.E., Pizarro, A., Lahens, N.F., Hogenesch, J.B. & Grant, G.R. Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinformatics 31, 3938–3945 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2

    Bonfert, T., Kirner, E., Csaba, G., Zimmer, R. & Friedel, C.C. ContextMap 2: fast and accurate context-based RNA-seq mapping. BMC Bioinformatics 16, 122 (2015).

    Article  Google Scholar 

  3. 3

    Philippe, N., Salson, M., Commes, T. & Rivals, E. CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biol. 14, R30 (2013).

    Article  Google Scholar 

  4. 4

    Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010).

    CAS  Article  Google Scholar 

  5. 5

    Kim, D., Langmead, B. & Salzberg, S.L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).

    CAS  Article  Google Scholar 

  6. 6

    Wang, K. et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178 (2010).

    Article  Google Scholar 

  7. 7

    Wu, J., Anczuków, O., Krainer, A.R., Zhang, M.Q. & Zhang, C. OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. Nucleic Acids Res. 41, 5149–5163 (2013).

    CAS  Article  Google Scholar 

  8. 8

    Grant, G.R. et al. Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics 27, 2518–2528 (2011).10.1093/bioinformatics/btr427

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  9. 9

    Huang, S. et al. SOAPsplice: Genome-wide ab initio detection of splice junctions from RNA-Seq data. Front. Genet. 2, 46 (2011).

    Article  Google Scholar 

  10. 10

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  Article  Google Scholar 

  11. 11

    Liao, Y., Smyth, G.K. & Shi, W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res. 41, e108 (2013).

    Article  Google Scholar 

  12. 12

    Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

    Article  Google Scholar 

  13. 13

    Engström, P.G. et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat. Methods 10, 1185–1191 (2013).

    Article  Google Scholar 

  14. 14

    Aurrecoechea, C. et al. PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res. 37, D539–D543 (2009).

    CAS  Article  Google Scholar 

  15. 15

    Glenn, T.C. Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11, 759–769 (2011).

    CAS  Article  Google Scholar 

  16. 16

    Wang, W.-A. et al. Comparisons and performance evaluations of RNA-seq alignment tools in 2014 International Conference on Electrical Engineering and Computer Science 215–218 (ICEECS, 2014).

  17. 17

    Benjamin, A.M., Nichols, M., Burke, T.W., Ginsburg, G.S. & Lucas, J.E. Comparing reference-based RNA-Seq mapping methods for non-human primate data. BMC Genomics 15, 570 (2014).

    Article  Google Scholar 

  18. 18

    Fonseca, N.A., Rung, J., Brazma, A. & Marioni, J.C. Tools for mapping high-throughput sequencing data. Bioinformatics 28, 3169–3177 (2012).

    CAS  Article  Google Scholar 

  19. 19

    Fonseca, N.A., Marioni, J. & Brazma, A. RNA-Seq gene profiling—a systematic empirical comparison. PLoS One 9, e107026 (2014).

    Article  Google Scholar 

  20. 20

    Gardner, M.J. et al. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419, 498–511 (2002).

    CAS  Article  Google Scholar 

  21. 21

    Lindner, R. & Friedel, C.C. A comprehensive evaluation of alignment algorithms in the context of RNA-seq. PLoS One 7, e52403 (2012).

    CAS  Article  Google Scholar 

  22. 22

    Hatem, A., Bozdagˇ, D., Toland, A.E. & Çatalyürek, U.V. Benchmarking short sequence mapping tools. BMC Bioinformatics 14, 184 (2013).

    Article  Google Scholar 

  23. 23

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

Download references

Acknowledgements

We thank A. Srinivasan for his help administrating the PMACS cluster. We thank N. Lahens, T. Grosser, D. Sarantopoulou, F. Coldren, E. Scarci, and E. Ricciotti for support and helpful discussions. This work was funded in part by the National Heart Lung and Blood Institute (U54HL117798, G.A.F.) and The National Center for Advancing Translational Sciences (UL1-TR-001878, G.A.F.).

Author information

Affiliations

Authors

Contributions

G.B. contributed research, analysis, and writing. K.E.H. contributed analysis, figures, and benchmarking scripts. E.J.K. contributed analysis. B.D.C. contributed analysis and formulation of ideas. G.A.F. contributed formulation of ideas and direction. G.R.G. contributed the simulated data, direction, ideas, and writing.

Corresponding author

Correspondence to Gregory R Grant.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15, Supplementary Notes 1–10 and Supplementary Tables 1–43. (PDF 6431 kb)

Source data to Supplementary Figure 1

Source data to Supplementary Figure 2

Source data to Supplementary Figure 3

Source data to Supplementary Figure 4

Source data to Supplementary Figure 5

Source data to Supplementary Figure 6

Source data to Supplementary Figure 7

Source data to Supplementary Figure 8

Source data to Supplementary Figure 9

Source data to Supplementary Figure 10

Source data to Supplementary Figure 11

Source data to Supplementary Figure 12

Source data to Supplementary Figure 13

Source data to Supplementary Figure 14

Source data to Supplementary Figure 15

Supplementary Data 1

Information about the tools involved in the comparison. (XLSX 54 kb)

Supplementary Data 2

Statistics and accuracy metrics of tweaked alignment on Human. (XLSX 59 kb)

Supplementary Data 3

Statistics and accuracy metrics of tweaked alignment on Malaria. (XLSX 1629 kb)

Supplementary Data 4

Statistics and accuracy metrics of default alignment on Human and Malaria (latest tool versions). (XLSX 65 kb)

Supplementary Data 5

Statistics and accuracy metrics of default alignment on Human. (XLSX 17 kb)

Supplementary Data 6

Statistics and accuracy metrics of default alignment on Malaria. (XLSX 17 kb)

Supplementary Data 7

Statistics and accuracy metrics achieved by the best tweaked alignment on Human. (XLSX 26 kb)

Supplementary Data 8

Statistics and accuracy metrics achieved by the best tweaked alignment on Malaria. (XLSX 72 kb)

Supplementary Data 9

Statistics and accuracy metrics of default alignment on Human including/omitting annotation. (XLSX 50 kb)

Supplementary Data 10

Statistics and accuracy metrics of default alignment on Malaria including/omitting annotation. (XLSX 31 kb)

Supplementary Data 11

Computational performance metrics of default alignment on Human. (XLS 78 kb)

Supplementary Data 12

Computational performance metrics of default alignment on Malaria. (XLS 78 kb)

Supplementary Data 13

Statistics and accuracy metrics of short anchored reads alignment on Human. (XLSX 310 kb)

Supplementary Data 14

Statistics and accuracy metrics of simulated adapters alignment on Human. (XLSX 313 kb)

Supplementary Data 15

Statistics and accuracy metrics of canonical and noncanonical junctions on Human. (XLSX 148 kb)

Supplementary Software

All scripts used in this analysis. (ZIP 3592 kb)

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Baruzzo, G., Hayer, K., Kim, E. et al. Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat Methods 14, 135–139 (2017). https://doi.org/10.1038/nmeth.4106

Download citation

Further reading