Fast and sensitive protein alignment using DIAMOND

Journal name:
Nature Methods
Volume:
12,
Pages:
59–60
Year published:
DOI:
doi:10.1038/nmeth.3176
Received
Accepted
Published online

The alignment of sequencing reads against a protein reference database is a major computational bottleneck in metagenomics and data-intensive evolutionary projects. Although recent tools offer improved performance over the gold standard BLASTX, they exhibit only a modest speedup or low sensitivity. We introduce DIAMOND, an open-source algorithm based on double indexing that is 20,000 times faster than BLASTX on short reads and has a similar degree of sensitivity.

At a glance

Figures

  1. Comparison of DIAMOND and RAPSearch2 against BLASTX for four sequencing technologies and for ORFs predicted from a bacterial assembly.
    Figure 1: Comparison of DIAMOND and RAPSearch2 against BLASTX for four sequencing technologies and for ORFs predicted from a bacterial assembly.

    (a) Fold speedup of each program over BLASTX. (b) Percentage (relative to BLASTX) of queries for which each program reports at least one alignment. (c) Percentage (relative to BLASTX) of matches recovered by each program. Only alignments with an expected value of ≤0.001 are considered. Programs were set to report alignments for up to 250 target sequences per read. Times are wall-clock times on a server using 48 cores and exclude one-time program startup overhead, which was <1 min for BLASTX and 5 min for DIAMOND-fast. HMP, Human Microbiome Project.

  2. Spaced seeds.
    Supplementary Fig. 1: Spaced seeds.

    (a) The four seed shapes of weight 12 that DIAMOND uses by default. Ones and zeros indicate positions to use and ignore, respectively. (b) Illustration of the application of a spaced seed to match letters between a reference and a query sequence.

  3. Ratio of main memory accesses.
    Supplementary Fig. 2: Ratio of main memory accesses.

    The ratio K/K’ as a function of the total length of the query sequences, for different seed lengths. The variables K and K’ represent the approximate number of main memory accesses required when using a single index or double index, respectively.

  4. PCoA analysis of 12 permafrost samples based on a subset of 6 million reads.
    Supplementary Fig. 3: PCoA analysis of 12 permafrost samples based on a subset of 6 million reads.

    BLASTX results are shown on the left, (a) and (c). DIAMOND-fast results are shown on the right, (b) and (d). The upper two panels show the first and second principle coordinates, whereas the lower two panels show the first and third principle coordinates.

References

  1. Handelsman, J., Rondon, M., Brady, S., Clardy, J. & Goodman, R. Chem. Biol. 5, R245R249 (1998).
  2. Benson, D.A., Karsch-Mizrachi, I., Lipman, D., Ostell, J. & Wheeler, D. Nucleic Acids Res. 33, D34D38 (2005).
  3. Kanehisa, M. & Goto, S. Nucleic Acids Res. 28, 2730 (2000).
  4. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J. Mol. Biol. 215, 403410 (1990).
  5. Kent, W.J. Genome Res. 12, 656664 (2002).
  6. Edgar, R.C. Bioinformatics 26, 24602461 (2010).
  7. Zhao, Y., Tang, H. & Ye, Y. Bioinformatics 28, 125126 (2012).
  8. Huson, D.H. & Xie, C. Bioinformatics 30, 3839 (2014).
  9. Burkhardt, S. & Kärkkäinen, J. Fundamenta Informaticae 23, 10011018 (2003).
  10. Ma, B., Tromp, J. & Li, M. Bioinformatics 18, 440445 (2002).
  11. Ilie, L., Ilie, S., Khoshraftar, S. & Bigvand, A.M. BMC Genomics 12, 280 (2011).
  12. Murphy, L.R., Wallqvist, A. & Levy, R.M. Protein Eng. 13, 149152 (2000).
  13. Smith, T.F. & Waterman, M.S. J. Mol. Biol. 147, 195197 (1981).
  14. Mackelprang, R. et al. Nature 480, 368371 (2011).
  15. Jansson, J. Microbe 6, 309315 (2011).
  16. Turnbaugh, P.J. et al. Nature 449, 804810 (2007).
  17. Venter, J.C. et al. Science 304, 6674 (2004).
  18. Wilson, M.C. et al. Nature 506, 5862 (2014).
  19. Wheeler, D.L. et al. Nucleic Acids Res. 36, D13D21 (2008).
  20. Boncz, P., Manegold, S. & Kersten, M.L. Proc. VLDB Conf. 99, 5465 (1999).
  21. Hach, F. et al. Nat. Methods 7, 576577 (2010).
  22. Rognes, T. BMC Bioinformatics 12, 221 (2011).
  23. Henikoff, J.G. & Henikoff, S. Methods Enzymol. 266, 88105 (1996).
  24. Zhu, W., Lomsadze, A. & Borodovsky, M. Nucleic Acids Res. 38, e132 (2010).

Download references

Author information

Affiliations

  1. Department of Computer Science and Center for Bioinformatics, University of Tübingen, Tübingen, Germany.

    • Benjamin Buchfink &
    • Daniel H Huson
  2. Singapore Centre on Environmental Life Sciences Engineering, School of Biological Sciences, Nanyang Technological University, Singapore.

    • Chao Xie &
    • Daniel H Huson
  3. Life Sciences Institute, National University of Singapore, Singapore.

    • Chao Xie

Contributions

B.B. designed and implemented the algorithm. C.X. performed the experimental study. C.X. and D.H.H. initiated and guided the project. D.H.H. and B.B. wrote the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Spaced seeds. (72 KB)

    (a) The four seed shapes of weight 12 that DIAMOND uses by default. Ones and zeros indicate positions to use and ignore, respectively. (b) Illustration of the application of a spaced seed to match letters between a reference and a query sequence.

  2. Supplementary Figure 2: Ratio of main memory accesses. (99 KB)

    The ratio K/K’ as a function of the total length of the query sequences, for different seed lengths. The variables K and K’ represent the approximate number of main memory accesses required when using a single index or double index, respectively.

  3. Supplementary Figure 3: PCoA analysis of 12 permafrost samples based on a subset of 6 million reads. (165 KB)

    BLASTX results are shown on the left, (a) and (c). DIAMOND-fast results are shown on the right, (b) and (d). The upper two panels show the first and second principle coordinates, whereas the lower two panels show the first and third principle coordinates.

PDF files

  1. Supplementary Text and Figures (536 KB)

    Supplementary Figures 1–3 and Supplementary Tables 1–3

Zip files

  1. Supplementary Software (2,803 KB)

    DIAMOND v0.4.7 source code

Additional data