MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: MMseqs2 searching in a nutshell.
Figure 2: MMseqs2 pushes the boundaries of sensitivity-speed trade-off.

References

  1. 1

    Sunagawa, S. et al. Science 348, 1261359 (2015).

    Article  Google Scholar 

  2. 2

    Afshinnekoo, E. et al. Cell Syst. 1, 72–87 (2015).

    CAS  Article  Google Scholar 

  3. 3

    Howe, A.C. et al. Proc. Natl. Acad. Sci. USA 111, 4904–4909 (2014).

    CAS  Article  Google Scholar 

  4. 4

    Franzosa, E.A. et al. Nat. Rev. Microbiol. 13, 360–372 (2015).

    CAS  Article  Google Scholar 

  5. 5

    Scholz, M.B., Lo, C.C. & Chain, P.S. Curr. Opin. Biotechnol. 23, 9–15 (2012).

    CAS  Article  Google Scholar 

  6. 6

    Desai, N., Antonopoulos, D., Gilbert, J.A., Glass, E.M. & Meyer, F. Curr. Opin. Biotechnol. 23, 72–76 (2012).

    CAS  Article  Google Scholar 

  7. 7

    Tang, W. et al. in IEEE International Conference on Big Data, 56–63 (IEEE, 2014).

  8. 8

    Altschul, S.F. et al. Nucleic Acids Res. 25, 3389–3402 (1997).

    CAS  Article  Google Scholar 

  9. 9

    Edgar, R.C. Bioinformatics 26, 2460–2461 (2010).

    CAS  Article  Google Scholar 

  10. 10

    Kiełbasa, S.M., Wan, R., Sato, K., Horton, P. & Frith, M.C. Genome Res. 21, 487–493 (2011).

    Article  Google Scholar 

  11. 11

    Zhao, Y., Tang, H. & Ye, Y. Bioinformatics 28, 125–126 (2012).

    CAS  Article  Google Scholar 

  12. 12

    Buchfink, B., Xie, C. & Huson, D.H. Nat. Methods 12, 59–60 (2015).

    CAS  Article  Google Scholar 

  13. 13

    Hurwitz, B.L. & Sullivan, M.B. PLoS One 8, e57355 (2013).

    CAS  Article  Google Scholar 

  14. 14

    Hauser, M., Steinegger, M. & Söding, J. Bioinformatics 32, 1323–1330 (2016).

    CAS  Article  Google Scholar 

  15. 15

    Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. J. Mol. Biol. 247, 536–540 (1995).

    CAS  PubMed  Google Scholar 

  16. 16

    Karplus, K., Barrett, C. & Hughey, R. Bioinformatics 14, 846–856 (1998).

    CAS  Article  Google Scholar 

  17. 17

    Rognes, T. BMC Bioinformatics 12, 221 (2011).

    Article  Google Scholar 

  18. 18

    Frith, M.C. Nucleic Acids Res. 39, e23–e23 (2011).

    Article  Google Scholar 

  19. 19

    Frith, M.C., Park, Y., Sheetlin, S.L. & Spouge, J.L. Nucleic Acids Res. 36, 5863–5871 (2008).

    CAS  Article  Google Scholar 

  20. 20

    Jensen, L.J. et al. Nucleic Acids Res. 36, D250–D254 (2008).

    CAS  Article  Google Scholar 

  21. 21

    Finn, R.D. et al. Nucleic Acids Res. 44 D1, D279–D285 (2016).

    Article  Google Scholar 

  22. 22

    Steinegger, M. & Söding, J. Preprint at bioRxiv https://dx.doi.org/10.1101/104034 (2017).

  23. 23

    Eddy, S.R. PLOS Comput. Biol. 7, e1002195 (2011).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We are grateful to C. Notredame and C. Seok for hosting M.S. at the CRG in Barcelona and at Seoul National University for 12 and 18 months, respectively, and to Burkhard Rost at TU Munich for accepting the formal supervision of his PhD thesis. We thank M. Mirdita, L. van den Driesch, and C. Galiez for contributing utilities and workflows, and S. Sunagawa, M. Frith, T. Rattei and our laboratory for feedback on the manuscript. This work was supported by the European Research Council's Horizon 2020 Framework Programme for Research and Innovation (“Virus-X”, project no. 685778) and by the German Federal Ministry for Education and Research (BMBF) (grants e:AtheroSysMed 01ZX1313D, “SysCore” 0316176A).

Author information

Affiliations

Authors

Contributions

M.S. developed the software and performed the data analysis. M.S. and J.S. conceived of and designed the algorithms and benchmarks and wrote the manuscript.

Corresponding author

Correspondence to Johannes Söding.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Eliminating random memory access during k-mer match stage in MMseqs2.

Numbers in this figure are represented in hexadecimal notation (e.g. 0xFF is equal to 255 in decimal). After the end of loop 2 (Fig. 1B), the matches array on the left, containing single k-mer matches between the query sequence and various target sequences, is processed in two steps to find double k-mer matches. In the first step, the entries (target_ID, i−j) of matches are sorted into 2B arrays (bins) according to the lowest B bits of target_ID. Here, for illustration purposes, we set B = 8. In the second step, the 2B bins are processed one by one. For each k-mer match (target_ID, i−j), we run the code in the magenta frame of Fig. 1B. But now, the diagonal_prev array fits into L1/L2 CPU cache, because it only contains ceil(N/2B) entries, where N is the number of sequences in the target database.

Supplementary Figure 2 Multi-core scaling of MMseqs2.

Runtimes of MMseqs and MMseqs2 searches in fast and default sensitivity using 1, 2, 4, 8 and 16 threads on a 2 × 8 core server with 128 GB main memory. Theoretically optimal scaling is indicated as a dashed black line for each method. We searched with 6370 full length protein queries against 30 Mio. UniProt sequences. On 16 cores, MMseqs achieves 58% and MMseqs2 85% of their theoretical maximum performance interpolated from the single core measurement. The improvement in scaling behaviour from MMseqs to MMseqs2 is owed to minimizing random main memory accesses, as explained in Fig. S1.

Supplementary Figure 3 Runtime of MMseqs2 against the UniProt at different sensitivity and database split settings.

We measured the search time with query sets of 10,000 and 100,000 sequences through the UniProt database (Release 2017_03 with 80204488 sequences) using four sensitivity settings (faster, fast, default, and sensitive) and splitting the database into 1, 2, and 4 chunks. Runtimes for Refseq/Genbank (Release March 3, 2017 with 81,027,309 sequences) are very similar. The memory consumption of the index table for the split levels of 1, 2, and 4 was 190GB, 101GB, and 57GB respectively. All searches ran on a 2×14-core server with 768GB main memory.

Supplementary Figure 4 False discovery rate versus E-value threshold.

False discovery rate versus E-value threshold in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Colors are the same as in Fig. 2a

Supplementary Figure 5 Sequence searching sensitivity assessment with unshuffled query sequences.

Cumulative distribution of area under the curve (AUC) sensitivity for all 6324 queries in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Higher curves signify higher sensitivity.

Supplementary Figure 6 Sequence profile searching sensitivity assessment with unshuffled query sequence profiles.

Cumulative distribution of area under the curve (AUC) sensitivity for all 6324 unshuffled query sequences in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Higher curves signify higher sensitivity. Higher curves signify higher sensitivity. 2 IT: 2 search iterations etc.

Supplementary Figure 7 False discovery rate versus E-value threshold for profile searches.

False discovery rate versus E-value threshold in version 2 of the sequence profile search sensitivity benchmark using unshuffled query sequences.

Supplementary Figure 8 Accuracy of reported E-values.

The expected number of false positives is the E-value threshold times the number of searches, E × 6324. The observed number of false positives is the total number of false positives below the E-value threshold in all 6324 searches. If E-values were accurate, observed and expected numbers of false positives would coincide (diagonal grey line). LAST and MMseqs2 report the most accurate E-values. The false positives shown were obtained with version 2 of the sequence search sensitivity benchmark. Colors are the same as in Fig. 2a.

Supplementary Figure 9 Sequence searching sensitivity assessment with single-domain SCOP sequences.

Cumulative distribution of area under the curve (AUC) sensitivity for all 7616 single domain SCOP sequences. Higher curves signify higher sensitivity. AUC up to the first false positive is the fraction of true positive matches found with better E-value than the first false positive match.

Supplementary Figure 10 False discovery rate versus E-value threshold for the single-domain benchmark.

False discovery rate versus E-value threshold for the single-domain SCOP sequence search sensitivity benchmark

Supplementary Figure 11 Workflow for fast and deep annotations of the Ocean Microbiome Reference Gene Catalog (OM-RGC) using MMseqs2.

Supplementary Figure 12 Algorithmic changes to perform fast sequence profile searches using MMseqs2.

We precompute all similar k-mers above a similarity threshold for each target profile and store them into the index table. For each query sequence we run over its overlapping, spaced k-mers (loop 2) and look up in the index table (blue frame) only the exact same k-mer. At the ungapped alignment stage we use the target profile consensus sequence. We transpose the results, i.e., we exchange the role of query and target in the results and then, as the last step, align the profiles against all query sequences and transpose back.

Supplementary information

Supplementary Information

Supplementary Figures Tables and Texts (PDF 5454 kb)

Supplementary Information

Supplementary Source Code (ZIP 7337 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017). https://doi.org/10.1038/nbt.3988

Download citation

Further reading