MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

Steinegger, Martin; Söding, Johannes

doi:10.1038/nbt.3988

Correspondence
Published: 16 October 2017

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

Nature Biotechnology volume 35, pages 1026–1028 (2017)Cite this article

32k Accesses
1008 Citations
115 Altmetric
Metrics details

Subjects

Access through your institution

Buy or subscribe

To the Editor

The throughput of DNA sequencing has increased much faster than computational speed in the past decade, and sensitive-sequence searching has become the main bottleneck in the analysis of large metagenomic data sets. We therefore developed MMseqs2 (https://github.com/soedinglab/mmseqs2), which improves on current search tools over the full range of speed-sensitivity trade-off, achieving sensitivities better than PSI-BLAST at more than 400 times its speed.

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: MMseqs2 searching in a nutshell.**

**Figure 2: MMseqs2 pushes the boundaries of sensitivity-speed trade-off.**

References

Sunagawa, S. et al. Science 348, 1261359 (2015).
Article Google Scholar
Afshinnekoo, E. et al. Cell Syst. 1, 72–87 (2015).
Article CAS Google Scholar
Howe, A.C. et al. Proc. Natl. Acad. Sci. USA 111, 4904–4909 (2014).
Article CAS Google Scholar
Franzosa, E.A. et al. Nat. Rev. Microbiol. 13, 360–372 (2015).
Article CAS Google Scholar
Scholz, M.B., Lo, C.C. & Chain, P.S. Curr. Opin. Biotechnol. 23, 9–15 (2012).
Article CAS Google Scholar
Desai, N., Antonopoulos, D., Gilbert, J.A., Glass, E.M. & Meyer, F. Curr. Opin. Biotechnol. 23, 72–76 (2012).
Article CAS Google Scholar
Tang, W. et al. in IEEE International Conference on Big Data, 56–63 (IEEE, 2014).
Altschul, S.F. et al. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
Edgar, R.C. Bioinformatics 26, 2460–2461 (2010).
Article CAS Google Scholar
Kiełbasa, S.M., Wan, R., Sato, K., Horton, P. & Frith, M.C. Genome Res. 21, 487–493 (2011).
Article Google Scholar
Zhao, Y., Tang, H. & Ye, Y. Bioinformatics 28, 125–126 (2012).
Article CAS Google Scholar
Buchfink, B., Xie, C. & Huson, D.H. Nat. Methods 12, 59–60 (2015).
Article CAS Google Scholar
Hurwitz, B.L. & Sullivan, M.B. PLoS One 8, e57355 (2013).
Article CAS Google Scholar
Hauser, M., Steinegger, M. & Söding, J. Bioinformatics 32, 1323–1330 (2016).
Article CAS Google Scholar
Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. J. Mol. Biol. 247, 536–540 (1995).
CAS PubMed Google Scholar
Karplus, K., Barrett, C. & Hughey, R. Bioinformatics 14, 846–856 (1998).
Article CAS Google Scholar
Rognes, T. BMC Bioinformatics 12, 221 (2011).
Article Google Scholar
Frith, M.C. Nucleic Acids Res. 39, e23–e23 (2011).
Article Google Scholar
Frith, M.C., Park, Y., Sheetlin, S.L. & Spouge, J.L. Nucleic Acids Res. 36, 5863–5871 (2008).
Article CAS Google Scholar
Jensen, L.J. et al. Nucleic Acids Res. 36, D250–D254 (2008).
Article CAS Google Scholar
Finn, R.D. et al. Nucleic Acids Res. 44 D1, D279–D285 (2016).
Article Google Scholar
Steinegger, M. & Söding, J. Preprint at bioRxiv https://dx.doi.org/10.1101/104034 (2017).
Eddy, S.R. PLOS Comput. Biol. 7, e1002195 (2011).
Article CAS Google Scholar

Download references

Acknowledgements

We are grateful to C. Notredame and C. Seok for hosting M.S. at the CRG in Barcelona and at Seoul National University for 12 and 18 months, respectively, and to Burkhard Rost at TU Munich for accepting the formal supervision of his PhD thesis. We thank M. Mirdita, L. van den Driesch, and C. Galiez for contributing utilities and workflows, and S. Sunagawa, M. Frith, T. Rattei and our laboratory for feedback on the manuscript. This work was supported by the European Research Council's Horizon 2020 Framework Programme for Research and Innovation (“Virus-X”, project no. 685778) and by the German Federal Ministry for Education and Research (BMBF) (grants e:AtheroSysMed 01ZX1313D, “SysCore” 0316176A).

Author information

Authors and Affiliations

Martin Steinegger and Johannes Söding are in the Quantitative and Computational Biology group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany.,
Martin Steinegger & Johannes Söding
Martin Steinegger is in the Department for Bioinformatics and Computational Biology, Technische Universität München, Garching, Germany.,
Martin Steinegger

Authors

Martin Steinegger
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Söding
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.S. developed the software and performed the data analysis. M.S. and J.S. conceived of and designed the algorithms and benchmarks and wrote the manuscript.

Corresponding author

Correspondence to Johannes Söding.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Eliminating random memory access during k-mer match stage in MMseqs2.

Numbers in this figure are represented in hexadecimal notation (e.g. 0xFF is equal to 255 in decimal). After the end of loop 2 (Fig. 1B), the matches array on the left, containing single k-mer matches between the query sequence and various target sequences, is processed in two steps to find double k-mer matches. In the first step, the entries (target_ID, i−j) of matches are sorted into 2^B arrays (bins) according to the lowest B bits of target_ID. Here, for illustration purposes, we set B = 8. In the second step, the 2^B bins are processed one by one. For each k-mer match (target_ID, i−j), we run the code in the magenta frame of Fig. 1B. But now, the diagonal_prev array fits into L1/L2 CPU cache, because it only contains ceil(N/2^B) entries, where N is the number of sequences in the target database.

Supplementary Figure 2 Multi-core scaling of MMseqs2.

Runtimes of MMseqs and MMseqs2 searches in fast and default sensitivity using 1, 2, 4, 8 and 16 threads on a 2 × 8 core server with 128 GB main memory. Theoretically optimal scaling is indicated as a dashed black line for each method. We searched with 6370 full length protein queries against 30 Mio. UniProt sequences. On 16 cores, MMseqs achieves 58% and MMseqs2 85% of their theoretical maximum performance interpolated from the single core measurement. The improvement in scaling behaviour from MMseqs to MMseqs2 is owed to minimizing random main memory accesses, as explained in Fig. S1.

Supplementary Figure 3 Runtime of MMseqs2 against the UniProt at different sensitivity and database split settings.

We measured the search time with query sets of 10,000 and 100,000 sequences through the UniProt database (Release 2017_03 with 80204488 sequences) using four sensitivity settings (faster, fast, default, and sensitive) and splitting the database into 1, 2, and 4 chunks. Runtimes for Refseq/Genbank (Release March 3, 2017 with 81,027,309 sequences) are very similar. The memory consumption of the index table for the split levels of 1, 2, and 4 was 190GB, 101GB, and 57GB respectively. All searches ran on a 2×14-core server with 768GB main memory.

Supplementary Figure 4 False discovery rate versus E-value threshold.

False discovery rate versus E-value threshold in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Colors are the same as in Fig. 2a

Supplementary Figure 5 Sequence searching sensitivity assessment with unshuffled query sequences.

Cumulative distribution of area under the curve (AUC) sensitivity for all 6324 queries in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Higher curves signify higher sensitivity.

Supplementary Figure 6 Sequence profile searching sensitivity assessment with unshuffled query sequence profiles.

Cumulative distribution of area under the curve (AUC) sensitivity for all 6324 unshuffled query sequences in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Higher curves signify higher sensitivity. Higher curves signify higher sensitivity. 2 IT: 2 search iterations etc.

Supplementary Figure 7 False discovery rate versus E-value threshold for profile searches.

False discovery rate versus E-value threshold in version 2 of the sequence profile search sensitivity benchmark using unshuffled query sequences.

Supplementary Figure 8 Accuracy of reported E-values.

The expected number of false positives is the E-value threshold times the number of searches, E × 6324. The observed number of false positives is the total number of false positives below the E-value threshold in all 6324 searches. If E-values were accurate, observed and expected numbers of false positives would coincide (diagonal grey line). LAST and MMseqs2 report the most accurate E-values. The false positives shown were obtained with version 2 of the sequence search sensitivity benchmark. Colors are the same as in Fig. 2a.

Supplementary Figure 9 Sequence searching sensitivity assessment with single-domain SCOP sequences.

Cumulative distribution of area under the curve (AUC) sensitivity for all 7616 single domain SCOP sequences. Higher curves signify higher sensitivity. AUC up to the first false positive is the fraction of true positive matches found with better E-value than the first false positive match.

Supplementary Figure 10 False discovery rate versus E-value threshold for the single-domain benchmark.

False discovery rate versus E-value threshold for the single-domain SCOP sequence search sensitivity benchmark

Supplementary Figure 11 Workflow for fast and deep annotations of the Ocean Microbiome Reference Gene Catalog (OM-RGC) using MMseqs2.

Supplementary Figure 12 Algorithmic changes to perform fast sequence profile searches using MMseqs2.

We precompute all similar k-mers above a similarity threshold for each target profile and store them into the index table. For each query sequence we run over its overlapping, spaced k-mers (loop 2) and look up in the index table (blue frame) only the exact same k-mer. At the ungapped alignment stage we use the target profile consensus sequence. We transpose the results, i.e., we exchange the role of query and target in the results and then, as the last step, align the profiles against all query sequences and transpose back.

Supplementary information

Supplementary Information

Supplementary Figures Tables and Texts (PDF 5454 kb)

Supplementary Information

Supplementary Source Code (ZIP 7337 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017). https://doi.org/10.1038/nbt.3988

Download citation

Published: 16 October 2017
Issue Date: November 2017
DOI: https://doi.org/10.1038/nbt.3988

This article is cited by

Deciphering the gut microbiome of grass carp through multi-omics approach
- Ming Li
- Hui Liang
- Zhigang Zhou
Microbiome (2024)
Landscapes of gut bacterial and fecal metabolic signatures and their relationship in severe preeclampsia
- Xianxian Liu
- Xiaoming Zeng
- Xinwei Xiong
Journal of Translational Medicine (2024)
Unraveling metagenomics through long-read sequencing: a comprehensive review
- Chankyung Kim
- Monnat Pongpanich
- Thantrira Porntaveetus
Journal of Translational Medicine (2024)
Short-term exposure to antibiotics begets long-term disturbance in gut microbial metabolism and molecular ecological networks
- Yuehui Hong
- Hao Li
- Lianxian Guo
Microbiome (2024)
Structural and functional analysis of the active cow rumen’s microbial community provides a catalogue of genes and microbes participating in the deconstruction of cardoon biomass
- Andrea Firrincieli
- Andrea Minuti
- Antoine L. Harfouche
Biotechnology for Biofuels and Bioproducts (2024)