Subscribe to Journal
Get full journal access for 1 year
only $20.83 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Sunagawa, S. et al. Science 348, 1261359 (2015).
Afshinnekoo, E. et al. Cell Syst. 1, 72–87 (2015).
Howe, A.C. et al. Proc. Natl. Acad. Sci. USA 111, 4904–4909 (2014).
Franzosa, E.A. et al. Nat. Rev. Microbiol. 13, 360–372 (2015).
Scholz, M.B., Lo, C.C. & Chain, P.S. Curr. Opin. Biotechnol. 23, 9–15 (2012).
Desai, N., Antonopoulos, D., Gilbert, J.A., Glass, E.M. & Meyer, F. Curr. Opin. Biotechnol. 23, 72–76 (2012).
Tang, W. et al. in IEEE International Conference on Big Data, 56–63 (IEEE, 2014).
Altschul, S.F. et al. Nucleic Acids Res. 25, 3389–3402 (1997).
Edgar, R.C. Bioinformatics 26, 2460–2461 (2010).
Kiełbasa, S.M., Wan, R., Sato, K., Horton, P. & Frith, M.C. Genome Res. 21, 487–493 (2011).
Zhao, Y., Tang, H. & Ye, Y. Bioinformatics 28, 125–126 (2012).
Buchfink, B., Xie, C. & Huson, D.H. Nat. Methods 12, 59–60 (2015).
Hurwitz, B.L. & Sullivan, M.B. PLoS One 8, e57355 (2013).
Hauser, M., Steinegger, M. & Söding, J. Bioinformatics 32, 1323–1330 (2016).
Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. J. Mol. Biol. 247, 536–540 (1995).
Karplus, K., Barrett, C. & Hughey, R. Bioinformatics 14, 846–856 (1998).
Rognes, T. BMC Bioinformatics 12, 221 (2011).
Frith, M.C. Nucleic Acids Res. 39, e23–e23 (2011).
Frith, M.C., Park, Y., Sheetlin, S.L. & Spouge, J.L. Nucleic Acids Res. 36, 5863–5871 (2008).
Jensen, L.J. et al. Nucleic Acids Res. 36, D250–D254 (2008).
Finn, R.D. et al. Nucleic Acids Res. 44 D1, D279–D285 (2016).
Steinegger, M. & Söding, J. Preprint at bioRxiv https://dx.doi.org/10.1101/104034 (2017).
Eddy, S.R. PLOS Comput. Biol. 7, e1002195 (2011).
We are grateful to C. Notredame and C. Seok for hosting M.S. at the CRG in Barcelona and at Seoul National University for 12 and 18 months, respectively, and to Burkhard Rost at TU Munich for accepting the formal supervision of his PhD thesis. We thank M. Mirdita, L. van den Driesch, and C. Galiez for contributing utilities and workflows, and S. Sunagawa, M. Frith, T. Rattei and our laboratory for feedback on the manuscript. This work was supported by the European Research Council's Horizon 2020 Framework Programme for Research and Innovation (“Virus-X”, project no. 685778) and by the German Federal Ministry for Education and Research (BMBF) (grants e:AtheroSysMed 01ZX1313D, “SysCore” 0316176A).
The authors declare no competing financial interests.
Integrated supplementary information
Numbers in this figure are represented in hexadecimal notation (e.g. 0xFF is equal to 255 in decimal). After the end of loop 2 (Fig. 1B), the matches array on the left, containing single k-mer matches between the query sequence and various target sequences, is processed in two steps to find double k-mer matches. In the first step, the entries (target_ID, i−j) of matches are sorted into 2B arrays (bins) according to the lowest B bits of target_ID. Here, for illustration purposes, we set B = 8. In the second step, the 2B bins are processed one by one. For each k-mer match (target_ID, i−j), we run the code in the magenta frame of Fig. 1B. But now, the diagonal_prev array fits into L1/L2 CPU cache, because it only contains ceil(N/2B) entries, where N is the number of sequences in the target database.
Runtimes of MMseqs and MMseqs2 searches in fast and default sensitivity using 1, 2, 4, 8 and 16 threads on a 2 × 8 core server with 128 GB main memory. Theoretically optimal scaling is indicated as a dashed black line for each method. We searched with 6370 full length protein queries against 30 Mio. UniProt sequences. On 16 cores, MMseqs achieves 58% and MMseqs2 85% of their theoretical maximum performance interpolated from the single core measurement. The improvement in scaling behaviour from MMseqs to MMseqs2 is owed to minimizing random main memory accesses, as explained in Fig. S1.
Supplementary Figure 3 Runtime of MMseqs2 against the UniProt at different sensitivity and database split settings.
We measured the search time with query sets of 10,000 and 100,000 sequences through the UniProt database (Release 2017_03 with 80204488 sequences) using four sensitivity settings (faster, fast, default, and sensitive) and splitting the database into 1, 2, and 4 chunks. Runtimes for Refseq/Genbank (Release March 3, 2017 with 81,027,309 sequences) are very similar. The memory consumption of the index table for the split levels of 1, 2, and 4 was 190GB, 101GB, and 57GB respectively. All searches ran on a 2×14-core server with 768GB main memory.
False discovery rate versus E-value threshold in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Colors are the same as in Fig. 2a
Cumulative distribution of area under the curve (AUC) sensitivity for all 6324 queries in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Higher curves signify higher sensitivity.
Supplementary Figure 6 Sequence profile searching sensitivity assessment with unshuffled query sequence profiles.
Cumulative distribution of area under the curve (AUC) sensitivity for all 6324 unshuffled query sequences in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Higher curves signify higher sensitivity. Higher curves signify higher sensitivity. 2 IT: 2 search iterations etc.
False discovery rate versus E-value threshold in version 2 of the sequence profile search sensitivity benchmark using unshuffled query sequences.
The expected number of false positives is the E-value threshold times the number of searches, E × 6324. The observed number of false positives is the total number of false positives below the E-value threshold in all 6324 searches. If E-values were accurate, observed and expected numbers of false positives would coincide (diagonal grey line). LAST and MMseqs2 report the most accurate E-values. The false positives shown were obtained with version 2 of the sequence search sensitivity benchmark. Colors are the same as in Fig. 2a.
Cumulative distribution of area under the curve (AUC) sensitivity for all 7616 single domain SCOP sequences. Higher curves signify higher sensitivity. AUC up to the first false positive is the fraction of true positive matches found with better E-value than the first false positive match.
Supplementary Figure 10 False discovery rate versus E-value threshold for the single-domain benchmark.
False discovery rate versus E-value threshold for the single-domain SCOP sequence search sensitivity benchmark
Supplementary Figure 11 Workflow for fast and deep annotations of the Ocean Microbiome Reference Gene Catalog (OM-RGC) using MMseqs2.
Supplementary Figure 12 Algorithmic changes to perform fast sequence profile searches using MMseqs2.
We precompute all similar k-mers above a similarity threshold for each target profile and store them into the index table. For each query sequence we run over its overlapping, spaced k-mers (loop 2) and look up in the index table (blue frame) only the exact same k-mer. At the ungapped alignment stage we use the target profile consensus sequence. We transpose the results, i.e., we exchange the role of query and target in the results and then, as the last step, align the profiles against all query sequences and transpose back.
About this article
Cite this article
Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017). https://doi.org/10.1038/nbt.3988
PLOS ONE (2020)
Similar yet different: phylogenomic analysis to delineate Salmonella and Citrobacter species boundaries
BMC Genomics (2020)
Journal of Computational Biology (2020)
ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data
BMC Bioinformatics (2020)