Skani enables accurate and efficient genome comparison for modern metagenomic datasets

Modern high-throughput metagenomics is producing hundreds of thousands of metagenome-assembled genomes (MAGs), which is overwhelming traditional sequence-similarity search methods. We present a computational method, skani, that efficiently compares MAGs on a terabyte scale while being robust to the inherent noise in MAGs, enabling larger and more accurate analyses.

Fig. 1: Skani gives improved clustering and speed over competing methods.


  1. Nayfach, S. et al. A genomic catalog of Earth’s microbiomes. Nat. Biotechnol. 39, 499–509 (2021). This paper highlights the scale of modern collections of MAGs, which number in the hundreds of thousands.

  2. Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016). This paper reports one of the first sketching methods for the rapid analysis of genomes.

  3. Belbasi, M., Blanca, A., Harris, R. S., Koslicki, D. & Medvedev, P. The minimizer Jaccard estimator is biased and inconsistent. Bioinformatics 38, i169–i176 (2022). This paper shows that certain k-mer seeding schemes give theoretically incorrect estimates of ANI.

  4. Hera, M. R., Pierce-Ward, N. T. & Koslicki, D. Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash. Genome Res. (2023). Our paper uses this k-mer seeding scheme, which has almost no ANI bias.

  5. Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017). This paper shows that Mash gives incorrect estimates of ANI in the presence of MAG incompleteness.

This is a summary of: Shaw, J. & Yu, Y. W. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nat. Methods, (2023).

