Abstract
Sequence comparison tools for metagenomeassembled genomes (MAGs) struggle with highvolume or lowquality data. We present skani (https://github.com/bluenote1577/skani), a method for determining average nucleotide identity (ANI) via sparse approximate alignments. skani outperforms FastANI in accuracy and speed (>20× faster) for fragmented, incomplete MAGs. skani can query genomes against >65,000 prokaryotic genomes in seconds and 6 GB memory. skani unlocks higherresolution insights for extensive, noisy metagenomic datasets.
Main
Consider the fundamental problem of computing sequencetosequence similarity between metagenomeassembled genomes (MAGs). Modern studies generate hundreds of thousands of MAGs^{1,2}, and searching these MAGs against a database or computing all pairwise similarities takes billions of comparisons; this is infeasible with traditional alignmentbased methods. Thus, largescale sequence comparison for metagenomic data is dominated by sketching methods. Sketching methods summarize datasets into small collections of kmers; these sketches can be efficiently compared against one another and return an average nucleotide identity (ANI) estimate.
Unfortunately, sketching methods such as Mash^{3} or sourmash^{4} may underestimate ANI when genome incompleteness is present^{5}. The decrease has nothing to do with the genetic distance between the genomes; it is simply an artefact of the assembly being incomplete. Even ‘mediumquality’ MAGs typically only require >50% completeness^{1}, so this is an issue in practice. On the other hand, alignmentbased methods are able to estimate ANI from only the orthologous regions, so incompleteness is not an issue. Additionally, the fraction of the genomes aligned to one another (the aligned fraction) is a useful statistic that pure sketching methods do not estimate. There is thus a need for algorithms that are fast, like sketching methods, yet robust to noise due to assembly artefacts, like alignment methods.
We developed skani, a fast, robust tool for calculating aligned fraction and ANIs in the >82% range. skani’s ANI method is robust against incomplete and fragmented MAGs, yet it is multiple orders of magnitude faster than alignmentbased methods and over an order of magnitude faster than even the stateoftheart FastANI^{6}. skani uses a very sparse kmer chaining^{7,8,9} procedure to quickly find orthologous regions between two genomes. This allows for sequence identity estimation using kmers on only the shared regions between two genomes (Fig. 1a), avoiding the pitfalls of alignmentignorant sketching methods. Like BLASTbased ANI methods, skani breaks genomes into nonoverlapping fragments, estimates the ANI for each fragment and then averages the ANI to output an ANI estimate. We then use a trained regression model to debias our ANI estimates (Methods).
We first verified that existing ANI methods are indeed sensitive to incompleteness and fragmentation in Extended Data Fig. 1. In a synthetic test, fragmented, incomplete yet identical genomes had ANI estimates that were systematically lower than 100% for all methods but ANIm^{10,11}, a slow but accurate method. We subsequently chose ANIm as a baseline when comparing MAGs. Mash was the most affected, with up to a difference of 4% ANI at 50% completeness, which can cause two very similar genomes of up to 99% ANI to be classified as different species when subject to the standard 95% ANI species threshold^{6}. FastANI was sensitive to fragmentation (low N50), which is why a minimum N50 of 10,000 is used in the original study^{6}, but that N50 requirement is not met in many real experiments^{1,12}. We additionally show simulations with mutations and chimeric genomes in Supplementary Figs. 1–3.
Next, we showed on real MAGs that only skani and ANIm are robust to MAG quality for highresolution ANI calculations. In Fig. 1b, we compared subspecies level MAGs generated by Pasolli et al^{1}, a large collection of mediumquality and highquality shortread assembled MAGs. As incompleteness and contamination increase, Mash and FastANI trend toward lower ANI, but this is not seen for ANIm (Extended Data Fig. 2) or skani. Extended Data Fig. 3 confirms our simulation results and shows that incompleteness and fragmentation are to blame for the bias.
Because ANI underestimations due to MAG quality are systematically biased, such ANI estimates can strongly impact downstream applications. We show in Fig. 1c that the cluster heatmaps obtained by averagelinkage clustering for a specieslevel bin differ greatly between ANI methods. skani’s heatmap qualitatively resembles ANIm’s heatmap (Extended Data Fig. 2) more closely than the other methods, yet it is >500 times faster than ANIm and >50 times faster than FastANI for computing the distance matrix (Supplementary Fig. 5).
To quantify the concordance of the clustering against ANIm, we used cophenetic correlation^{13} (Supplementary Note). In Fig. 1c and Extended Data Fig. 4, we see that skani has a better cophenetic correlation with respect to ANIm than all other compared methods. Notably, the ordering of the methods with respect to cophenetic correlation in Extended Data Fig. 4 is skani > sourmash maxcontain^{14} > FastANI > mash; this exactly agrees with the ordering using R^{2} values from the contaminationincompleteness plots in Fig. 1b and Extended Data Fig. 2, implying that MAG assembly artefacts are indeed to blame for the clustering discordance.
We explore skani on three additional datasets: ocean eukaryotic MAGs^{15,16}, ocean archaea MAGs^{12} and soil prokaryotic MAGs^{17}, which in addition to the Pasolli et al. dataset gives four datasets in total. Extended Data Fig. 5 shows that skani’s results generalize to a diverse set of genomes, including eukaryotic MAGs with a median size of 17.6 Mbp (and 32 MAGs of size >100 Mbp). When comparing each method’s deviation from ANIm and considering the 1 to 99 percentile deviations, skani has the smallest 1 to 99 percentile interval lengths, indicating robustness. Extended Data Figs. 6 and 7 show that skani has a better linear aligned fraction correlation with ANIm than FastANI. The aligned fraction accuracy can be improved further by controlling subsampling rate of the kmers (parameter c in Methods).
An important task is classifying MAGs (or isolate genomes) by searching against a database of reference genomes. Such databases represent a diverse collection of genomes where only a fraction of the genomes are similar to the query. Therefore, sensitively searching against each reference is unnecessary. To enable efficient database search, we augmented skani with a quick sketchingbased ANI filter against distant genomes before performing a more accurate ANI computation.
Figure 2 shows that skani can query an Escherichia coli genome against the GTDB R207 database^{18} (>65,000 genomes) in comparable speed and memory to Mash. skani is much faster than FastANI for querying (>20 times on the E. coli dataset) and is >2.5 times faster than Mash for indexing. Furthermore, skani can do alltoall comparisons on a set of 4,233 bacterial genomes as quickly as Mash due to the fast filtering of unrelated genomes. FastANI takes much longer (Supplementary Fig. 9), but FastANI does more comparisons due to a slightly larger valid ANI range.
In Fig. 2a and Supplementary Fig. 7, we benchmarked skani against OrthoANIu^{19,20} (which we shorten to ANIu), a faster but almostidentical analogue of the BLASTbased ANIb, as a baseline. We did not use ANIm, because ANIm overestimates ANI for pairs of genomes with <90% ANI^{20,21}. skani outputs an ANI estimate only if one of the genomes has predicted AF ≥15% by default, which ends up giving reasonable ANIs down to the 82% range on the three datasets shown. skani’s accuracy is usually better than Mash but slightly worse than FastANI for referencequality genomes, skani is better on the E. coli dataset, which includes many fragmented, possibly incomplete genomes that give rise to many Mash and FastANI outliers. The results for FastANI improved (Pearson R from 0.974 to 0.994) on the E. coli dataset if we removed genomes with N50 <10,000, giving the exact same Pearson R value as the originally reported FastANI results^{6}. Thus, skani gives only slightly less accurate values than FastANI on referencequality genomes with the assurance of robustness for lowquality assemblies.
We have shown that skani improves on the stateoftheart for metagenomic sequence comparison. skani’s key operating regimes are for mediumtohigh ANI (>82%), comparisons against diverse sets of genomes (such as databases), and fast alltoall comparisons for up to tens of thousands of highly similar genomes. skani is limited in extreme regimes, such as low ANI or comparing hundreds of thousands of similar genomes (for example alltoall calculation for the >200,000 E. coli genomes currently available in RefSeq). Future work includes exploring parameter choices and methods for accessing more extreme operating regimes, for example, lineartime clustering heuristics or more sensitive aminoacid computations.
In conclusion, skani is almost as fast as sketchingbased methods for ANI database search, yet it gives a more robust signal when comparing noisy MAGs. Given the overwhelming amount of data generated by modern metagenomic studies, we believe skani’s ability to analyze an order of magnitude more data while simultaneously giving a stronger signal will allow examination of vast metagenomic sequences at a higher resolution, unlocking new types of analysis not possible before.
Methods
Sequence identity estimation
Formally, let G be a string of nucleotides and \({G}^{{\prime} }\) be a mutated version of G where every letter is independently changed to a different letter with probability θ. We will define the true ANI to be equal to 1 − θ under our model. Under the usual assumption of no repetitive kmers^{24}, it is easy to estimate θ from kmer matching statistics^{3,24,25} between G and \({G}^{{\prime} }\). We proved in a previous work that for random, mutating strings, the expected number of kmer matches arising spuriously from repetitiveness for a string of length n is \(\sim \frac{{n}^{2}}{{4}^{k}}\) (Theorem 4 in Shaw and Yu^{26}), so the usual assumption of no repetitive kmers is not a bad one in practice for simpler, noneukaryotic genomes and large enough k.
Sketching methods such as Mash^{3} or sourmash^{4} use the above framework to calculate ANI through the aforementioned kmer statistics, which are built around estimating the normalized overlap between the kmers in two genomes. More precisely, these indices take the form \(\frac{ A\cap B }{f(A,B)}\) where A and B are the sets of kmers in two genomes, and f(A, B) controls the normalization. The Jaccard index used in Mash^{3} corresponds to f(A, B) = ∣A ∪ B∣, and the containment index is f(A, B) = ∣A∣ (ref. ^{14}). The main index we consider is the maxcontainment index, which corresponds to \(f(A,B)=\min ( A , B )\); minimizing the denominator maximizes the containment. These indices can approximate ANI through simple formulas^{25,27}.
However, when dealing with MAGs, we do not have G and \({G}^{{\prime} }\) but instead fragmented, contaminated and incomplete versions of G and \({G}^{{\prime} }\). The models used in these sketching methods give biased estimates for ANI, resulting in underestimated ANIs, because missing kmer matches may be due to mutations, incompleteness, or contamination instead of only due to mutations. Thus, all of the above kmer statistics suffer from the following problem: as two MAGs become more incomplete, the overlap ∣A ∩ B∣ may decrease more than f(A, B) because a homologous kmer is simply not present in the assembly, therefore making the index smaller.
This issue is also present in the context of kmer based alignmentfree genome comparison using reads^{28,29,30}, where a genome could potentially not be fully covered due to the random sampling of reads. In our case, however, we have access to incomplete, fragmented assemblies instead of reads. Thus, to accurately use kmer statistics, we first find orthologous regions by approximate alignment and then use kmer statistics.
Algorithm outline
The main idea behind skani is to find an approximate set of orthologous alignments between two genomes by obtaining a set of minimally overlapping kmer chains^{7} (the chains do not overlap much; the kmers may overlap within the chain). We can then estimate ANI from the statistics of the kmers in the chains, avoiding costly baselevel alignment. The main algorithmic steps are listed below.

1.
We use a very sparse set of marker ℓmers to estimate maxcontainment index and obtain a putative ANI using the FracMinHash method (section Sketching by FracMinHash). We filter out pairs of genomes with putative ANI <80% (section Maxcontainment putative ANI screening with marker ℓmers).

2.
We select the genome with the larger score, defined as total sequence length times mean contig length, to be the reference and the other to be the query. We then fragment the query into 20kb nonoverlapping chunks. In particular, this implies that the ANI computed by skani does not depend on the order of the inputs (that is, it is symmetric).

3.
We extract \(\frac{1}{c}\) fraction of kmers for both genomes for some c (c = 125 by default) as seeds to be used for chaining using FracMinHash (section Obtaining sparse seeds for chaining).

4.
For each chunk on the query, we chain the seeds using a standard banded, heuristic chaining method against the reference (section Chaining sparse kmer seeds).

5.
We greedily extract minimally overlapping chains between the query and the reference and output aligned fraction (section Obtaining orthologous chains from homologous chains).

6.
We estimate the ANI for each chunk, output the mean ANI over all chunks, and perform a learned ANI debiasing step (sections Estimating ANI from chains and Nonparametric regression for ANI debiasing).
Sketching by FracMinHash
Instead of using the set of all kmers in a genome, we use a compressed representation by sketching, by which we mean selecting only a subset of all kmers. To obtain such a set of kmers, we use the FracMinHash method^{31}: given a hash function h that maps kmers to [0, M], we select the kmer x as a seed if h(x) < M/γ, where γ controls the fraction of selected kmers. Assuming a uniform hash function, the expected fraction of selected kmers is \(\frac{1}{\gamma }\).
Although FastANI uses minimizer^{32} kmers to estimate Jaccard index and then ANI, recently, it was shown that Jaccard estimates (and thus ANI estimates) from minimizer kmers are biased^{27} and depend crucially on the window size w. Although this bias is not too bad when the w is small (FastANI uses relatively small w = 24), it scales as w increases. This means that a minimizer ANI estimator cannot use very sparse seeds, since the fraction of selected seeds is \(\frac{2}{w+1}\) (ref. ^{33}).
Maxcontainment putative ANI screening with marker ℓmers
skani is not optimized for comparing distant genomes, so we can filter out comparisons against distant genomes using a very sparse set of FracMinHash ℓmers, which we call markers. We use these markers to estimate the maxcontainment index^{14}; the same method is implemented in sourmash^{25}, although sourmash does not use max containment by default. Let a set of markers obtained from FracMinHash from the genome G_{1} (with γ = c_{m} and ℓmers) be denoted as A, and denote B as the analogous set from the genome G_{2}. Assuming one of the genomes is contained in the other completely, we then calculate an ANI estimate between two genomes G_{1}, G_{2} as
The term on the right inside the exponent is the maxcontainment index. FracMinHash has only negligible bias in calculating the containment index^{25} and can be used to obtain an estimate of ANI no matter the density (the fraction of selected kmers). We let ℓ = 21 for our marker ℓmers and c_{m} = 1,000 by default. We found skani’s ANI algorithm was most accurate when ANI ≥ ~82%, so we only compare genomes with ANI_{FMH} > 80% as a conservative underestimate.
We can store all markers contained in the reference genomes as keys in a hash table with the associated values being the set of genomes containing the key (that is, an inverted index). Given a query genome, we can then obtain the intersection of all markers for the query against the references by checking all of its markers against the inverted index. This allows computation of the maxcontainment index in running time dependent on the number of similar genomes in the references^{34}. From this, we can obtain ANI_{FMH}, making our filtering step fast as long as our reference and query genomes are diverse. When performing alltoall pairwise comparisons, skani uses this strategy by default. However, when querying a small number of genomes against a database, skani iteratively checks all pairs of genomes by default because building the hash table takes a nontrivial amount of time.
Obtaining sparse seeds for chaining
The marker ℓmers described above are only used for filtering, but not for actually estimating the ANI. To actually estimate the ANI, we use a hybrid approach via local mapping (without baselevel alignment) and containment index estimation. The first step in our approach is to obtain a different set of FracMinHash kmers, this time taking γ, the sampling rate of kmers, to be γ = c where c < c_{m}, and k < ℓ. This gives a denser, more sensitive set of kmers to be used as seeds. These new kmers are called seeds instead of markers because we actually use them as seeds for kmer matching and alignment. We note that although we could have used other ‘contextindependent’ kmer seeding methods that are more ‘conserved’ than FracMinHash^{35}, we found that FracMinHash works well enough for relatively sparse seeds when c ≫ k.
By default, k = 15 and c = 125. We note that the small value of k = 15 used by default leads to too many repetitive anchors on larger genomes, so we mask the top seeds that occur more than 2,500/c times by default. See the section Chaining score function and algorithm below for a justification of the choice of 2,500/c.
Choosing the c parameter
The main parameter influencing runtime and accuracy is the c parameter. The default value of c = 125 works well on a variety of tasks such as searching databases and MAG comparison. We found that for very fragmented and distant genomes, lowering c may lead to more accurate ANI and AF estimates (Extended Data Figs. 5–7 and Supplementary Fig. 10). However, the runtime is inversely proportional to c. To guide users for choosing c, we suggest three different preset values of c (in addition to the default) in skani v0.1.2’s help messages based on empirical heuristics: a ‘slow’ preset with c = 30 for the most accurate AF estimates and pairs of genomes with N50 ≈ 3 kb, a ‘medium’ preset with c = 70 for genomes with ANI ≤95 and N50 ≤10 kb, and a ‘fast’ preset c = 200 for similar genomes with >95% ANI with N50 >10 kb.
skani’s index file sizes are inversely proportional to c, and a drawback of skani is that the index file sizes are larger than pure sketching methods (Supplementary Table 3), but the full indices can be stored on disk and only read into RAM by skani as needed (see skani implementation details).
Chaining sparse kmer seeds
After selecting one genome as a reference and the other as the query, we fragment the query into 20kb nonoverlapping chunks. For each chunk, we find a set anchors, which are exact kmer matches between g and \({g}^{\,{\prime} }\). Each anchor can be described by a tuple (x, y), indicating the starting position of the matching kmers on g and \({g}^{{\prime} }\) respectively. We collect the anchors into a strictly increasing subsequence called a chain^{7} based on the ordering (x_{1}, y_{1}) ≺ (x_{2}, y_{2}) if x_{1} < x_{2} and y_{1} < y_{2}.
We sort the anchors (x_{1}, y_{1}), …(x_{N}, y_{N}) in lexicographic order and let S(i, j) be the score of chaining the ith anchor to the jth anchor. We wish to find a strictly increasing subsequence (based on our previously defined ordering ≺ ) of anchors (i_{1}, i_{2}, …) maximizing ∑_{ℓ}S(i_{ℓ}, i_{ℓ−1}). The optimal such subsequence can be calculated in O(N^{2}) time where N is the number of anchors by the following dynamic programming: letting f(i) be the optimal score of the chain up to the ith anchor, let
After calculating f(i) for each anchor i, we can obtain a set of optimal chains. We describe this in detail below.
Chaining score function and algorithm
The chaining problem can be solved optimally in subquadratic time for a variety of chaining costs^{8,9}, but we opt for a simple heuristic method that is fast and good enough for our purposes instead. Letting (x_{i}, y_{i}) and (x_{j}, y_{j}) be the ith and jth anchors, we define our chaining cost to simply be S(i, j) = 20 − ∣(y_{j} − y_{i}) − (x_{j} − x_{i})∣. Notably, we allow and do not penalize overlapping kmers because this would bias the ANI calculation, as we want the chain to include all kmers that arise from sequence homology. We also do not need to use more sophisticated scoring functions because we do not actually need to worry about baselevel alignments or finding the longest chain. We use a banded dynamic programming method where instead of iterating over all j < i, we iterate only up to i − A < j < i or stop if ∣x_{i} − x_{j}∣ > B for some constants A and B. Thus, the worstcase time is O(AN) instead of O(N^{2}). By default, we let A = B/c, where B = 2,500 and c is the seed subsampling rate. This banded procedure is a simpler version of minimap2.23’s chaining procedure, which also employs a heuristic to stop chaining early via the maxchainskip parameter.
Obtaining homologous chains by backtracking
For each chunk, after computing all optimal scores f(i) over all anchors using banded dynamic programming, we obtain optimal chains using the standard method of backtracking. That is, we store an array of pointers corresponding to anchors, where the pointer points to the optimal predecessor determined by the dynamic programming. For any anchor, we could then trace through this array to obtain the best chain corresponding to each anchor.
We wish to obtain a set of chains that do not share any anchors for each chunk. To do this, we partition the anchors into disjoint sets using a unionfind data structure, taking unions of two anchor representatives whenever one anchor is an optimal predecessor of another. We then find the best f(i) within each disjoint set and backtrack to obtain the optimal chain within each disjoint set. We take the set of all such chains over all chunks and call these chains homologous chains.
Obtaining orthologous chains from homologous chains
It is possible that a single region chains to multiple paralogs, so the homologous chains obtained above may not be orthologous. We will use the term ‘orthologous’ loosely in the same sense as other ANI methods^{6} – we denote a set of mappings (corresponding to chains) to be orthologous if they do not overlap too much along one of the genomes (that is, no duplicated onetoall mappings). To obtain orthologous chains, we use a greedy minimally overlapping chainfinding procedure. We first sort every homologous chain (over all chunks) by its score and examine the best chain, only examining chains with three or more anchors. If the overlap between this current best chain and any other already selected chains is less than 50% of the current chain’s length, we select this best homologous chain as an orthologous chain and then examine the next best homologous chain. We repeat this procedure until all remaining homologous chains overlap an orthologous chain by more than 50% and return all orthologous chains.
This procedure is similar in spirit to the reciprocal mapping method used in other ANI methods^{6,19} to capture orthology, but it avoids performing alignment twice, making running time twice as fast. Other more sophisticated methods^{36} exist for finding sets of orthologous alignments but we found this heuristic to be good enough for our method.
Estimating ANI from chains
For a given chunk, let α be the number of anchors in the orthologous chains on that chunk, and M be the number of seeds in the chunk. Then we estimate an ANI for each chunk by
This comes from modeling each kmer as being an exact match and thus an anchor with probability (1 − θ)^{k} under a simple, independent mutation model, so the expected number of anchors is (1 − θ)^{k} times the number of kmers^{25}.
However, the above formula fails when only part of the chunk is homologous to the reference, which may happen due to incompleteness of MAGs or structural variation. This will underestimate the ANI, since not all seeds in the chunk arise from sequence homology. Let M_{LR} be the number of seeds in the chunk contained only between the leftmost and rightmost anchor over all minimally overlapping chains, and consider
If \({\hat{{{\mbox{ANI}}}}}_{LR} > 0.950\) and the number of bases covered by in between these flanking anchors is >4*c, then we use \({\hat{{{\mbox{ANI}}}}}_{LR}\) as the estimate instead. This heuristic comes from the observation that when the ANI is large enough, our sparse chains are relatively accurate, so we can truncate the chunk with good accuracy. However, if ANI is small, it is more possible that this heuristic incorrectly excludes seeds that arise from sequence homology, so we avoid applying the heuristic to distant pairs of genomes.
Estimating final ANI and AF
To obtain our final ANI estimates, we take a weighted mean over all chunks with weights given by M (or M_{LR} if the heuristic is applied instead), the number of seeds in the chunk. We also report the reference and query alignment fraction as the sum of the bases covered by all chains divided by the respective genome size. The number of bases in a chain is the last anchor position minus the first anchor position plus 2*c, where the 2*c term is to account for kmers near the edge of the chain missed by subsampling. By default, we only output a final ANI value if the alignment fraction is greater than 15%.
Because our final ANI estimate is just a weighted mean over the ANIs of the chunks, once we have these ANIs, we can quickly do bootstrapping by resampling the chunks and calculating the weighted mean over the resamples. We use 100 iterations and only proceed if there are >10 chunks, outputting the 5th and 95th percentile ANIs to be our final 90% confidence interval. We found that the bootstrap gave reasonable ANI confidence intervals over skani’s inherent uncertainty (for example kmer seeding variance), but we stress that it does not account for any systematic biases of skani’s ANI estimator relative to other methods.
Nonparametric regression for ANI debiasing
The chaining procedure can overestimate ANI, because the chains may exclude homologous but mutated kmers near the edges of the chain due to the local mapping procedure. To handle this, we introduce an optional postprocessing regression step to debias skani.
We trained a gradient boosted regression tree with absolute deviation loss where the target variable is ANIm’s ANI calculation, and the features consist of the following: skani’s ANI, standard deviation of skani’s putative ANI distribution, and the 90th percentile contig lengths in the reference and query, and the average length of the kmer chains, giving five features in total. We trained skani on a large, diverse set of 52,515 MAGs from Nayfach et al^{2}. We computed alltoall pairwise ANI values with >90% ANI according to an untrained version skani, and ran ANIm on the resulting 1,004,213 pairs of genomes. To tune hyperparameters such as tree depth, number of trees, and learning rate, we chose a set of human gut MAGs^{37} that comes from a separate study than the training set and the datasets used in our results and then optimized our parameters over this new dataset.
To show that skani is not simply memorizing the organisms in the dataset, we partitioned this training set into two sets, A and B, where each part is ‘disjoint’ from one another in the sense that skani does not output an ANI estimate between any MAG in A and B due to low aligned fraction (<15%). We show in Supplementary Figs. 6 and 11 that this debiasing procedure still corrects the ANI to be closer to a MUMmerbased ground truth.
We only debias comparisons with putative ANI > 90% and > 150, 000 aligned bases. We enable the debiasing procedure by default when the parameter c is ≥70. We noticed that for smaller c, bias is less of an issue. We trained two models, one for c = 125 (default) and another for c = 200. If the user selects a specific value of c, the model with corresponding c closer to the selected value is used. In particular, our results with c = 125 have debiasing enabled, but our results with c = 30 do not.
skani implementation details
skani is implemented in the rust, a systemslevel programming language, for speed. skani implements four primary subcommands: sketch, dist, search and triangle. The sketch command stores genomes in sketched representation (kmer seeds and markers) for dropin replacement for the other three commands. We use the same fast invertible kmer hash function as minimap2 (ref. ^{38}) for sketching. Additionally, skani uses AVX2 SIMD instructions to do vectorized seeding of 64bit kmers in 256bit lanes when AVX2 instructions are detected, which we found to speed up sketching by approximately 30%. Chaining was implemented naively with no hardware accelerations.
The three commands for sequence comparison, dist, triangle, and search, all calculate AF and ANI but have different runtime behavior. dist and triangle keeps all sketches in RAM, whereas the search command only keeps the marker kmers in memory and allows for onthefly loading of the full sketch for each genome into RAM if the markerbased ANI threshold for the comparison is sufficiently high (≥ 80% by default), afterward discarding the index from RAM. This makes search much more memory efficient and possibly more timeefficient than dist when querying a large database, but it is IObound and not as efficient as dist when performing alltoall comparisons. The triangle command is similar to dist but is limited to computing distance matrices, but because skani is symmetric, it performs only n(n − 1)/2 comparisons instead of all n^{2}. The inverted index for marker kmer filtering is enabled by default in triangle, and otherwise in search and dist when the number of query genomes is ≥100. See Supplementary Fig. 12 for a benchmark on skani’s runtime scaling with thread usage.
Benchmarking details
All runtimes were benchmarked on a Intel Xeon CPU at 3.10 GHz machine with 64 cores and 240 GB RAM as a Google Cloud instance with a persistent SSD disk. Unless otherwise specified, all programs were run using 50 threads. Exact commands are shown in Supplementary Table 1.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All datasets are also specified in Supplementary Table 2 with descriptions. The Pasolli et al. 2550 dataset^{1} is available at http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html. Ocean archaea MAGs^{12} are available at https://doi.org/10.6084/m9.figshare.c.5564844.v1. Soil MAGs^{17} are available at https://figshare.com/collections/Genomes_for_consistent_metagenomederived_metrics_verify_and_define_bacterial_species_boundaries/4508162/1. Ocean eukaryotic MAGs^{15,16} are available at https://osf.io/gm564/ and https://www.genoscope.cns.fr/tara/. Nayfach et al. MAGs^{2} are available at https://genome.jgi.doe.gov/GEMs. E. coli, B. anthracis, and the D5 dataset genomes are available at http://enveomics.ce.gatech.edu/data/fastani. GTDBR207 (ref. ^{18}) database is available at https://gtdb.ecogenomic.org. B. fragilis genomes and the refseqrc database is available at https://zenodo.org/record/8058221 (ref. ^{39}).
Code availability
skani is available at https://github.com/bluenote1577/skani. Scripts and notebooks for reproducing all figures is available at https://github.com/bluenote1577/skanitest. skani v0.1.4’s source code is deposited at https://zenodo.org/record/8058221 (ref. ^{39}).
References
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).
Nayfach, S. et al. A genomic catalog of Earth’s microbiomes. Nat. Biotechnol. 39, 499–509 (2021).
Ondov, B. D. et al. Mash: Fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
Pierce, N. T., Irber, L., Reiter, T., Brooks, P. & Brown, C. T. Largescale sequence comparisons with sourmash. F1000Research 8, 1006 (2019).
Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: A tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through dereplication. The ISME Journal 11, 2864–2868 (2017).
Jain, C., RodriguezR, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
Abouelhoda, M. I. & Ohlebusch, E. Chaining algorithms for multiple genome comparison. J. Discrete Algorithms 3, 321–341 (2005).
Jain, C., Gibney, D. & Thankachan, S. V. Colinear chaining with overlaps and gap costs. In Pe’er, I. (ed.) Research in Computational Molecular Biology, Lecture Notes in Computer Science, 246–262 (Springer International Publishing, 2022).
Mäkinen, V. & Sahlin, K. Chaining with Overlaps Revisited. In Gørtz, I. L. & Weimann, O. (eds.) 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020), vol. 161, Leibniz International Proceedings in Informatics (LIPIcs), 25:1–25:12 (Schloss Dagstuhl–LeibnizZentrum für Informatik, 2020).
Richter, M. & RossellóMóra, R. Shifting the genomic gold standard for the prokaryotic species definition. Proc. Natl Acad. Sci. USA 106, 19126–19131 (2009).
Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).
Nishimura, Y. & Yoshizawa, S. The OceanDNA MAG catalog contains over 50,000 prokaryotic genomes originated from various marine environments. Sci. Data 9, 305 (2022).
Sokal, R. R. & Rohlf, F. J. The comparison of dendrograms by objective methods. Taxon 11, 33–40 (1962).
Koslicki, D. & Zabeti, H. Improving MinHash via the containment index with applications to metagenomic analysis. Appl. Math. Comput. 354, 206–215 (2019).
Alexander, H. et al. Eukaryotic genomes from a global metagenomic dataset illuminate trophic modes and biogeography of ocean plankton. Preprint at bioRxiv https://doi.org/10.1101/2021.07.25.453713 (2022).
Delmont, T. O. et al. Functional repertoire convergence of distantly related eukaryotic plankton lineages abundant in the sunlit ocean. Cell Genom. 2, 100123 (2022).
Olm, M. R. et al. Consistent metagenomederived metrics verify and delineate bacterial species boundaries. mSystems 5, e00731–19 (2020).
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Lee, I., Ouk Kim, Y., Park, S.C. & Chun, J. OrthoANI: An improved algorithm and software for calculating average nucleotide identity. Int. J. Syst. Evol. Microbiol. 66, 1100–1103 (2016).
Yoon, S.H., Ha, S.m, Lim, J., Kwon, S. & Chun, J. A largescale evaluation of algorithms to calculate average nucleotide identity. Antonie van Leeuwenhoek 110, 1281–1286 (2017).
Palmer, M., Steenkamp, E. T., Blom, J., Hedlund, B. P. & Venter, S. N. All ANIs are not created equal: Implications for prokaryotic species boundaries and integration of ANIs into polyphasic taxonomy. Int. J.Syst. Evol. Microbiol. 70, 2937–2948 (2020).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Robinson, D. F. & Foulds, L. R. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 (1981).
Blanca, A., Harris, R. S., Koslicki, D. & Medvedev, P. The statistics of kmers from a sequence undergoing a simple mutation process without spurious matches. J. Comput. Biol. 29, 155–168 (2022).
Hera, M. R. et al. Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using fracminhash. Genome Res. 33, 1061–1068 (2023).
Shaw, J. & Yu, Y. W. Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an averagecase analysis of the seedchainextend heuristic. Genome Res. 33, 1175–1187 (2023).
Belbasi, M., Blanca, A., Harris, R. S., Koslicki, D. & Medvedev, P. The minimizer Jaccard estimator is biased and inconsistent. Bioinformatics 38, i169–i176 (2022).
Tang, K., Ren, J. & Sun, F. Afann: Bias adjustment for alignmentfree sequence comparison based on sequencing data using neural network regression. Genome Biol. 20, 266 (2019).
Fan, H., Ives, A. R., SurgetGroba, Y. & Cannon, C. H. An assembly and alignmentfree method of phylogeny reconstruction from nextgeneration sequencing data. BMC Genom. 16, 522 (2015).
Sarmashghi, S., Bohmann, K., P. Gilbert, M. T., Bafna, V. & Mirarab, S. Skmer: Assemblyfree and alignmentfree sample identification using genome skims. Genome Biol. 20, 34 (2019).
Irber, L. et al. Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers. Preprint at bioRxiv https://doi.org/10.1101/2022.01.11.475838 (2022).
Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. & Yorke, J. A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).
Schleimer, S., Wilkerson, D. S. & Aiken, A. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03, 76–85 (Association for Computing Machinery, 2003).
Yu, Y. W., Daniels, N. M., Danko, D. C. & Berger, B. Entropyscaling search of massive biological data. Cell Syst. 1, 130–140 (2015).
Shaw, J. & Yu, Y. W. Theory of local kmer selection with applications to longread alignment. Bioinformatics 38, 4659–4669 (2022).
Frith, M. C. & Kawaguchi, R. Splitalignment of genomes finds orthologies more accurately. Genome Biol. 16, 106 (2015).
Zeng, S. et al. A compendium of 32,277 metagenomeassembled genomes and over 80 million genes from the earlylife human gut microbiome. Nat. Commun. 13, 5139 (2022).
Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Shaw, J. & Yu, Y.W. skani experiment files and source code. Zenodo https://zenodo.org/record/8058221 (2023).
Acknowledgements
J.S. was supported by an NSERC CGSD scholarship. Work supported by Natural Sciences and Engineering Research Council of Canada (NSERC) grant RGPIN202203074 and DND/NSERC Supplement DGDND202203074.
Author information
Authors and Affiliations
Contributions
J.S. conceived the project, designed the algorithms and implemented skani. Y.W.Y. supervised and contributed to the development of the methods. Both authors wrote and edited the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Donovan Parks, Rob Patro, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 ANI benchmark under simulated fragmentation and incompleteness.
We fragmented an E. coli genome to obtain nonoverlapping contigs with lengths distributed according to an exponential distribution (mean length on the xaxis) and then retained each contig with some probability (probability on the yaxis) if the length was ≥ 1000 bp. These simulated MAGs (20 for each pair length and probability parameters) were compared to each other and the average is shown.
Extended Data Fig. 2 ANIm and sourmash experiments corresponding to the experiments shown in Fig. 1.
sourmash maxcontainment estimates ANI using the maxcontainment index, corresponding to the –maxcontainment option in sourmash.
Extended Data Fig. 3 ANI sensitivity to fragmentation and incompleteness.
The same experiment was run as in Fig. 1b on Pasolli et al 2550, except we used N50 as a covariate instead of contamination.
Extended Data Fig. 4 Violin plot of cophenetic correlation deviation for each method with respect to ANIm on the Pasolli 2550 data set.
We took all specieslevel bins in the Pasolli data set with > 25 and < 50 genomes and computed distance matrices for all ANI methods. sourmash maxcontain implies the maxcontainment option in sourmash, otherwise sourmash was run with default parameters. For each bin, we compared the resulting clustering dendrograms for each method against ANIm’s distance matrix by taking the cophenetic correlation. We plot the difference between this cophenetic correlation and the cophenetic correlation of ANIm’s dendrogram to itself (which may be < 1 since ANIm’s averagecluster dendrogram may not be perfectly concordant with itself). The legend shows the mean deviation of each method’s cophenetic correlation from ANIm’s cophenetic correlation.
Extended Data Fig. 5 ANI deviation from ANIm for four data sets of MAGs with > 90% ANI (as predicted by ANIm).
See Supplementary Table 2 for data set descriptions. skani c30 indicates skani with the parameter c set to 30, and skaninl indicates no learned ANI regression debiasing (with default c = 125). sourmc and sour correspond to sourmash with maxcontain enabled and disabled. Blue lines show 5 and 5 deviation, and the legend indicates the 1 and 99 percentile deviations as well as their difference. The best 1% to 99% deviation distance for each data set is either skani or skanic30. Lowering the c parameter generally improves results slightly, but is not guaranteed to do so; the default c = 125 gives results more in line with ANIm on the Pasolli 2550 data set.
Extended Data Fig. 6 Aligned fraction correlation between ANIm and skani with default parameters (c = 125), skani with c = 30, and FastANI.
Only MAGs with > 90% ANI as predicted by ANIm are compared. Lowering the value of c for skani gives a more accurate signal. skani with default parameters still outperforms FastANI in terms of Pearson R for all data sets, and skani with c = 30 has near perfect correspondence with ANIm. The noisier results for archaea MAGs are likely due to its small N50 (median 5863 bp) and genome length (median 1.27 Mbp).
Extended Data Fig. 7 Aligned fraction deviation from ANIm as a function of ANI.
Dashed lines indicate the 1 and 99 percentiles of aligned fraction deviation from ANIm. Aligned fraction gets more accurate for skani as ANI increases. Decreasing c for skani improves the relationship between ANI and aligned fraction and decreases the variance as well.
Supplementary information
Supplementary Information
Supplementary Note, Figs. 1–12 and Tables 1–3.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Shaw, J., Yu, Y.W. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nat Methods 20, 1661–1665 (2023). https://doi.org/10.1038/s41592023020183
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592023020183