Abstract
Metagenomic datasets contain billions of protein sequences that could greatly enhance largescale functional annotation and structure prediction. Utilizing this enormous resource would require reducing its redundancy by similarity clustering. However, clustering hundreds of millions of sequences is impractical using current algorithms because their runtimes scale as the input set size N times the number of clusters K, which is typically of similar order as N, resulting in runtimes that increase almost quadratically with N. We developed Linclust, the first clustering algorithm whose runtime scales as N, independent of K. It can also cluster datasets several times larger than the available main memory. We cluster 1.6 billion metagenomic sequence fragments in 10 h on a single server to 50% sequence identity, >1000 times faster than has been possible before. Linclust will help to unlock the great wealth contained in metagenomic and genomic sequence databases.
Introduction
In metagenomics, DNA is sequenced directly from the environment, allowing us to study the vast majority of microbes that cannot be cultivated in vitro^{1}. During the last decade, costs and throughput of nextgeneration sequencing have dropped twofold each year, twice faster than computational costs. This enormous progress has resulted in hundreds of thousands of metagenomes and tens of billions of putative gene and protein sequences^{2,3}. Therefore, computing and storage costs are now dominating metagenomics^{4,5,6}. Clustering protein sequences predicted from sequencing reads or preassembled contigs can considerably reduce the redundancy of sequence sets and costs of downstream analysis and storage.
CDHIT and UCLUST^{7,8} are by far the most widely used tools for clustering and redundancy filtering of protein sequence sets (see ref. ^{9} for a review). Their goal is to find a representative set of sequences such that each of the input set sequences is represented well enough by one of the K representatives, where “well enough” is quantified by some similarity criteria.
Like most other fast sequence clustering tools, they use a fast prefilter to reduce the number of slow pairwise sequence alignments. An alignment is only computed if two sequences share a minimum number of identical kmers (substrings of length k). If we denote the average probability by p_{match} that this happens by chance between two nonhomologous input sequences, then the prefilter would speed up the sequence comparison by a factor of up to 1/p_{match} at the expense of some loss in sensitivity. This is usually unproblematic: if sequence matches are missed (false negatives) we create too many clusters, but we do not lose information. In contrast, false positives are costly as they can cause the loss of unique sequences from the representative set.
CDHIT and UCLUST employ the following "greedy incremental clustering" approach: each of the N input sequences is compared with the representative sequences of already established clusters. When the sequence is similar enough to the representative sequence of one of the clusters, that is, the similarity criteria such as sequence identity are satisfied, the sequence is added to that cluster. Otherwise, the sequence becomes the representative of a new cluster. Due to the comparison of all sequences with the cluster representatives, the runtimes of CDHIT and UCLUST scale as O(NK), where K is the final number of clusters. In protein sequence clustering K is typically of similar size to N and therefore the total runtime scales almost quadratically with N. The fast sequence prefilters speed up each pairwise comparison by a large factor 1/p_{match} but cannot improve the time complexity of O(NK). This almost quadratic scaling results in impractical runtimes for a billion or more sequences.
Here we present the sequence clustering algorithm Linclust, whose runtime scales as O(N), independent of the number of clusters found. We demonstrate that it produces clusterings of comparable quality as other tools that are orders of magnitude slower and that it can cluster over a billion sequences within hours on a single server.
Results
Overview of Linclust
The Linclust algorithm is explained in Fig. 1 (for details see Methods and Fig. 5). As in previous methods, we reduce the number of pairwise comparisons by requiring the sequences to share at least one identical kmer substring. A critical insight to achieve linear time complexity is that we need not align every sequence with every other sequence sharing a kmer (see steps 3,4). We reach similar sensitivities by selecting only a very small subset of sequences as "center sequences" (colored dots) and only aligning sequences to the center sequences with which they share a kmer. Linclust thus requires less than mN sequence comparisons with a small constant m (default value 20), instead of the \(\sim NKp_{{\mathrm{match}}}\) comparisons needed by UCLUST, CDHIT and other tools.
In most clustering tools, the main memory size severely limits the size of the datasets that can be clustered. UCLUST, for example, needs 10 bytes per residue of the representative sequences. Linclust needs m × 16 bytes per sequence, but before running it automatically checks available main memory and if necessary splits the table of mN lines into chunks such that each chunk fits into memory (Supplementary Fig. 1 and Methods). It then processes the chunks sequentially. In this way, Linclust can cluster sequence sets that would occupy many times its main memory size at almost no loss in speed.
Linclust and Linclust/MMseqs2 workflows
We integrated Linclust into our MMseqs2 (ManyversusMany sequence searching) software package^{10}, and we test two versions of Linclust in our benchmark comparison: the bare Linclust algorithm described in Fig. 1 (simply named “Linclust”), and a combined fourstep cascaded clustering workflow (“Linclust/MMseqs2”). In this workflow, a Linclust clustering step is followed by one (above 60% sequence identity) or three (≤60%) clustering steps, each of which clusters the representative sequences from the previous step by performing an increasingly sensitive allagainstall MMseqs2 sequence search followed by the greedy incremental clustering algorithm. We also include in our benchmark our original MMseqs clustering tool^{11}.
Runtime and clustering sensitivity benchmark
We measure clustering runtimes on seven sets: the 61522444 sequences of the UniProt database, randomly sampled subsets with relative sizes 1/16, 1/8, 1/4, 1/2, and UniProt plus all reversed sequences (123 million sequences). Each tool clustered these sets using a minimum pairwise sequence identity of 90%, 70% and 50%. Sequence identity was defined similarly for all three tools. The three tools use somewhat different strategies to try to ensure that only proteins with the same domain architecture are clustered together (see Methods: Clustering criteria).
At 50% identity, Linclust clusters the 123 million sequences 10 times faster than Linclust/MMseqs2 and, by extrapolation, 2300 times faster than UCLUST, 720 times faster than MMseqs, 4600 times faster than CDHIT, 1600 times faster than DIAMOND^{12}, 69000 times faster than MASH^{13}, and 26000 times faster than RAPsearch2^{14} (Fig. 2a, b). At 90% identity, Linclust still clusters these sequences 570 times faster than MMseqs, 100 times faster than UCLUST, 62 times faster than CDHIT, and 4.5 times faster than Linclust/MMseqs2.
At 90% sequence identity threshold, we determined how the runtimes scale with the input set size N by fitting a power law \(\left( {T\sim aN^b} \right)\) to the measured runtimes. Runtimes scale very roughly quadratically for UCLUST (N^{1.62}) and CDHIT (N^{2.75}) whereas they grow only linearly for Linclust/MMseqs2 (N^{0.94}) and Linclust (N^{1.01}). The speedups due to Linclust’s Hamming distance stage and the ungapped alignment filter are analyzed in Supplementary Fig. 2.
To assess the clustering sensitivity, we compare the average size of clusters: a deeper clustering with more sequences per cluster implies a higher sensitivity to detect similar sequences. All three tools produce similar numbers of clusters at 90% and 70% sequence identity (Fig. 2c). Importantly, despite Linclust’s linear scaling of the runtime with input set size, it manifests no loss of sensitivity for growing dataset sizes. At 50%, Linclust produces 13% more clusters than UCLUST. But we can increase Linclust’s sensitivity simply by selecting more kmers per sequence. By increasing m from 20 to 80, Linclust takes only 1.5 to 2 times longer but attains a sensitivity similar to UCLUST (pink in Fig. 2a–c, Supplementary Fig. 4).
To estimate the fraction of missed sequence pairs that could have been clustered together, we examined the distribution of sequence identities between representative cluster sequences (Fig. 3a–c). For each clustering run, we searched with BLAST^{15} a random sample of 1000 representative sequences against all representative sequences of the clustering. We show the cumulative distribution of sequence identities for the best matches that satisfy the minimum coverage threshold of 90% used in the clustering runs. This coverage threshold is favorable for UCLUST since its own coverage criterion is less strict (see Methods, “Clustering criteria”). Due to the heuristic prefiltering methods employed by all tools, none produces a perfect clustering. This limitation is seen most clearly at 50% sequence identity (Fig. 3c), for which Linclust/MMseqs2, UCLUST, Linclustm80 and Linclust miss 2%, 10%, 16% and 28% of sequence pairs satisfying the clustering threshold.
Cluster consistency analysis
We measure the quality of the clusterings produced by the tools by analyzing the homogeneity of the functional annotation of the sequences in the clusters^{16}. We assess Gene Ontology (GO) annotations^{17} (Fig. 4a, b) and Pfam domain annotations^{18} (Fig. 4c) provided by the UniProt database. For each of these annotations, we averaged two score variants over all clusters, “mean” and a “worst”. The “mean” (“worst”) score for a cluster is the mean (minimum) annotation similarity score between the representative sequence and all other cluster members, as described in ref. ^{16}.
Overall, the consistencies of cluster annotations are similar for all tools, which is not surprising since they all use exact SmithWaterman alignments and similar acceptance criteria (Supplementary Fig. 3, Methods). However, Linclust/MMseqs2 and Linclust clusterings have better consistencies than UCLUST and CDHIT according to purely experimentally derived GO annotations (Fig. 4a) and according to Pfam domain annotations (Fig. 4c). This might be either due to a stricter minimum coverage criterion in Linclust or due to its slightly different definition of sequence similarity, which translates the sequence identity threshold into an approximately equivalent threshold for the similarity score of the local alignment divided by the maximum length of the two aligned segments (Methods: Clustering criteria). This similarity measure is more appropriate than sequence identity to cluster together sequences with conserved functions, as it also accounts for gaps and for the degree of similarity between aligned residues. The cluster consistencies of all tools are similar when GO annotations based on computational predictions are included (Fig. 4b).
Clustering 1.6 billion metagenomic sequences
As a demonstration of Linclust’s ability to cluster huge sets, we applied it to cluster 1.59 billion protein sequence fragments predicted by Prodigal^{19} in 2200 metagenomic and metatranscriptomic datasets^{3,20,21} downloaded mainly from the Joint Genome Institute. We clustered these sequences with a minimum sequence identity of ≥50% and minimum coverage of the shorter sequence of 90% (Methods: Clustering criteria), producing 424 million clusters in 10 h on a 2 × 14core server.
Our Metaclust database of 424 million representative sequences will improve the sensitivity of profile sequence searches by increasing the diversity of the underlying multiple sequence alignments. It will thereby raise the fraction of annotatable sequences in genomic and metagenomic datasets^{6,21}. It could also increase the number protein families for which reliable structures can be predicted de novo, as shown by Ovchinnikov et al.^{22}, who used an unpublished dataset of 2 billion metagenomic sequences. Metaclust should also allow us to predict more accurately the effects of mutations on proteins^{23}.
Discussion
Clustering a set of N items is challenging when both N and the number of clusters K are large, due to the time complexity of existing clustering algorithms. Hierarchical agglomerative clustering approaches have a time complexity of O(N^{2}logN)^{24}, others with a predefined number of clusters such as Kmeans or expectation maximization clustering have complexity O(NK). When both N and K are in the tens of millions, traditional approaches are impracticably slow. Driven by the need to cluster huge datasets in the era of big data, most work has focused on reducing the proportionality constant.
One example is the widely used canopy clustering algorithm^{25}. The items are first preclustered into overlapping sets ("canopies") based on a fast approximate similarity measure. Canopies could be biological sequences sharing the same kmer or documents sharing a keyword. Some traditional clustering algorithm is run on all items, but with the restriction that slow, exact similarities are only computed between items belonging to the same canopy. Similar to the kmer prefilter used in CDHIT, UCLUST, kclust, and MMseqs^{8,11,26,27}, the preclustering reduces the number of comparisons by a large factor F using the slow, exact measure, but the time complexity of the exact distance calculation O(N^{2}/F) is still quadratic in N. Lineartime clustering algorithms, using for instance hashing techniques, have been proposed^{28,29}. But like the preclustering step in canopy clustering or Linclust’s prefilter to find kmer matches, these algorithms are only approximate. If falsely clustered pairs are costly (e.g., for redundancy filtering), pairwise links need to be verified with the exact similarity measure, which still requires quadratic time complexity. In contrast, Linclust’s linear time complexity of O(mN) includes verification of all edges between items using the exact distance measure.
Linclust can be trivially generalized to cluster any items for which a set of m keys per item can be defined such that (1) items belonging to a cluster are likely to share at least one of their keys and (2) items not belonging to a cluster are unlikely to share a key (see Methods, Optimal kmer length). For clustering documents the keys could be all \( m = \left({ {n}\atop{k}} \right)\) subsets of the n keywords of size k, for example^{28}. To achieve a high sensitivity, we could select as center of the group of items sharing a key the member with the largest sum of sizes of groups it belongs to. In this way, the center items are able to pull together into the same cluster many items from different groups.
We perform the clustering in step 5 of Fig. 1 with the greedy incremental clustering, because it always chooses the longest sequence as the cluster representative. It ensures that the representative sequences, being the longest sequence in each cluster, are likely to contain all protein domains of all cluster members. Our rule in step 2 to choose the longest protein sequence per kmer group as its center is wellsuited to achieve large clusters, because the longest sequences tend to be selected as centers of most of the kmer groups they belong to, and these long sequences therefore have edges to most sequences they share kmers with.
As far as we know, Linclust is the only algorithm that could run on datasets of billions of items resulting in billions of clusters, overcoming the time and memory bottlenecks of existing clustering algorithms. Linclust could therefore be useful for many other applications. We have recently extended Linclust to nucleotide sequences. We are also working on a version to cluster Ddimensional vectors, which could be used, for instance, for metagenomic binning to cluster contigs by their coverage profiles across D metagenomic samples^{30}.
In summary, we hope the Linclust algorithm will prove helpful to exploit the tremendous value in publicly available metagenomic and metatranscriptomic datasets. Linclust should lead to considerable savings in computing resources in current applications. Most importantly, it will enable previously infeasible largescale analyses.
Methods
The Linclust algorithm consists of the following steps (Figs 1 and 5):
Step 1: Generating the table of kmers
We transform the sequence set into a reduced alphabet of 13 letters to increase the number of kmer matches and hence the kmer sensitivity at a moderate reduction in selectivity (see subsection “Reduced amino acid alphabet”). The kmer length is chosen as described in subsection “Optimal kmer length” and is typically between 10 and 14.
For each sequence, we extract m kmers, as described in “Selection of kmers”. Increasing m from its default value of 20 (option—kmerperseq) increases the sensitivity at the cost of a moderately decreasing speed (Supplementary Fig. S4). We store each extracted kmer index (8 bytes), the sequence identifier (4 bytes), its length (2 bytes), and its position j in the sequence (2 bytes) in a table with mN lines. Therefore, Linclust has a memory footprint of mN × 16 bytes.
Step 2: Finding exact kmer matches
We sort this table by the kmer index using the inplace sort from the OpenMP template library (http://freecode.com/projects/omptl). The sorting has a quasilinear time complexity of O(mNlog(mN)) and typically takes less than 10% of the total runtime. The sorting groups together sequences into blocks of lines that contain the same kmer. For each such kmer group we select the longest sequence as its center sequence. We overwrite the position j with the diagonal i−j of the kmer match with the center sequence, where i is the position of the group’s kmer in the center sequence. We further overwrite the kmer index by the center sequence identifier and resort the mN lines of the table by the center sequence identifier. The kmer match stage results file has one entry for each center sequence identifier containing the list of identifiers of sequences that share a kmer with the center sequence. If a sequence shares multiple kmer matches with a center sequence, we keep only the entry with the lowest diagonal i−j.
Step 3a: Hamming distance preclustering
For each kmer group we compute the Hamming distance (the number of mismatches) in the full amino acid alphabet between the center sequence and each sequence in the group along the stored diagonals i−j. This operation is fast as it needs no random memory or cache access and uses AVX2/SSE4.1 vector instructions. Members that already satisfy the specified sequence identity and coverage thresholds on the entire diagonal are removed from the results passed to step 3b and are added to the cluster of their center sequence after step 5.
Step 3b: Ungapped alignment filtering
For each kmer group we compute the optimal ungapped, local alignment between the center sequence and each sequence in the group along the stored diagonals i−j, using onedimensional dynamic programming with the Blosum62 matrix. We filter out matches between center and member sequences if the ungapped alignment score divided by the length of the diagonal is very low. We set a conservative threshold, such that the false negative rate is 1%, i.e., only 1% of the alignments below this threshold would satisfy the two criteria, sequence identity and coverage. For each combination on a grid {50, 55, 60,…,100}⊗{0, 10, 20,…,100}, we determined these thresholds empirically on 4 million local alignments sampled from an allagainstall comparison of the UniProt database^{31}.
Step 4: Local gapped sequence alignment
Sequences that pass the ungapped alignment filter are aligned to their center sequence using the AVX2/SSE4.1vectorized alignment module with amino acid compositional bias correction from MMseqs2^{10}, which builds on code from the SSW library^{32}. Sequences satisfying the specified sequence identity and coverage thresholds are linked by an edge. These edges (neighbor relationships) are written in the format used by MMseqs2 for clustering results.
Step 5: Greedy incremental clustering
This algorithm was already implemented for MMseqs^{11}. Briefly, the file with the validated directed edges from center sequences to member sequences is read in and all reverse edges are added. The list of input sequences is sorted by decreasing length. While the list is not yet empty, the top sequence is removed from the list, together with all sequences still in the list that share an edge with it. These sequences form a new cluster with the top sequence as its representative.
Reduced amino acid alphabet
We iteratively constructed reduced alphabets starting from the full amino acid alphabet. At each step, we merged the two letters {a, b} \(\longrightarrow\) a′ = (a or b) that conserve the maximum mutual information, \({\mathrm{MI}} = \mathop {\sum}\nolimits_{x,y = 1}^A p\left( {x,y} \right){\mathrm{log}}_2\left( {p\left( {x,y} \right)/p\left( x \right)/p\left( y \right)} \right)\). Here A is the new alphabet size, p(x) is the probability of observing letter x at any given position, and p(x, y) is the probabilities of observing x and y aligned to each other. These probabilities are extracted from the Blosum62 matrix. When a and b are merged into a′, for example, p(a′) = p(a) + p(b) and p(a′,y) = p(a, y) + p(b, y). The default alphabet with A = 13, which performed well over all tested clustering sequence identities from 50% to 100%, merges (L, M), (I, V), (K, R), (E, Q), (A, S, T), (N, D), and (F, Y).
Optimal kmer length
For optimal results and efficiency, the majority of the sequences in kmer groups should be homologous to their center sequence. In other words, the kmers have to be specific enough for the size of the database, with larger databases requiring larger k. To automatically set a good value of k, a very conservative condition is to limit to 1 the expectation value E_{FP} of the number of sequences per kmer group that are not homologous to their center sequence. E_{FP} is equal to the number mN of kmers selected in the entire sequence set times the probability p_{match} for one of those kmers to match the kmer of the kmer group by chance. If the kmers were not preselected by their hash function values, this probability would be approximately \(1/A_{{\mathrm{eff}}}^k\), where \(1/A_{{\mathrm{eff}}} = \mathop {\sum}\nolimits_{a = 1}^A p_a^2\) is the probability for two letters from the reduced alphabet of size A to match by chance (1/8.7 for A = 13) and p_{ a } is the frequency of letter a in the database. Due to the preselection, only a fraction ~m/L of the entire set of kmers is used, where L is the average sequence length. Therefore, the probability of two selected kmers to match by chance is \(L/\left( {mA_{{\mathrm{eff}}}^k} \right)\). The condition for the kmer specificity is \(1 \ge E_{{\mathrm{FP}}} = mNL/\left( {mA_{{\mathrm{eff}}}^k} \right) = NL/A_{{\mathrm{eff}}}^k\), and hence we demand \(k \ge \left\lfloor {{\mathrm{log}}\left( {NL} \right)/{\mathrm{log}}\left( {A_{{\mathrm{eff}}}} \right)} \right\rfloor = :k_{{\mathrm{spec}}}\). In Linclust, we set k = max{k_{spec}, k_{seqid}}, with k_{seqid} = 14 for a sequence identity clustering threshold ≥90% and k_{seqid} = 10 otherwise to ensures slightly higher efficiency for high sequence identities, for which longer kmers are sufficiently sensitive.
Selection of kmers
To be able to cluster two sequences together we need to find a kmer in the reduced alphabet that occurs in both. Because we extract only a small fraction of kmers from each sequence, we need to avoid picking different kmers in each sequence. Our first criterion for kmer selection is therefore to extract kmers such that the same kmers tend to be extracted from homologous sequences. Second, we need to avoid positional clustering of selected kmers in order to be sensitive to detect local homologies in every region of a sequence. Third, we would like to extract kmers that tend to be conserved between homologous sequences. Note that we cannot simply store a subset of A^{k}m/L kmers to be selected due to its sheer size.
We can satisfy the first two criteria by computing hash values for all kmers in a sequence and selecting the m kmers that obtain the lowest hash values. Since appropriate hash functions can produce values that are not correlated in any simple way with their argument, this method should randomly select kmers from the sequences such that the same kmers always tend to get selected in all sequences. We developed a simple 16bit rolling hash function with good mixing properties, which we can compute very efficiently using the hash value of the previous kmer (Supplementary Fig. 5).
In view of the third criterion, we experimented with combining the hash value with a kmer conservation score \(S_{{\mathrm{cons}}}\left( {x_{1:k}} \right) = \mathop {\sum}\nolimits_{i = 1}^k S\left( {x_i,x_i} \right)/k\). This score ranks kmers x_{1:k} by the conservation of their amino acids, according to the diagonal elements of the Blosum62 substitution matrix S(⋅, ⋅). We scaled the hash function with a rectified version of the conservation score: hash−value(x_{1:k})/max{1, S_{cons}(x_{1:k}) − S_{offset}}. Despite its intuitive appeal, we did not succeed in obtaining significant improvements and reverted to the simple hash function.
Clustering datasets that do not fit into main memory
Linclust needs m × 16 bytes of memory per sequence. If the computer’s main memory is too small, Linclust automatically splits the kmer array into C equalsized chunks small enough to fit each into main memory (Supplementary Fig. 1). For each chunk index c ∈ {0,…,C − 1} we run Linclust steps 1 and 2 (Fig. 5) normally but extract only kmers whose numerical index modulo C yields a rest c. This way each of the C runs builds up a kmer table with only about mN/C lines instead of mN, and hence each run needs C times less memory. Each run writes out a file with all found kmer groups, and afterwards all C files are merged into a single file such that kmer groups are sorted by ascending center IDs. Finally, Linclust steps 3 to 5 are performed as usual.
Parallelization and supported platforms
We used OpenMP to parallelize all stages except the fast step 5 and SIMD instructions to parallelize step 3 and step 4. Linclust supports Linux and Windows, Mac OS X and CPUs with AVX2 or SSE4.1 instructions.
Clustering criteria
Linclust/MMseqs2 and Linclust has three main criteria to link two sequences by an edge: (1) a maximum Evalue threshold (option e [0, ∞[) computed according to the gapcorrected KarlinAltschul statistics using the ALP library;^{33} (2) a minimum coverage (option c [0,1], which is defined by the number of aligned residue pairs divided by either the maximum of the length of query/center and target/noncenter sequences (default mode, covmode 0), or by the length of the target/noncenter sequence (covmode 1), or by the length of the query/center (covmode 2); (3) a minimum sequence identity (minseqid [0, 1]) with option alignmentmode 3 defined as the number of identical aligned residues divided by the number of aligned columns including internal gap columns, or, by default, defined by a highly correlated measure, the equivalent similarity score of the local alignment (including gap penalties) divided by the maximum of the lengths of the two locally aligned sequence segments. The score per residue equivalent to a certain sequence identity is obtained by a linear regression using thousands of local alignments as training set (Fig. S2 in^{27}).
The sequence identity in UCLUST is defined as number of identical residues in the pairwise global alignment divided by the number of aligned columns including internal gaps. Due to the global alignment, no explicit coverage threshold is needed. CDHIT defines sequence identity as the number of identical residues in the local alignment divided by the length of the shorter sequence. Therefore, sequence coverage of the shorter sequence must be at least as large as the sequence identity threshold.
Tools and options for benchmark comparison
Linclust and Linclust/MMseqs2 (commit 5e21868) used the commands mmseqs linclust covmode 1 c 0.9 minseqid 0.9 and mmseqs cluster covmode 1 c 0.9 minseqid 0.9 for 90%, respectively, and minseqid 0.7 or minseqid 0.5 for 70% and 50%. The minimum coverage of 90% of the shorter sequence was chosen to enforce global similarity, similar to UCLUST and CDHIT. CDHIT 4.6 was run with the parameters T 16 M 0 and n 5 c 0.9, n 4 c 0.7, and n 3 c 0.5 for 90%, 70%, and 50%, respectively. UCLUST (7.0.1090) was run with id 0.9, 0.7, 0.5, for RAPsearch2 (2.23) we used z 16, for DIAMOND (v0.8.36.98) option id 0.5, and for MASH (v2.0) s 20 a i p 16. Runtimes were measured with the Linux time command.
Functional consistency benchmark
We evaluated the functional cluster consistency based on Gene Ontology (GO) annotations of the UniProt knowledge base. We carried out three tests: one based on (1) experimentally validated GO annotations, (2) general functional GO annotations (mostly inferred from homologous proteins) and (3) Pfam annotations. The UniProt 2016_03 release was clustered by each tool at 90%, 70% and 50% sequence identity level and then evaluated. For CDHIT we computed only the clustering at 90% sequence identity because of run time constraints. For each cluster, we computed the ‘worst’ and ‘mean’ cluster consistency scores, as described earlier^{16}. These cluster consistency scores are defined respectively as the minimum and the mean of all pairwise annotation similarity scores between the cluster’s representative sequence and the other sequences in the cluster.
GO annotations often annotate the whole sequence. We used the Pfam annotations of the UniProt to check local consistence of clusters (Fig. 3c). We compared the Pfam domain annotation of the representative sequence against all cluster members. If the member had the exact same domain annotation as the representative sequence we counted it as correct (value = 1) and otherwise as false (value = 0).
Clustering
We downloaded ~1800 metagenomic and ~400 metatranscriptomic datasets with assembled contigs from the Joint Genome institute’s IMG/M archive^{3} and NCBI’s Sequence Read Archive^{20} (ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux) using the script metadownload.sh from https://bitbucket.org/martin_steinegger/linclustanalysis. We predicted genes and protein sequences using Prodigal^{19} resulting in 1,595,926,152 proteins.
We clustered the 1.59 million sequence fragments with Linclust using the following acceptance criteria: (1) the minimum sequence identity is 50%, using the scorepercolumn similarity measure described in Clustering criteria, (2) the shorter of the two sequences has at least 90% of its residues aligned, and (3) the maximum Evalue is 10^{−3} (default) (Linclust options: minseqid 0.5 covmode 1 c 0.9 clustermode 2). The clustering step found 424 million cluster within 10 h on a server with two 14core Intel Xeon E52680 v4 CPUs (2.4 GHz) and 762 GB RAM.
Metaclust protein sequence sets
The Metaclust database is available as FASTA formatted file at https://metaclust.mmseqs.org/.
Code availability
Linclust has been integrated into our free GPLv3licensed MMseqs2 software suite^{10}. The source code and binaries for Linclust can be download at https://github.com/soedinglab/mmseqs2.
Data availability
The Metaclust dataset generated during the current study is available at https://metaclust.mmseqs.org. The Linclust source code is available at https://mmseqs.org. All scripts and benchmark data including commandline parameters necessary to reproduce the benchmark and analysis results presented here are available at https://bitbucket.org/martin_steinegger/linclustanalysis.
References
Rappe, M. S. & Giovannoni, S. J. The uncultured microbial majority. Ann. Rev. Microbiol. 57, 369–394 (2003).
Wilke, A. et al. The MGRAST metagenomics database and portal in 2015. Nucleic Acids Res. 44, D590–D594 (2016).
Markowitz, V. M. et al. IMG/M 4 version of the integrated metagenome comparative analysis system. Nucleic Acids Res. 42, D568–D573 (2014).
Scholz, M. B., Lo, C.C. & Chain, P. S. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr. Opin. Biotechnol. 23, 9–15 (2012).
Desai, N., Antonopoulos, D., Gilbert, J. A., Glass, E. M. & Meyer, F. From genomics to metagenomics. Curr. Opin. Biotechnol. 23 72–76 (2012).
Prakash, T. & Taylor, T. D. Functional assignment of metagenomic data: challenges and applications. Brief. Bioinform. 13, 711–727 (2012).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CDHIT: accelerated for clustering the nextgeneration sequencing data. Bioinformatics 28, 3150–3152 (2012).
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
Li, W., Fu, L., Niu, B., Wu, S. & Wooley, J. Ultrafast clustering algorithms for metagenomic sequence analysis. Brief. Bioinform. 13, 656–668 (2012).
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Hauser, M., Steinegger, M. & Söding, J. MMseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics 32, 1323–1330 (2016).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using diamond. Nat. Methods 12, 59–60 (2015).
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using minhash. Genome Biol. 17, 132 (2016).
Zhao, Y., Tang, H. & Ye, Y. RAPSearch2: a fast and memoryefficient protein similarity search tool for nextgeneration sequencing data. Bioinformatics 28, 125–126 (2012).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).
Gene Ontology Consortium. Gene ontology consortium: going forward. Nucleic Acids Res. 43, D1049–D1056 (2015).
Finn, R. D. et al. The pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2016).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 119 (2010).
Kodama, Y., Shumway, M. & Leinonen, R. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 40, D54–D56 (2012).
Sunagawa, S. et al., Structure and function of the global ocean microbiome, Science, 348, no. 6237, pp. 1261359–1–9, (2015).
Ovchinnikov, S. et al. Protein structure determination using metagenome sequence data. Science 355, 294–298 (2017).
Hopf, T. A. et al. Mutation effects predicted from sequence covariation. Nat. Biotechnol. 35, 128–135 (2017).
Day, W. H. & Edelsbrunner, H. Efficient algorithms for agglomerative hierarchical clustering methods,. J. Classif. 1, 7–24 (1984).
McCallum, A. Nigam, K. & Ungar, L. H. Efficient clustering of highdimensional data sets with application to reference matching. In Proc. 6th ACM SIGKDD International Conference on Knowledge Discovery and Data mining. 169–178 (ACM, 2000).
Li, W. & Godzik, A. Cdhit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Hauser, M., Mayer, C. & Soding, J. kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinforma. 14, 248 (2013).
Marshall J. A. & Rafsky L. C. Exact clustering in linear time. Preprint at https://arxiv.org/abs/1702.05425 (2017).
Wang, J. Shen, H. T., Song, J. & Ji., J. Hashing for similarity search: a survey. Preprint at https://arxiv.org/abs/1408.2927 (2014).
Albertsen, M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013).
Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33(Suppl. 1), D154–D159 (2005).
Zhao, M. Lee, W.P. Garrison, E. P. & Marth, G. T. SSW Library: An SIMD SmithWaterman C/C++ library for use in genomic applications. PLoS ONE 8, e82138 (2013).
Sheetlin, S., Park, Y., Frith, M. C. & Spouge, J. L. ALP & FALP: C++ libraries for pairwise local alignment Evalues. Bioinformatics 32, 304–305 (2015).
Acknowledgements
We are grateful to Cedric Notredame and Chaok Seok for hosting MS at the CRG in Barcelona and at Seoul National University for 12 and 30 months, respectively. Thanks to Milot Mirdita and Clovis Galiez for discussions and to all who contributed metagenomic datasets used to build Metaclust, in particular the US Department of Energy Joint Genome Institute http://www.jgi.doe.gov/ and their user community. This work was supported by the EU’s Horizon 2020 Framework Programme (VirusX, grant 685778).
Author information
Authors and Affiliations
Contributions
M.S. performed the research and programming, M.S. and J.S. jointly designed the research and wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Steinegger, M., Söding, J. Clustering huge protein sequence sets in linear time. Nat Commun 9, 2542 (2018). https://doi.org/10.1038/s41467018049645
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467018049645
This article is cited by

Reverse engineering environmental metatranscriptomes clarifies best practices for eukaryotic assembly
BMC Bioinformatics (2023)

Apache Sparkbased scalable feature extraction approaches for protein sequence and their clustering performance analysis
International Journal of Data Science and Analytics (2023)

Assemblyfree rapid differential gene expression analysis in nonmodel organisms using DNAprotein alignment
BMC Genomics (2022)

Evaluating metagenomic assembly approaches for biomespecific gene catalogues
Microbiome (2022)

Contamination detection in genomic data: more is not enough
Genome Biology (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.