Effective binning of metagenomic contigs using contrastive multi-view representation learning

Contig binning plays a crucial role in metagenomic data analysis by grouping contigs from the same or closely related genomes. However, existing binning methods face challenges in practical applications due to the diversity of data types and the difficulties in efficiently integrating heterogeneous information. Here, we introduce COMEBin, a binning method based on contrastive multi-view representation learning. COMEBin utilizes data augmentation to generate multiple fragments (views) of each contig and obtains high-quality embeddings of heterogeneous features (sequence coverage and k-mer distribution) through contrastive learning. Experimental results on multiple simulated and real datasets demonstrate that COMEBin outperforms state-of-the-art binning methods, particularly in recovering near-complete genomes from real environmental samples. COMEBin outperforms other binning methods remarkably when integrated into metagenomic analysis pipelines, including the recovery of potentially pathogenic antibiotic-resistant bacteria (PARB) and moderate or higher quality bins containing potential biosynthetic gene clusters (BGCs).

MetaHIT MetaHIT (10-sample) Fig. S3 COMEBin recovers more known and unknown bins with >50% completeness and <5% contamination on the species level.The "known" genomes refer to bins that can be annotated at the species level using GTDB-Tk, and "unknown" otherwise.Fig. S4 Comparison of the number of bins with F1-score>0.9 recovered by the binning algorithm."Unique" denotes the unique strains (genomes with an average nucleotide identity (ANI) of less than 95% to any other genome) introduced in the benchmark paper [1], and "common" otherwise.Fig. S5 Comparison of the number of bins with F1-score>0.9 recovered by the binning algorithm on the Strain madness GSA dataset."Unique" denotes the unique strains (genomes with an average nucleotide identity (ANI) of less than 95% to any other genome) introduced in the benchmark paper [1], and "common" otherwise.COMEBin        The results annotated with an asterisk (*) represent the total runtimes or memory usage across all ten samples in VAMB's multi-sample mode."BATS (average)" represents the average running time or memory usage across the ten BATS samples.We ran each tool on each dataset three times and reported the mean scores and the respective standard deviations.✓ LeakyReLU "#hidden layers" denotes the number of hidden layer; "#hidden units" denotes the number of hidden units; "#sequencing samples" denotes the number of sequencing samples.Genome Analyzer II The term "Q20 (%)" represents the fraction of reads with an average quality > 20.

Algorithm S1 The contrastive learning training process of COMEBin
Input: Batch size N bs ; the number of views V ; Neural Networks f cov and f combine ; features of contigs X (com) and X (cov) .Output: for i ∈ {1, 2, . . ., N bs } do for all v ∈ {1, 2, . . ., V } do 4: end for 6: end for 7: Update network parameters of f cov and f combine to minimize L. L is given in Equation 12 in the main text.8: end for 9: return f cov and f combine 2 Supplementary Note

Estimating completeness and contamination of the bins
Similar to MetaBinner [2], we utilized CheckM1 [3] to analyze one binning result and identify contigs containing single-copy genes of bacterial or archaeal domains.Subsequently, we employed the scoring strategy provided by CheckM1 [3] to estimate the contamination and completeness of each bin in all the clustering results, leveraging the obtained information.

2.2
The binning performance of COMEBin on the long-read data.
We conducted additional testing to evaluate COMEBin's performance on four long-read datasets.We included SemiBin2, SemiBin2 (long-read mode), MetaDecoder, and MetaBAT2 for comparison.Three of these datasets were previously used in SemiBin2's evaluation.Long-read assemblies were generated using flye (version 2.9.2) with the options "-pacbio-hifi" and "-meta".More details about the long-read datasets can be found in Table S7.These datasets are publicly available in the National Genomics Data Center (NGDC) under the study accession PRJCA007414 (Runs: CRR344871 and CRR344872), in the ENA under the run accession SRR10963010, and in the NCBI under the run accession ERR9769275.It's worth noting that long-read sequencing typically produces highly contiguous assemblies, resulting in fewer contigs and smaller bins (measured by the number of contigs) [4].According to the results shown in Supplementary Fig. S10, SemiBin2 (long-read mode) performs best, followed by COMEBin.

Comparison of variants of COMEBin using different clustering methods
We conducted experiments with different variants of COMEBin, replacing the Leiden-based clustering method with InfoMap, as implemented in SemiBin1.Additionally, we employed k-means and weighted k-means for clustering, utilizing the embeddings as features, and determined bin numbers based on single-copy genes.In "weighted k-means", we assigned the weight for each contig based on its length.For Infomap, we used the same graphs converted from the embeddings as inputs, following the same methodology for automatically selecting the final result as in COMEBin.The parameters used to generate the graphs included σ in Formula 13 with values of 0.05, 0.1, 0.15, 0.2, and 0.3, along with edge ratios(proportions of edges kept for clustering) with values of 50%, 80%, and 100%.Our comparative analysis revealed that COMEBin outperforms its variants, as illustrated in Supplementary Fig. S13.

Fig. S1 f
Fig. S1 Comparison of binning methods on four simulated datasets based on the F1-score (bp), Adjusted Rand Index(bp), percentage of binned bp, and accuracy (bp) metrics.a, CAMI Gt dataset; b, CAMI Airways dataset; c, CAMI Skin dataset; and d, CAMI Mouse gut dataset.
a D e c o d e r S e m iB in 1 S e m iB in 2

Fig. S6
Fig. S6 COMEBin outperforms other binners in real datasets in single-and multi-sample binning.

Fig. S7
Fig. S7 Comparison of variants of COMEBin.We conducted experiments with different variants of COMEBin by replacing COMEBin embeddings with those from other methods.Subsequently, we applied the same clustering approach used in COMEBin for binning.

Fig. S8
Fig. S8 Comparison of COMEBin with different numbers of views.The number of views indicates the number of sequence fragments extracted from each original contig for augmentation.A view count of six implies that we randomly sampled five sequence segments for augmentation from each original contig, resulting in six views, including the original (original contig).The default setting for COMEBin is six views.

Fig. S9
Fig.S9Comparison of binning methods on two low-complexity datasets.Note that default settings of VAMB are not applicable to the CAMI mouse gut (10-genome) dataset, as the dataset contains fewer than 4096 contigs.

Fig. S10
Fig. S10 Comparison of binning methods on long-read sequencing datasets.
Fig.S12Sequence length distribution for the real datasets.

Fig. S13
Fig. S13 Comparison of variants of COMEBin using different clustering methods.

Table S1
Running time and memory usage for different datasets and binning modes

Table S2
Sample information of the MetaHIT (10-sample) and Bermuda-Atlantic Time-series Study (BATS) samples.

Table S3
Hyper-parameters used by the network module in the experiments.

Table S4
Simulated datasets used in the experiments.

Table S5
Real datasets used in the experiments.

Table S7
Long-read sequencing datasets used for extended experiments.