Abstract
Shotgun sequencing enables the reconstruction of genomes from complex microbial communities, but because assembly does not reconstruct entire genomes, it is necessary to bin genome fragments. Here we present CONCOCT, a new algorithm that combines sequence composition and coverage across multiple samples, to automatically cluster contigs into genomes. We demonstrate high recall and precision on artificial as well as real human gut metagenome data sets.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
metaSpectraST: an unsupervised and database-independent analysis workflow for metaproteomic MS/MS data using spectrum clustering
Microbiome Open Access 07 August 2023
-
Epiphytic common core bacteria in the microbiomes of co-located green (Ulva), brown (Saccharina) and red (Grateloupia, Gelidium) macroalgae
Microbiome Open Access 01 June 2023
-
Metagenomics reveals the habitat specificity of biosynthetic potential of secondary metabolites in global food fermentations
Microbiome Open Access 20 May 2023
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout


References
Tyson, G.W. et al. Nature 428, 37–43 (2004).
Herlemann, D.P. et al. MBio 4, e00569–e00512 (2013).
Sharon, I. & Banfield, J.F. Science 342, 1057–1058 (2013).
Kislyuk, A., Bhatnagar, S., Dushoff, J. & Weitz, J.S. BMC Bioinformatics 10, 316 (2009).
Strous, M., Kraft, B., Bisdorf, R. & Tegetmeyer, H. Front. Microbiol. 3, 410 (2012).
Chatterji, S., Yamazaki, I., Bai, Z. & Eisen, J.A. in Res. Comput. Mol. Biol. (eds. Vingron, M. & Wong, L.) 17–28 (Springer, 2008).
Kelley, D.R. & Salzberg, S.L. BMC Bioinformatics 11, 544 (2010).
Sharon, I. et al. Genome Res. 23, 111–120 (2013).
Albertsen, M. et al. Nat. Biotechnol. 31, 533–538 (2013).
Corduneanu, A. & Bishop, C.M. in Artif. Intell. Stat. 2001 (eds. Jaakkola, T. & Richardson, T.) 27–34 (Morgan Kaufmann, 2001).
Human Microbiome Project Consortium. Nature 486, 207–214 (2012.).
Sandberg, R. et al. Genome Res. 11, 1404–1409 (2001).
Dick, G.J. et al. Genome Biol. 10, R85 (2009).
Loman, N.J. et al. J. Am. Med. Assoc. 309, 1502–1510 (2013).
Asahara, T. et al. Infect. Immun. 72, 2240–2247 (2004).
Pell, J. et al. Proc. Natl. Acad. Sci. USA 109, 13272–13277 (2012).
Boisvert, S., Raymond, F., Godzaridis, E., Laviolette, F. & Corbeil, J. Genome Biol. 13, R122 (2012).
Tatusov, R.L., Koonin, E.V. & Lipman, D.J. Science 278, 631–637 (1997).
Ciccarelli, F.D. et al. Science 311, 1283–1287 (2006).
Acknowledgements
This research arose out of a workshop funded through the COST project ES1103 and hosted by P. Fernandes at the Instituto Gulbenkian de Ciência. This work was funded by grants (to A.F.A.) from the Swedish Research Councils VR (grant 2011-5689), FORMAS (grant 2009-1174) and EC BONUS project BLUEPRINT. C.Q. is funded by an EPSRC Career Acceleration Fellowship—EP/H003851/1. M.S. is supported by Unilever R&D Port Sunlight, Bebington, UK. L.L. is supported by the Academy of Finland (grant 256950), N.L. by a UK Medical Research Council Special Training Fellowship in Biomedical Informatics and J.Q. by the UK National Institute for Health Research (NIHR) Centre for Surgical Reconstruction and Microbiology. This paper presents independent research funded by the NIHR Surgical Reconstruction and Microbiology Research Centre (partnership between University Hospitals Birmingham National Health Service (NHS) Foundation Trust, the University of Birmingham and the Royal Centre for Defence Medicine). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.
Author information
Authors and Affiliations
Contributions
C.Q. developed the core algorithm and cluster validation metrics and performed analyses. A.F.A. assisted with the analyses, developed the SCG validation and contributed to algorithm development. J.A. and B.S.B. developed the CONCOCT software pipeline and contributed to algorithm development. I.d.B. performed assemblies and mappings. M.S. generated simulation data. J.Q. performed E. coli mappings. U.Z.I. assisted with SCG validation and production of graphics. L.L. helped with graphics and algorithm design. N.J.L. performed E. coli analysis and contributed to algorithm development. All authors contributed to the writing of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Relative abundances in the synthetic mock communities.
Relative abundances in the synthetic mock communities. A) heat map of the relative abundances of the 101 genomes in the synthetic species mock community distributed across the 96 HMP samples (see Online Methods). B) heat map of the relative abundances of the 20 genomes in the synthetic strain mock community distributed across the 64 HMP samples (see Online Methods). The samples have been positioned according to similarity. The relative abundances have been square root transformed to emphasise rare species and the inset scale should be interpreted accordingly.
Supplementary Figure 2 Two-dimensional visualisation of the synthetic species mock contigs labeled by species.
The 37,564 labelled synthetic species mock contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 101 different species discriminated.
Supplementary Figure 3 Two-dimensional visualisation of the synthetic strain mock contigs labeled by genome.
The 9,109 labelled synthetic strain community contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 20 different genomes discriminated.
Supplementary Figure 4 Two-dimensional visualisation of the synthetic species mock contigs labeled by cluster.
The 37,627 synthetic species mock contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 101 different clusters discriminated.
Supplementary Figure 5 Confusion matrix for the synthetic species mock contigs.
A heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings for the optimal 101 cluster solution with the species assignments for the synthetic species mock contig fragments. Each column is a cluster named Dk where k is the cluster index. The rows correspond to the species and the intensities reflect the proportion of each cluster deriving from each species. The intensities are weighted by contig length.
Supplementary Figure 6 Impact of contig length on error probability.
Predicted error probabilities from a logistic regression of contig missasignment as a function of contig fragment length (p-value < 2.0e-16) for the synthetic species community. Each cluster was assigned to the species from which the majority of contigs, weighted by length, derived. A contig was defined as misassigned if it derived from a different species to its cluster. The solid line is the logistic regression predicted error probability, together with standard errors as grey shaded portions. The points are average error rates calculated over a length bin of size 1000bp, their size is proportional to the log of the number of contigs in that bin.
Supplementary Figure 7 Frequency of single-copy core genes in the synthetic species mock clusters.
A heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 101 cluster solution generated by CONCOCT applied to the synthetic species mock community.
Supplementary Figure 8 Two-dimensional visualisation of the synthetic strain mock contigs labeled by cluster.
The 9,411 synthetic strain mock contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 21 different clusters discriminated.
Supplementary Figure 9 Validation of the synthetic strain mock contig clusterings.
Top panel: a heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 21 cluster solution generated by CONCOCT applied to the synthetic strain mock community. Bottom panel: a heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings with the genome assignments for the synthetic strain mock contig fragments. Each row is a cluster and column a species. Intensities reflect the proportion of each cluster weighted by contig length deriving from each species.
Supplementary Figure 10 Two-dimensional visualisation of the Sharon (2013) contigs labeled by cluster.
The 5,571 Sharon (2013) contig fragments of length > 1000bp plotted in the first two PCA dimensions with the 34 different clusters discriminated.
Supplementary Figure 11 Validation of the Sharon (2013) contig clusterings.
Top panel: A heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 34 cluster solution generated by CONCOCT applied to the Sharon (2013) data. Only clusters with at least one SCG are shown. Bottom panel: A heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings with the species assignments from TAXAassign. Each row is a cluster and the intensities reflect the proportion of each cluster deriving from each species. The intensities are weighted by contig length. Most clusters probably derive from the majority species assignment except 33 which is probably a novel Peptoniphilus species as determined by direct blasting against the NCBI and 11 an unknown Staphylococcus species.
Supplementary Figure 12 SCG plots for four different composition based clustering algorithms applied to the Sharon (2013) data.
These were A) CompostBin4, B) LikelyBin5, C) MetaWatt6, and D) SCIMM7.
Supplementary Figure 13 Confusion matrix for the Escherichia coli (STEC) O104:H4 outbreak contigs.
A heatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings for the optimal 297 cluster solution of the shiga-toxigenic Escherichia coli (STEC) O104:H4 outbreak. Each column is a cluster named Dk where k is the cluster index. The rows correspond to the species assignments, the intensities reflect the proportion of each cluster deriving from each species, only clusters and taxa with greater than 50kb of labeled representatives are shown. The intensities are weighted by contig length.
Supplementary Figure 14 Frequency of single-copy core genes in the Escherichia coli (STEC) O104:H4 outbreak clusters.
A heatmap visualisation of the number of single-copy core genes in each cluster for the optimal 297 cluster solution of CONCOCT applied to the shiga-toxigenic Escherichia coli (STEC) O104:H4 outbreak. Only clusters with a total average coverage summed across samples of greater than 50.0 are shown.
Supplementary Figure 15 Mapping of contigs to the Escherichia coli (STEC) O104:H4 outbreak genome.
The mapping of contig fragments to the known Escherichia coli (STEC) O104:H4 outbreak genome with cluster discriminated by colour and total coverage across all samples shown on the y-axis.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–15, Supplementary Tables 1–8 and Supplementary Note (PDF 12624 kb)
Supplementary Software
CONCOCT version 0.3.3 software (ZIP 5301 kb)
Source data
Rights and permissions
About this article
Cite this article
Alneberg, J., Bjarnason, B., de Bruijn, I. et al. Binning metagenomic contigs by coverage and composition. Nat Methods 11, 1144–1146 (2014). https://doi.org/10.1038/nmeth.3103
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3103
This article is cited by
-
metaSpectraST: an unsupervised and database-independent analysis workflow for metaproteomic MS/MS data using spectrum clustering
Microbiome (2023)
-
Metagenomics reveals the habitat specificity of biosynthetic potential of secondary metabolites in global food fermentations
Microbiome (2023)
-
Genome-centric metagenomics reveals the host-driven dynamics and ecological role of CPR bacteria in an activated sludge system
Microbiome (2023)
-
Comparing genomes recovered from time-series metagenomes using long- and short-read sequencing technologies
Microbiome (2023)
-
A nontuberculous mycobacterium could solve the mystery of the lady from the Franciscan church in Basel, Switzerland
BMC Biology (2023)