Improved metagenome binning and assembly using deep variational autoencoders

Nissen, Jakob Nybo; Johansen, Joachim; Allesøe, Rosa Lundbye; Sønderby, Casper Kaae; Armenteros, Jose Juan Almagro; Grønbech, Christopher Heje; Jensen, Lars Juhl; Nielsen, Henrik Bjørn; Petersen, Thomas Nordahl; Winther, Ole; Rasmussen, Simon

doi:10.1038/s41587-020-00777-4

Letter
Published: 04 January 2021

Improved metagenome binning and assembly using deep variational autoencoders

Nature Biotechnology volume 39, pages 555–560 (2021)Cite this article

23k Accesses
176 Citations
118 Altmetric
Metrics details

Subjects

Abstract

Despite recent advances in metagenomic binning, reconstruction of microbial species from metagenomics data remains challenging. Here we develop variational autoencoders for metagenomic binning (VAMB), a program that uses deep variational autoencoders to encode sequence coabundance and k-mer distribution information before clustering. We show that a variational autoencoder is able to integrate these two distinct data types without any previous knowledge of the datasets. VAMB outperforms existing state-of-the-art binners, reconstructing 29–98% and 45% more near-complete (NC) genomes on simulated and real data, respectively. Furthermore, VAMB is able to separate closely related strains up to 99.5% average nucleotide identity (ANI), and reconstructed 255 and 91 NC Bacteroides vulgatus and Bacteroides dorei sample-specific genomes as two distinct clusters from a dataset of 1,000 human gut microbiome samples. We use 2,606 NC bins from this dataset to show that species of the human gut microbiome have different geographical distribution patterns. VAMB can be run on standard hardware and is freely available at https://github.com/RasmussenLab/vamb.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Performance of clustering different inputs.**

**Fig. 3: Phylogeny of bins across 1,000 human gut microbiome samples.**

Adversarial and variational autoencoders improve metagenomic binning

Article Open access 21 October 2023

Strainberry: automated strain separation in low-complexity metagenomes using long reads

Article Open access 23 July 2021

BASALT refines binning from metagenomic data and increases resolution of genome-resolved metagenomic analysis

Article Open access 11 March 2024

Data availability

The sequence data used in this study are publicly available from either the respective studies or ENA. The semisynthetic MetaHIT dataset was downloaded from https://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/ as the files depth.txt.gz and assembly-filtered.fa.gz. The simulated CAMI High and CAMI2 datasets were downloaded from https://data.cami-challenge.org/participate from ‘Toy Test Dataset High_Complexity’ and ‘2nd CAMI Toy Human Microbiome Project Dataset’, respectively. The de novo assemblies of the Almeida dataset were obtained through personal communication with A. Almeida and R. D. Finn, and the reads downloaded from ENA as specified in their publication. The data and results of binning the MetaHIT, CAMI2 and Almeida datasets, as well as the source data for Figs. 1–3, are available on figshare at https://figshare.com/projects/VAMB/72677. A CodeOcean capsule of VAMB v.3.0.1, including the six training and test datasets for reproducing benchmarking results, is available from https://doi.org/10.24433/CO.2518623.v1. Source data are provided with this paper.

Code availability

All code can be found on GitHub at https://github.com/RasmussenLab/vamb and is freely available under the permissive MIT license. All analyses were performed using VAMB v.3.0.1. Additionally, code are available as a CodeOcean capsule at https://doi.org/10.24433/CO.2518623.v1.

References

Turaev, D. & Rattei, T. High definition for systems biology of microbial communities: metagenomics gets genome-centric and strain-resolved. Curr. Opin. Biotechnol. 39, 174–181 (2016).
Article CAS PubMed Google Scholar
Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J. & Segata, N. Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 35, 833–844 (2017).
Article CAS PubMed Google Scholar
Wang, J. & Jia, H. Metagenome-wide association studies: fine-mining the microbiome. Nat. Rev. Microbiol. 14, 508–522 (2016).
Article CAS PubMed Google Scholar
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2014).
Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. Proc. Mach. Learn. Res. 32, 1278–1286 (2014).
Google Scholar
Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).
Article CAS PubMed Google Scholar
Teeling, H., Meyerdierks, A., Bauer, M., Amann, R. & Glöckner, F. O. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ. Microbiol. 6, 938–947 (2004).
Article CAS PubMed Google Scholar
Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).
Article CAS PubMed Google Scholar
Albertsen, M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013).
Article CAS PubMed Google Scholar
Kang, D. D., Froula, J., Egan, R. & Wang, Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 3, e1165 (2015).
Article PubMed PubMed Central Google Scholar
Wu, Y.-W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 605–607 (2016).
Article CAS PubMed Google Scholar
Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 7, e7359 (2019).
Article PubMed PubMed Central Google Scholar
Plaza Oñate, F. et al. MSPminer: abundance-based reconstitution of microbial pan-genomes from shotgun metagenomic data. Bioinformatics 35, 1544–1552 (2019).
Article PubMed Google Scholar
Lin, H. H. & Liao, Y. C. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci. Rep. 6, 24175 (2016).
Chatterji, S., Yamazaki, I., Bai, Z. & Eisen, J. A. CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads. in Research in Computational Molecular Biology (eds. Vingron, M. & Wong, L.) 17–28 (Springer, 2008).
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).
Article CAS PubMed Google Scholar
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).
Article CAS PubMed PubMed Central Google Scholar
Almeida, A. et al. A new genomic blueprint of the human gut microbiota. Nature 568, 499–504 (2019).
Article CAS PubMed PubMed Central Google Scholar
Brooks, B. et al. Strain-resolved analysis of hospital rooms and infants reveals overlap between the human and room microbiome. Nat. Commun. 8, 1–7 (2017).
Article Google Scholar
Sczyrba, A. et al. Critical Assessment of Metagenome Interpretation – a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).
Article CAS PubMed PubMed Central Google Scholar
Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).
Article CAS PubMed PubMed Central Google Scholar
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).
Article CAS PubMed PubMed Central Google Scholar
Sieber, C. M. K. et al. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat. Microbiol. 3, 836–843 (2018).
Article CAS PubMed PubMed Central Google Scholar
Cleary, B. et al. Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning. Nat. Biotechnol. 33, 1053–1060 (2015).
Article CAS PubMed PubMed Central Google Scholar
Huttenhower, C. et al. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).
Article CAS Google Scholar
Saeed, I., Tang, S.-L. & Halgamuge, S. K. Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition. Nucleic Acids Res. 40, e34 (2012).
Article CAS PubMed Google Scholar
Pride, D. T., Meinersmann, R. J., Wassenaar, T. M. & Blaser, M. J. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 13, 145–156 (2003).
Article CAS PubMed PubMed Central Google Scholar
Chen, L.-X., Anantharaman, K., Shaiber, A., Eren, A. M. & Banfield, J. F. Accurate and complete genomes from metagenomes. Genome Res. 30, 315–333 (2020).
Article CAS PubMed PubMed Central Google Scholar
Daubin, V., Lerat, E. & Perrière, G. The source of laterally transferred genes in bacterial genomes. Genome Biol. 4, R57 (2003).
Article PubMed PubMed Central Google Scholar
Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).
Article PubMed PubMed Central Google Scholar
Schloissnig, S. et al. Genomic variation landscape of the human gut microbiome. Nature 493, 45–50 (2013).
Article PubMed Google Scholar
Yatsunenko, T. et al. Human gut microbiome viewed across age and geography. Nature 486, 222–227 (2012).
Article CAS PubMed PubMed Central Google Scholar
Deschasaux, M. et al. Depicting the composition of gut microbiota in a population with varied ethnic origins but shared geography. Nat. Med. 24, 1526–1531 (2018).
Article CAS PubMed Google Scholar
He, Y. et al. Regional variation limits applications of healthy gut microbiome reference ranges and disease models. Nat. Med. 24, 1532–1535 (2018).
Article CAS PubMed Google Scholar
Asnicar, F. et al. Studying vertical microbiome transmission from mothers to infants by strain-level metagenomic profiling. mSystems 2, e00164–16 (2017).
Article CAS PubMed PubMed Central Google Scholar
Ferretti, P. et al. Mother-to-infant microbial transmission from different body sites shapes the developing infant gut microbiome. Cell Host Microbe 24, 133–145 (2018).
Article CAS PubMed PubMed Central Google Scholar
Grønbech, C. H. et al. scVAE: variational auto-encoders for single-cell gene expression data. Bioinformatics 36, 4415–4422 (2020).
Dilokthanakul, N. et al. Deep unsupervised clustering with Gaussian mixture variational autoencoders. Preprint at https://arxiv.org/abs/1611.02648 (2017).
Kislyuk, A., Bhatnagar, S., Dushoff, J. & Weitz, J. S. Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform. 10, 316 (2009).
Article Google Scholar
Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Preprint at https://arxiv.org/abs/1502.03167 (2015).
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. Preprint at https://arxiv.org/pdf/1207.0580.pdf (2012).
Maas, A. L., Maas, A. L., Hannun, A. Y. & Ng, A. Y. Rectifier nonlinearities improve neural network acoustic models. Preprint at https://arxiv.org/pdf/1207.0580.pdf (2013).
Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2017).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).
Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997?upload=1 (2013).
Li, H. et al. The sequence alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Sculley, D. Web-Scale k-Means Clustering. in Proc. 19th International Conference on World Wide Web 1177–1178 (ACM Press, 2010).
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Article PubMed Google Scholar
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. MetaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
CAS PubMed PubMed Central Google Scholar
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009).
Article CAS PubMed PubMed Central Google Scholar
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinform. 10, 421 (2009).
Article Google Scholar
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
Article PubMed PubMed Central Google Scholar
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 119 (2010).
Article Google Scholar
Mitchell, A. L. et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 47, D351–D360 (2019).
Article CAS PubMed Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B 57, 289–300 (1995).
Google Scholar
Nayfach, S., Pedro Camargo, A., Eloe-Fadrosh, E. & Roux, S. CheckV: assessing the quality of metagenome-assembled viral genomes. Preprint at bioRxiv https://doi.org/10.1101/2020.05.06.081778 (2020).
Ren, J. et al. Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 64–77 (2020).
Article CAS PubMed PubMed Central Google Scholar
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Article PubMed Google Scholar
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2019).
PubMed Central Google Scholar
Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
Article CAS PubMed Google Scholar
Cosentino, S. & Iwasaki, W. SonicParanoid: fast, accurate and easy orthology inference. Bioinformatics 35, 149–151 (2018).
Article PubMed Central Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Article CAS PubMed PubMed Central Google Scholar
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., Von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).
Article CAS PubMed PubMed Central Google Scholar
Zhang, C., Rabiee, M., Sayyari, E. & Mirarab, S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinform. 19, 153 (2018).
Article Google Scholar
Hoang, D. T., Chernomor, O., von Haeseler, A., Minh, B. Q. & Vinh, L. S. UFBoot2: improving the ultrafast bootstrap approximation.Mol. Biol. Evol. 35, 518–522 (2018).
Article CAS PubMed Google Scholar
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 47, W256–W259 (2019).
Article CAS PubMed PubMed Central Google Scholar
Oksanen, J. et al. Package ‘vegan’. Community Ecology Package v.2.5-6. R Package version 3.4.0 1–296. https://cran.r-project.org/src/contrib/Archive/vegan/vegan_2.5-6.tar.gz (2019).
Paradis, E. & Schliep, K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528 (2018).
Article Google Scholar

Download references

Acknowledgements

We thank A. Almeida and R. D. Finn for sharing de novo assemblies of the 1,000 gut microbiome samples that we used for benchmarking VAMB. We thank C. Titus Brown for his source code contribution to the VAMB software package. J.N.N., J.J., R.L.A., L.J.J. and S.R. were supported by the Novo Nordisk Foundation (grant NNF14CC0001). S.R. was supported by the Jorck Foundation Research Award.

Author information

Authors and Affiliations

Department of Health Technology, Technical University of Denmark, Lyngby, Denmark
Jakob Nybo Nissen & Jose Juan Almagro Armenteros
Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
Jakob Nybo Nissen, Joachim Johansen, Rosa Lundbye Allesøe, Lars Juhl Jensen & Simon Rasmussen
Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark
Casper Kaae Sønderby, Christopher Heje Grønbech & Ole Winther
Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark
Christopher Heje Grønbech & Ole Winther
Clinical-Microbiomics A/S, Copenhagen, Denmark
Henrik Bjørn Nielsen
National Food Institute, Technical University of Denmark, Lyngby, Denmark
Thomas Nordahl Petersen
Center for Genomic Medicine, Copenhagen University Hospital, Copenhagen, Denmark
Ole Winther

Authors

Jakob Nybo Nissen
View author publications
You can also search for this author in PubMed Google Scholar
Joachim Johansen
View author publications
You can also search for this author in PubMed Google Scholar
Rosa Lundbye Allesøe
View author publications
You can also search for this author in PubMed Google Scholar
Casper Kaae Sønderby
View author publications
You can also search for this author in PubMed Google Scholar
Jose Juan Almagro Armenteros
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Heje Grønbech
View author publications
You can also search for this author in PubMed Google Scholar
Lars Juhl Jensen
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Bjørn Nielsen
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Nordahl Petersen
View author publications
You can also search for this author in PubMed Google Scholar
Ole Winther
View author publications
You can also search for this author in PubMed Google Scholar
Simon Rasmussen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.R. conceived the study and guided the analysis. J.N.N., S.R., J.J. and R.L.A. performed the analyses. J.N.N. wrote the software. C.K.S., J.J.A.A., C.H.G., T.N.P., L.J.J., H.B.N. and O.W. provided guidance and input for the analysis. J.N.N., L.J.J. and S.R. wrote the manuscript with contributions from all coauthors. All authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Simon Rasmussen.

Ethics declarations

Competing interests

H.B.N. is employed at Clinical-Microbiomics A/S. The remaining authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–33 and Tables 1–10.

Reporting Summary

Supplementary Data 1

Overview of Salmonella spike-in with one to three Salmonella strains in a background of 50 HMP samples.

Supplementary Data 2

Results of Salmonella spike-in with one to three genomes in a background of 50 HMP samples.

Supplementary Data 3

Results of comparison between multisplit and single-sample binning.

Supplementary Data 4

Information on the 1,000 human gut microbiome samples (benchmark set from Almeida et al.¹⁸).

Supplementary Data 5

GTDB annotation of VAMB NC bins from the dataset of Almeida et al.¹⁸.

Supplementary Data 6

Number of NC bins generated by VAMB and MetaBAT2 that are annotated by GTDB to a particular species.

Supplementary Data 7

Overview of BLAST hits for alignment of VAMB clusters versus NCBI nonredundant nucleotides.

Supplementary Data 8

PERMANOVA analysis of phylogenetic placements and geography.

Supplementary Data 9

Results of hyperparameter optimizations of the VAE in VAMB.

Supplementary Data 10

CheckM results for all bins produced by VAMB.

Source data

Source Data Fig. 1

Statistical source data Fig. 1a–d,f.

Source Data Fig. 2

Statistical source data Fig. 2a–f.

Source Data Fig. 3

Statistical source data Fig. 3c,d.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nissen, J.N., Johansen, J., Allesøe, R.L. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat Biotechnol 39, 555–560 (2021). https://doi.org/10.1038/s41587-020-00777-4

Download citation

Received: 06 December 2019
Accepted: 17 November 2020
Published: 04 January 2021
Issue Date: May 2021
DOI: https://doi.org/10.1038/s41587-020-00777-4

This article is cited by

Evaluating and improving the representation of bacterial contents in long-read metagenome assemblies
- Xiaowen Feng
- Heng Li
Genome Biology (2024)
Weathered granites and soils harbour microbes with lanthanide-dependent methylotrophic enzymes
- Marcos Y. Voutsinos
- Jacob A. West-Roberts
- Jillian F. Banfield
BMC Biology (2024)
Autotrophic biofilms sustained by deeply sourced groundwater host diverse bacteria implicated in sulfur and hydrogen metabolism
- Luis E. Valentin-Alvarado
- Sirine C. Fakra
- Jillian F. Banfield
Microbiome (2024)
Functional host-specific adaptation of the intestinal microbiome in hominids
- M. C. Rühlemann
- C. Bang
- A. Franke
Nature Communications (2024)
Effective binning of metagenomic contigs using contrastive multi-view representation learning
- Ziye Wang
- Ronghui You
- Shanfeng Zhu
Nature Communications (2024)