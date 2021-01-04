Improved metagenome binning and assembly using deep variational autoencoders

Despite recent advances in metagenomic binning, reconstruction of microbial species from metagenomics data remains challenging. Here we develop variational autoencoders for metagenomic binning (VAMB), a program that uses deep variational autoencoders to encode sequence coabundance and k-mer distribution information before clustering. We show that a variational autoencoder is able to integrate these two distinct data types without any previous knowledge of the datasets. VAMB outperforms existing state-of-the-art binners, reconstructing 29–98% and 45% more near-complete (NC) genomes on simulated and real data, respectively. Furthermore, VAMB is able to separate closely related strains up to 99.5% average nucleotide identity (ANI), and reconstructed 255 and 91 NC Bacteroides vulgatus and Bacteroides dorei sample-specific genomes as two distinct clusters from a dataset of 1,000 human gut microbiome samples. We use 2,606 NC bins from this dataset to show that species of the human gut microbiome have different geographical distribution patterns. VAMB can be run on standard hardware and is freely available at https://github.com/RasmussenLab/vamb.

Fig. 1: Performance of VAMB.
Fig. 2: Performance of clustering different inputs.
Fig. 3: Phylogeny of bins across 1,000 human gut microbiome samples.

Data availability

The sequence data used in this study are publicly available from either the respective studies or ENA. The semisynthetic MetaHIT dataset was downloaded from https://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/ as the files depth.txt.gz and assembly-filtered.fa.gz. The simulated CAMI High and CAMI2 datasets were downloaded from https://data.cami-challenge.org/participate from ‘Toy Test Dataset High_Complexity’ and ‘2nd CAMI Toy Human Microbiome Project Dataset’, respectively. The de novo assemblies of the Almeida dataset were obtained through personal communication with A. Almeida and R. D. Finn, and the reads downloaded from ENA as specified in their publication. The data and results of binning the MetaHIT, CAMI2 and Almeida datasets, as well as the source data for Figs. 13, are available on figshare at https://figshare.com/projects/VAMB/72677. A CodeOcean capsule of VAMB v.3.0.1, including the six training and test datasets for reproducing benchmarking results, is available from https://doi.org/10.24433/CO.2518623.v1. Source data are provided with this paper.

Code availability

All code can be found on GitHub at https://github.com/RasmussenLab/vamb and is freely available under the permissive MIT license. All analyses were performed using VAMB v.3.0.1. Additionally, code are available as a CodeOcean capsule at https://doi.org/10.24433/CO.2518623.v1.

We thank A. Almeida and R. D. Finn for sharing de novo assemblies of the 1,000 gut microbiome samples that we used for benchmarking VAMB. We thank C. Titus Brown for his source code contribution to the VAMB software package. J.N.N., J.J., R.L.A., L.J.J. and S.R. were supported by the Novo Nordisk Foundation (grant NNF14CC0001). S.R. was supported by the Jorck Foundation Research Award.

  Department of Health Technology, Technical University of Denmark, Lyngby, Denmark

    Jakob Nybo Nissen & Jose Juan Almagro Armenteros

  Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark

    Jakob Nybo Nissen, Joachim Johansen, Rosa Lundbye Allesøe, Lars Juhl Jensen & Simon Rasmussen

  Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark

    Casper Kaae Sønderby, Christopher Heje Grønbech & Ole Winther

  Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark

    Christopher Heje Grønbech & Ole Winther

  Clinical-Microbiomics A/S, Copenhagen, Denmark

    Henrik Bjørn Nielsen

  National Food Institute, Technical University of Denmark, Lyngby, Denmark

    Thomas Nordahl Petersen

  Center for Genomic Medicine, Copenhagen University Hospital, Copenhagen, Denmark

    Ole Winther

  Jakob Nybo Nissen
  Joachim Johansen
  Rosa Lundbye Allesøe
  Casper Kaae Sønderby
  Jose Juan Almagro Armenteros
  Christopher Heje Grønbech
  Lars Juhl Jensen
  Henrik Bjørn Nielsen
  Thomas Nordahl Petersen
  Ole Winther
  Simon Rasmussen
S.R. conceived the study and guided the analysis. J.N.N., S.R., J.J. and R.L.A. performed the analyses. J.N.N. wrote the software. C.K.S., J.J.A.A., C.H.G., T.N.P., L.J.J., H.B.N. and O.W. provided guidance and input for the analysis. J.N.N., L.J.J. and S.R. wrote the manuscript with contributions from all coauthors. All authors read and approved the final version of the manuscript.

Correspondence to Simon Rasmussen.

H.B.N. is employed at Clinical-Microbiomics A/S. The remaining authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Figs. 1–33 and Tables 1–10.

Reporting Summary

Supplementary Data 1

Overview of Salmonella spike-in with one to three Salmonella strains in a background of 50 HMP samples.

Supplementary Data 2

Results of Salmonella spike-in with one to three genomes in a background of 50 HMP samples.

Supplementary Data 3

Results of comparison between multisplit and single-sample binning.

Supplementary Data 4

Information on the 1,000 human gut microbiome samples (benchmark set from Almeida et al.18).

Supplementary Data 5

GTDB annotation of VAMB NC bins from the dataset of Almeida et al.18.

Supplementary Data 6

Number of NC bins generated by VAMB and MetaBAT2 that are annotated by GTDB to a particular species.

Supplementary Data 7

Overview of BLAST hits for alignment of VAMB clusters versus NCBI nonredundant nucleotides.

Supplementary Data 8

PERMANOVA analysis of phylogenetic placements and geography.

Supplementary Data 9

Results of hyperparameter optimizations of the VAE in VAMB.

Supplementary Data 10

CheckM results for all bins produced by VAMB.

Source Data Fig. 1

Statistical source data Fig. 1a–d,f.

Source Data Fig. 2

Statistical source data Fig. 2a–f.

Source Data Fig. 3

Statistical source data Fig. 3c,d.

