Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning

Journal name:
Nature Biotechnology
Volume:
33,
Pages:
1053–1060
Year published:
DOI:
doi:10.1038/nbt.3329
Received
Accepted
Published online

Abstract

Analyses of metagenomic datasets that are sequenced to a depth of billions or trillions of bases can uncover hundreds of microbial genomes, but naive assembly of these data is computationally intensive, requiring hundreds of gigabytes to terabytes of RAM. We present latent strain analysis (LSA), a scalable, de novo pre-assembly method that separates reads into biologically informed partitions and thereby enables assembly of individual genomes. LSA is implemented with a streaming calculation of unobserved variables that we call eigengenomes. Eigengenomes reflect covariance in the abundance of short, fixed-length sequences, or k-mers. As the abundance of each genome in a sample is reflected in the abundance of each k-mer in that genome, eigengenome analysis can be used to partition reads from different genomes. This partitioning can be done in fixed memory using tens of gigabytes of RAM, which makes assembly and downstream analyses of terabytes of data feasible on commodity hardware. Using LSA, we assemble partial and near-complete genomes of bacterial taxa present at relative abundances as low as 0.00001%. We also show that LSA is sensitive enough to separate reads from several strains of the same species.

At a glance

Figures

  1. Accuracy and completeness of recovered genomes.
    Figure 1: Accuracy and completeness of recovered genomes.

    The accuracy of Salmonella-enriched partitions (rows) with respect to each strain (columns) is depicted on a color scale. Saturation of each color indicates the completeness of each assembly with respect to each strain. Bars in the two right-hand panels indicate the fraction of reads in a partition coming from any Salmonella strain (red line = 5%; the background abundance of spiked-in Salmonella reads) and the total assembly length. The tree at the top was constructed using MUMi distance between strains. S. bong., S. bongori; S. ent., S. enterica.

  2. S. enterica multiple genome alignment.
    Figure 2: S. enterica multiple genome alignment.

    Multiple sequence alignment (MSA) blocks (gray ring) are ordered by their conservation across 1–7 strains. The inner rings depict portions of each genome that align to each MSA block. Within five S. enterica–enriched partitions, the read depth at each MSA block is shown as a heatmap in the outer rings. Partition numbers from the center, outwards are: 1424, 56, 86, 1369 and 1093. RPKM, reads per kilobase per million.

  3. Latent strain analysis pipeline.
    Figure 3: Latent strain analysis pipeline.

    Metagenomic samples containing multiple species (depicted by different colors) are sequenced. Every k-mer in every sequencing read is hashed to one column of a matrix. Values from each sample occupy a different row. Singular value decomposition of this k-mer abundance matrix defines a set of eigengenomes. k-mers are clustered across eigengenomes, and each read is partitioned based on the intersection of its k-mers with each of these clusters. Each partition contains a small fraction of the original data and can be analyzed independently of all others. SVD, singular value decomposition.

  4. Enrichment of bacterial families spanning six orders of magnitude in abundance.
    Figure 4: Enrichment of bacterial families spanning six orders of magnitude in abundance.

    Each circle represents one family in the FijiCoMP stool collection. The x axis is the background (unpartitioned) abundance of each family, as determined by species-specific 16S ribosomal DNA. Y-axis values are the maximum relative abundance in any one partition, as measured by MetaPhyler analysis of marker genes. Circle size is determined by the number of AMPHORA genes in the assembly of each partition.

  5. GC content versus contig depth.
    Figure 5: GC content versus contig depth.

    Plotted are the GC content (x axis) and depth (y axis) for contigs in partitions representing the top 15 enriched families from the FijiCoMP collection. Size of each circle represents the length of each contig. For each family the background abundance is indicated in parentheses.

  6. Accuracy and Completeness with Redundant Partitioning
    Supplementary Fig. 1: Accuracy and Completeness with Redundant Partitioning

    The accuracy of Salmonella enriched partitions (rows) with respect to each strain (columns) is depicted on a color scale. Saturation of each color indicates the completeness of each assembly with respect to each strain. Bars in the right panel indicate the total assembly length.

  7. Sub-partitioning enrichment
    Supplementary Fig. 2: Sub-partitioning enrichment

    LSA was run on the reads from a single partition containing a mixture of genomes. Blue bars represent the relative abundance of each genus within that partition. After sub-partitioning (red bars), the mixed genomes were further resolved into species-specific bins.

  8. Salmonella bongori alignments
    Supplementary Fig. 3: Salmonella bongori alignments

    For two partitions (top and bottom panels) the total number of genes mapping to each Salmonella bongori reference strain is shown in a bar chart, and the sequence identity of these mappings is depicted as a box plot.

  9. Novel genome GC content versus depth
    Supplementary Fig. 4: Novel genome GC content versus depth

    Plotted are the GC content (x-axis) and depth (y-axis) for contigs in partitions that may contain novel genomes. Alignments to different families are depicted in different colors (unclassified shown in red), and the size of each circle represents the length of each contig.

Accession codes

Primary accessions

Sequence Read Archive

Referenced accessions

Sequence Read Archive

References

  1. Fierer, N. et al. Metagenomic and small-subunit rRNA analyses reveal the genetic diversity of bacteria, archaea, fungi, and viruses in soil. Appl. Environ. Microbiol. 73, 70597066 (2007).
  2. Koren, O. et al. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput. Biol. 9, e1002863 (2013).
  3. Gans, J., Wolinsky, M. & Dunbar, J. Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science 309, 13871390 (2005).
  4. Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554557 (2005).
  5. Daniel, R. The metagenomics of soil. Nat. Rev. Microbiol. 3, 470478 (2005).
  6. Bates, S.T. et al. Global biogeography of highly diverse protistan communities in soil. ISME J. 7, 652659 (2013).
  7. Arumugam, M. et al. Enterotypes of the human gut microbiome. Nature 473, 174180 (2011).
  8. Thomas, T., Gilbert, J. & Meyer, F. Metagenomics—a guide from sampling to data analysis. Microb. Inform. Exp. 2, 3 (2012).
  9. Pop, M. Genome assembly reborn: recent computational challenges. Brief. Bioinform. 10, 354366 (2009).
  10. Treangen, T. et al. MetAMOS: a metagenomics assembly and analysis pipeline for AMOS. Genome Biol. 12 (suppl. 1), 25 (2011).
  11. Namiki, T., Hachiya, T., Tanaka, H. & Sakakibara, Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 40, e155 (2012).
  12. Peng, Y., Leung, H.C., Yiu, S.M. & Chin, F.Y. Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics 27, i94i101 (2011).
  13. Boisvert, S., Raymond, F., Godzaridis, E., Laviolette, F. & Corbeil, J. Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol. 13, R122 (2012).
  14. Howe, A.C. et al. Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl. Acad. Sci. USA 111, 49044909 (2014).
  15. Pell, J. et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. USA 109, 1327213277 (2012).
  16. Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821829 (2008).
  17. Li, D. et al. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics doi.10.1093/bioinformatics/btv033 (20 January 2015).
  18. Sharon, I. et al. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 23, 111120 (2013).
  19. Albertsen, M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533538 (2013).
  20. Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 11441146 (2014).
  21. Nielsen, H.B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822828 (2014).
  22. Imelfort, M. et al. GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ 2, e603 (2014).
  23. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. & Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41, 391407 (1990).
  24. Řehůřek, R & Sojka, P. Software framework for topic modelling with large corpora. Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks 4650 (University of Malta, 2010).
  25. NIH HMP Working Group. et al. The NIH Human Microbiome Project. Genome Res. 19, 23172323 (2009).
  26. Deloger, M., El Karoui, M. & Petit, M.-A. A genomic distance based on MUM indicates discontinuity between most bacterial species and genera. J. Bacteriol. 191, 9199 (2009).
  27. Morowitz, M.J., Poroyko, V., Caplan, M., Alverdy, J. & Liu, D.C. Redefining the role of intestinal microbes in the pathogenesis of necrotizing enterocolitis. Pediatrics 125, 777785 (2010).
  28. Wu, M. & Eisen, J.A. A simple, fast, and accurate method of phylogenomic inference. Genome Biol. 9, R151 (2008).
  29. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 18511858 (2008).
  30. Kulis, B. & Grauman, K. Kernelized locality-sensitive hashing for scalable image search. Proceedings of the IEEE 12th International Conference on Computer Vision 21302137 (October 2009).
  31. Gionis, A., Indyk, P. & Motwani, R. Similarity search in high dimensions via hashing. Proceedings of the 25th International Conference on Very Large Data Bases (1999).
  32. Liu, B., Gibbons, T., Ghodsi, M., Treangen, T. & Pop, M. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 12 (suppl. 2), S4 (2011).
  33. DeSantis, T.Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 50695072 (2006).

Download references

Author information

Affiliations

  1. Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Brian Cleary
  2. Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA.

    • Brian Cleary,
    • Ilana Lauren Brito,
    • Katherine Huang,
    • Dirk Gevers,
    • Terrance Shea,
    • Sarah Young &
    • Eric J Alm
  3. Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Ilana Lauren Brito &
    • Eric J Alm
  4. Center for Microbiome Informatics and Therapeutics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.

    • Ilana Lauren Brito &
    • Eric J Alm

Contributions

B.C. conceived the algorithm. B.C., I.L.B., K.H., D.G. and E.J.A. designed the experiments. B.C., I.L.B., K.H., T.S. and S.Y. performed the experiments. B.C., I.L.B., K.H., D.G. and E.J.A. wrote the manuscript. All authors reviewed and approved the final manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Accuracy and Completeness with Redundant Partitioning (36 KB)

    The accuracy of Salmonella enriched partitions (rows) with respect to each strain (columns) is depicted on a color scale. Saturation of each color indicates the completeness of each assembly with respect to each strain. Bars in the right panel indicate the total assembly length.

  2. Supplementary Figure 2: Sub-partitioning enrichment (40 KB)

    LSA was run on the reads from a single partition containing a mixture of genomes. Blue bars represent the relative abundance of each genus within that partition. After sub-partitioning (red bars), the mixed genomes were further resolved into species-specific bins.

  3. Supplementary Figure 3: Salmonella bongori alignments (244 KB)

    For two partitions (top and bottom panels) the total number of genes mapping to each Salmonella bongori reference strain is shown in a bar chart, and the sequence identity of these mappings is depicted as a box plot.

  4. Supplementary Figure 4: Novel genome GC content versus depth (121 KB)

    Plotted are the GC content (x-axis) and depth (y-axis) for contigs in partitions that may contain novel genomes. Alignments to different families are depicted in different colors (unclassified shown in red), and the size of each circle represents the length of each contig.

PDF files

  1. Supplementary Text and Figures (830 KB)

    Supplementary Figures 1–4 and Supplementary Tables 1–4, 8, 9, 11

Excel files

  1. Supplementary Tables 5, 6, 7 and 10 (614 KB)

Zip files

  1. Supplementary Code (111.39 KB)

Additional data