Analyses of metagenomic datasets that are sequenced to a depth of billions or trillions of bases can uncover hundreds of microbial genomes, but naive assembly of these data is computationally intensive, requiring hundreds of gigabytes to terabytes of RAM. We present latent strain analysis (LSA), a scalable, de novo pre-assembly method that separates reads into biologically informed partitions and thereby enables assembly of individual genomes. LSA is implemented with a streaming calculation of unobserved variables that we call eigengenomes. Eigengenomes reflect covariance in the abundance of short, fixed-length sequences, or k-mers. As the abundance of each genome in a sample is reflected in the abundance of each k-mer in that genome, eigengenome analysis can be used to partition reads from different genomes. This partitioning can be done in fixed memory using tens of gigabytes of RAM, which makes assembly and downstream analyses of terabytes of data feasible on commodity hardware. Using LSA, we assemble partial and near-complete genomes of bacterial taxa present at relative abundances as low as 0.00001%. We also show that LSA is sensitive enough to separate reads from several strains of the same species.
At a glance
- Metagenomic and small-subunit rRNA analyses reveal the genetic diversity of bacteria, archaea, fungi, and viruses in soil. Appl. Environ. Microbiol. 73, 7059–7066 (2007). et al.
- A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput. Biol. 9, e1002863 (2013). et al.
- Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science 309, 1387–1390 (2005). , &
- Comparative metagenomics of microbial communities. Science 308, 554–557 (2005). et al.
- The metagenomics of soil. Nat. Rev. Microbiol. 3, 470–478 (2005).
- Global biogeography of highly diverse protistan communities in soil. ISME J. 7, 652–659 (2013). et al.
- Enterotypes of the human gut microbiome. Nature 473, 174–180 (2011). et al.
- Metagenomics—a guide from sampling to data analysis. Microb. Inform. Exp. 2, 3 (2012). , &
- Genome assembly reborn: recent computational challenges. Brief. Bioinform. 10, 354–366 (2009).
- MetAMOS: a metagenomics assembly and analysis pipeline for AMOS. Genome Biol. 12 (suppl. 1), 25 (2011). et al.
- MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 40, e155 (2012). , , &
- Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics 27, i94–i101 (2011). , , &
- Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol. 13, R122 (2012). , , , &
- Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl. Acad. Sci. USA 111, 4904–4909 (2014). et al.
- Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. USA 109, 13272–13277 (2012). et al.
- Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). &
- MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics doi.10.1093/bioinformatics/btv033 (20 January 2015). et al.
- Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 23, 111–120 (2013). et al.
- Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013). et al.
- Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014). et al.
- Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014). et al.
- GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ 2, e603 (2014). et al.
- Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41, 391–407 (1990). , , , &
- Software framework for topic modelling with large corpora. Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks 46–50 (University of Malta, 2010). &
- NIH HMP Working Group. et al. The NIH Human Microbiome Project. Genome Res. 19, 2317–2323 (2009).
- A genomic distance based on MUM indicates discontinuity between most bacterial species and genera. J. Bacteriol. 191, 91–99 (2009). , &
- Redefining the role of intestinal microbes in the pathogenesis of necrotizing enterocolitis. Pediatrics 125, 777–785 (2010). , , , &
- A simple, fast, and accurate method of phylogenomic inference. Genome Biol. 9, R151 (2008). &
- Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008). , &
- Kernelized locality-sensitive hashing for scalable image search. Proceedings of the IEEE 12th International Conference on Computer Vision 2130–2137 (October 2009). &
- Similarity search in high dimensions via hashing. Proceedings of the 25th International Conference on Very Large Data Bases (1999). , &
- Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 12 (suppl. 2), S4 (2011). , , , &
- Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006). et al.
- Supplementary Figure 1: Accuracy and Completeness with Redundant Partitioning (36 KB)
The accuracy of Salmonella enriched partitions (rows) with respect to each strain (columns) is depicted on a color scale. Saturation of each color indicates the completeness of each assembly with respect to each strain. Bars in the right panel indicate the total assembly length.
- Supplementary Figure 2: Sub-partitioning enrichment (40 KB)
LSA was run on the reads from a single partition containing a mixture of genomes. Blue bars represent the relative abundance of each genus within that partition. After sub-partitioning (red bars), the mixed genomes were further resolved into species-specific bins.
- Supplementary Figure 3: Salmonella bongori alignments (244 KB)
For two partitions (top and bottom panels) the total number of genes mapping to each Salmonella bongori reference strain is shown in a bar chart, and the sequence identity of these mappings is depicted as a box plot.
- Supplementary Figure 4: Novel genome GC content versus depth (121 KB)
Plotted are the GC content (x-axis) and depth (y-axis) for contigs in partitions that may contain novel genomes. Alignments to different families are depicted in different colors (unclassified shown in red), and the size of each circle represents the length of each contig.
- Supplementary Text and Figures (830 KB)
Supplementary Figures 1–4 and Supplementary Tables 1–4, 8, 9, 11