Differences in the presence of even a few genes between otherwise identical bacterial strains may result in critical phenotypic differences. Here we systematically identify microbial genomic structural variants (SVs) and find them to be prevalent in the human gut microbiome across phyla and to replicate in different cohorts. SVs are enriched for CRISPR-associated and antibiotic-producing functions and depleted from housekeeping genes, suggesting that they have a role in microbial adaptation. We find multiple associations between SVs and host disease risk factors, many of which replicate in an independent cohort. Exploring genes that are clustered in the same SV, we uncover several possible mechanistic links between the microbiome and its host, including a region in Anaerostipes hadrus that encodes a composite inositol catabolism-butyrate biosynthesis pathway, the presence of which is associated with lower host metabolic disease risk. Overall, our results uncover a nascent layer of variability in the microbiome that is associated with microbial adaptation and host health.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Turnover of strain-level diversity modulates functional traits in the honeybee gut microbiome between nurses and foragers
Genome Biology Open Access 08 December 2023
Nature Communications Open Access 16 November 2023
Nature Communications Open Access 17 August 2023
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
The 7 strains samples used in Fig. 1c are available through ENA (https://www.ebi.ac.uk/ena), accession ENA: PRJEB25194. The 887 samples are publicly available through ENA, accession numbers ENA: PRJEB11532, ENA: PRJEB17643. The raw metagenomic sequencing data for the Lifelines DEEP cohort, and age and sex information per sample are available from the European genome-phenome archive (https://www.ebi.ac.uk/ega/) at accession number EGAS00001001704. Other phenotypic data can be requested from the Lifelines cohort study (https://lifelines.nl/lifelines-research/access-to-lifelines) following the standard protocol for data access.
McCarroll, S. A. & Altshuler, D. M. Copy-number variation and association studies of human disease. Nat. Genet. 39 (Suppl), S37–S42 (2007).
Taniguchi, Y. et al. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329, 533–538 (2010).
Sokurenko, E. V. et al. Pathogenic adaptation of Escherichia coli by natural variation of the FimH adhesin. Proc. Natl Acad. Sci. USA 95, 8922–8926 (1998).
Gill, S. R. et al. Insights on evolution of virulence and resistance from the complete genome analysis of an early methicillin-resistant Staphylococcus aureus strain and a biofilm-producing methicillin-resistant Staphylococcus epidermidis strain. J. Bacteriol. 187, 2426–2438 (2005).
Koeth, R. A. et al. Intestinal microbiota metabolism of l-carnitine, a nutrient in red meat, promotes atherosclerosis. Nature Med. 19, 576–585 (2013).
Han, B. et al. Microbial genetic composition tunes host longevity. Cell 169, 1249–1262 (2017).
Greenblum, S., Carr, R. & Borenstein, E. Extensive strain-level copy-number variation across human gut microbiome species. Cell 160, 583–594 (2015).
Swann, J. R. et al. Systemic gut microbial modulation of bile acid metabolism in host tissue compartments. Proc. Natl Acad. Sci. USA 108 (Suppl 1), 4523–4530 (2011).
LeBlanc, J. G. et al. Bacteria as vitamin suppliers to their host: a gut microbiota perspective. Curr. Opin. Biotechnol. 24, 160–168 (2013).
Levy, M. et al. Microbiota-modulated metabolites shape the intestinal microenvironment by regulating NLRP6 inflammasome signaling. Cell 163, 1428–1443 (2015).
Zeevi, D. et al. Personalized nutrition by prediction of glycemic responses. Cell 163, 1079–1094 (2015).
Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).
Halfvarson, J. et al. Dynamics of the human gut microbiome in inflammatory bowel disease. Nature Microbiol. 2, 17004 (2017).
Pascal, V. et al. A microbial signature for Crohn’s disease. Gut 66, 813–822 (2017).
Rowan, S. et al. Involvement of a gut–retina axis in protection against dietary glycemia-induced age-related macular degeneration. Proc. Natl Acad. Sci. USA 114, E4472–E4481 (2017).
Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nature Biotechnol. 32, 834–841 (2014).
Manor, O. & Borenstein, E. Systematic characterization and analysis of the taxonomic drivers of functional shifts in the human microbiome. Cell Host Microbe 21, 254–267 (2017).
Franzosa, E. A. et al. Species-level functional profiling of metagenomes and metatranscriptomes. Nature Methods 15, 962–968 (2018).
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nature Rev. Genet. 12, 363–376 (2011).
Korem, T. et al. Bread affects clinical parameters and induces gut microbiome-associated personal glycemic responses. Cell Metab. 25, 1243–1253 (2017).
Zhernakova, A. et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science 352, 565–569 (2016).
Rothschild, D. et al. Environment dominates over host genetics in shaping human gut microbiota. Nature 555, 210–215 (2018).
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46 (D1), D754–D761 (2018).
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. (2018). https://doi.org/10.1093/nar/gky995
Korem, T. et al. Growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples. Science 349, 1101–1106 (2015).
Hayashi, F. et al. The innate immune response to bacterial flagellin is mediated by Toll-like receptor 5. Nature 410, 1099–1103 (2001).
Shen, Y. et al. Flagellar hooks and hook protein Flge participate in host microbe interactions at immunological level. Sci. Rep. 7, 1433 (2017).
Weiser, J. N. et al. Phosphorylcholine on the lipopolysaccharide of Haemophilus influenzae contributes to persistence in the respiratory tract and sensitivity to serum killing mediated by C-reactive protein. J. Exp. Med. 187, 631–640 (1998).
Ross, J. I. et al. Inducible erythromycin resistance in staphylococci is encoded by a member of the ATP-binding transport super-gene family. Mol. Microbiol. 4, 1207–1214 (1990).
Zupancic, M. L. et al. Analysis of the gut microbiota in the old order Amish and its relation to the metabolic syndrome. PLoS One 7, e43052 (2012).
Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).
Yoshida, K. et al. myo-Inositol catabolism in Bacillus subtilis. J. Biol. Chem. 283, 10415–10424 (2008).
Bergman, E. N. Energy contributions of volatile fatty acids from the gastrointestinal tract in various species. Physiol. Rev. 70, 567–590 (1990).
Harig, J. M., Soergel, K. H., Komorowski, R. A. & Wood, C. M. Treatment of diversion colitis with short-chain-fatty acid irrigation. N. Engl. J. Med. 320, 23–28 (1989).
Gao, Z. et al. Butyrate improves insulin sensitivity and increases energy expenditure in mice. Diabetes 58, 1509–1517 (2009).
Mende, D. R. et al. proGenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes. Nucleic Acids Res. 45 (D1), D529–D534 (2017).
Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017).
Marco-Sola, S., Sammeth, M., Guigó, R. & Ribeca, P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1185–1188 (2012).
Truong, D. T. et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12, 902–903 (2015).
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. Peer. J. Comput. Sci. 3, e104 (2017).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Potter, S. C. et al. HMMER web server: 2018 update. Nucleic Acids Res. 46, W200–W204 (2018).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).
Suez, J. et al. Artificial sweeteners induce glucose intolerance by altering the gut microbiota. Nature 514, 181–186 (2014).
Sczyrba, A. et al. Critical Assessment of Metagenome Interpretation – a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).
Liu, B., Gibbons, T., Ghodsi, M. & Pop, M. MetaPhyler: taxonomic profiling for metagenomic sequences. 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 95–100 (IEEE, 2010).
We thank members of the Segal group and the Center for Studies in Physics and Biology for discussions. E.S. is supported by grants from the European Research Council and the Israel Science Foundation. D.Z. is supported by the James S. McDonnell Foundation and the Dan David Prize Scholarship. D.Z. and T.K. were partly supported by the Israeli Ministry of Science and Technology. Lifelines DEEP was funded by: ERC-2012-322698 and NWO-SPI-92-266 to C.W.; ERC-715772 and NWO-178.056 to A.Z.; NWO-864.13.013 and CVON-2012-03 to J.F.
Nature thanks Ami Bhatt, Julie Segre and the other anonymous reviewer(s) for their contribution to the peer review of this work.
The authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Superior assignment of metagenomic reads using the Iterative Coverage-based Read-Assignment (ICRA) algorithm.
a, Boxplot (centre, median; box, IQR; whiskers, 10th and 90th percentiles) of ambiguous read assignment ratios of 887 samples11,20 mapped to a reference database of 3,953 representative microbial genomes (Methods) before (blue) and after (yellow) ICRA correction. b, Illustration of our computational pipeline. c–e, Swarm-plots of the ratio of correct read assignment per taxonomy level with no assignment correction (blue) or following assignment correction with ICRA (yellow), Kraken41 (red) or MetaPhyler48 (green) for CAMI47 high complexity (c; n = 5), medium complexity (d; n = 2) and low complexity (e; n = 1) datasets. Note that MetaPhyler did not provide sub-species level read assignments. *P < 0.05, **P < 0.01, two-sided Mann–Whitney U-test.
a, Dot-plot of the calculated relative abundances of 7 bacterial species in 100 samples, using either ICRA (yellow), MetaPhlAn240 (blue) or Bracken42 (red), as compared to the true relative abundances. Inset shows a violin plot (white dot, median; black rectangle, IQR, whiskers, 1.5 × IQR) of Bray–Curtis dissimilarities between the estimates (n = 100) of each method and the true abundances. **, two-sided Wilcoxon signed-rank P = 1.3 × 10−4; ****P = 3.0 × 10−18. b–h, Dot-plot of the calculated relative abundances (y axis) of A. finegoldii (b), B. faecium (c), C. flavigena (d), E. faecalis (e), L. gasseri (f), S. cristatus (g) and A. muciniphila (h) in 100 samples, using either ICRA (yellow), MetaPhlAn2 (blue) or Bracken (red), as compared to the true relative abundances (x axis). R2 was calculated using Pearson correlation.
a, b, Illustration of the online SV explorer available at https://genie.weizmann.ac.il/SV/, spanning the entire R. torques genome (a) and spanning a 26-kbp region of this genome (b).
a, Heatmap showing the number of subjects with SVs (yellow colour scale), the number of SVs (green colour scale), the mean SV size (blue colour scale) and the fraction of the genome that is variable (red colour scale), for each microbe analysed, along with their phylogenetic tree. b, Heatmap showing the genomic length percentage of variable and deletion SVs replicated in the Lifelines cohort for each microbe analysed.
Fold difference (x axis) and statistical significance (Methods; y axis) of the enrichment of functional KEGG modules in SVs present in regions significantly associated with microbial growth dynamics. A total of 56,088 genes were considered, 3,805 of them in growth rates-associated SVs.
a, Boxplot (centre, median; box, IQR; whiskers, IQR × 1.5) of microbial growth rates calculated using PTR26 in individuals harbouring a 7-segment deletion in the E. eligens genome (blue, n = 281) and individuals with no deletion (maroon, n = 166). b, Genomic map of E. eligens with the 7 segments marked in yellow. c, As in a for a 9-segment deletion SV in the E. eligens genome (blue, n = 57) and individuals with no deletion (maroon, n = 390). d, As in b with the 9 segments marked in orange. P value determined by two-sided Mann–Whitney U-test.
Full heatmap of statistically significant correlations (Methods) between disease risk factors and variable SVs, depicting associations replicated (yellow star), replicated using a different variable (orange star) or reversed (grey star) in the Lifelines cohort.
a, Boxplot (centre, median; box, IQR; whiskers, IQR × 1.5) of glycated haemoglobin in individuals harbouring an 11-kbp deletion in the E. rectale genome (blue, n = 253) and individuals with no deletion (maroon, n = 377); P - two-sided Mann–Whitney U-test. b, Same as Fig. 4d for this 11-kbp genomic region of E. rectale. c, Boxplot of BMI in individuals harbouring a 4-kbp deletion in the A. hadrus genome (blue, n = 276) and individuals with no deletion (maroon, n = 403). d, Same as Fig. 4d for this 4-kbp genomic region of A. hadrus. e, Depiction of the genes encoded in the region, which encode key enzymes in the folate biosynthesis pathway. Note correspondence of enzyme commission (EC) numbers with d. f, Boxplot of total cholesterol in individuals harbouring an 18-kbp deletion in the R. intestinalis genome (blue, n = 194) and individuals with no deletion (maroon, n = 68). g, same as Fig. 4d for a 10-kbp stretch of the 18-kbp region in R. intestinalis. h, Boxplot of BMI in individuals harbouring an 8-kbp deletion in the C. comes genome (blue, n = 158) and individuals with no deletion (maroon, n = 294). i, Same as Fig. 4d for this 8-kbp genomic region of C. comes. P - two-sided Mann–Whitney U-test. Boxplots - centre, median; box, IQR; whiskers, IQR × 1.5.
a–c, Boxplot of waist circumference (a), BMI (b) and HDL cholesterol (c) in individuals of the Israeli cohort harbouring the 31-kbp deletion in the A. hadrus genome depicted in Fig. 4 (blue, n = 213) and individuals with no deletion (maroon, n = 468). d, Boxplot of BMI in individuals of the Dutch Lifelines DEEP cohort harbouring the same 31-kbp deletion in the A. hadrus genome (blue, n = 249) and individuals with no deletion (maroon, n = 547). P value determined by two-sided Mann–Whitney U-test. Boxplots: centre, median; box, IQR; whiskers, IQR × 1.5.
This file contains Supplementary Note 1 (Validation of the Iterative Coverage-based Read Assignment (ICRA) Algorithm) and Supplementary Note 2 (Community metabolic potential (CMP) of a 31-kbp deletion-SV in A. hadrus).
Modules enriched and depleted in SVs KEGG23 modules enriched (p<0.05) in variable-SVs (columns A-E), deletion-SVs (columns G-K) and conserved regions (columns M-Q). Each table records the KEGG module ID (‘KEGG ID’), module name (‘Name’), number of genes belonging to the module that were in each region type (‘Module genes in region’), number of genes in the module (‘Module genes count’), fold change as compared to non-SV regions of the genome (‘Fold change’), whether the module is enriched or depleted in SVs (‘isEnriched’; TRUE if enriched, FALSE if depleted) and two-sided permutation test p-value (‘p’; Methods). 167,389 genes were analyzed in total, of which 14,147, 34,372 and 112,343 and were in variable-SVs, deletion-SVs and conserved regions, respectively.
Deletion-SVs associated with growth rates of the harboring bacteria Columns record harboring microbe (‘Microbe’; formatted as <NCBI taxonomy ID>.<NCBI bioproject accession>), SV (‘Region’), the difference in median PTR between microbes harboring the SV and those that do not (‘EffectSize’), n of samples where the given region was deleted (‘Samples with deletion’), and n of samples where it was retained (‘Samples with retention’), two-sided Mann-Whitney U p-value (‘p’). Only associations with p<3*10-5 (FWER) are shown.
Genes on two E. eligens growth rate-associated SVs Genes on E. eligens SVs negatively (columns A-F) and positively (columns H-M) associated with growth of E. eligens.
Genes on a 31-kbp deletion-SV in A. hadrus significantly associated with lower body weight, waist circumference, BMI, and higher HDL cholesterol.
NCBI taxonomy ID and bioproject accession for all microbial genomes in our reference database.
Growth specifications of microbial strains used for validation of ICRA.
Difference in the community metabolic potential (CMP) of compounds in subjects with a 31-kbp deletion-SV in A. hadrus (n=213) as compared to subjects with no deletion (n=468) p - two-sided Mann-Whitney U test; q - FDR corrected p-value.
About this article
Cite this article
Zeevi, D., Korem, T., Godneva, A. et al. Structural variation in the gut microbiome associates with host health. Nature 568, 43–48 (2019). https://doi.org/10.1038/s41586-019-1065-y
This article is cited by
Nature Medicine (2023)
Nature Communications (2023)
Nature Reviews Genetics (2023)
Nature Communications (2023)