Article | Published:

Structural variation in the gut microbiome associates with host health

Naturevolume 568pages4348 (2019) | Download Citation


Differences in the presence of even a few genes between otherwise identical bacterial strains may result in critical phenotypic differences. Here we systematically identify microbial genomic structural variants (SVs) and find them to be prevalent in the human gut microbiome across phyla and to replicate in different cohorts. SVs are enriched for CRISPR-associated and antibiotic-producing functions and depleted from housekeeping genes, suggesting that they have a role in microbial adaptation. We find multiple associations between SVs and host disease risk factors, many of which replicate in an independent cohort. Exploring genes that are clustered in the same SV, we uncover several possible mechanistic links between the microbiome and its host, including a region in Anaerostipes hadrus that encodes a composite inositol catabolism-butyrate biosynthesis pathway, the presence of which is associated with lower host metabolic disease risk. Overall, our results uncover a nascent layer of variability in the microbiome that is associated with microbial adaptation and host health.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Data availability

The 7 strains samples used in Fig. 1c are available through ENA (, accession ENA: PRJEB25194. The 887 samples are publicly available through ENA, accession numbers ENA: PRJEB11532, ENA: PRJEB17643. The raw metagenomic sequencing data for the Lifelines DEEP cohort, and age and sex information per sample are available from the European genome-phenome archive ( at accession number EGAS00001001704. Other phenotypic data can be requested from the Lifelines cohort study ( following the standard protocol for data access.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    McCarroll, S. A. & Altshuler, D. M. Copy-number variation and association studies of human disease. Nat. Genet. 39 (Suppl), S37–S42 (2007).

  2. 2.

    Taniguchi, Y. et al. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329, 533–538 (2010).

  3. 3.

    Sokurenko, E. V. et al. Pathogenic adaptation of Escherichia coli by natural variation of the FimH adhesin. Proc. Natl Acad. Sci. USA 95, 8922–8926 (1998).

  4. 4.

    Gill, S. R. et al. Insights on evolution of virulence and resistance from the complete genome analysis of an early methicillin-resistant Staphylococcus aureus strain and a biofilm-producing methicillin-resistant Staphylococcus epidermidis strain. J. Bacteriol. 187, 2426–2438 (2005).

  5. 5.

    Koeth, R. A. et al. Intestinal microbiota metabolism of l-carnitine, a nutrient in red meat, promotes atherosclerosis. Nature Med. 19, 576–585 (2013).

  6. 6.

    Han, B. et al. Microbial genetic composition tunes host longevity. Cell 169, 1249–1262 (2017).

  7. 7.

    Greenblum, S., Carr, R. & Borenstein, E. Extensive strain-level copy-number variation across human gut microbiome species. Cell 160, 583–594 (2015).

  8. 8.

    Swann, J. R. et al. Systemic gut microbial modulation of bile acid metabolism in host tissue compartments. Proc. Natl Acad. Sci. USA 108 (Suppl 1), 4523–4530 (2011).

  9. 9.

    LeBlanc, J. G. et al. Bacteria as vitamin suppliers to their host: a gut microbiota perspective. Curr. Opin. Biotechnol. 24, 160–168 (2013).

  10. 10.

    Levy, M. et al. Microbiota-modulated metabolites shape the intestinal microenvironment by regulating NLRP6 inflammasome signaling. Cell 163, 1428–1443 (2015).

  11. 11.

    Zeevi, D. et al. Personalized nutrition by prediction of glycemic responses. Cell 163, 1079–1094 (2015).

  12. 12.

    Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).

  13. 13.

    Halfvarson, J. et al. Dynamics of the human gut microbiome in inflammatory bowel disease. Nature Microbiol. 2, 17004 (2017).

  14. 14.

    Pascal, V. et al. A microbial signature for Crohn’s disease. Gut 66, 813–822 (2017).

  15. 15.

    Rowan, S. et al. Involvement of a gut–retina axis in protection against dietary glycemia-induced age-related macular degeneration. Proc. Natl Acad. Sci. USA 114, E4472–E4481 (2017).

  16. 16.

    Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nature Biotechnol. 32, 834–841 (2014).

  17. 17.

    Manor, O. & Borenstein, E. Systematic characterization and analysis of the taxonomic drivers of functional shifts in the human microbiome. Cell Host Microbe 21, 254–267 (2017).

  18. 18.

    Franzosa, E. A. et al. Species-level functional profiling of metagenomes and metatranscriptomes. Nature Methods 15, 962–968 (2018).

  19. 19.

    Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nature Rev. Genet. 12, 363–376 (2011).

  20. 20.

    Korem, T. et al. Bread affects clinical parameters and induces gut microbiome-associated personal glycemic responses. Cell Metab. 25, 1243–1253 (2017).

  21. 21.

    Zhernakova, A. et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science 352, 565–569 (2016).

  22. 22.

    Rothschild, D. et al. Environment dominates over host genetics in shaping human gut microbiota. Nature 555, 210–215 (2018).

  23. 23.

    Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).

  24. 24.

    Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46 (D1), D754–D761 (2018).

  25. 25.

    El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. (2018).

  26. 26.

    Korem, T. et al. Growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples. Science 349, 1101–1106 (2015).

  27. 27.

    Hayashi, F. et al. The innate immune response to bacterial flagellin is mediated by Toll-like receptor 5. Nature 410, 1099–1103 (2001).

  28. 28.

    Shen, Y. et al. Flagellar hooks and hook protein Flge participate in host microbe interactions at immunological level. Sci. Rep. 7, 1433 (2017).

  29. 29.

    Weiser, J. N. et al. Phosphorylcholine on the lipopolysaccharide of Haemophilus influenzae contributes to persistence in the respiratory tract and sensitivity to serum killing mediated by C-reactive protein. J. Exp. Med. 187, 631–640 (1998).

  30. 30.

    Ross, J. I. et al. Inducible erythromycin resistance in staphylococci is encoded by a member of the ATP-binding transport super-gene family. Mol. Microbiol. 4, 1207–1214 (1990).

  31. 31.

    Zupancic, M. L. et al. Analysis of the gut microbiota in the old order Amish and its relation to the metabolic syndrome. PLoS One 7, e43052 (2012).

  32. 32.

    Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).

  33. 33.

    Yoshida, K. et al. myo-Inositol catabolism in Bacillus subtilis. J. Biol. Chem. 283, 10415–10424 (2008).

  34. 34.

    Bergman, E. N. Energy contributions of volatile fatty acids from the gastrointestinal tract in various species. Physiol. Rev. 70, 567–590 (1990).

  35. 35.

    Harig, J. M., Soergel, K. H., Komorowski, R. A. & Wood, C. M. Treatment of diversion colitis with short-chain-fatty acid irrigation. N. Engl. J. Med. 320, 23–28 (1989).

  36. 36.

    Gao, Z. et al. Butyrate improves insulin sensitivity and increases energy expenditure in mice. Diabetes 58, 1509–1517 (2009).

  37. 37.

    Mende, D. R. et al. proGenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes. Nucleic Acids Res. 45 (D1), D529–D534 (2017).

  38. 38.

    Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017).

  39. 39.

    Marco-Sola, S., Sammeth, M., Guigó, R. & Ribeca, P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1185–1188 (2012).

  40. 40.

    Truong, D. T. et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12, 902–903 (2015).

  41. 41.

    Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).

  42. 42.

    Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. Peer. J. Comput. Sci. 3, e104 (2017).

  43. 43.

    Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).

  44. 44.

    Potter, S. C. et al. HMMER web server: 2018 update. Nucleic Acids Res. 46, W200–W204 (2018).

  45. 45.

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).

  46. 46.

    Suez, J. et al. Artificial sweeteners induce glucose intolerance by altering the gut microbiota. Nature 514, 181–186 (2014).

  47. 47.

    Sczyrba, A. et al. Critical Assessment of Metagenome Interpretation – a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).

  48. 48.

    Liu, B., Gibbons, T., Ghodsi, M. & Pop, M. MetaPhyler: taxonomic profiling for metagenomic sequences. 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 95–100 (IEEE, 2010).

Download references


We thank members of the Segal group and the Center for Studies in Physics and Biology for discussions. E.S. is supported by grants from the European Research Council and the Israel Science Foundation. D.Z. is supported by the James S. McDonnell Foundation and the Dan David Prize Scholarship. D.Z. and T.K. were partly supported by the Israeli Ministry of Science and Technology. Lifelines DEEP was funded by: ERC-2012-322698 and NWO-SPI-92-266 to C.W.; ERC-715772 and NWO-178.056 to A.Z.; NWO-864.13.013 and CVON-2012-03 to J.F.

Reviewer information

Nature thanks Ami Bhatt, Julie Segre and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author information

Author notes

  1. These authors contributed equally: David Zeevi, Tal Korem


  1. Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel

    • David Zeevi
    • , Tal Korem
    • , Anastasia Godneva
    • , Noam Bar
    • , Maya Lotan-Pompan
    • , Adina Weinberger
    •  & Eran Segal
  2. Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel

    • David Zeevi
    • , Tal Korem
    • , Anastasia Godneva
    • , Noam Bar
    • , Maya Lotan-Pompan
    • , Adina Weinberger
    •  & Eran Segal
  3. Center for Studies in Physics and Biology, The Rockefeller University, New York, NY, USA

    • David Zeevi
  4. Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA

    • Tal Korem
  5. Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA

    • Tal Korem
  6. University of Groningen, University Medical Center Groningen, Department of Genetics, Groningen, The Netherlands

    • Alexander Kurilshikov
    • , Jingyuan Fu
    • , Cisca Wijmenga
    •  & Alexandra Zhernakova
  7. University of Groningen, University Medical Center Groningen, Department of Pediatrics, Groningen, The Netherlands

    • Jingyuan Fu
  8. Department of Immunology, K.G. Jebsen Coeliac Disease Research Centre, University of Oslo, Oslo, Norway

    • Cisca Wijmenga


  1. Search for David Zeevi in:

  2. Search for Tal Korem in:

  3. Search for Anastasia Godneva in:

  4. Search for Noam Bar in:

  5. Search for Alexander Kurilshikov in:

  6. Search for Maya Lotan-Pompan in:

  7. Search for Adina Weinberger in:

  8. Search for Jingyuan Fu in:

  9. Search for Cisca Wijmenga in:

  10. Search for Alexandra Zhernakova in:

  11. Search for Eran Segal in:


T.K. and D.Z. conceived and designed the study, designed and conducted all analyses, interpreted the results and wrote the manuscript. T.K. and D.Z. equally contributed to this work and are listed in random order. A.G. and N.B. developed methods. A.K., J.F., C.W. and A.Z. analysed the Dutch Lifelines cohort. M.L.-P. and A.W. did experimental work. A.W. designed the study. E.S. conceived, directed and designed the project and analyses, interpreted the results and wrote the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to David Zeevi or Eran Segal.

Extended data figures and tables

  1. Extended Data Fig. 1 Superior assignment of metagenomic reads using the Iterative Coverage-based Read-Assignment (ICRA) algorithm.

    a, Boxplot (centre, median; box, IQR; whiskers, 10th and 90th percentiles) of ambiguous read assignment ratios of 887 samples11,20 mapped to a reference database of 3,953 representative microbial genomes (Methods) before (blue) and after (yellow) ICRA correction. b, Illustration of our computational pipeline. ce, Swarm-plots of the ratio of correct read assignment per taxonomy level with no assignment correction (blue) or following assignment correction with ICRA (yellow), Kraken41 (red) or MetaPhyler48 (green) for CAMI47 high complexity (c; n = 5), medium complexity (d; n = 2) and low complexity (e; n = 1) datasets. Note that MetaPhyler did not provide sub-species level read assignments. *P < 0.05, **P < 0.01, two-sided Mann–Whitney U-test.

  2. Extended Data Fig. 2 ICRA estimates relative abundances with accuracy comparable to other tools.

    a, Dot-plot of the calculated relative abundances of 7 bacterial species in 100 samples, using either ICRA (yellow), MetaPhlAn240 (blue) or Bracken42 (red), as compared to the true relative abundances. Inset shows a violin plot (white dot, median; black rectangle, IQR, whiskers, 1.5 × IQR) of Bray–Curtis dissimilarities between the estimates (n = 100) of each method and the true abundances. **, two-sided Wilcoxon signed-rank P = 1.3 × 10−4; ****P = 3.0 × 10−18. bh, Dot-plot of the calculated relative abundances (y axis) of A. finegoldii (b), B. faecium (c), C. flavigena (d), E. faecalis (e), L. gasseri (f), S. cristatus (g) and A. muciniphila (h) in 100 samples, using either ICRA (yellow), MetaPhlAn2 (blue) or Bracken (red), as compared to the true relative abundances (x axis). R2 was calculated using Pearson correlation.

  3. Extended Data Fig. 3 SV Explorer enables investigation of co-varying genes.

    a, b, Illustration of the online SV explorer available at, spanning the entire R. torques genome (a) and spanning a 26-kbp region of this genome (b).

  4. Extended Data Fig. 4 SVs are prevalent in the human microbiome across two cohorts.

    a, Heatmap showing the number of subjects with SVs (yellow colour scale), the number of SVs (green colour scale), the mean SV size (blue colour scale) and the fraction of the genome that is variable (red colour scale), for each microbe analysed, along with their phylogenetic tree. b, Heatmap showing the genomic length percentage of variable and deletion SVs replicated in the Lifelines cohort for each microbe analysed.

  5. Extended Data Fig. 5 Growth rates-associated SVs harbour specific functions.

    Fold difference (x axis) and statistical significance (Methods; y axis) of the enrichment of functional KEGG modules in SVs present in regions significantly associated with microbial growth dynamics. A total of 56,088 genes were considered, 3,805 of them in growth rates-associated SVs.

  6. Extended Data Fig. 6 SVs are associated with microbial growth rates.

    a, Boxplot (centre, median; box, IQR; whiskers, IQR × 1.5) of microbial growth rates calculated using PTR26 in individuals harbouring a 7-segment deletion in the E. eligens genome (blue, n = 281) and individuals with no deletion (maroon, n = 166). b, Genomic map of E. eligens with the 7 segments marked in yellow. c, As in a for a 9-segment deletion SV in the E. eligens genome (blue, n = 57) and individuals with no deletion (maroon, n = 390). d, As in b with the 9 segments marked in orange. P value determined by two-sided Mann–Whitney U-test.

  7. Extended Data Fig. 7 SVs are associated with disease risk, replicated in a second cohort.

    Full heatmap of statistically significant correlations (Methods) between disease risk factors and variable SVs, depicting associations replicated (yellow star), replicated using a different variable (orange star) or reversed (grey star) in the Lifelines cohort.

  8. Extended Data Fig. 8 Gene content of SVs associated with host risk factors.

    a, Boxplot (centre, median; box, IQR; whiskers, IQR × 1.5) of glycated haemoglobin in individuals harbouring an 11-kbp deletion in the E. rectale genome (blue, n = 253) and individuals with no deletion (maroon, n = 377); P - two-sided Mann–Whitney U-test. b, Same as Fig. 4d for this 11-kbp genomic region of E. rectale. c, Boxplot of BMI in individuals harbouring a 4-kbp deletion in the A. hadrus genome (blue, n = 276) and individuals with no deletion (maroon, n = 403). d, Same as Fig. 4d for this 4-kbp genomic region of A. hadrus. e, Depiction of the genes encoded in the region, which encode key enzymes in the folate biosynthesis pathway. Note correspondence of enzyme commission (EC) numbers with d. f, Boxplot of total cholesterol in individuals harbouring an 18-kbp deletion in the R. intestinalis genome (blue, n = 194) and individuals with no deletion (maroon, n = 68). g, same as Fig. 4d for a 10-kbp stretch of the 18-kbp region in R. intestinalis. h, Boxplot of BMI in individuals harbouring an 8-kbp deletion in the C. comes genome (blue, n = 158) and individuals with no deletion (maroon, n = 294). i, Same as Fig. 4d for this 8-kbp genomic region of C. comes. P - two-sided Mann–Whitney U-test. Boxplots - centre, median; box, IQR; whiskers, IQR × 1.5.

  9. Extended Data Fig. 9 Detailed examples of SV replication.

    Replication of deletion and variable regions depicted in Fig. 4 and Extended Data Fig. 8 between the Israeli (yellow) and Dutch Lifelines DEEP (blue) cohorts.

  10. Extended Data Fig. 10 SV of A. hadrus associated with host risk factors.

    ac, Boxplot of waist circumference (a), BMI (b) and HDL cholesterol (c) in individuals of the Israeli cohort harbouring the 31-kbp deletion in the A. hadrus genome depicted in Fig. 4 (blue, n = 213) and individuals with no deletion (maroon, n = 468). d, Boxplot of BMI in individuals of the Dutch Lifelines DEEP cohort harbouring the same 31-kbp deletion in the A. hadrus genome (blue, n = 249) and individuals with no deletion (maroon, n = 547). P value determined by two-sided Mann–Whitney U-test. Boxplots: centre, median; box, IQR; whiskers, IQR × 1.5.

Supplementary information

  1. Supplementary Information

    This file contains Supplementary Note 1 (Validation of the Iterative Coverage-based Read Assignment (ICRA) Algorithm) and Supplementary Note 2 (Community metabolic potential (CMP) of a 31-kbp deletion-SV in A. hadrus).

  2. Reporting Summary

  3. Supplementary Table 1

    Modules enriched and depleted in SVs KEGG23 modules enriched (p<0.05) in variable-SVs (columns A-E), deletion-SVs (columns G-K) and conserved regions (columns M-Q). Each table records the KEGG module ID (‘KEGG ID’), module name (‘Name’), number of genes belonging to the module that were in each region type (‘Module genes in region’), number of genes in the module (‘Module genes count’), fold change as compared to non-SV regions of the genome (‘Fold change’), whether the module is enriched or depleted in SVs (‘isEnriched’; TRUE if enriched, FALSE if depleted) and two-sided permutation test p-value (‘p’; Methods). 167,389 genes were analyzed in total, of which 14,147, 34,372 and 112,343 and were in variable-SVs, deletion-SVs and conserved regions, respectively.

  4. Supplementary Table 2

    Deletion-SVs associated with growth rates of the harboring bacteria Columns record harboring microbe (‘Microbe’; formatted as <NCBI taxonomy ID>.<NCBI bioproject accession>), SV (‘Region’), the difference in median PTR between microbes harboring the SV and those that do not (‘EffectSize’), n of samples where the given region was deleted (‘Samples with deletion’), and n of samples where it was retained (‘Samples with retention’), two-sided Mann-Whitney U p-value (‘p’). Only associations with p<3*10-5 (FWER) are shown.

  5. Supplementary Table 3

    Genes on two E. eligens growth rate-associated SVs Genes on E. eligens SVs negatively (columns A-F) and positively (columns H-M) associated with growth of E. eligens.

  6. Supplementary Table 4

    Genes on a 31-kbp deletion-SV in A. hadrus significantly associated with lower body weight, waist circumference, BMI, and higher HDL cholesterol.

  7. Supplementary Table 5

    NCBI taxonomy ID and bioproject accession for all microbial genomes in our reference database.

  8. Supplementary Table 6

    Growth specifications of microbial strains used for validation of ICRA.

  9. Supplementary Table 7

    Difference in the community metabolic potential (CMP) of compounds in subjects with a 31-kbp deletion-SV in A. hadrus (n=213) as compared to subjects with no deletion (n=468) p - two-sided Mann-Whitney U test; q - FDR corrected p-value.

About this article

Publication history




Issue Date


Further reading


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.