The emergence of high-throughput DNA sequencing methods provides unprecedented opportunities to further unravel bacterial biodiversity and its worldwide role from human health to ecosystem functioning. However, despite the abundance of sequencing studies, combining data from multiple individual studies to address macroecological questions of bacterial diversity remains methodically challenging and plagued with biases. Here, using a machine-learning approach that accounts for differences among studies and complex interactions among taxa, we merge 30 independent bacterial data sets comprising 1,998 soil samples from 21 countries. Whereas previous meta-analysis efforts have focused on bacterial diversity measures or abundances of major taxa, we show that disparate amplicon sequence data can be combined at the taxonomy-based level to assess bacterial community structure. We find that rarer taxa are more important for structuring soil communities than abundant taxa, and that these rarer taxa are better predictors of community structure than environmental factors, which are often confounded across studies. We conclude that combining data from independent studies can be used to explore bacterial community dynamics, identify potential ‘indicator’ taxa with an important role in structuring communities, and propose hypotheses on the factors that shape bacterial biogeography that have been overlooked in the past.

  • Subscribe to Nature Microbiology for full access:



Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.


  1. 1.

    Proser, J. I. Dispersing misconceptions and identifying opportunities for the use of ‘omics’ in soil microbial ecology. Nat. Rev. Microbiol. 13, 439–446 (2015).

  2. 2.

    Bardgett, R. D. & van der Putten, W. H. Belowground biodiversity and ecosystem functioning. Nature 515, 505–511 (2014).

  3. 3.

    Caporaso, J. G. et al. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J. 6, 1621–1624 (2012).

  4. 4.

    Tedersoo, L. et al. Fungal biogeography. Global diversity and geography of soil fungi. Science 346, 1256688 (2014).

  5. 5.

    Davison, J. et al. Fungal symbionts. Global assessment of arbuscular mycorrhizal fungus diversity reveals very low endemism. Science 349, 970–973 (2015).

  6. 6.

    Wieder, W. R., Bonan, G. B. & Allison, S. D. Global soil carbon projections are improved by modelling microbial processes. Nat. Clim. Change 3, 909–912 (2013).

  7. 7.

    Karhu, K. et al. Temperature sensitivity of soil respiration rates enhanced by microbial community response. Nature 513, 81–84 (2014).

  8. 8.

    Barberán, A., Casamayor, E. O. & Fierer, N. The microbial contribution to macroecology. Front. Microbiol. 5, 203 (2014).

  9. 9.

    Ramirez, K. S. et al. Biogeographic patterns in below-ground diversity in New York City’s Central Park are similar to those observed globally. P. R. Soc. B 281, 20141988 (2014).

  10. 10.

    O’Brien, S. L. et al. Spatial scale drives patterns in soil bacterial diversity. Environ. Microbiol. 18, 2039–2051 (2016).

  11. 11.

    Evans, S., Martiny, J. B. H. & Allison, S. D. Effects of dispersal and selection on stochastic assembly in microbial communities. ISME J. 11, 176–185 (2017).

  12. 12.

    Talbot, J. M. et al. Endemism and functional convergence across the North American soil mycobiome. Proc. Natl Acad. Sci. USA 111, 6341–6346 (2014).

  13. 13.

    Barber, A. et al. Why are some microbes more ubiquitous than others? Predicting the habitat breadth of soil bacteria. Ecol. Lett. 17, 794–802 (2014).

  14. 14.

    Ranjard, L. et al. Turnover of soil bacterial diversity driven by wide-scale environmental heterogeneity. Nat. Commun. 4, 1434 (2013).

  15. 15.

    Jetz, W., McPherson, J. M. & Guralnick, R. P. Integrating biodiversity distribution knowledge: toward a global map of life. Trends Ecol. Evol. 27, 151–159 (2012).

  16. 16.

    Ricketts, T. H. et al. Disaggregating the evidence linking biodiversity and ecosystem services. Nat. Commun. 7, 13106 (2016).

  17. 17.

    Dirzo, R. et al. Defaunation in the Anthropocene. Science 345, 401–406 (2014).

  18. 18.

    Patterson, D. J., Cooper, J., Kirk, P. M., Pyle, R. L. & Remsen, D. P. Names are key to the big new biology. Trends Ecol. Evol. 25, 686–691 (2010).

  19. 19.

    Santos, A. M. & Branco, M. The quality of name-based species records in databases. Trends Ecol. Evol. 27, 6–7 (2012).

  20. 20.

    Beiko, R. G. Microbial malaise: how can we classify the microbiome? Trends Microbiol. 23, 671–679 (2015).

  21. 21.

    Tedersoo, L. et al. Standardizing metadata and taxonomic identification in metabarcoding studies. Gigascience 4, 34 (2015).

  22. 22.

    Ramirez, K. S. et al. Toward a global platform for linking soil biodiversity data. Front. Ecol. Evol. 3, 91 (2015).

  23. 23.

    Turner, W. et al. Free and open-access satellite data are key to biodiversity conservation. Biol. Conserv. 182, 173–176 (2015).

  24. 24.

    Gilbert, J. A., Jansson, J. K. & Knight, R. The Earth Microbiome project: successes and aspirations. BMC Biol. 12, 69 (2014).

  25. 25.

    Joppa, L. N. et al. Big data and biodiversity. Filling in biodiversity threat gaps. Science 352, 416–418 (2016).

  26. 26.

    Sinha, R., Abnet, C. C., White, O., Knight, R. & Huttenhower, C. The Microbiome Quality Control project: baseline study design and future directions. Genome Biol. 16, 276 (2015).

  27. 27.

    Sogin, M. L. et al. Microbial diversity in the deep sea and the underexplored ‘rare biosphere’. Proc. Natl Acad. Sci. USA 103, 12115–12120 (2006).

  28. 28.

    García-Palacios, P. et al. Are there links between responses of soil microbes and ecosystem functioning to elevated CO2, N deposition and warming? A global perspective. Glob. Chang. Biol. 21, 1590–1600 (2015).

  29. 29.

    Hermans, S. M. et al. Bacteria as emerging indicators of soil condition. Appl. Environ. Microbiol. 83, e02826-16 (2016).

  30. 30.

    Philippot, L. et al. The ecological coherence of high bacterial taxonomic ranks. Nat. Rev. Microbiol. 8, 523–529 (2010).

  31. 31.

    Shade, A., Caporaso, J. G., Handelsman, J., Knight, R. & Fierer, N. A meta-analysis of changes in bacterial and archaeal communities with time. ISME J. 7, 1493–1506 (2013).

  32. 32.

    Hendershot, J. N., Read, Q. D., Henning, J. A., Sanders, N. J. & Classen, A. T. Consistently inconsistent drivers of microbial diversity and abundance at macroecological scales. Ecology 98, 1757–1763 (2017).

  33. 33.

    Bier, R. L. et al. Linking microbial community structure and microbial processes: an empirical and conceptual overview. FEMS Microbiol. Ecol. 91, fiv113 (2015).

  34. 34.

    Walters, W. A., Xu, Z. & Knight, R. Meta-analyses of human gut microbes associated with obesity and IBD. FEBS Lett. 588, 4223–4233 (2014).

  35. 35.

    Bik, H. M. et al. Sequencing our way towards understanding global eukaryotic biodiversity. Trends Ecol. Evol. 27, 233–243 (2012).

  36. 36.

    Lauber, C. L., Hamady, M., Knight, R. & Fierer, N. Pyrosequencing-based assessment of soil pH as a predictor of soil bacterial community structure at the continental scale. Appl. Environ. Microbiol. 75, 5111–5120 (2009).

  37. 37.

    McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6, 610–618 (2012).

  38. 38.

    Lozupone, C. A. et al. Meta-analyses of studies of the human microbiota. Genome Res. 23, 1704–1714 (2013).

  39. 39.

    Pawluczyk, M. et al. Quantitative evaluation of bias in PCR amplification and next-generation sequencing derived from metabarcoding samples. Anal. Bioanal. Chem. 407, 1841–1848 (2015).

  40. 40.

    Lu, X., Seuradge, B. J. & Neufeld, J. D. Biogeography of soil Thaumarchaeota in relation to soil depth and land usage. FEMS Microbiol. Ecol. 93, fiw246 (2017).

  41. 41.

    Jung, S. P. & Kang, H. Assessment of microbial diversity bias associated with soil heterogeneity and sequencing resolution in pyrosequencing analyses. J. Microbiol. 52, 574–580 (2014).

  42. 42.

    Langille, M. G. et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat. Biotechnol. 31, 814–821 (2013).

  43. 43.

    Jousset, A. et al. Where less may be more: how the rare biosphere pulls ecosystems strings. ISME J. 11, 853–862 (2017).

  44. 44.

    De Cáceres, M. & Legendre, P. Associations between species and groups of sites: indices and statistical inference. Ecology 90, 3566–3574 (2009).

  45. 45.

    Maestre, F. T. et al. Increasing aridity reduces soil microbial diversity and abundance in global drylands. Proc. Natl Acad. Sci. USA 112, 15684–15689 (2015).

  46. 46.

    Knights, D. et al. Bayesian community-wide culture-independent microbial source tracking. Nat. Methods 8, 761–763 (2011).

  47. 47.

    Muir, P. et al. The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biol. 17, 53 (2016).

  48. 48.

    Rideout, J. R. et al. Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ 2, e545 (2014).

  49. 49.

    Yilmaz, P. et al. The genomic standards consortium: bringing standards to life for microbial ecology. ISME J. 5, 1565–1567 (2011).

  50. 50.

    Wickham, H. & Francois, R. dplyr: a grammar of data manipulation. R package v. 0.5.0 (CRAN, 2016); https://cran.r-project.org/package=dplyr.

  51. 51.

    The R Core Team. R: A Language and Environment for Statistical (R Foundation for Statistical Computing, 2016); https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

  52. 52.

    Wilke, A. et al. The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Res. 44, D590–D594 (2016).

  53. 53.

    Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590–D596 (2013).

  54. 54.

    Suzuki, M. T. & Giovannoni, S. J. Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Appl. Environ. Microbiol. 62, 625–630 (1996).

  55. 55.

    Sipos, R. et al. Effect of primer mismatch, annealing temperature and PCR cycle number on 16S rRNA gene-targeting bacterial community analysis. FEMS Microbiol. Ecol. 60, 341–350 (2007).

  56. 56.

    Klindworth, A. et al. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic Acids Res. 41, e1 (2013).

  57. 57.

    Joshi, N. A. & Fass, J. N. Sickle: a sliding-window, adaptive, quality-based trimming tool for FastQ files. v. 1.33 (2011); https://github.com/najoshi/sickle.

  58. 58.

    Rognes, T. et al. vsearch: VSEARCH 1.9.6. (2016); https://doi.org/10.5281/ZENODO.44512.

  59. 59.

    McDonald, D. et al. The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. Gigascience 1, 7 (2012).

  60. 60.

    Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, 335–336 (2010).

  61. 61.

    Koster, J. & Rahmann, S. Snakemake — a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).

  62. 62.

    Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

  63. 63.

    Breiman, L. & Cutler, A. Using Random Forests v4.0 (UC Berkeley, 2003); https://www.scribd.com/document/208387804/Using-Random-Forests-v4-0.

  64. 64.

    Shi, T. & Horvath, S. Unsupervised learning with Random Forest predictors. J. Comput. Graph. Stat. 15, 118–138 (2006).

Download references


We thank all the people who contributed data and input to this study. This study was conducted at a workshop (May 2015, Manchester, UK) funded by the British Ecological Society’s special interest group Plants-Soils-Ecosystems and organized by F.T.d.V. and K.S.R. This study and participants were funded in part by ERC Advanced Grant 26055290 (K.S.R., and W.H.v.d.P.); BBSRC David Phillips Fellowship (BB/L02456X/1) (F.T.d.V.); ERC Grant Agreements 242658 (BIOCOM) and 647038 (BIODESERT) (F.T.M.); the European Regional Development Fund (Centre of Excellence EcolChange) (J.D.); Yorkshire Agricultural Society, Nafferton Ecological Farming Group, and the Northumbria University Research Development Fund (C.H.O.); BBSRC Training Grant (BB/K501943/1) (C.H.); Wallenberg Academy Fellowship (KAW 2012.0152), Formas (214-2011-788) and Vetenskapsrådet (612-2011-5444) (E.D.); the Glastir Monitoring & Evaluation Programme (contract reference: C147/2010/11) and the full support of the GMEP team on the Glastir project (D.L.J., S.C., and D.A.R.). Computing was facilitated by the University of Manchester Condor pool and the CLIMB infrastructure (http://www.climb.ac.uk).

Author information

Author notes

  1. Kelly S. Ramirez and Christopher G. Knight contributed equally to this work.


  1. Netherlands Institute of Ecology, Wageningen, The Netherlands

    • Kelly S. Ramirez
    • , Mattias de Hollander
    • , Thomas W. Crowther
    •  & Wim H. van der Putten
  2. Faculty of Science and Engineering, University of Manchester, Manchester, UK

    • Christopher G. Knight
    • , Angela L. Straathof
    •  & Franciska T. de Vries
  3. School of Science and the Environment, Manchester Metropolitan University, Manchester, UK

    • Francis Q. Brearley
    • , David R. Elliott
    • , Graeme Fox
    •  & Jennifer Rowntree
  4. Evolution and Genomic Sciences, School of Biological Sciences, University of Manchester, Manchester, UK

    • Bede Constantinides
  5. Department of Animal and Plant Sciences, University of Sheffield, Sheffield, UK

    • Anne Cotton
  6. Environment Centre Wales, College of Natural Sciences, Bangor University, Bangor, UK

    • Si Creer
    •  & David L. Jones
  7. Institute of Integrative Biology, ETH Zürich, Zürich, Switzerland

    • Thomas W. Crowther
  8. Department of Botany, Institute of Ecology and Earth Sciences, University of Tartu, Tartu, Estonia

    • John Davison
  9. Cooperative Institute for Research in Environmental Sciences, University of Colorado, Boulder, CO, USA

    • Manuel Delgado-Baquerizo
  10. Climate Impacts Research Centre, Department of Ecology and Environmental Science, Umeå University, Abisko, Sweden

    • Ellen Dorrepaal
    • , Eveline J. Krab
    •  & Sylvain Monteux
  11. Environmental Sustainability Research Centre, University of Derby, Derby, UK

    • David R. Elliott
  12. Centre for Ecology and Hydrology, Wallingford, UK

    • Robert I. Griffiths
  13. School of Life Sciences, University of Warwick, Coventry, UK

    • Chris Hale
  14. Division of Agroecology and Environment, Agroscope, Zürich, Switzerland

    • Kyle Hartman
    • , Klaus Schlaeppi
    •  & Marcel G. A. van der Heijden
  15. Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK

    • Ashley Houlden
    •  & Ian S. Roberts
  16. Departamento de Biología y Geología, Física y Química Inorgánica, Escuela Superior de Ciencias Experimentales y Tecnología, Universidad Rey Juan Carlos, Móstoles, Spain

    • Fernando T. Maestre
  17. Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR, USA

    • Krista L. McGuire
  18. School of Science and Engineering, Teesside University, Middlesbrough, UK

    • Caroline H. Orr
  19. Laboratory of Nematology, Wageningen University, Wageningen, The Netherlands

    • Wim H. van der Putten
  20. Centre for Ecology and Hydrology, Bangor, UK

    • David A. Robinson
  21. Department of Biology, Duke University, Durham, NC, USA

    • Jennifer D. Rocca
  22. Natural England, Exeter, UK

    • Matthew Shepherd
  23. Hawkesbury Institute for the Environment, Western Sydney University, Penrith, New South Wales, Australia

    • Brajesh K. Singh
  24. Department of Biology, Boston University, Boston, MA, USA

    • Jennifer M. Bhatnagar
  25. Institute of Biological and Environmental Sciences, University of Aberdeen, Aberdeen, UK

    • Cécile Thion
  26. Institute for Evolutionary Biology and Environmental Studies, University of Zürich, Zürich, Switzerland

    • Marcel G. A. van der Heijden
  27. Plant–Microbe Interactions, Institute of Environmental Biology, Faculty of Science, Utrecht University, Utrecht, The Netherlands

    • Marcel G. A. van der Heijden


  1. Search for Kelly S. Ramirez in:

  2. Search for Christopher G. Knight in:

  3. Search for Mattias de Hollander in:

  4. Search for Francis Q. Brearley in:

  5. Search for Bede Constantinides in:

  6. Search for Anne Cotton in:

  7. Search for Si Creer in:

  8. Search for Thomas W. Crowther in:

  9. Search for John Davison in:

  10. Search for Manuel Delgado-Baquerizo in:

  11. Search for Ellen Dorrepaal in:

  12. Search for David R. Elliott in:

  13. Search for Graeme Fox in:

  14. Search for Robert I. Griffiths in:

  15. Search for Chris Hale in:

  16. Search for Kyle Hartman in:

  17. Search for Ashley Houlden in:

  18. Search for David L. Jones in:

  19. Search for Eveline J. Krab in:

  20. Search for Fernando T. Maestre in:

  21. Search for Krista L. McGuire in:

  22. Search for Sylvain Monteux in:

  23. Search for Caroline H. Orr in:

  24. Search for Wim H. van der Putten in:

  25. Search for Ian S. Roberts in:

  26. Search for David A. Robinson in:

  27. Search for Jennifer D. Rocca in:

  28. Search for Jennifer Rowntree in:

  29. Search for Klaus Schlaeppi in:

  30. Search for Matthew Shepherd in:

  31. Search for Brajesh K. Singh in:

  32. Search for Angela L. Straathof in:

  33. Search for Jennifer M. Bhatnagar in:

  34. Search for Cécile Thion in:

  35. Search for Marcel G. A. van der Heijden in:

  36. Search for Franciska T. de Vries in:


The idea for this study was conceived by F.T.d.V. and K.S.R. The data sets were compiled by C.G.K., R.G., J.D., A.H., B.C., G.F., A.L.S., and J.R. Metadata were compiled by J.D. and J.R. Raw sequence analysis was conducted by M.d.H. Primer bias analysis was conducted by A.C. Random Forest analyses and figures were conducted by C.G.K. The manuscript was written by K.S.R., C.G.K., and F.T.d.V., with contributions from all co-authors.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Kelly S. Ramirez.

Electronic supplementary material

  1. Supplementary Information

    Supplementary Tables 2 and 3 and Supplementary Figures 1–10.

  2. Life Sciences Reporting Summary

  3. Figure Generation Data

    Supplementary Table 4: Data used to generate figures.

  4. Figure Generation Code

    R code use to generate figures.

  5. Supplementary Table 1

    Summary of all datasets used.

  6. Supplementary Table 5

    Name-matched data.

  7. Supplementary Table 6

    Sequence-matched data.