LSU LSU A communal catalogue reveals Earth's multiscale microbial A communal catalogue reveals Earth's multiscale microbial diversity diversity

taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth’s microbial diversity. strongly encourage code deposition in a community repository (e.g. GitHub). Nature Methods guidance for providing algorithms and software for publication provides further information on this topic.

A primary aim of microbial ecology is to determine patterns and drivers of community distribution, interaction, and assembly amidst complexity and uncertainty. Microbial community composition has been shown to change across gradients of environment, geographic distance, salinity, temperature, oxygen, nutrients, pH, day length, and biotic factors 1-6 . These patterns have been identified mostly by focusing on one sample type and region at a time, with insights extra polated across environments and geography to produce generalized principles. To assess how microbes are distributed across environments globally-or whether microbial community dynamics follow funda mental ecological 'laws' at a planetary scale-requires either a massive monolithic crossenvironment survey or a practical methodology for coordinating many independent surveys. New studies of microbial environments are rapidly accumulating; however, our ability to extract meaningful information from across datasets is outstripped by the rate of data generation. Previous metaanalyses have suggested robust gen eral trends in community composition, including the importance of salinity 1 and animal association 2 . These findings, although derived from relatively small and uncontrolled sample sets, support the util ity of metaanalysis to reveal basic patterns of microbial diversity and suggest that a scalable and accessible analytical framework is needed.
The Earth Microbiome Project (EMP, http://www.earthmicrobiome. org) was founded in 2010 to sample the Earth's microbial communities at an unprecedented scale in order to advance our understanding of the organizing biogeographic principles that govern microbial commu nity structure 7,8 . We recognized that open and collaborative science, including scientific crowdsourcing and standardized methods 8 , would help to reduce technical variation among individual studies, which can overwhelm biological variation and make general trends difficult to detect 9 . Comprising around 100 studies, over half of which have yielded peerreviewed publications (Supplementary Table 1), the EMP has now dwarfed by 100fold the sampling and sequencing depth of earlier metaanalysis efforts 1,2 ; concurrently, powerful analysis tools have been developed, opening a new and larger window into the distri bution of microbial diversity on Earth. In establishing a scalable frame work to catalogue microbiota globally, we provide both a resource for the exploration of myriad questions and a starting point for the guided acquisition of new data to answer them. As an example of using this

A standardized and scalable approach
The EMP solicited the global scientific community for environmen tal samples and associated metadata spanning diverse environments and capturing spatial, temporal, and/or physicochemical covariation. The first 27,751 samples from 97 independent studies (Supplementary Table 1) represent diverse environment types (Fig. 1a), geographies (Fig. 1b), and chemistries (Extended Data Fig. 1). The EMP encom passes studies of bacterial, archaeal, and eukaryotic microbial diversity. The analysis here focuses exclusively on the bacterial and archaeal components of the overall database (for concision, use of 'microbial' will hereafter refer to bacteria and archaea only). Associated meta data included environment type, location information, host taxonomy (if relevant), and physico chemical measurements (Supplementary Table 2). Physicochemical measurements were made in situ at the time of sampling. Investigators were encouraged to measure temperature and pH at minimum. Salinity, oxygen, and inorganic nutrients were measured when possible, and investigators collected additional meta data pertinent to their particular investigations.
Metadata were required to conform to the Genomic Standards Consortium's MIxS and Environment Ontology (ENVO) standards 10,11 . We also used a lightweight application ontology built on top of ENVO: the EMP Ontology (EMPO) of microbial environments. EMPO was tailored to capture two major environmental axes along which micro bial betadiversity has been shown to orient: host association and salinity 1,2 . We indexed the classes in this application ontology (Fig. 1a) as levels of a structured categorical variable to classify EMP samples as hostassociated or freeliving (level 1). Samples were categorized within those classes as animalassociated versus plantassociated or saline versus nonsaline, respectively (level 2). A finer level (level 3) was then assigned that satisfied the degree of environment granularity sought for this metaanalysis (for example, sediment (saline), plant rhizos phere, or animal distal gut). We expect EMPO to evolve as new studies and sample types are added to the EMP and as additional patterns of betadiversity are revealed.
We surveyed bacterial and archaeal diversity using amplicon sequencing of the 16S rRNA gene, a common taxonomic marker for bacteria and archaea 12 that remains a valuable tool for microbial ecology despite the introduction of wholegenome methods (for example, shotgun metagenomics) that capture genelevel functional diversity 13 . DNA was extracted from samples using the MO BIO PowerSoil DNA extraction kit, PCRamplified, and sequenced on the Illumina platform. Standardized DNA extraction was chosen to minimize the potential bias introduced by different extraction methodologies; however, extrac tion efficiency may also be subject to interactions between sample type and cell type, and thus extraction effects should be considered as a possible confounding factor in interpreting results. We amplified the 16S rRNA gene (V4 region) using primers 14 shown to recover sequences from most bacterial taxa and many archaea 15 . We note that these primers may miss newly discovered phyla with alternative riboso mal gene structures 16 , and subsequent modifications not used here have shown improved efficiency with certain clades of Alphaproteobacteria and Archaea 17-19 . We generated sequence reads of 90-151 base pairs (bp) (Extended Data Fig. 2a, Supplementary Table 1), totaling 2.2 billion sequences, an average of 80,000 sequences per sample.
Sequence analysis and taxonomic profiling were done initially using the common approach of assigning sequences to operational taxonomic units (OTUs) clustered by sequence similarity to existing rRNA data bases 14,20 . While this approach was useful for certain analyses, for many sample types, especially plantassociated and freeliving communities, onethird of reads or more could not be mapped to existing rRNA databases (Extended Data Fig. 2b). We therefore used a referencefree method, Deblur 21 , to remove suspected error sequences and provide singlenucleotide resolution 'subOTUs' , also known as 'amplicon sequence variants' 22 , here called 'tag sequences' or simply 'sequences' . Because Deblur tag sequences for a given metaanalysis must be the same length in each sample, and some of the EMP studies have read lengths of 90 bp, we trimmed all sequences to 90 bp for this meta analysis. We verified that the patterns presented here were not adversely affected by trimming the sequences (Extended Data Fig. 3 community structure. Because exact sequences are stable identifiers, unlike OTUs, they can be compared to any 16S rRNA or genomic data base now and in the future, thereby promoting reusability 22 .

Microbial ecology without OTU clustering
While earlier largescale 16S rRNA amplicon studies adopted OTU clustering approaches in part out of concern that erroneous reads would dominate diversity assessments 23 , patterns of prevalence (presenceabsence) in our results suggest that Deblur error removal produced ecologically informative sequences without clustering. After rarefying to 5,000 sequences per sample, a total of 307,572 unique sequences were contained in the 96 studies and 23,828 samples of the 'QCfiltered' Deblur 90bp observation table. Among studies, more than half (57%) of all obtained sequences were observed in two or more studies, but only 5% were observed in more than ten studies; the most prevalent sequence was found in 88 of 96 studies (Extended Data Fig. 4a). Among samples, although most sequences (86%) were observed in two or more samples, only 7% were observed in more than 100 samples (Extended Data Fig. 4b). As expected, the most prevalent sequences were also the most abundant (Extended Data Fig. 4c).
Our analyses were carried out using a modest sequencing depth of 5,000 observations per sample after Deblur and rarefaction. To inves tigate how prevalence estimates were affected by sequencing depth, we focused on four major environment types for which we had the greatest number of samples with more than 50,000 observations (soil, saltwater, freshwater, and animal distal gut). The relationship between average tag sequence prevalence and sequencing depth differed among these environments (Extended Data Fig. 4d) but was generally positive, suggesting that our global analysis underestimated true prevalence. Animalassociated microbiomes were a notable exception, with an upper bound on prevalence apparently imposed by host specificity when all host species were considered (Extended Data Fig. 4e); this bound disappeared when considering only humanderived samples (Extended Data Fig. 4f). Although contamination remains an issue in microbiome studies 24 , most of the very highly abundant and prevalent sequences here had higher mean relative abundances among samples than among notemplate controls (Supplementary Table 3), suggesting that they did not originate from reagents.
Matches between our sequences and existing 16S rRNA gene reference databases highlight the novelty captured by the EMP. Exact matches to 46% of Greengenes 25 and 45% of SILVA 26 rRNA gene databases were found in our dataset, indicating that we 'recaptured' nearly half of the reference sequence diversity with just under 100 environmental surveys. These matches accounted for 10% and 13%, respectively, of the tag sequences in our dataset, indicating that large swathes of microbial community diversity are not yet captured in full length sequence databases. The failure of many sequences to be mapped in referencebased alignments to Greengenes and SILVA 97% identity OTUs (Extended Data Fig. 2b) supports this observation.

Patterns of diversity reflect environment
We used a structured categorical variable of microbial environments, EMPO, to analyse diversity in the EMP catalogue in the context of lessons from previous investigations 1,2 . We observed environment dependent patterns in the number of observed tag sequences (alpha diversity), turnover and nestedness of taxa (betadiversity), and predicted genome properties (ecological strategy). Derived from a more standardized methodology, our dataset confirms the previous finding 2 that host association is a fundamental environmental factor that differentiates microbial communities (Fig. 2c, Extended Data Fig. 2d). We build on this pattern by showing that there is less rich ness in hostassociated communities than in freeliving communi ties (Fig. 2a), with the noted exception of plant rhizosphere samples, which resemble freeliving soil communities in both richness (Fig. 2a) and composition (Fig. 2c). Our findings also confirm the major com positional distinction between saline and nonsaline communities 1 (Fig. 2c). The effect sizes of environmental factors on alpha and betadiversity generally showed large contributions of environment type and (for hostassociated samples) host species to both types of diversity (Extended Data Fig. 5a, b).
The ability to identify sample provenance using only a microbial community profile has applications ranging from criminal forensics to mistaken sample identification; these applications will require large curated datasets, such as the EMP. Supervised machine learning demonstrated that samples could be distinguished as being animal associated, plantassociated, saline freeliving, or nonsaline freeliving with 91% accuracy based solely on community composition, and to finescale environment with 84% accuracy (Extended Data Fig. 5c-e). The most commonly misclassified samples were soil, nonsaline surface and aerosol, and animal secretion. In many of these cases, misclassi fication can be attributed to the limitations of EMPO. As additional samples are classified, classification can be improved by iteratively and empirically redefining categories using machine learning. Conversely, with continuous factors, such as salinity, categorical definitions cannot perfectly capture intermediate values. High classification success to environment type was supported by sourcetracking analyses (Extended Data Fig. 5f, g), with the exception of plant rhizosphere samples, owing to their similarity to soil samples.
Predicted average community copy number (ACN) of the 16S rRNA gene was another metric found to differentiate microbial communities in both hostassociated and freeliving communities (Fig. 2d). ACN can be predicted from 16S rRNA amplicon data 27 ; this method has been used, for example, to link the taxonomic groups associated with copiotrophic and oligotrophic behaviours in soils to high and low rRNA gene copy numbers, respectively 28 . Approximately half the dataset centred on an ACN of 2.2 (freeliving and plantassociated samples) and the other half on an ACN of 3.4 (animalassociated samples) (Fig. 2d). Greater pergenome rRNA operon copy number has been found to be associated with rapid maximum growth rates 29 , which may provide a selective advantage when resources are abundant, such as in animal hosts. While ACN is an estimate rather than a measurement of average rRNA copy number and is subject to potential biases in the underlying reference database, the distributions we observed are consistent with 16S rRNA copy number reflecting differences in ecological strategies among environments.

A resource for theoretical ecology
The coordinated accumulation of data across studies allows investiga tions of patterns within (alphadiversity) and among (beta diversity) microbial communities at scales that vastly exceed what could be measured in any individual study. Patterns of alphadiversity in meta analyses have revealed global trends that have been key to the development of major ideas in macroecological theory, but fundamental patterns have been more difficult to discern in microbial ecology. For example, a nearly ubiquitous tendency towards greater diversity in the tropics is evident in macroecology 30 , but there is substantial variation among studies examining latitudinal trends of microbial diversity 31-33 . The large EMP dataset analysed here reveals a weak but significant trend towards increasing diversity at lower latitudes in nonhost associated environments (Extended Data Fig. 5h). An effect of latitude was apparent both within and across studies, consistent with global trends in latitudinal microbial diversity being an emergent function of locally selective environmental heterogeneity 34 . However, substantial studytostudy variation in richness highlights the caveats inherent in metaanalysis; more coordination of sample collections from similar environments across larger gradients is necessary to better address this question.
The EMP has the potential to link global patterns of microbial diversity with physicochemical parameters-if appropriate metadata are provided by researchers. Microbial community richness has been found to correlate with environmental factors, including pH and temperature 3,33,35,36 . For example, richness has been shown to increase Article reSeArcH up to neutral pH 36 and often to decrease above neutral pH 3,35 in soil communities. Richness has been shown to increase with tempera ture up to a limit and then to decrease beyond that limit in seawater (maximum at about 19 °C) 33 and to increase with temperature in soil (up to at least around 26 °C) 36 . However, general relationships of rich ness to temperature and pH remain unresolved 37 . Here, across samples from nonhostassociated environments where pH or temperature were measured (mostly freshwater and soil environments), richness was greatest near neutral pH (around 7) and relatively cool temperatures (about 10 °C) (Fig. 2b). We observed apparent upper bounds on richness with both temperature and pH that were best fit by twosided exponen tial (Laplace) curves. Thus, the present dataset suggests that maximum microbial richness occurs within a relatively narrow range of interme diate pH and temperature values. These patterns, while robust in the context of the EMP dataset, necessarily reflect only the subset of sam ple types for which variables were measured (Supplementary Table 2); they should therefore be interpreted with caution. Understanding universal relationships between richness and environmental factors will require information from more studies with detailed and carefully collected physicochemical metadata.
Beyond measured physical covariates, the breadth of environments in the EMP catalogue allows a detailed exploration of how microbial diversity is distributed across environments. Diversity among commu nities (betadiversity) is driven by turnover (replacement of taxa) and nestedness (gain or loss of taxa resulting in differences in richness) 38 . If turnover dominates, then disparate communities will harbour unique taxa. If nestedness dominates, then communities with fewer taxa will be subsets of communities with more taxa. We tested for nestedness using a 2,000sample subset with even representation across environ ments and studies. Given the contrasting environments and geographic separation among the many studies in the EMP, we expected different environments to contain unique sets of taxa and to show little nest edness. However, we found that communities across environments were significantly nested (Fig. 3a, b; P < 0.05) in comparison to null models (Fig. 3c), accounting for the observed patterns of richness. At coarse taxonomic levels, an average of 84% of phyla, 73% of classes, and 58% of orders that occurred in less diverse samples also occurred in more diverse samples. Nestedness was observed even when the most prevalent taxa were removed and was robust across randomly chosen subsets of samples (Extended Data Fig. 6). These patterns could have resulted from several mechanisms, including ordered extinctions 39 and the filtering of complex communities over time 40 , differential dispersal abilities 41 and cascading source-sink colonization processes that assemble nested subsets from more complex communities, or by  Greatest maximal richness occurred at values of pH and temperature that corresponded to modes of the Laplace curves. Maximum richness exponentially decreased away from these apparent optima. c, Between community (beta) diversity among in n = 23,828 biologically independent samples: principal coordinates analysis (PCoA) of unweighted UniFrac distance, PC1 versus PC2 and PC1 versus PC3, coloured by EMPO levels 2 and 3. Clustering of samples could be explained largely by environment. d, 16S rRNA gene average copy number (ACN, abundanceweighted) of EMP communities in n = 23,228 biologically independent samples, coloured by environment. EMPO level 2 (left): animalassociated communities had a higher ACN distribution than plantassociated and freeliving (both saline and nonsaline) communities. Right: soil communities had the lowest ACN distribution, while animal gut and saliva communities had the highest ACN distribution.
the tendency of larger habitat patches to support more rare taxa with lower prevalence 42 . Notably, finer taxonomic groupings showed less nestedness (Fig. 3c), indicating that the processes that underlie nested patterns of turnover are likely to reflect conserved aspects of micro bial biology, and not to result from the interplay of diversification and dispersal on short timescales. These global ecological patterns offer a glimpse of what is possible with coordinated and cumulative sampling-in addition to the specific questions addressed by individual studies, context is built and easily queried across studies. They also necessarily highlight the inherent limitations to decentralized studies, especially regarding the collection of comparable environmental data. Future studies will be able to use the current EMP data as a starting point for more explicit tests of broad ecological principles, both to identify gaps in current knowledge and to more confidently plan large directed studies with sufficient power to fill them.

A more precise and scalable catalogue
An advantage of using exact sequences is that they enable us to observe and analyse microbial distribution patterns at finer resolution than is possible with traditional OTUs. As an example, we applied a Shannon entropy analysis to tag sequences and higher taxonomic groups to measure biases in the distribution of taxa. Taxa that are equally likely to be found in any environment will have high entropy and low specificity, whereas taxa found only in a single environment will have low entropy and high specificity (note that we use 'specificity' solely to denote distributional patterns, not to imply adaptation or causality). Tag sequences exhibited high specificity for environment, with distributions skewed towards one or a few environments (low Shannon entropy); by contrast, higher taxonomic levels tended to be more evenly distributed across environments (high Shannon entropy, low speci ficity) (Fig. 4a). Entropy distributions across all tag sequences at each taxonomic level show that this pattern is general (Fig. 4b). Seeking a more precise measure of the divergence at which a taxon is specific for environments, we next investigated how entropy changes as a function of phylogenetic distance. We calculated entropy for each node of the phylogeny and visualized it as a function of maximum tiptotip branch length (Fig. 4c). While entropy decreased gradually at finer phyloge netic resolution, it dropped sharply at the tips of the tree. We conclude that environment specificity is best captured by individual 16S rRNA sequences, below the typical threshold defining microbial species (97% identity of the 16S rRNA gene).
The EMP dataset provides the ability to track individual sequences across the Earth's microbial communities. Using a representative subset of the EMP (Extended Data Fig. 7a), we produced a table of sequence counts and distributions, including among environments (EMPO) and along environmental gradients (pH, temperature, salinity, and oxygen). From this we generated 'EMP Trading Cards' , which promote explora tion of the dataset and here highlight the distribution patterns of three prevalent or environmentcorrelated tag sequences (Extended Data  Table 3). The entire EMP catalogue can be que ried using the Redbiom software, with commandline (https://github. com/biocore/redbiom) and webbased (http://qiita.microbio.me) interfaces to find samples based on sequences, taxa, or sample meta data, and to export selected sample data and metadata (instructions at https://github.com/biocore/emp). User data generated from the EMP protocols can be readily incorporated into this framework: because Deblur operates independently on each sample 21 , additional tag sequences can be added to this dataset from new studies without repro cessing existing samples. Future combinations of datasets targeting the same genomic region but sequenced using different methods may be admissible but would require considerations to account for methodo logical biases.
The growing EMP catalogue is expected to have applications for research and industry, with tag sequences used as environmental indicators and representing targets for cultivation, genome sequencing, and laboratory study. In addition, these tools and approaches, although developed for bacteria and archaea, could be expanded to all domains of life 43 . To achieve greater utility for the EMP and similar projects, we must continually improve metadata collection and curation, ontologies, support for multiomics data, and access to computational resources. Shown is a subset of the EMP consisting of n = 2,000 biologically independent samples with even representation across environments and studies. With increasing sample richness (left to right), phyla tended to be gained but not lost (P < 0.0001 versus null model; NODF (nestedness measure based on overlap and decreasing fills) statistic and 95% confidence interval = 0.841 ± 0.018). b, As in a but separated into nonsaline, saline, animal, and plant environments (P < 0.0001, respective NODF = 0.811 ± 0.013, 0.787 ± 0.015, 0.788 ± 0.018 and 0.860 ± 0.021). c, Nestedness as a function of taxonomic level, from phylum to tag sequence, across all samples and within environment types. Also shown are median null model NODF scores (± s.d.). NODF measures the average fraction of taxa from less diverse communities that occur in more diverse communities. All environments at all taxonomic levels were more nested than expected randomly, with nestedness higher at higher taxonomic levels (for example, phyla).
Article reSeArcH 4 6 2 | n A T U r e | v o L 5 5 1 | 2 3 n o v e m b e r 2 0 1 7

Conclusions and future directions
Here we have used crowdsourced sample collection and standardized microbiome sequencing and metadata curation to perform a global metaanalysis of bacterial and archaeal communities. Using exact sequences in place of OTUs and a learned structure of microbial envi ronments, we have shown that agglomerative sampling can reveal basic biogeographic patterns of microbial ecology, with resolution and scope rivaling data compilations currently available for 'macrobial' ecology 44,45 . Our results point to key organizing principles of micro bial communities, with lessrich communities nested within richer communities at higher taxonomic levels, and environment specificity becoming much more evident at the level of individual 16S rRNA sequences.
The EMP framework and global synthesis presented here represent value added to the scientific community beyond the substantial contri butions of the constituent studies (Supplementary Table 1). However, as with any metaanalysis in which data are gathered primarily in service of separate questions rather than a single theme 46 , conclusions should be viewed with caution and form starting points for future hypothesisdirected investigations. There is a need to span gradients of geography (for example, latitude and elevation) and chemistry (for example, temperature, pH, and salinity) more evenly-assisted by tools for more comprehensive collection and curation of metadata-and to track environments over time. In addition, biotic factors (for example, animals, fungi, plants, viruses, and eukaryotic microbes) not meas ured in this study have important roles in determining community structure 4-6 . The scalable framework introduced here can be expanded to address these needs: new studies to fill gaps in physicochemical space, amplicon data for microbial eukaryotes and viruses, and wholegenome and wholemetabolome profiling. At a time when both academic and governmental agencies increasingly recognize the value of communal biodiversity monitoring efforts 47,48 , the EMP provides one example of a logistically feasible standardization framework to maximize inter operability across diverse and independent studies, in particular using stable identifiers (exact sequences) to enable enduring utility of environmental sequence data. Given current global sequencing efforts, the use of coordinated protocols and submission to this and other public databases should allow rapid accumulation of new data, providing an ever more diverse reference catalogue of microbes and microbiomes on Earth.
Online Content Methods, along with any additional Extended Data display items and Source Data, are available in the online version of the paper; references unique to these sections appear only in the online paper. Supplementary Information is available in the online version of the paper.

MEthOdS
Study design. This effort was possible because of a unified standard workflow that leveraged existing sample and data reporting standards to allow biomass and metadata collection across diverse environments on Earth. After sample collection, all samples were processed following the same protocols. A standard DNA extraction protocol (http://www.earthmicrobiome.org/protocolsand standards/ dna extractionprotocol) was implemented to ensure that trends observed were due either to the biological system or to biases in extraction potential for organisms from different environmental matrices, and not due to inherent biases in the extraction protocol. To avoid known issues that arise when multiple amplicon strategies are combined 49 , we also standardized PCR primers, amplification strategy, and sequencing 50 . More recent studies not included in this metaanalysis adopted additional primer modifications to allow recovery of key taxa in marine and soil samples [17][18][19] . Data reporting standards, including the MIxS (minimal information about any sequence) metadata standard developed by the Genomic Standards Consortium 10 and the Environment Ontology (ENVO) 11,51 , enabled interoperability, data analysis, and interpretation between samples from disparate environments, collected using many different techniques through unconnected programs of investigation.
To transfer our knowledge of microbial environments to the broader community, we engaged with the developers of ENVO to ensure that the basic, salient features of microbial environments (hostassociated or freeliving, and respectively within those, animal or plantassociated, and saline or nonsaline materials) were rep resented either in this ontology or in those with which it interoperates. For ease of application, we gathered these contributions into an application ontology, the EMP Ontology (EMPO) (Fig. 1a). The EMP community will continue to work with ontology engineers to shape ENVO and other ontologies around the EMPO application ontology. EMPO will be maintained as a logical subset of ENVO and integrated into the ENVO release cycle to maximize interoperation.
Metadata curation was automated using Pandas (http://pandas.pydata.org). The size of the dataset also required extensive software development to support analysis at this scale, leading to tools including the data and analysis portal Qiita (https:// qiita.microbio.me), the BIOM format 52 , new 'OTU picking' methods Deblur 21 and a subsampled openreference procedure 53 , a scalability improvement of Fast UniFrac phylogenetic inference software 54 , speed improvements to sequence insertion tree method SEPP 55 , and speed and feature improvements to Emperor ordination visualization software 56 (http://biocore.github.io/emperor). Sample collection. The global community of microbial ecologists was invited to submit samples for microbiome analysis, and samples were accepted for DNA extraction and sequencing provided that scientific justification and highquality sample metadata were provided before sample submission. Standardized sampling procedures for each sample type were used by contributing investigators. Samples were collected fresh and, where possible, immediately frozen in liquid nitrogen and stored at -80 °C. Detailed sampling protocols are described in publications of the individual studies (Supplementary Table 1). Bulk samples (for example, soil, sediment, faeces) and fractionated bulk samples (for example, sponge coral surface tissue, centrifuged turbid water) were taken using microcentrifuge tubes. Swabs (BD SWUBE dual cotton swabs or similar) were used for biofilm or surface samples. Filters (Sterivex cartridges, 0.2 μ m, Millipore) were used for water samples. Samples were sent to laboratories in the United States for DNA extraction and sequencing: water samples to Argonne National Laboratory, soil samples to Lawrence Berkeley National Laboratory (pre2014) or Pacific Northwest National Laboratory (2014 onward), and faecal and other samples to the University of Colorado Boulder (pre2015) or the University of California San Diego (2015 onward). Metadata curation and EMP ontology. Metadata were collected in compliance with MIMARKS 10 , EBI (https://www.ebi.ac.uk/ena), and Qiita (https://qiita. microbio.me) standards, as described in the EMP Metadata Guide (http://www. earthmicrobiome.org/ protocolsandstandards/metadataguide). QIIME mapping files (metadata) were downloaded from Qiita, merged, and refined using Python with Pandas, generating qualitycontrolled mapping files. Mapping file columns are described in Supplementary Table 2. Mapping files for the full EMP dataset and subsets (see below) are available at ftp://ftp.microbio.me/emp/release1/ mapping_ files/. The EMP Ontology (EMPO) for microbial environments was devised to facilitate the present analysis while preserving interoperability. Coordinated by the ENVO team, annotations from ENVO 11,51 , UBERON (metazoan anatomy) 57 , PO (plant ontology) 58 , FAO (fungal anatomy ontology, http://purl.obolibrary.org/ obo/fao.owl), and OMP (ontology of microbial phenotypes) 59 were mapped to our EMPO levels 2 and 3 (empo_2 and empo_3). Additionally, the free living or hostassociated lifestyles were captured in level 1 categories (empo_1). Descriptions of empo_3 categories are provided at http://www.earthmicrobiome.org/ protocols andstandards/empo. The W3C Web Ontology Language (OWL) document is available at http://purl.obolibrary.org/obo/envo/subsets/envoEmpo.owl. Map data were derived from the opensource project MatPlotLib package Basemap, which distributes map data from Generic Mapping Tools data (http://gmt.soest.hawaii. edu) released under the GNU Lesser General Public License v3. DNA extraction, amplicon PCR, sequencing, and sequence pre-processing. DNA extraction and 16S rRNA amplicon sequencing was done using EMP standard protocols (http://www.earthmicrobiome.org/protocolsand standards/16s) 14 . In brief, DNA was extracted using the MO BIO PowerSoil DNA extraction kit (Carlsbad, CA), chosen because of its versatility with diverse sample types (rather than high yields with any given sample type). Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f-806r 50 with Golay errorcorrecting barcodes on the reverse primer. Although any primerbased method necessarily undersamples diversity, a recent analysis of 16S rRNA genes captured in shotgun metagenomic sequences indicates that this primer pair is among the best available for sampling both bacteria and noneukaryotic archaea 15 . Amplicons were barcoded and pooled in equal concentrations for sequencing. The amplicon pool was purified with the MoBio UltraClean PCR Cleanup kit and sequenced on the Illumina HiSeq or MiSeq sequencing platform; the same sequencing primers were used with both platforms, and previous work has shown that conclusions drawn from 16S rRNA amplicon data are not dependent on which of these sequencing platforms is used 50 . Sequence data were demultiplexed and minimally quality filtered using the QIIME 1.9.1 script split_libraries_fastq.py 60 with Phred quality threshold of 3 and default parameters to generate perstudy FASTA sequence files. Tag sequence and OTU picking and subsets. Sequence data were errorfiltered and trimmed to the length of the shortest sequencing run (90 bp) using the Deblur software 21 ; the resulting 90bp Deblur BIOM table was used for all analyses unless otherwise noted. Deblur tables trimmed to 100 bp and 150 bp were also generated and provided, which contain greater sequence resolution but fewer samples. Deblur observation tables were filtered to keep only tag sequences with at least 25 reads total over all samples. For comparison to existing OTU tables, traditional closedreference OTU picking was done against 16S rRNA databases Greengenes 13.8 25 and SILVA 123 26 using SortMeRNA 61 , and subsampled openreference OTU picking 53 was done against Greengenes 13.8. These unfiltered tables and the filtered and subset tables described below are available at ftp://ftp.microbio.me/ emp/release1/otu_tables.
A total of 97 studies and 27,742 samples are included in the present study and in the unfiltered BIOM tables. The QCfiltered subset used in core diversity analyses (Fig. 2) contains 96 studies and 23,828 samples, and it was subset further for some analyses. In the provided BIOM tables (ftp://ftp.microbio.me/emp/release1/otu_ tables/ and https://zenodo.org/record/890000), the 'release1' set contains all sam ples in the 97 studies that have at least one sequence per sample; this set includes controls (blanks and mock communities). The 'qc_filtered' set, from which the subsets are drawn, has samples with ≥ 1,000 observations in each of four observa tion tables: closedreference Greengenes, closedreference SILVA, openreference Greengenes, and Deblur 90bp; controls (empo_1 = = 'Control') are excluded. Subsets were then generated which give equal (as possible) representation across environments (EMPO level 3) and across studies within those environments. The subsets contain 10,000, 5,000, and 2,000 samples (nested subsets). In each subset the samples must have ≥ 5,000 observations in the Deblur 90bp observation table and ≥ 10,000 observations in each of the closedreference Greengenes, closed reference SILVA, and openreference Greengenes observation tables. Note that Deblur removes approximately onethird to onehalf of sequences owing to sus pected errors, which is consistent with a sequence length of ~ 90-150 bp and an average error rate of 0.006 per position 62 . Comparison against reference databases. To compare the unique sequence diversity in this study to that in existing databases, sequences from the complete Deblur 90bp observation table were compared to the set of unique fulllength sequences from Greengenes 13.8 and the noneukaryotic fraction of Silva 128 data bases using the opensource sequence search tool VSEARCH 63 in global alignment search mode, requiring 100% similarity across the query sequence and allowing multiple 100% reference matches. Prevalence as a function of sequencing depth. The QCfiltered Deblur 90bp observation table was additionally filtered to samples that had at least 50,000 sequences (observations). We chose to focus on four environment types (EMPO level 3) where there were many hundreds of samples with more than 50,000 sequences: soil (n = 2,279), saltwater (n = 478), freshwater (n = 1,508), and animal distal gut (n = 695) environments. For each environment, the observation tables were randomly subsampled to 50, 500, 5,000, and 50,000 sequences per sample. The prevalence of each tag sequence was determined as the number of nonzero occur rences across samples divided by the total number of samples. We then plotted a histogram of tag sequence prevalence at each sampling depth. In order to control for potential study bias, we ran the same analysis on a subset of the observation tables where 30 samples were randomly sampled from each study (studies with fewer than 30 samples with > 50,000 sequences were discarded). To investigate how mean tag sequence prevalence changes with increasing sequencing depth across environments, we calculated the average mean tag sequence prevalence across three replicate rarefactions. We plotted the average and standard deviation in mean prevalence across replicate subsamples over a subsampling gradient (that is, 50, 100, 500, 1,000, 5,000, 10,000 and 50,000 sequences per sample). Greengenes insertion tree. Deblur tag sequences were inserted into the Greengenes reference tree using SEPP 55 , which uses a divideandconquer technique to enable phylogenetic placement on very large reference trees. The SEPP method uses HMMER 64 internally for aligning each Deblurred sequence to a reference Greengenes alignment (gg_13_5_ssu_align_99_pfiltered.fasta) with 99% threshold for clustering (resulting in 203,452 tag sequences) and dividing the reference alignment to subsets with a thousand sequences each. It then uses pplacer 65 to insert the sequences into the reference Greengenes tree (99_otus.tree), dividing it into subsets of size 5,000. The branch lengths on the Greengenes tree were recomputed using RAxML 66 under the GTRCAT model before the placement. The pipeline used, including the reference trees and alignments can be found at ftp://ftp.microbio.me/emp/release1/otu_info/greengenes_sepp_pipeline, and the bash script is available at https://github.com/biocore/emp/blob/master/code/03 otupickingtrees/deblur/run_sepp.sh. Fast UniFrac. Unweighted and weighted UniFrac were computed using the Cythonized 67 implementation of Fast UniFrac 54 in scikitbio 68 . Fast UniFrac by itself was not scalable for the EMP dataset owing to an intermediary data structure required by the algorithm, which scales in space by O(NM), where N is the number of nodes in the phylogeny and M is the number of samples. A workaround was designed and implemented in scikitbio (skbio.diversity.block_beta_diversity) which computes partial distance matrices as opposed to all samples pairwise, enabling large reductions within the intermediary data structure by shrinking M and, in tandem, shrinking N to only the relevant nodes of the phylogeny. This decomposition also allows a classic mapreduce parallel approach with low perprocess space requirements. Further space and time reductions were obtained through the implementation and use of a balancedparentheses tree representation 69 (https://pypi.python.org/pypi/iow). Core diversity analyses: alpha-and beta-diversity. Alphararefaction was computed using single_rarefaction.py in QIIME 1.9.1 60 using as input the Deblur 90bp BIOM table and rarefaction depths of 1,000, 5,000, 10,000, 30,000, and 100,000. Alphadiversity was computed using scikitbio 0.5.0 with the input Deblur 90bp BIOM table rarefied to 5,000 observations per sample, and alphadiversity indices were observed_otus (number of unique tag sequences), shannon (Shannon diversity index 70 ), chao1 (Chao 1 index 71 ), and faith_pd (Faith's phylogenetic diversity 72 , using the Greengenes insertion tree). Fast UniFrac 54 was run on the Deblur 90bp table using the aforementioned approach and the corresponding insertion tree. Principal coordinates were computed using QIIME 1.9.1. Effect size calculations of alpha-and beta-diversity. A version of the mapping file (metadata) was compiled containing the predictors to be tested: study_id, host_scientific_name (a proxy for host taxonomy), latitude_deg, longitude_deg, envo_biome_3 (a proxy for biome or environment), empo_3 (a proxy for sample type or environment generally), temperature_deg_c, ph, salinity_psu, and nitrate_ umol_per_l (a proxy for nutrient levels generally). Predictors chosen were those expected to be less redundant with other predictors not chosen, with the excep tion that there was substantial overlap between study ID and many of the other predictors-because independent studies typically focused on limited sample types from constrained geographic ranges, it is expected that study ID serves as a proxy for a wide range of other measured and unmeasured environmental variables (see Extended Data Fig. 5b). Categories for each predictor were chosen as follows: numerical data were first converted to categories using quartiles; then each category was required to be found in at least 0.3% of all samples (corre sponding to 75 samples); categories that were less common than this were ignored. Note that some predictors in our data have complex nonlinear relationships that multivariate statistical analyses using quartiles may miss, such as the unimodal upper constraintbased richness relationships of temperature and pH. We then tested the effect size of each predictor versus the number of observed tag sequences (alpha diversity) and weighted and unweighted UniFrac distances (betadiversity). Effect size was calculated using a Python implementation of the mixeddirectional false discovery rate (mdFDR) 73,74 . mdFDR reduces the false discovery rates by penalizing the multiple pairwise comparisons within each metadata category and the multiple metadata category comparisons. mdFDR has four steps. First, it performs a pairwise comparison (Mann-Whitney U for alphadiversity, and PERMANOVA for betadiversity) of each group within each category. Second, for each category we calculate a pooled P value based on the P values of all pairwise comparisons for any given category. Third, we apply the Benjamini-Hochberg procedure to the pooled P values and remove nonsignificant metadata categories. Finally, we estimate the effect size of those categories found to be significant in step 3 and that have a pairwise comparison P value greater than (R/m × q i ) × α, where R is the number of categories that were found significant, m is the number of categories that are being compared (the original number of categories in the input mapping file), q i is the number of pairwise comparisons in each given category, and α is the control level for FDR. The effect size for a given metadata column is calculated as the difference of means of each pairwise comparison divided by pooled standard deviation. To further assess the combined effect size of predictors with nonredundant explanatory power on alpha and betadiversity, the non redundant predictors were selected by forward stepwise redundancy analysis with the R package vegan 75 ordiR2step function. This analysis provides an estimate of the relative contribution of each nonredundant predictor to the combined effect size and their independent fraction to the community variation. Average community 16S rRNA gene copy number. The closedreference obser vation table (Greengenes 13.8) was run through the PICRUSt 1.1.0 command normalize_by_copy_number.py script 76 , which divides the abundance of each OTU by its inferred 16S rRNA gene copy number (that is, copy number is inferred from the closest genome representative for a Greengenes 16S rRNA gene reference sequence). Samples with more than 10,000 sequence reads were summed (that is, OTU abundances were summed within each sample) in both the copynumber normalized and original observation tables. The weighted average community 16S rRNA gene copy number (ACN) for each sample was calculated as the raw sample sum divided by the normalized sample sum. Covariation of richness with latitude, pH, and temperature. Measurements of alphadiversity were compared to absolute latitude using a linear mixedeffects model incorporating study ID as a random variable and the interaction of envi ronment and latitude as fixed effects; this was performed on a dataset filtered to include only studies comprising samples that spanned at least 10° of absolute latitude. Correlation of richness with pH and temperature were fitted with a Laplace distribution. The Laplace distribution is a continuous probability distribution that simultaneously captures exponential increase and exponential decrease around a modal value (μ). This distribution is also referred to as the double exponential or twosided exponential because it represents two symmetrical exponential distributions backtoback. The Laplace is particularly useful for testing the biological hypothesis that a system is under strong selection to take a particular value (μ) and that small deviations from μ produce an exponential decrease, for example, in diversity. We tested this hypothesis with regards to how tag sequence richness (S) relates to pH and temperature. We used the upper 99th percentile of tag sequence richness across narrow ranges of pH (100 bins) and temperature (120 bins), meaning that our question pertained to the relationship of maximum tag sequence richness (S max ) to pH and temperature. We compared our expecta tions of exponential decrease in maximum S against the fit to a Gaussian curve, which can also predict a steep symmetrical decrease with small deviations from μ. Random forest classification of samples. Random forest classification models were trained on the 2,000sample subset of the Deblur 90bp observation table to test classification success of samples into the environmental categories from which they came. The R packages caret 77 and randomForest 78 were used. Five repeats of tenfold crossvalidation were used to evaluate the classification accuracy. Confusion matrices were computed to measure the agreement between prediction and true observation. The models were then used to classify the other remaining samples in the full QCfiltered subset. SourceTracker analyses. SourceTracker 79 uses a Bayesian classification model together with Gibbs sampling to predict the proportion of tag sequences from a given set of source environments that contribute to sink environments. We applied SourceTracker 2.0.1 (http://github.com/biota/sourcetracker2) to define the degree to which tag sequences are shared among environmental samples, using the 2,000sample subset of the Deblur 90bp observation table (~ 20% of each sample type) as source samples to train the model, and the remainder as sink samples to test the model. Additionally, we used leaveoneout crossvalidation to predict the sample type of each source sample when that sample type is excluded from the model, in order to evaluate the homogeneity of source samples and independence of each source type. Source and sink samples were rarefied to 1,000 sequences per sample before feature selection and testing. Nestedness of taxonomic composition. Nestedness captures the degree to which elements of a large set are contained within progressively smaller sets. We used the NODF statistic 80 to quantify nestedness of the samplebytaxa matrix. The rows of this matrix correspond to specific taxa grouped at particular taxonomic levels (for example, phylum, class, etc.), while the columns correspond to particular samples. After sorting the matrix from greatesttoleast according to row and column sums, we quantified two aspects of the NODF statistic. The first is a 'row' version of NODF that quantifies the degree to which ranges of less prevalent taxa are subsets of the ranges of more prevalent taxa. The second is a 'column' version of NODF that quantifies the degree to which less diverse communities are subsets of more diverse communities. We employed two null models to better interpret the Article reSeArcH observed values of the NODF statistic. The first is based on a random shuffling of occurrences within each row, holding row sums constant (fixed rows, equiprobable columns), while the second is based on a random shuffling of occurrences within each column, holding column sums constant (equiprobable rows, fixed columns) 81 . The results from both of these null models were qualitatively consistent, so we only report findings using the equiprobable rows, fixed columns model, as it is more consistent with rarefaction of the observation tables. We considered null models at each taxonomic level (phylum, class, order, family, genus), and for all of the samples and each subset of the samples at EMPO level 2. To compute standardized effect scores (SES), we used analytical results based on the hypergeometric distribution to find the expectation and variance of the NODF statistic under both models. SES values were generally very large (> 2); we used Wald tests to compute approximate P values. Environment distribution of taxa and Shannon entropy. For each Deblur tag sequence B, sample s in the set of all EMP samples S, and sample type (EMPO level 3) category E, define : (1) E as the fraction of total appearances of tag sequence B in sample type category E (with N possible values). For a given cluster of tag sequences T (phylogenetic subtree or taxonomic group, for example, Firmicutes), we then calculate cluster distribution vector as where W E combined for all tag sequences in the sequence cluster is given by Clusters of tag sequences were defined in two ways: first, by partitioning using the taxonomic lineage information for the tag sequences; second, by maximum tiptotip branch length for nodes on the phylogenetic tree. To calculate entropy of environment distribution as a function of taxonomic level (for example, phylum), the mean of Shannon entropies for all taxonomic groups belonging to that taxonomic level was calculated (weighted by the number of tag sequences in each taxonomic group). To calculate the entropy as a function of the phylogenetic subtree group width, cluster Shannon entropy was calculated for all subtrees, as well as the maximum tiptotip distance for each subtree. To ascertain whether changes in entropy between taxonomic and phylogenetic levels were expected given the observed distribution of environment entropy among tag sequences, a null model was calculated by randomly permuting the Deblur tag sequence taxonomy associa tions (for the entropy versus taxonomy analysis) or the phylogenetic tip placement (for the entropy versus phylogeny analysis). To reduce the effect of discretization on the entropy calculation in both analyses, clusters of tag sequences were included in the analysis only if they had a minimum of 20 tag sequences. For unique tag sequences (that is, a branch length threshold of 0.0), sequences were required to be found in a minimum of 10 samples. To calculate the approximate branch length corresponding to each taxonomic level, we found the lowest common ancestor for each group and calculated the maximum tiptotip distance in that subtree. EMP trading cards. We started with a BIOM table of 90bp Deblur tag sequences (16S rRNA gene, V4 region), rarefied to 5,000 observations per sample, containing 2,000 samples evenly distributed across environments and studies (Extended Data Fig. 7a). From this we calculated the following: the number, fraction, and rank of samples in which a tag sequence is found; the abundance, fraction, and rank of observations represented by that tag sequence; the taxonomy of the tag sequence from Greengenes; and the list of all the samples in which the tag sequence is found. This summary is located at ftp://ftp.microbio.me/emp/release1/otu_ distributions/. Additionally, for each tag sequence with a trading card in Extended Data Fig. 7b

Article reSeArcH
Extended Data Figure 1 | Physicochemical properties of the EMP samples. Pairwise scatter plots of available physicochemical metadata are shown for temperature, salinity, oxygen, and pH, and for phosphate, nitrate, and ammonium. Histograms for each factor are also shown; the number (n) of samples having data for each factor is provided at the top of each histogram. Samples are coloured by environment, and only QCfiltered samples are included. In sample metadata files, environmental factors are named in our recommended format, with analyte name and units combined in the metadata field name.  Figure 5 | Environmental effect sizes, sample classification, and correlation patterns. a, Effect sizes of predictors on alpha and betadiversity. Maximum pairwise effect size (difference between means divided by standard deviation) between categories of each predictor plotted for observed tag sequences (alphadiversity) and unweighted and weighted UniFrac distance (betadiversity). Response variables (alpha and betadiversity) were derived from the QCfiltered subset of the 90bp Deblur table containing n = 23,828 biologically independent samples. Numeric predictor variables were converted to quartiles (categorical predictors). Categories within each predictor had a minimum of 75 samples per category. b, Cumulative variation explained by the optimal model of stepwise redundancy analysis (RDA) of predictors: study ID, EMPO level 3, ENVO biome level 3, latitude, and longitude (predictors with values for less than half of samples, including host scientific name, were excluded). Environment (EMPO level 3) and biome (ENVO biome level 3) explained as much variation as study ID when study ID was excluded from the RDA. c, Confusion matrix for random forest classifier of samples to environment (EMPO level 3). The classifier was trained on the 2,000sample subset, which was then tested on the remaining samples (QCfiltered samples minus 2,000sample subset). Squares are coloured relative to 100 classification attempts for each true label. Overall success rate was 84%, with the most commonly misclassified sample environments being Surface (nonsaline), Animal secretion, Soil (nonsaline), and Aerosol (nonsaline). d, Receiver operating characteristic (ROC) curve for classification of samples to environment (EMPO level 3). The AUC (area under curve) indicates the probability that the classifier will rank a randomly chosen sample of the given class higher than a randomly chosen sample of other classes. e, Classification success, using a random forest classifier, to EMPO levels 1-3, ENVO material, ENVO feature, and ENVO biome levels 1-3. f, Microbial source tracking: mean predicted proportion of tag sequences from each source environment (EMPO level 3) that occurs in each sink environment. The model was trained on a subset of samples (~ 20% of each environment), and tested to predict tag sequence source composition in all remaining samples. Aerosol (nonsaline), Surface (saline), and Hypersaline samples were not included in this analysis because there were insufficient sample numbers. g, Microbial source tracking: which other environments a sample type most resembles. The model was trained on all source environments except one using a leaveoneout crossvalidated model, and then used to classify each sample included in that group. Hence, the predicted classification proportion of environment X to environment X is zero. h, Correlation of microbial richness with latitude. Richness of 16S rRNA tag sequences per sample across EMPO level 2 environmental categories as a function of absolute latitude. Samples from studies that span at least 10° latitude are highlighted in colour, with linear fits displayed perstudy as matching coloured lines. Samples from studies with narrower latitudinal origins are shown in grey. The global fit for all samples per category is indicated by a dashed black line. n/a Confirmed The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement (animals, litters, cultures, etc.)

Extended Data
A description of how samples were collected, noting whether measurements were taken from distinct samples or whether the same sample was measured repeatedly A statement indicating how many times each experiment was replicated The statistical test(s) used and whether they are one-or two-sided (note: only common tests should be described solely by name; more complex techniques should be described in the Methods section) A description of any assumptions or corrections, such as an adjustment for multiple comparisons The test results (e.g. P values) given as exact values whenever possible and with confidence intervals noted A clear description of statistics including central tendency (e.g. median, mean) and variation (e.g. standard deviation, interquartile range)

Clearly defined error bars
See the web collection on statistics for biologists for further resources and guidance.

Software
Policy information about availability of computer code 7. Software Describe the software used to analyze the data in this study.
Code for reproducing sequence processing, data analysis, and figure generation is provided at github.com/biocore/emp and is archived at zenodo.org with DOI 10.5281/zenodo.XXXXXX. Redbiom code is available at github.com/biocore/ redbiom and is archived at zenodo.org with DOI 10.5281/zenodo.XXXXXX. (Zenodo DOIs will be provided in proof stage, as discussed with the editor.) For manuscripts utilizing custom algorithms or software that are central to the paper but not yet described in the published literature, software must be made available to editors and reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). Nature Methods guidance for providing algorithms and software for publication provides further information on this topic.

Materials and reagents
Policy information about availability of materials 8. Materials availability Indicate whether there are restrictions on availability of unique materials or if these materials are only available for distribution by a for-profit company.
No unique materials were used.

Antibodies
Describe the antibodies used and how they were validated for use in the system under study (i.e. assay and species). c. Report whether the cell lines were tested for mycoplasma contamination.
No eukaryotic cell lines were used.
d. If any of the cell lines used are listed in the database of commonly misidentified cell lines maintained by ICLAC, provide a scientific rationale for their use.
No commonly misidentified cell lines were used.