Mediterranean grassland soil C–N compound turnover is dependent on rainfall and depth, and is mediated by genomically divergent microorganisms

Soil microbial activity drives the carbon and nitrogen cycles and is an important determinant of atmospheric trace gas turnover, yet most soils are dominated by microorganisms with unknown metabolic capacities. Even Acidobacteria, among the most abundant bacteria in soil, remain poorly characterized, and functions across groups such as Verrucomicrobia, Gemmatimonadetes, Chloroflexi and Rokubacteria are understudied. Here, we have resolved 60 metagenomic and 20 proteomic data sets from a Mediterranean grassland soil ecosystem and recovered 793 near-complete microbial genomes from 18 phyla, representing around one-third of all microorganisms detected. Importantly, this enabled extensive genomics-based metabolic predictions for these communities. Acidobacteria from multiple previously unstudied classes have genomes that encode large enzyme complements for complex carbohydrate degradation. Alternatively, most microorganisms encode carbohydrate esterases that strip readily accessible methyl and acetyl groups from polymers like pectin and xylan, forming methanol and acetate, the availability of which could explain the high prevalence of C1 metabolism and acetate utilization in genomes. Microorganism abundances among samples collected at three soil depths and under natural and amended rainfall regimes indicate statistically higher associations of inorganic nitrogen metabolism and carbon degradation in deep and shallow soils, respectively. This partitioning decreased in samples under extended spring rainfall, indicating that long-term climate alteration can affect both carbon and nitrogen cycling. Overall, by leveraging natural and experimental gradients with genome-resolved metabolic profiles, we link microorganisms lacking prior genomic characterization to specific roles in complex carbon, C1, nitrate and ammonia transformations, and constrain factors that impact their distributions in soil.

A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated

Clearly defined error bars
State explicitly what error bars represent (e.g. SD, SE, CI) Our web collection on statistics for biologists may be useful.

Software and code
Policy information about availability of computer code For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Genomic data including assemblies and raw reads will be made available under the NCBI BioProject accession number PRJNA449266.
Proteomic data are available through the ProteomeXchange Consortium via the PRIDE partner repository with identifier PXD013110.
Code involved in analysis will be made available at the following GitHub link: https://github.com/SDmetagenomics/Angelo2019_Paper.
A compressed archive of all genomes reconstructed in this study (See Supplementary

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sample size
The total sample size used for the metagenomics experiment was 60 samples, with a structured design of 3 sampling depths, 3 replicate plot locations, and 2 different treatment conditions, across 5 time points. The total sample size used for the proteomics experiments was 20 samples. Due to lower throughput of proteomics instrumentation and downstream analysis, samples were only collected for 2 sampling depths, 2 replicate plot locations, and 2 treatment conditions, across 3 time points. A formal analysis of statistical power was not performed, but these sample size were chosen based on an evaluation of sample sizes for microbial genome resolived metagenomics and proteomics experiments in existing literature, and made significantly larger to compare signals across different sample groupings.
Data exclusions Two types of data were excluded from our analysis: 1) Sequencing reads with low quality scores, as is commonly performed prior to assembly of short read data.
2) Genomes were excluded from our bulk analysis of metabolism and statistical analysis of metabolic traits if they did not meet established criteria for completeness (>70 %) and contamination (<10 %) as measured by the checkM software package. This was done to limit false negatives when assigning functional information to genomes, and to assure that the genomes being analyzed are of similar and high quality. 3) Treatment 30-40cm: Treatment 30-40 cm (n = 10) v Control 30-40 cm (n = 10)

Replication
We did not repeat the sampling, assembly, and analysis with a different set of soil samples, nor did we split samples and run two separate analysis. This was due to the cost of performing the initial experiment with large numbers of replicates, and the desire to maintain a high number of replicate samples for our statistical analysis respectively.

> Replication in Sampling Location:
The plots used for sampling consisted of 3 biological replicate plot pairs (control and treated with extended rainfall). We feel this level of replication was successful in showing both differences between physical plot locations as well as fine differences between control and treated plots. We specifically observe that rainfall treatment based effects were observed reproducibly in the context of plot location (which has a much stronger effect on organism distribution than treatment overall).

> Replication of Analyses Where Permutation was Used:
In some of our statistical analyses we applied permutation based methods (i.e. MRPP and enrichment permutation tests). Prior to reporting a final data value we repeated these analyses up to 5 times using different starting random seeds for the random number generation, and did not find any results changed during these tests. However, we only report a single result as we wanted to provide the same starting seed for all permutation based analyses, and seeds from test analyses were chosen at random internally by the computer as to avoid any bias in manual starting seed entry, and therefore were not recoverable. Thus, we feel outside of biological replication of the entire experiment, the testing of permutation based analyses before reporting a final result was successful in confirming that results were not obtained simply due to outliers generated by the randomization procedures.

Randomization > Soil Plot Definition:
Soil plots of 70 m^2 circular sampling locations were laid out in a grid across the north meadow of the Angelo coast range reserve, CA, and plots that would receive extended rainfall treatment were selected as every other plot in the field. The pattern in plot layout, and treatment layout, was evenly distributed across the field and not randomized. Randomization was not performed in defining plots as there was a desire to have balanced numbers of plots from representative locations across the entire field site.

> Physical Soil Plot Sampling:
In our study soil plots were sampled at three depths, from paired plots, in triplicate. The exact sampling location within each plot that was sampled was randomly chosen for each set of cores that could include up to 3 depth strata, and any locations previously sampled were excluded on return sampling visits on different dates due to the destructive nature of the sampling. The longitudinal sampling dates were not randomized as we wanted these dates to fall at specific times before and after natural rainfall events.

> Defining Differentially Abundant Species Groups:
Species Groups (SGs; rpS3 markers clustered at 99% amino acid identity) were determined to be differentially abundant across depths, plots, and treatments using DEseq to assess differences in the counts of reads mapping to these sequences from each of our samples (see Replication for sample numbers). Randomization was not applied to the analysis of these groups as this is not a typical procedure for the analysis of grouped read count data. However, when analyzing the effect of a single variable such as depth or treatment response, we did control for co-variates using the linear modeling structure of the DEseq experimental design (i.e. Response = plot_replciate + treatment + date + depth -> in this case if we wanted to asses the effect of depth, the date of sampling, treatment status of the plot, and plot pair replicate would be controlled for) > Determining Influence of Metadata Variables: The statistical significance and strength of influence for plot location, treatment, depth, and time of sampling on the distribution of SGs was assessed using the multi response permutation procedure (MRPP). In this procedure samples were randomly associated with different metadata variables to determine significance and strength of influence (10,000 permutations). MRPP was performed in the vegan package in R and uses R internal random number generation for sample permutation. A seed was set in the code so that data is reproducible.

> Determining Phylum and Functional Enrichment Between Sample Groups:
The statistical significance of an observed distribution of a phylum or metabolic function was determined using a custom permutation function written in R, defined in the text, and available in the Github code (see Methods). For the group of genomes that made up a distribution during the testing (ie: all genomes that show differential abundance with depth), the observed distribution of these genomes with respect to a variable (ie: distribution of acidobacterial phylum genomes that increased or decreased with depth) was compared to randomly re-sampled sets, resulting the same number of genomes in permuted sets as the observed set, from the total set of genomes analyzed in our study (n = 793 genomes; 10,000 permutations per phylum or function test). Permutations were performed in R, and use R internal random number generation for sample permutation. A seed was set in the code so that data is reproducible.

> Randomization in Other Analyses:
In addition to the analyses explicitly listed, for other instances were permutation is mentioned in the text the randomization of samples was performed using R, explicitly the R internal random number generator. In all cases a seed was set to allow reproduction of results.

Blinding
Investigators were not blinded to group allocation during data analysis in this study. Initial Investigatory analysis of the data required the investigators to know the true groupings of the data to understand the results of data clustering and dimension reduction preformed at the onset of the analysis.
Reporting for specific materials, systems and methods