Bacterial associations in the healthy human gut microbiome across populations

In a microbial community, associations between constituent members play an important role in determining the overall structure and function of the community. The human gut microbiome is believed to play an integral role in host health and disease. To understand the nature of bacterial associations at the species level in healthy human gut microbiomes, we analyzed previously published collections of whole-genome shotgun sequence data, totaling over 1.6 Tbp, generated from 606 fecal samples obtained from four different healthy human populations. Using a Random Forest Classifier, we identified 202 signature bacterial species that were prevalent in these populations and whose relative abundances could be used to accurately distinguish between the populations. Bacterial association networks were constructed with these signature species using an approach based on the graphical lasso. Network analysis revealed conserved bacterial associations across populations and a dominance of positive associations over negative associations, with this dominance being driven by associations between species that are closely related either taxonomically or functionally. Bacterial species that form network modules, and species that constitute hubs and bottlenecks, were also identified. Functional analysis using protein families suggests that much of the taxonomic variation across human populations does not foment substantial functional or structural differences.

Data Pre-processing Trimmomatic 3 was used with the following settings: phred33 MINLEN:60 SLIDINGWIN-DOW:4:15. BowTie2 4 was used for human read filtering was run with the following settings: --very-sensitive. Of the initial 640 samples, one was removed due to contamination (Milkweed yellows phytoplasma) and four samples were removed due to being below three years of age. Read depth analysis was performed by sub-sampling files with 5+ million reads. The sequencing files were randomly sub-sampled at varying depths ranging from 100,000 to 1 million reads. Once the sub-samples were created for each read-depth, the files were then aligned using BowTie2 with the following settings: --very-sensitive --reorder --mp 1,1 --rfg 1,1 -k 1000 -score-min L,0,-0.1. The CLR-transformed relative abundance profiles form the aligned files at each read depth were then compared to the original files using ordinary least squares linear regression. Our results demonstrated that files with greater than 250,000 mapped reads had high agreement.

Read Mapping and species-level Taxonomic Profiling
BowTie2 was run using the following settings: --very-sensitive --reorder --mp 1,1 --rfg 1,1 -k 1000 -score-min L,0,-0.1. Bacterial species abundances were produced by rolling back strain assignments to species level (using accession and tax ids with NCBIs taxonomic assignments), and then summing the relative genome abundance of strains.
WGSim (https://github.com/lh3/wgsim) was used to create simulated WGS reads from two synthetic communities: (i) a mixture of 8 Escherichia coli strains and (ii) a mixture of 9 unique species.

Random Forest Classifier
A random forest classifier (RFC) was used to determine the effects various prevalence thresholds had on classification accuracy. The RFC was trained on the bacterial species present at different prevalence thresholds. The model utilized a 70%-30% train-test split. The model was then randomly re-run 50 times on the same features and the mean F1-scores were reported for the model trained at the species present at each tested prevalence threshold (0%, 20%, 40%, 50%, 60%, 80%, 90%, 100%).

Bacterial Network Construction
To account for the compositional nature of the sequencing data, relative abundances were transformed using the Centered Log-Ratio transformation 5 . To examine the accuracy of our implementation, we generated synthetic data using the HUGE 6 package in R (Version 3.6.3) 7 . The means of the synthetic data were modeled on the CLR-transformed real data to replicate the real data as accurately as possible and five different graph-types were generated (band, cluster, hub, random, and scale-free) at four different sample-to-taxa ratios (0.6, 0.75, 0.86, 1) that closely resembled our real data. One additional sample-to-taxa ratio was also utilized to demonstrate the effect of having larger datasets.

Network Property, Clique, Module, and Node Centrality Analysis
For statistical analysis, Monte Carlo simulations were performed where 1,000 Erdos-Renyi (Gn,p) random networks 8 were generated for comparison to each cohort network where n was the number of nodes within the cohort network and p the density of edges within the cohort network.

Network Modeling
• Species not shown in network models as they had zero edges (associations)