Meta analysis of microbiome studies identifies shared and disease-specific patterns

Hundreds of clinical studies have been published that demonstrate associations between the human microbiome and a variety of diseases. Yet, fundamental questions remain on how we can generalize this knowledge. For example, if diseases are mainly characterized by a small number of pathogenic species, then new targeted antimicrobial therapies may be called for. Alternatively, if diseases are characterized by a lack of healthy commensal bacteria, then new probiotic therapies might be a better option. Results from individual studies, however, can be inconsistent or in conflict, and comparing published data is further complicated by the lack of standard processing and analysis methods. Here, we introduce the MicrobiomeHD database, which includes 29 published case-control gut microbiome studies spanning ten different diseases. Using standardized data processing and analyses, we perform a comprehensive crossdisease meta-analysis of these studies. We find consistent and specific patterns of disease-associated microbiome changes. A few diseases are associated with many individual bacterial associations, while most show only around 20 genus-level changes. Some diseases are marked by the presence of pathogenic microbes whereas others are characterized by a depletion of health-associated bacteria. Furthermore, over 60% of microbes associated with individual diseases fall into a set of “core” health and disease-associated microbes, which are associated with multiple disease states. This suggests a universal microbial response to disease.

We also see an increase in prevalence of organisms often associated with lower pH 208 and higher oxygen levels of the upper-gut, like Lactobacillaceae and Enterobac-209 teriaceae, in patients with diarrhea ( Figure 3A) [42]. Additionally, both CDI 210 and non-CDI diarrhea patients had lower Shannon alpha diversity, a measure 211 of overall community structure, than healthy controls in all studies (Supple-212 mentary Figure 4). Consistent with the CDI and non-CDI diarrheal studies, 213 we also found that organisms associated with the upper gut, like Lactobacillus 214 and Enterobacteriaceae, appear to be enriched in IBD patients, who can present 215 with diarrheal symptoms ( Figure 3A) [42,43]. IBD patients also tended to 216 have lower alpha diversities than controls (Crohn's disease vs. controls in three   [11,12]. 230 We next compared genera across all diseases in order to determine 231 whether some microbes respond to multiple disease states, forming a 232 core response to health and illness. We considered a genus to be part of 233 the "core" microbial response if it was significantly enriched or depleted (q < 234 0.05) in at least one dataset from at least two different diseases. We identified 35 235 health-associated genera and 24 disease-associated genera out of the 139 genera 236 that were significant in at least one dataset (Figure 3). We also found five genera that were both health-and disease-associated (i.e. they were enriched 238 in controls across at least two diseases, but were also depleted in controls in 239 different datasets across at least two diseases) ( Figure 3A, black). Perhaps 240 these genera represent bacteria disproportionately affected by confounders or 241 technical artifacts. Alternatively, these organisms may play different roles across 242 different diseases or community contexts. 243 Here, we identify distinct sub-groups of health-and disease-associated 244 organisms within the Bacteroidetes and Firmicutes phyla, which dom-245 inate the guts of healthy people. The order Clostridiales is associated with 246 health while the orders Lactobacillales, Enterobacterales, and Clostridiales In-247 certae Sedis XI are associated with disease. All but two of the "core" genera 248 in the order Clostridiales were associated with health (24 genera out of 26), 249 comprising the majority of all of the health-associated core microbes. All of 250 the "core" genera in the orders Lactobacillales and Enterobacterales (five and 251 two genera respectively) and four out of five in the order Clostridiales Incertae  [42]. Therefore, these disease-associated taxa 256 may be indicators of shorter stool transit times and disruptions in the redox 257 state and/or pH of the lower intestine, rather than specific pathogens. These 258 "core" genera are consistent with the results from a recent meta-analysis of 259 six metagenomics datasets, which also found Lactobacillales and Clostridiales that had at least one significant (q < 0.05) association, we calculated the per-276 cent of associated genera which were also part of the "core" response in the 277 same direction ( Figure 3B). Strikingly, the majority of responses were not spe-278 cific to individual diseases; on average, 67% of a dataset's genus-level associa-279 tions were genera in the "core" response. In light of this finding, it is crucial 280 that researchers consider these "core" bacteria when interpreting results from 281 their case-control studies. To ensure that an identified microbial association 282 8 is disease-specific, researchers should make sure that it is not part of the universal response by cross-checking their results with an updated list of "core" 284 microbes. Researchers can access an updated list of "core" microbes from this 285 analysis at the MicrobiomeHD database [46], or they can curate their own lists 286 by performing similar cross-disease meta-analyses.

287
The core healthy microbiome is made up of bacteria that are both 288 ubiquitous and abundant across people, whereas bacteria within the 289 core disease microbiome are abundant when present but are not ubiq-290 uitous. We calculated the average abundance (i.e. the total abundance across 291 all patients divided by the number of patients with non-zero abundance) and 292 ubiquity (i.e. the number of patients with the genus present divided by the total 293 number of patients) for each "core" genus. We found that the "core" health-294 associated genera were more ubiquitous than the disease-associated ones, but 295 not necessarily more abundant ( Figure 3C). Thus, presence/absence of core gen-296 era appears to be a better indicator of disease-associated microbial shifts than 297 changes in the overall abundance of these genera. However, a small subset of 298 the core disease-associated genera were relatively ubiquitous across patients.

299
Among the most ubiquitous were Escherichia/Shigella and Streptococcus. Es-300 cherichia includes common commensal strains, as well as pathogenic strains [47], 301 and is frequently present in healthy people's guts as well as over-represented in 302 sick patients. Genera within Enterobacteriaceae, Lactobacillaceae, and Strepto-303 coccaceae families are dominant in the upper gastrointestinal tract [42,48] and 304 are present in many people's stool at low frequency. These taxa likely become 305 enriched with faster stool transit time (i.e. signatures of diarrhea) [42,49].     other conditions show fewer associations. We also find a "core" set of microbes 359 associated with more than one disease and that these "core" microbes comprise 360 the majority of disease-associated microbes within any given study. Therefore, 361 disease case-control studies should be interpreted with extra caution, as the 362 majority of identifed microbial associations are likely to not be indicative of a 363 disease-specific biological difference, but rather a general response to health or 364 disease.

365
The identification of a "core" microbial response is an important concept 366 that should be considered in all future case-control microbiome studies. For 367 example, microbes that are associated with a "core" disease-independent re-368 sponse to illness would not be useful as disease-specific diagnostics or to address diseases. To re-analyze these studies, we applied standard methods commonly 381 used in the field and assumed that the original study designs and patient selec-382 tion methods were adequate. We were reassured to find that a straightforward 383 and standardized approach was able to recover very similar results to those pre-  Table 4). We did not include any studies which required addi-   100 reads and OTUs with fewer than 10 reads, as well as OTUs which were 454 present in fewer than 1% of samples within a study. We calculated the relative 455 2 https://github.com/thomasgurry/amplicon_sequencing_pipeline 12 abundance of each OTU by dividing its value by the total reads per sample.
We then collapsed OTUs to genus level by summing their respective relative 457 abundances, discarding any OTUs which were unannotated at the genus level.

458
All statistical analyses were performed on this genus-level relative abundance 459 data. . We performed all analyses on genus-level relative abundances for each 474 dataset individually, and then compared these results across all studies. 475 We considered a genus to be consistently associated with a disease (Figure   476 3A, bottom) if it was significantly associated (q < 0.05) with the disease in 477 the same direction in at least two studies of that disease. We considered a 478 genus to be part of the "core" microbial associations ( Figure 3A [46]. The code to reproduce all of the analyses in this paper is available at 497 https://github.com/cduvallet/microbiomeHD.

498
Supplementary files, including the q-values for all genus-level comparisons in 499 every dataset, disease-associated genera for the diseases with more than three 500 datasets, and a list of "core" genera are also available at https://github.com/ 501 cduvallet/microbiomeHD.    Total sample size for each study included in these analyses. Additional information about each dataset can be found in Table 1. Studies on the y-axis are grouped by disease and ordered by decreasing sample size (top to bottom). Right: Area under the ROC curve for genus-level random forest classifiers. X-axis starts at 0.5, the expected value for a classifier which assigns labels randomly, and AUCs less than 0.5 are not shown. ROC curves for all datasets are in Supplementary Figure 5. (B) Left: Number of genera with q < 0.05 (FDR KW test) for each dataset. If a study has no significant associations, no point is shown. Right: Direction of the microbiome shift, i.e. the percent of total associated genera which were enriched in diseased patients. In datasets on the leftmost blue line, 100% of associated (q < 0.05) genera are health-associated (i.e. depleted in patients relative to controls). In datasets on the rightmost red line, 100% of associated (q < 0.05) genera are disease-associated (i.e. enriched in patients relative to controls). Supplementary Figures 8 and 9 show q-values and effects for each genus in each study.   Figure 3: The majority of disease-associated microbiome alterations overlap with a "core" microbial response to disease. (A) Core and disease-associated genera. Genera are in columns, arranged phylogenetically according to a PhyloT tree built from genus-level NCBI IDs (http://phylot. biobyte.de). Core genera are associated with health (or disease) in at least two different diseases (q < 0.05, FDR KW test). Disease-specific genera are significant in the same direction in at least two studies of the same disease (q < 0.05, FDR KW test). As in Figure 2, blue indicates higher mean abundance in controls and red indicates higher mean abundance in patients. Black bars indicate mixed genera which were associated with health in two diseases and also associated with disease in two diseases. Core genera are calculated using results from all datasets. Disease-specific genera are shown for diseases with at least 3 studies. Phyla, left to right: Euryarchaeota (brown), Verrucomicrobia Subdivision 5 (gray), Candidatus Saccharibacteria (gray), Bacteroidetes (blue), Proteobacteria (red), Synergistetes (pink), Actinobacteria (green), Firmicutes (purple), Verrucomicrobia (gray), Lentisphaerae (pink), Fusobacteria (orange). See Supplementary Figure 7 for genus labels. (B) The percent of each study's genus-level associations which overlap with the core response (q < 0.05). Only datasets with at least one significant association are shown. (C) Overall abundance and ubiquity of core genera across all patients in all datasets. "Core" genera on the x-axis are as defined above. is that each data set was processed and analyzed in the same way, which allowed 517 us to more directly compare results across studies and diseases.   processing the data, we found very similar results to those originally reported.
We found that alpha diversity was significantly lower in patients with enteric 593 infections (q <= 0.05, KW test). We saw significant enrichment in Proteobactelus, Tetragenococcus, Enterococcus, and Collinsella in diseased patients. In   Bacteroides and Alstipes were more abundant in controls. We found all the associations reported above in our re-analysis. Additionally, we saw higher rel- chloroplast sequences, was associated with ASD children with functional con-857 stipation, but this signal appeared to be due to dietary intake of chia seeds.
Similar to the authors findings, we did not detect any significant differences 859 in genera abundances between ASD children and neurotypical children in the 860 reprocessed data (q > 0.05, Kruskal-Wallis).

861
Taken together, we find no evidence for changes in the composition or diver-862 sity of the gut microbiome in response to ASD. However, we cannot discount 863 subtle dysbiosis (i.e. small effect size) in response to ASD due to the small 864 number of patients in each study. in the controls. We saw no significant differences in our re-analysis of these data.

878
Overall, the original authors report a consistent increase in Bacteroides and 879 depletion in Prevotella genera associated with T1D. However, our re-analysis 880 found that these differences did not pass our significance threshold. Thus, we 881 cannot yet conclude that there is a consistent dysbiosis associated with T1D.   Table 4: Locations of raw data and associated metadata for each dataset used in these analyses.  Figure 4: Reduction in alpha diversity is not a reliable indicator of "dysbiosis." Shannon diversity index across all patient groups in all studies, calculated on OTUs (i.e. not collapsed to genus level, and including unannotated OTUs). Diarrheal patients consistently have lower alpha diversity than nondiarrheal controls (green box). Crohn's disease (CD) patients also show a slight reduction of alpha diversity relative to controls in three out of four IBD studies and ulcerative colitis (UC) patients in two studies (purple box). Obese patients have inconsistent and small reductions in alpha diversity, consistent with a previous meta-analysis [12]. * : 0.01 < p < 0.05, * * : 10 −4 < p < 0.01, * * * : p < 10 −4 . P values are calculated from a two-sided T-test (using scipy.stats.ttest ind) and are not corrected for multiple tests. Note that ob zhu and nash zhu are the same study; the full cohort results are presented only once in this plot (ob zhu).   ) for all genera which were significant (q < 0.05) in at least one dataset, across all studies. Rows are genera, ordered phylogenetically (as in Figure 3A). Columns are datasets, grouped by disease and ordered according to total sample size (decreasing from left to right). The first and second heatmap panels from the left are the same as in Figure 3A. q-values are colored according to directionality of the effect, where red indicates higher mean abundance in patients relative to controls and blue indicates higher mean abundance in controls. Opacity indicates significance and ranges from 0.05 to 1, where q values less than 0.05 are the darkest colors and q values close to 1 are gray. White indicates that the genus was not present in that dataset.

36
No change

More abundant in controls
More abundant in cases Figure 9: Heatmap of log-fold change between cases and controls (i.e. log 2 ( mean abundance in cases mean abundance in controls ) for all genera which were significant (q < 0.05) in at least one dataset, across all studies. Rows are genera, ordered phylogenetically (as in Figure 3A). Columns are datasets, grouped by disease and ordered according to total sample size (decreasing from left to right). The first and second heatmap panels from the left are the same as in Figure 3A. Values are colored according to directionality of the effect, where red indicates higher mean abundance in patients relative to controls and blue indicates higher mean abundance in controls. Opacity indicates fold change and ranges from 1300 to 0, where fold changes greater than 1300 are the darkest colors and fold changes close to 0 are gray. White indicates that the genus was not present in that dataset. Figure 10: Varying Random Forest parameters does not significantly affect AUC of classification of cases from controls (Gini criteria). Random Forest classifiers built by using the Gini impurity ("gini") split criteria. Upward-pointing triangles are classifiers built with 10000 estimators; downward-pointing triangles are built with 1000 estimators. Colors indicate the value of min samples leaf (the minimum number of samples required to be at a leaf node): red = 1, blue = 2, green = 3. X-axes are the value of min samples split (the minimum number of samples required to split an internal node) [56]. All Random Forests were built using the random state seed 12345. Figure 11: Varying Random Forest parameters does not significantly affect AUC of classification of cases from controls (entropy criteria). Random Forest classifiers built by using the information gain ("entropy") split criteria. Upward-pointing triangles are classifiers built with 10000 estimators; downwardpointing triangles are built with 1000 estimators. Colors indicate the value of min samples leaf (the minimum number of samples required to be at a leaf node): red = 1, blue = 2, green = 3. X-axes are the value of min samples split (the minimum number of samples required to split an internal node) [56]. All Random Forests were built using random state seed 12345. the intestinal microbiome in inflammatory bowel disease and treatment.