Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Large-scale microbiome data integration enables robust biomarker identification

## Abstract

The close association between gut microbiota dysbiosis and human diseases is being increasingly recognized. However, contradictory results are frequently reported, as confounding effects exist. The lack of unbiased data integration methods is also impeding the discovery of disease-associated microbial biomarkers from different cohorts. Here we propose an algorithm, NetMoss, for assessing shifts of microbial network modules to identify robust biomarkers associated with various diseases. Compared to previous approaches, the NetMoss method shows better performance in removing batch effects. Through comprehensive evaluations on both simulated and real datasets, we demonstrate that NetMoss has great advantages in the identification of disease-related biomarkers. Based on analysis of pandisease microbiota studies, there is a high prevalence of multidisease-related bacteria in global populations. We believe that large-scale data integration will help in understanding the role of the microbiome from a more comprehensive perspective and that accurate biomarker identification will greatly promote microbiome-based medical diagnosis.

## Main

Microbiomes in the human body have profound impacts on many aspects of human health, especially those in the gut, which are closely associated with the occurrence and development of many diseases1,2. Comprehensive evaluations of the relationship between microbiota and disease are significant to improving health. Biomarkers correspond to biological indicators to measure and evaluate the biological states of individuals, such as differentially expressed genes or differentially abundant bacteria3. Accurate identification of biomarkers helps to facilitate clinical diagnosis and improves clinical prognosis prediction4,5. Most previous studies have identified key bacteria as biomarkers based on variation in abundance between healthy and diseased groups6,7,8. However, confounding factors between studies often mask the real features of microbial communities and thus may lead to unreliable conclusions. Although several studies have sought to address the challenge by correcting statistical parameters or microbial profiles9,10, the dependence on additional clinical information limits their applications. Statistical tools such as combat11 and limma12, which were developed to remove batch effects in the analysis of microarray expression data, also exhibit poor performance due to the sparsity feature of the microbial datasets. Consequently, computational methods for the integration of microbiome data from different cohorts are urgently needed.

In the human gut, the interaction of microbial species, rather than microbes alone, maintains community structure and provides a stable environment for commensals, in which co-occurrence networks contribute to an understanding of the relationship between different taxa13,14,15. A number of studies have demonstrated that the application of co-occurrence networks can simplify the identification of disease-related biomarkers and thus improve clinical prediction models16,17,18. Nevertheless, great challenges in network-based microbiome analysis remain, especially when integrating networks of multiple cohorts. For example, different cutoff selections substantially alter the topological structures of networks and thus correspond to different microbial interactions. In addition, the sample size in microbiome studies also influences the network structure. The most common tactic in data integration is to combine networks directly based on microbial interaction pairs. However, this kind of method fails to consider divergence among different datasets. An integrated network ranking approach to predict regulatory genes involved in the host response has been proposed19. Taking perturbation into account, its application in microbial analysis is still limited.

In this study we have developed an algorithm called Network Module Structure Shift (NetMoss), which focuses on the shift of network modules to evaluate the importance of bacteria between different states. By applying NetMoss to both simulated and real datasets, we demonstrate that it can efficiently reduce batch effects and identify more robust biomarkers that were neglected by traditional abundance-based methods. Furthermore, from a network perspective, we have found that, in pandisease microbiome studies, many gut bacteria are multidisease-related rather than disease-specific. The application of our network-based method greatly improves the efficiency of integrating multiple datasets and promotes the identification of microbial biomarkers for clinical diagnosis.

## Results

### Batch effect confounds integration of large-scale cohorts

Multipopulation cohorts are currently widely used in the analysis of case–control studies; however, one of the most striking problems when integrating different datasets is the batch effect. On the one hand, different studies usually employ various experimental and computational methods during sample collection, processing and data generation, causing extensive biases in microbial profiles. On the other hand, taxon abundance varies substantially in different studies due to divergence in community composition and structure, which may lead to false interactions and network structures (Fig. 1a). For these reasons, direct integration of different datasets may cover the authentic characteristics of microbial communities and generate strong bias. Overall, to solve the problem of the integration of large microbiota datasets, more reliable approaches are urgently needed.

To evaluate potential bias in different microbiota datasets, we collected 2,742 gut microbiota datasets from seven independent colorectal cancer (CRC) studies representing three different countries (China, Germany and the United States; Supplementary Table 1). First, we explored the heterogeneity among different batches. Principal coordinates analysis (PCoA) indicated that the difference among studies was much greater than that between case–control groups (Kruskal–Wallis rank sum test, P < 0.001; Fig. 1b). Similar patterns were also observed in the results of the independent Wilcoxon rank sum test (Fig. 1c) and the blocked Wilcoxon test (Fig. 1d). Among all 665 genera from the seven CRC studies, although 142 were significantly different (false discovery rate (FDR) < 0.01), very few were shared by multiple studies. Specifically, only eight differential bacteria were shared by at least three studies, five by four studies, and none by more than five studies (Fig. 1c). Even for those shared bacteria, the alteration in abundance varied greatly in different studies (Fig. 1d). For example, the genus Fusobacterium was significantly enriched in diseased individuals in most CRC studies but exhibited higher abundance in the healthy group in CRC2 (Fig. 1d; FDR < 0.01). In contrast, the genus Lachnospira was more abundant in the disease group in CRC2, but had higher abundance in the healthy group in other studies. Such discrepancies indicate that the conclusion is less convincing when ignoring batch effects during the integration of different cohorts.

We then attempted to explore the difference of network structure among different batches. Network topology in each CRC study was examined with different tools and different thresholds (Supplementary Fig. 1). Although the topological parameters of all seven studies tended to decrease as the threshold of correlation coefficients increased, the situation varied among studies (Fig. 1e). For example, when the threshold increased from 0 to 0.1, the number of edges in the CRC1 and CRC2 network decreased sharply while those in the other networks dropped more smoothly. The same tendency was observed in the variation of average degree (Fig. 1e), suggesting that the study size affects the topological structure of networks. In addition, small networks appear to be more sensitive to the choice of threshold. Consequently, the selection of different thresholds in network analysis may lead to different conclusions.

For clarity, we constructed co-occurrence networks of the seven studies. The results indicated that the microbial interactions were much weaker in large studies than in small studies (Supplementary Figs. 1 and 2). We speculate that microbial profiles in large cohorts are distributed more evenly; thus, the network structure is much looser than that in small communities. Accordingly, great differences were observed in the comparison of networks of different sizes when the threshold changed rapidly (Fig. 1e). Owing to the lack of appropriate normalization, neither the classic differential abundance method nor the previous integration network method can achieve satisfactory performance for the integration of various microbiota datasets.

### Integration of networks using a univariate weighting method

Considering that the networks constructed from large studies exhibit weak microbial interactions, directly integrating datasets with different sizes into one network might mask the real microbial features of large datasets. To address this, a univariate weighting method was introduced in our analysis, whereby a greater weight was assigned to the larger dataset to increase its contribution in the final integrated network (Fig. 2a). We first verified the method in a pairwise permutation test, in which any two networks were integrated into one network. As shown in Fig. 2b, among all combinations of two study groups, the large study had a greater contribution to the integrated network, with the contribution increasing with the sample size of the included study. Similar results were observed in the integration of all seven networks: the larger the community size, the greater its contribution to the final integrated network (Fig. 2c), suggesting that this univariate weighting method can efficiently highlight the strength of large studies in the final network and reduce bias in the process of integration.

To further verify whether the univariate weighting method can remove batch effects in the process of integration, we permutated the seven studies to generate 127 different integrated networks (the number of studies included in the integrated networks ranged from one to seven). Compared to the situation with the traditional unweighted method, the distribution of network distance with the univariate weighting method not only exhibited a more even pattern (Fig. 2d), but also decreased more sharply with an increasing number of studies included in the integrated networks (Fig. 2e), demonstrating that integration of different datasets based on the univariate weighting method can reduce heterogeneity among studies. Notably, the univariate weighting method also showed a significantly higher correlation between network distance and sample dissimilarity (Fig. 2f; P < 0.001), suggesting its better performance in describing variation among studies. To explore the differences among different methods, we constructed networks using four different strategies: (1) integrating datasets simply based on abundance without removing batch effects (unprocessed); (2) integrating datasets based on the univariate weighting method; (3–4) integrating datasets based on abundance and removing batch effects using combat (3) or limma (4). Consistent with previous studies, the traditional methods showed inferior performance in batch effects removal on the microbial datasets (Supplementary Fig. 3). By contrast, the univariate weighting method showed a lower distance between the final integrated network and seven original networks, indicating its good performance in capturing original biological features (Fig. 2g).

### Prediction of transition using a network-based algorithm

To delineate the transition from health to disease and to identify key bacteria during this process, we propose a NetMoss algorithm to perform network-based differential analysis (Fig. 3a; more details are provided in the NetMoss algorithm section in the Methods). We first generated two simulated networks to confirm that the NetMoss algorithm is able to measure variation in network structure between different states (Fig. 3b,c). After perturbation, 30 out of 40 submodules transitioned from module 1 to module 2, implying an alteration from health to disease (Fig. 3c). We then calculated the NetMoss scores of these 40 taxa in the integrated network to confirm whether our method can distinguish transited submodules from others. The results showed that a majority of transited submodules (86.7%) could be predicted by the NetMoss score (Fig. 3d), indicating its great performance in identifying driver bacteria associated with state transition.

To further evaluate the performance of the NetMoss method, the Neighbor Shift (NESH) score20 and the Jaccard Edge Index (JEI), which are both used to measure the variation of nodes in networks, were introduced to compare with our method. We re-perturbed the simulated network and added different noises to benchmark whether the three methods could identify transited submodules correctly. As shown in Fig. 3e, when random noises were added to taxon 81 to taxon 120, the NetMoss method outperformed the other two methods on distinguishing transited submodules from others. We then altered the noise level on the simulated networks and found that the area under the curve (AUC) of NetMoss remained high and stable (average AUC = 0.95; Fig. 3e, Supplementary Fig. 4 and Supplementary Table 2), further demonstrating its good performance and consistency on different community types. When perturbation occurs, the connection of bacteria changes as the structure of the network changes. Unlike NESH and JEI, the NetMoss algorithm not only takes node connection into consideration, but it also quantifies the node distance between different modules. Even a slight change in the network structure can be detected based on this module shift strategy, so the NetMoss method shows great advantages in the identification of biomarkers compared to other network-based methods.

### Identification of biomarkers in integrated CRC networks

To identify disease-related bacteria, we integrated seven CRC studies into two integrated networks (case and control; Fig. 4a) and found that great differences existed between the case and control groups (Fig. 4b). For example, compared to the control group, Actinobacteria in the case group was greatly decreased, but Firmicutes was more abundant (Fig. 4b). In particular, in the small modules of the case group, the microbial composition was very simple, and among the four most common bacterial phyla, only Firmicutes was detected (Fig. 4b). Such distinctions indicated that the lack of certain bacteria in the microbial network may be associated with the transition from health to disease. We further retrieved 66 CRC-relevant bacteria from the gutMDisorder database21 and found that the connection strength of marker bacteria was significantly higher than that of nonmarker bacteria in both the case and control network modules (Fig. 4c; P < 0.001, Wilcoxon test), suggesting the crucial role of these marker bacteria in integrated networks. Consequently, it would be an efficient strategy to determine disease-related bacteria from case–control network comparisons using the NetMoss method.

We then evaluated the accuracy of the NetMoss method using 66 known CRC-relevant bacteria, with 55 of them being present in the combined CRC datasets. A classic statistical test was used to identify differentially abundant bacteria between case and control groups, and the NetMoss score was used to assess the importance of bacteria in the integrated networks. Among the bacteria identified by the two methods, only 32% were marker bacteria using the statistical test method (FDR < 0.05); in contrast, 68% were successfully identified when NetMoss was implemented (NetMoss score > 0.12), suggesting that our network-based method substantially improves the efficiency of the identification of disease-related bacteria (Fig. 4d,e). In particular, some genera showed a relatively high NetMoss score—for example, Faecalibacterium, Actinetobacter and Parvionas (NetMoss score > 0.99), which have been demonstrated to be associated with human health or disease (Fig. 4e). It should be noted that only 10 out of 43 taxa were identified by both methods, indicating that the abundance-based method and the network-based method should complement each other in differential microbiota analysis.

To further explore the differences among the various methods, we examined the prediction power using six different strategies. The NetMoss group integrated datasets and identified markers using our network-based workflow (Fig. 3a), and the other five groups integrated datasets and identified markers based on abundance, two of which were further processed using combat or limma to remove batch effects. We observed that the efficiency of the traditional abundance-based method was very low, ranging from 16% to 25%, and most CRC markers could not be identified, no matter whether batch effects were removed or not (Supplementary Figs. 5 and 6a). By contrast, the NetMoss method exhibited a much higher AUC among these groups in both the combined and uncombined datasets (Fig. 4f–i and Supplementary Fig. 6b,c), demonstrating its robustness to different batches and its advantages in large-scale microbiome data integration. As well as at the genus level, the efficiency of NetMoss on both amplicon sequence variants (ASVs) and species levels was also robust (Supplementary Fig. 7). In the CRC integrated networks, only 116 submodules (17.4%) changed between healthy and diseased groups in all modules, and such slight variations could not be recognized by other methods. The NetMoss algorithm, however, focuses on module shift and is more sensitive to perturbation between different networks.

### Application in pandisease microbiota studies

Considering the complementary role of abundance-based and network-based methods in identifying disease-related bacteria, we further applied them to other diseases to determine the common characteristics of microbiota changes in these diseases. We analyzed 11,377 microbiota samples from public studies of different diseases (Supplementary Table 1) and found that, compared with the abundance-based method, the NetMoss method identified many more bacteria associated with disease (Fig. 5a). Intriguingly, these key bacteria exhibited two different patterns: some only correlated with a specific disease (disease-specific bacteria), whereas others exhibited wide associations with multiple diseases (multidisease-related bacteria; Fig. 5a). Unexpectedly, the latter accounted for the majority in all differential bacteria (Supplementary Fig. 8a). For example, many genera of Enterobacteriaceae and Lachnospiraceae, known as opportunistic pathogens, were found to be associated with infection and multiple diseases, such as CRC, diarrhea and type 2 diabetes6,22,23. We identified some of them as biomarkers in over five diseases (Fig. 5a and Supplementary Fig. 8b). In addition, several studies have reported strong associations between hepatitis B virus infection and Streptococcus or Bacteroides, which are also key bacteria in the occurrence of gestational diabetes mellitus24,25,26. Although a certain degree of abundance difference was observed between the case and control groups, most associations between diseases and these bacteria were only identified by using NetMoss (Fig. 5a).

We then focused on the five most prevalent diseases in the public datasets (Supplementary Table 1). We examined the prevalence of disease-specific bacteria and multidisease-related bacteria in each study and found that most biomarkers are multidisease-related bacteria (Fig. 5b). For example, four bacteria exhibited substantial differences between healthy and diseased groups in five diseases, but the number rose to 63 when only two diseases were included (Fig. 5c). Moreover, compared to disease-specific bacteria, multidisease-related bacteria were found to be much more abundant in both healthy and disease populations, confirming the importance of these biomarkers in the human gut.

To explore the role of multidisease-related bacteria in the development of disease, we compared the differences of network structure of five diseases. We found that multidisease-related bacteria showed a closer network connection and a higher NetMoss score compared to specific bacteria (Fig. 5d,e and Supplementary Fig. 8c; P < 0.05, Wilcoxon test). Such vital roles in the microbial networks suggested they may act as drivers in the development of multiple diseases. To further examine the association among multiple diseases, we integrated five disease networks into one combined network. Interestingly, unlike healthy controls, taxa from different diseases were largely separated from each other, with multidisease-related bacteria locating in the hub regions of the combined network (Fig. 5f and Supplementary Fig. 8e). Such opposite network structures between healthy and diseased groups further demonstrated the importance of microbial interaction networks in exploring the contribution of gut microbiota to various diseases.

## Discussion

Although some algorithms and tools have been developed to tackle the problem of batch effects in meta-analysis27,28, most examine differential bacteria based on abundance; however, in human gut communities, microbes frequently interact with one another, forming a closely connected network14. Perturbation from outside may alter the structure of the network and change cooperative or competitive relationships among the bacteria. For this reason, the abundance of specific bacteria cannot describe the whole picture of the ecosystem, let alone the transition from health to disease. Network analysis has been widely used in various biological systems16,29,30. However, as the structure of the network is often associated with various factors, such as the size of datasets, selection of cutoffs and construction methods, it is difficult to compare networks from different studies directly. Consequently, integration of different networks is necessary to understand microbial interactions. Considering the robust characteristic of gut communities, a slight perturbation often imposes little effect on the whole structure of the microbial networks, and the distinction may only manifest as small variations in network submodules. Therefore, focusing on such variations of submodules may be a reasonable strategy to discriminate disease-associated bacteria from others.

CRC is one of the most common cancers across the world, and colorectal tumorigenesis is highly associated with gut microbial dysbiosis31. However, in contrast to genetic signatures, the gut microbiota associated with CRC is highly dependent on environmental factors such as diet and life style, which differ greatly in different countries, especially between western and non-western populations32. Such divergence poses a significant challenge to the early diagnosis of CRC based on microbial biomarkers and usually leads to contradictory results in different microbiome studies. By utilizing the NetMoss algorithm, we revealed the importance of Lactobacillus in the occurrence of CRC, which was demonstrated to have a protective effect for CRC33,34, although it did not show a significant difference in abundance between case and control groups. This finding highlights the advantage of network-based methods for the identification of abundance-insensitive biomarkers. However, the NetMoss method still has some limitations. In practice, the clinical progress of CRC can be divided into several distinct stages characterized by different symptoms and divergent gut microbial compositions35,36. The state of patients, such as medicated or not, may also affect their gut microbial structures, resulting in inter-individual variation. However, such metadata are generally not available from public datasets, which makes it difficult to utilize such information in the NetMoss method. Taking detailed clinical factors into consideration will undoubtedly improve the accuracy of biomarker identification and may represent a new direction towards deep mining of clinical microbiome data.

The NetMoss method greatly facilitates the identification of significant biomarkers in the transition from health to disease and helps contribute to our understanding of the roles of the human microbiota in networks of ecosystems. With the integration of multiple cohorts based on this network-based algorithm, we believe that divergence among different studies can be largely reduced and that neglected details can be elucidated from a more comprehensive perspective.

## Methods

### Datasets

We collected human gut microbiota datasets of different diseases from the National Center for Biotechnology Information (NCBI) to construct a multipopulation cohort. The keywords we searched in PubMed included ‘gut microbiota’, ‘16S’, ‘human’, ‘stool’ and ‘microbiome’. Only samples from adult stool were retained for downstream analyses. In total, 5,608 fecal samples of diseased individuals and 5,769 samples of healthy control individuals from 78 studies were collected in our research, covering 13 countries (Canada, China, Denmark, Finland, France, Germany, India, Italy, Japan, Mexico, Spain, Sweden and the United States). Among the 37 kinds of disease included in our research, CRC was the most prevalent, with 1,455 disease samples and 1,287 healthy samples. Accordingly, the processes of method development, validation and differential analysis were mainly based on CRC cohorts.

### Analysis of 16S rRNA sequences

The raw data of 16S rRNA gene sequencing were analyzed using the QIIME237 platform (v2020.2). In brief, the DADA2 plugin was used to filter the sequencing reads and construct an ASVs feature table. The taxonomy information for the ASVs was assigned against the Silva Database (https://www.arb-silva.de) (v138.1) using the classify-sklearn algorithm in the feature classifier plugin. Low-abundance ASVs, whose relative abundance did not reach 0.1% in at least 10% of the samples, were excluded.

### Network analysis

The co-occurrence network of microbes was constructed with SPIEC-EASI38. The topological coefficients of the network were calculated using the R package ‘igraph’39. For each case or control group, different sizes of study were integrated into one network using the univariate weighting method to remove batch effects. Considering that the networks constructed from large studies exhibit weak co-abundance patterns, great weights were added to the large networks to increase their contribution to the final integrated network and thus reduce bias in the integration. The procedure was implemented as follows.

First, a co-occurrence network was constructed based on the abundance matrix of each study. Next, every pair of co-abundance patterns between any two taxa was aligned. The missing co-abundance patterns were filled with value 0. A univariate weighting method was implemented to add different weights to different pairs of co-abundance patterns based on the size of each study. During this process, the Hedges and Olkin method40 was used to evaluate the conditional deviation of the correlation coefficient in each study. For a certain study ni, the conditional deviation vi of correlation coefficient ri was calculated as

$${v_i} = {\frac{{1 - {r_i^2}}}{{{n_i} - 1}}}$$

The weight of each pair of co-abundance patterns was defined as

$${\rho} = {\frac{{\mathop {\sum }\nolimits_{i = 1}^{k} {w_i}{r_i}}}{{\mathop {\sum }\nolimits_{i = 1}^{k} {w_i}}}}$$

where wi is the reciprocal of vi, and k represents the number of studies.

To demonstrate real ecological processes, module division was conducted using WGCNA41, with which microbes interacted cooperatively with one another in one single module while interacting competitively between any two modules. In the weighted networks, the connection strength of node i was defined as the sum of the connections between this node and all other nodes in the network, as

$${k_i} = {\mathop {\sum }\limits_{j = 1}^{n}} {a_{ij}}$$

where aij represents the correlation coefficient between node i and node j.

To highlight the importance of nodes in the network module structure, we redefined the connection strength of node i as

$${k_i} = {\mathop {\sum }\limits_{j = 1}^{n_1}} {a_{ij}} - {\mathop {\sum }\limits_{j = {n_1} + 1}^{n}} {a_{ij}}$$

where n represents the number of all nodes in the network and n1 represents the number of nodes inside a specific module.

### NetMoss algorithm

The NetMoss score was used to measure the driving force of every node in the transition of the network structure. This was defined as follows. First, the correlation matrix of the health state was A = [cij] and the correlation matrix of the disease state was $${A^\prime} = {\left[ {c_{ij}^\prime } \right]}$$, where cij is the correlation coefficient of node i and node j:

$$c_{ij} = {\rm{cor}}(i, j)$$

To obtain the optimized module structure, linear transformation was implemented to convert cij to sij

$${s_{ij}} = {\frac{{1 + {c_{ij}}}}{2}}$$

Thus, the correlation matrices of the health and disease states after transformation were B = [sij] and $${B^\prime} = {\left[ {s_{ij}^\prime } \right]}$$, respectively.

The soft threshold β was calculated based on the WGCNA algorithm, and the weighted network aij was

$${a_{ij}} = {\left| {s_{ij}} \right|^\beta}$$

Thus, the weighted correlation matrices of health and disease states were C = [aij] and $${C^\prime} = {\left[ {a_{ij}^\prime } \right]}$$, respectively.

The topological overlap matrix ωij of node i and node j was calculated as

$${\omega _{ij}} = {\frac{{{l_{ij}} + {a_{ij}}}}{{\min \left\{ {{k_i},\,{k_j}} \right\} + {1} - {a_{ij}}}}}$$

where $${l_{ij}} = {{{\Sigma }}_u}{a_{iu}}{a_{uj}}$$, $${k_i} = {{\Sigma }}{a_{iu}}$$; u represents other nodes besides node i and node j in the network.

The distance between node i and node j was defined as

$${d_{ij}} = {1} - {\omega _{ij}}$$

Then, the distance matrices of the health and disease states were D = [dij] and $${D^\prime} = {\left[ {d_{ij}^\prime } \right]}$$, respectively.

Module division was conducted in matrices D and D′, with matrix D containing n modules and matrix Dm modules. The number of intersection modules of D and D′ was K (K ≤ mn). The average distance of every node from intersection modules to health distance matrix D and disease distance matrix D′ was calculated, obtaining N-by-K-order matrices Dmod and Dmod, respectively. The differential module distance matrix was defined as

$${\Delta D} = {D_{{\rm{mod}}}^\prime} - {D_{\rm{mod}}}$$

The NetMoss score of node i in any intersection module is

$${\rm{NMSS}}{{\left( i \right)}_{A \to B}} = {\mathop {\sum }\limits_j^{{\rm{Neighbors}}{A}}} {\Delta {D_{ij}}} - {\mathop {\sum }\limits_l^{{{\rm{Neighbors}}{B}}}} {\Delta {D_{il}}}$$

where A and B represent the health and disease networks, respectively; NeighborsA represents all neighboring modules in the health network, and NeighborsB represents all neighboring modules in the disease network.

The intersection modules represent the stable elements during the transition from health to disease, where the transited modules resulted in alteration of the network structure. The NetMoss algorithm measures the driving force in the transition of the network structure to evaluate the importance of every node.

### Module shift stimulation and random noise production

To verify the effect of the NetMoss algorithm in the identification of network structure, simulated networks were generated. Considering the sparsity of microbiota networks, we developed an algorithm to generate a simulated correlation matrix with a certain module structure and added random noise to the matrix to simulate natural disturbance.

The number of gk unit vectors was selected from vector space $${R^{M_k}}$$ to construct an Mk-by-gk matrix Uk. The kth module was represented by a gk-by-gk matrix Σk:

$${U_k} = {\left( {{u_1}|{u_2}|...|{u_k}} \right)}$$
$${\mathop {\sum}\nolimits_k} = {{\rho _k}{\left( {{U_k^T}{U_k}} \right)} + {\left( {1 - {\rho _k}} \right)}{I}}$$

where k represents the number of modules in the simulation matrix, gk represents the size of the kth module, Mk represents the variation range of the correlation coefficient inside the kth module, ρk represents the maximum correlation coefficient inside the kth module, and I represents the unit vector.

Accordingly, an N-by-N matrix was constructed, with modules Σ1, Σ2, ..., Σk arranged on the diagonal in order and the area outside the modules filled with 0.

To produce random noise, the number of c unit vectors was selected from vector space Rm to construct an m-by-c matrix Uk. The random noise matrix S was selected from a c-by-c matrix Sk:

$${U_k} = {\left( {{u_1}|{u_2}| {\ldots} |{u_c}} \right)}$$
$${S_k} = {\varepsilon _k}{\left( {{U_k^T}{U_k}} \right)}$$
$${\varepsilon _k} = {\varepsilon _1} + {\frac{{k - 1}}{{r - 1}}}{\left( {{\varepsilon _r} - {\varepsilon _1}} \right)}$$

where r represents the row of the noise matrix, c represents the column of the noise matrix, m represents the variation range of noise, εk represents k-order noise, in which k = 1 represents the minimum value of noise, and k = r represents the maximum value of noise. Finally, random noise matrix S was added to the corresponding module matrix to simulate natural disturbance.

### Statistical analysis

All statistical analyses were conducted in R and visualized using the package ‘ggplot2’. The blocked Wilcoxon test was applied using the R package ‘coin’42. Principal coordinates analysis was implemented by the function ‘pco’ in the R package ‘labdsv’. Co-occurrence networks were visualized using Cytoscape43. For all analyses regarding multiple comparisons, we used the FDR method to correct for multiple testing. Classification was applied using the combined markers, which were identified based on network (NetMoss) or abundance (the other five methods) in the combined CRC datasets. For the separate classification in each of the seven studies, the combined markers were screened using tenfold cross-validation before the prediction.

## Data availability

All sequence data supporting the findings of this study were obtained from Sequence Read Archive (SRA) of NCBI with the accession numbers listed in Supplementary Table 1. Source data are provided with this paper.

## Code availability

The source code for the NetMoss algorithm and data analysis scripts can be accessed at Zenodo44. NetMoss has been implemented as an R package, which can be accessed from GitHub (https://github.com/xiaolw95/NetMoss).

## References

1. Schupack, D. A., Mars, R. A. T., Voelker, D. H., Abeykoon, J. P. & Kashyap, P. C. The promise of the gut microbiome as part of individualized treatment strategies. Nat. Rev. Gastroenterol. Hepatol. 19, 17–25 (2022).

2. Fan, Y. & Pedersen, O. Gut microbiota in human metabolic health and disease. Nat. Rev. Microbiol. 19, 55–71 (2021).

3. Strimbu, K. & Tavel, J. A. What are biomarkers? Curr. Opin. HIV AIDS 5, 463–466 (2010).

4. Sawyers, C. L. The cancer biomarker problem. Nature 452, 548–552 (2008).

5. Califf, R. M. Biomarker definitions and their applications. Exp. Biol. Med. (Maywood) 243, 213–221 (2018).

6. Duvallet, C., Gibbons, S. M., Gurry, T., Irizarry, R. A. & Alm, E. J. Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nat. Commun. 8, 1784 (2017).

7. Dai, Z. et al. Multi-cohort analysis of colorectal cancer metagenome identified altered bacteria across populations and universal bacterial markers. Microbiome 6, 70 (2018).

8. Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).

9. Wirbel, J. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. 25, 679–689 (2019).

10. Gibbons, S. M., Duvallet, C. & Alm, E. J. Correcting for batch effects in case-control microbiome studies. PLoS Comput. Biol. 14, e1006102 (2018).

11. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

12. Smyth, G. K. limma: Linear Models for Microarray Data. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (eds Gentleman, R. et al.) 397–420 (Springer, 2004).

13. Banerjee, S. et al. Agricultural intensification reduces microbial network complexity and the abundance of keystone taxa in roots. ISME J. 13, 1722–1736 (2019).

14. Rao, C. et al. Multi-kingdom ecological drivers of microbiota assembly in preterm infants. Nature 591, 633–638 (2021).

15. Xiao, L., Wang, J., Zheng, J., Li, X. & Zhao, F. Deterministic transition of enterotypes shapes the infant gut microbiome at an early age. Genome Biol. 22, 243 (2021).

16. Naqvia, A., Rangwalaa, H., Keshavarziand, A. & Gillevet, P. Network-based modeling of the human gut microbiome. Chem. Biodivers. 7, 1041–1050 (2010).

17. Yilmaz, B. et al. Microbial network disturbances in relapsing refractory Crohn’s disease. Nat. Med. 25, 323–336 (2019).

18. Mac Aogain, M. et al. Integrative microbiomics in bronchiectasis exacerbations. Nat. Med. 27, 688–699 (2021).

19. Mitchell, H. D. et al. A network integration approach to predict conserved regulators related to pathogenicity of influenza and SARS-CoV respiratory viruses. PLoS ONE 8, e69374 (2013).

20. Kuntal, B. K., Chandrakar, P., Sadhu, S. & Mande, S. S. ‘NetShift’: a methodology for understanding ‘driver microbes’ from healthy and disease microbiome datasets. ISME J. 13, 442–454 (2019).

21. Cheng, L., Qi, C., Zhuang, H., Fu, T. & Zhang, X. gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Res. 48, D554–D560 (2020).

22. Donaldson, G. P., Lee, S. M. & Mazmanian, S. K. Gut biogeography of the bacterial microbiota. Nat. Rev. Microbiol. 14, 20–32 (2016).

23. Gurung, M. et al. Role of gut microbiota in type 2 diabetes pathophysiology. EBioMedicine 51, 102590 (2020).

24. Wang, J. et al. Dysbiosis of maternal and neonatal microbiota associated with gestational diabetes mellitus. Gut 67, 1614–1625 (2018).

25. Mavenyengwa, R. T., Moyo, S. R. & Nordbø, S. A. Streptococcus agalactiae colonization and correlation with HIV-1 and HBV seroprevalence in pregnant women from Zimbabwe. Eur. J. Obstet. Gynecol. Reprod. Biol. 150, 34–38 (2010).

26. Liu, Q. et al. Alteration in gut microbiota associated with hepatitis B and non-hepatitis virus related hepatocellular carcinoma. Gut Pathog. 11, 1 (2019).

27. Wang, Y. & LêCao, K. A. Managing batch effects in microbiome data. Brief. Bioinform. 21, 1954–1970 (2020).

28. Dai, Z., Wong, S. H., Yu, J. & Wei, Y. Batch effects correction for microbiome data with Dirichlet-multinomial regression. Bioinformatics 35, 807–814 (2019).

29. Barberan, A., Bates, S. T., Casamayor, E. O. & Fierer, N. Using network analysis to explore co-occurrence patterns in soil microbial communities. ISME J. 6, 343–351 (2012).

30. Wang, J., Gao, Y. & Zhao, F. Phage-bacteria interaction network in human oral microbiome. Environ. Microbiol. 18, 2143–2158 (2016).

31. Yang, J. et al. High-fat diet promotes colorectal tumorigenesis through modulating gut microbiota and metabolites. Gastroenterology 162, 135–149.e2 (2021).

32. Veettil, S. K. et al. Role of diet in colorectal cancer incidence: umbrella review of meta-analyses of prospective observational studies. JAMA Netw. Open 4, e2037341 (2021).

33. Jacouton, E., Chain, F., Sokol, H., Langella, P. & Bermudez-Humaran, L. G. Probiotic strain Lactobacillus casei BL23 prevents colitis-associated colorectal cancer. Front Immunol. 8, 1553 (2017).

34. Lenoir, M. et al. Lactobacillus casei BL23 regulates Treg and Th17 T-cell populations and reduces DMH-associated colorectal cancer. J. Gastroenterol. 51, 862–873 (2016).

35. Nakatsu, G. et al. Gut mucosal microbiome across stages of colorectal carcinogenesis. Nat. Commun. 6, 8727 (2015).

36. Zhang, M. et al. Differential mucosal microbiome profiles across stages of human colorectal cancer. Life (Basel) 11, 831 (2021).

37. Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857 (2019).

38. Kurtz, Z. D. et al. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput. Biol. 11, e1004226 (2015).

39. Csárdi, G. & Nepusz, T. The igraph software package for complex network research. Int. J. Complex Syst. 1695, https://igraph.org (2006).

40. Hedges, L. V. & Olkin, I. Parametric estimation of effect size from a series of experiments. In Statistical Methods for Meta-Analysis (eds Hedges, L.V. et al.) 107–145 (Academic Press, 1985).

41. Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008).

42. Hothorn, T., Hornik, K., van de Wiel, M. A. & Zeileis, A. A lego system for conditional inference. Am. Statistician 60, 257–263 (2006).

43. Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

44. Xiao, L., Zhang, F. & Zhao, F. Large-scale Microbiome Data Integration Enables Robust Biomarker Identification (Zenodo, 2022); https://doi.org/10.5281/zenodo.5913041

## Acknowledgements

This work was supported by grants from the National Natural Science Foundation of China (no. 32025009), the National Key R&D Program of China (nos. 2021YFA1301000 and 2021YFC2301300) and the Strategic Priority Research Program of the Chinese Academy of Sciences (no. XDB38020300).

## Author information

Authors

### Contributions

F. Zhao conceived the project. L.X. and F. Zhang designed the algorithm and performed the data analysis. L.X., F. Zhang and F. Zhao wrote the manuscript.

### Corresponding author

Correspondence to Fangqing Zhao.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

## Peer review

### Peer review information

Nature Computational Science thanks Yong Fan, Leo Lahti and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Jie Pan, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary information

### Supplementary Information

Supplementary Figs. 1–8 and Tables 1 and 2.

## Source data

### Source Data Fig. 1

Statistical source data.

### Source Data Fig. 2

Statistical source data.

### Source Data Fig. 3

Statistical source data.

### Source Data Fig. 4

Statistical source data.

### Source Data Fig. 5

Statistical source data.

## Rights and permissions

Reprints and Permissions

Xiao, L., Zhang, F. & Zhao, F. Large-scale microbiome data integration enables robust biomarker identification. Nat Comput Sci 2, 307–316 (2022). https://doi.org/10.1038/s43588-022-00247-8

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1038/s43588-022-00247-8

• ### Enhancing biomarkers with co-abundance

• Leo Lahti

Nature Computational Science (2022)