Trios—promising in silico biomarkers for differentiating the effect of disease on the human microbiome network

Recent advances in the HMP (human microbiome project) research have revealed profound implications of the human microbiome to our health and diseases. We postulated that there should be distinctive features associated with healthy and/or diseased microbiome networks. Following Occam’s razor principle, we further hypothesized that triangle motifs or trios, arguably the simplest motif in a complex network of the human microbiome, should be sufficient to detect changes that occurred in the diseased microbiome. Here we test our hypothesis with six HMP datasets that cover five major human microbiome sites (gut, lung, oral, skin, and vaginal). The tests confirm our hypothesis and demonstrate that the trios involving the special nodes (e.g., most abundant OTU or MAO, and most dominant OTU or MDO, etc.) and interactions types (positive vs. negative) can be a powerful tool to differentiate between healthy and diseased microbiome samples. Our findings suggest that 12 kinds of trios (especially, dominantly inhibitive trio with mixed strategy, dominantly inhibitive trio with pure strategy, and fully facilitative strategy) may be utilized as in silico biomarkers for detecting disease-associated changes in the human microbiome, and may play an important role in personalized precision diagnosis of the human microbiome associated diseases.

With the rapid advances in the human microbiome research, it becomes increasingly important to detect and quantify the changes occurring in the human microbiome, especially the changes in the microbiome associated with disease. Differentiating between the healthy microbiome (sampled from healthy individuals) and diseased microbiome (sampled from individuals with microbiome associated disease such as bacterial vaginosis) is essentially a problem of measuring the dissimilarity between two microbial communities. Naturally, diversity analysis with traditional biodiversity measures such as species richness, Shannon information entropy (Shannon index), Simpson's index, have been playing a dominant role in the field of comparing the human microbiome across space and time, as well as between healthy and diseased samples (HMP Consortium 2012a, 2012b, Lozupone et al. 2012) [1][2][3] . While diversity measures such as Shannon index are certainly useful for measuring the dissimilarity between communities, and indeed they have been applied to characterize microbial community in nearly every 16s-rRNA sequencing based microbiome study (e.g., Abusleme [4][5][6][7][8] , the diversity analysis is not without shortcomings. One inherent issue associated with all diversity measures is that they ignore the interactions between species, and they are simply some incarnations of species abundance distributions (SAD) in the community. Consequently, diversity analysis cannot fully account for the contributions of species interactions in the microbiome.
Although still not widely applied to the study of human microbiome, network analysis has been widely applied to other fields of computational biology and bioinformatics, such as gene regulatory and signal transduction networks, protein interaction networks, metabolic networks, phylogenetic networks, and ecological networks (Junker & Schreiber 2008) 9 . Indeed, network analysis, which considers both species abundance and their interactions (links), has been anticipated to remedy or even solve the issues associated with traditional diversity measures. In the field of human microbiome research, Faust and his collaborators (Faust & Raes 2012, 2015 [10][11][12] performed extensive pioneering works. Since their works (Faust & Raes 2012, Faust et al. (i.e., MAO is not part of the trio). The second class refers to the triangle motif that is connected with an external MAO and is further distinguished as three categories: single-link MAO handle, double-link MAO handle, and triple-link MAO handle, with one, two and three links to the node of MAO handle, respectively.
At the tertiary level classification, each of the five secondary level categories (trios without MAO handle, trios with MAO handle, single-link MAO handle, double-link MAO handle, and triple-link MAO handle) is further classified based on the signs (+ or −) of the interactions (correlations) within the trio or between the trio and handle. Detailed classification of the 19 trio types at the tertiary level, generated from the above-described three-level classification scheme is presented in the following Table 1. Among 19 trio types, four types in the category of trios without MAO are nothing particular and are detected in existing network analysis software packages such as Cytoscape (Shannon et al. 2003) 23 and iGraph (Csardi & Nepusz 2006) 24 . To the best of our knowledge, the other 15 trios have not been investigated in the existing literature. Our focus will be centered on those 15 special trios.
As it is demonstrated below, even the 15 special trios are not created equal, and some of the theoretically possible triangle motifs are not detected in our datasets and may even be 'prohibited. ' Some of the trios, especially the six types of trios in the category of trios with MAO were named as Type-1A, Type-1B, Type-2A, Type-2B, Type-3, and Type-4, respectively. These six types are identified at the tertiary level classification; the sign (+ or −) of interaction in trios in the microbiome network is considered. For example, the difference between Type-1A and Type-1B lies in the signs of two links connected with the MAO, i.e., (+ −) in Type-1A and (− −) in Type-1B.
Computational procedures. Our computational procedures for detecting the trios (triangle motifs) consist of the following four major steps: (i) computing OTU correlation coefficients (using either Spearman's or Pearson's definitions), (ii) filtering out false correlations with FDR (false discovery rate) adjustment, (iii) constructing the OTU (or species) correlation (interaction) networks with standard network analysis software packages such as Cytoscape (Shannon et al. 2003) 23 or iGraph (Csardi & Nepusz 2006) 24 ; and (iv) detecting the trios with home-made trio finder program (TFP) program, supplied in the online Supplementary document. The first two steps were actually implemented in a R-script CCFDR.r (Correlation Computing with False Discovery Rate) we provided in the online Supplementary materials. The R-script (CCFDR.r) calls the function "rcorr" from existing R-package Hmisc (https://cran.r-project.org/web/packages/Hmisc/) and the function "multiple.correction" from existing R-package EMA (https://cran.r-project.org/web/packages/EMA/index.html) to compute the correlation coefficients and filter out false correlations, respectively. The output from the CCFDR.r, i.e., correlation computing adjusted with FDR control, is feed into our Perl program TFP.pl, which completes the task of seeking and counting the various trio types. The following flowchart shows the computational procedures, and we further elaborate the possible issues involved in the procedures below.
A flowchart showing the four steps for implementing the trio-finding process: Step (i): Compute the OTU (species) correlation coefficients with Spearman's or Pearson's definitions.
Step (ii): Filter out false correlation with FDR (false discovery rate) control with our CCFDR R-script.
Step (iii): Construct the OTU (or species) correlation (interaction) networks with standard network analysis software packages such as Cytoscape (Shannon et al. 2003) 23 or iGraph (Csardi & Nepusz 2006) 24 . This step can be omitted if no network graphs are output.
Step (iv): Detecting the trios with our trio finder program (TFP.pl) (see the Supplement).
To construct SIN, we recommend using Spearman's rank coefficient or occasionally Pearson's correlation coefficient as demonstrated in Junker & Schreiber (2008) 9  The category of "Trios with MAO" The category of "Trios without MAO" The class of "Trios with MAO Handle" The category of "Single-Link MAO" (SLM Trio) The category of "Double-Link MAO" (DLM Trio) The category of "Triple-Link MAO" (TLM Trio) − SLM- However, to utilize the correlation coefficients for constructing the species or OTU correlation networks, there are two issues that should be addressed first: one is the choice of either the relative abundance or actual OTU reads and another is to filter out the false correlations in the raw correlation coefficient values in consideration of the rising risk of false correlations from multiple testing (i.e., simultaneously testing multiple null hypotheses or the significance of multiple correlation coefficients) with sequence data [30][31][32] . Both steps are usually necessary to ensure proper construction of the underlying OTU (species) correlation networks (SCN), also known as species interaction networks (SIN) as often termed in macro-ecology.
Regarding the utilization of OTU reads for computing the correlation coefficients, our recommendation is that, when the numbers of sequence reads from different samples are approximately equal, the OTU reads can be utilized directly to compute the correlation coefficients; when the numbers of sequences reads from different samples are significantly different, the relative abundances should be utilized instead. The usage of OTU reads directly has an advantage over the relative abundance since the former can avoid the potential error from decimal conversion in calculating the relative abundance (i.e., the OTU reads for a particular OTU or species divided by the total number of reads of all OTUs in the sample). Our pre-experiment tests found that, although both relative abundance and absolute abundance (raw OTU reads) may produce different results when the numbers of sequencing reads across samples are different, the trend of trios is rather robust. In this study, we use the relative abundance (i.e., the reads of a particular OTU in a sample divided by the total reads in the sample) to be fail-safe. Alternatively, if one does not wish to use relative abundance, sub-sampling (i.e., randomly choose the same number of reads from each sample, e.g., 5000 reads from each sample) may be utilized to deal with the issues associated with unbalanced sample sizes.
To deal with the rising chances that some tests will tend to pass falsely when simultaneously testing multiple null hypotheses (i.e., the significance of many correlation relationships) in 16s-rRNA sequence data, we suggest correcting the p-values associated with the correlation coefficients (from either Spearman's or Pearson's methods) with the FDR-BH algorithm (Benjamini-Hochberg standard false discovery rate correction) (Benjamini and Hochberg's 1995) 30 . The procedures with the FDR control have been implemented in several R packages, and we choose to use the R-package EMA, which implemented FDR-BH algorithm by Servant et al. (https:// cran.r-project.org/web/packages/EMA/index.html) 31. Specifically, we called the "multiple.correction" function from the EMA package in our own R-script "CCFDR.r", which also called another function rcorr from the R-package Hmisc (https://cran.r-project.org/web/packages/Hmisc/) to compute the Spearman's or Pearson's correlation coefficients. Our R-script "CCFDR.r" essentially implemented the first two steps in previous flowchart and its output is feed into our Perl program TFP.pl (Trios-Finder Program), which completes the trios-finding function outlined in step 4 in the previous flowchart. Both CCFDR.r and TFP.pl are supplied in the online Supplementary materials.
In summary, after dealing with the above-described two potential issues with our "CCFDR.r" R-script (i.e., eliminating the side effect of unbalanced of sample sizes and filtering out false correlations) we use the remaining correlation relationships (i.e., Spearman's rank correlation coefficient (ρ) in this study) with a threshold of p ≤ 0.05 to build OTU correlation networks for the healthy and diseased microbiome samples, respectively. From the OTU correlation networks, the trios defined in Table 1 are sought out and counted with our homemade Perl-program TFP.pl.
In this article, we present our methodology and hypothesis based on the trios that are associated with the MAO (most abundant OTU) in the microbial species interaction network to simplify the presentation. It is noted that our methodology presented here regarding the special node can be readily extended to other nodes in SIN with some special biomedical or computational implications. We have also applied the MDO (most dominant OTU) and hub associated trios elsewhere (Ma & Ellison 2017a) 25 with the same computational procedures presented here, but the detailed approaches are only reported in this article.
To compare the distribution of the above-defined trios in the healthy and diseased microbiome samples, we define RDHT (the ratio of disease to healthy trios), the number of a particular trio type in the disease treatment divided by the number of the same trio type in the healthy counterpart. Nevertheless, caution should be taken to use RDHT for diagnostic purpose since the magnitude of different trio types may be different. In addition, the identity of trio members may also be of critical biomedical significance.

Test Results and Discussion
Test dataset description. Largely following the sampling scheme of NIH human microbiome project, we selected six datasets that represent the microbiomes sampled from five major body sites (gut, lung, oral, skin, and vagina). Except for gut that is represented by two datasets (HIV and IBD), each microbiome site is represented by one dataset, with six datasets in total. A brief description on the six datasets is summarized in Table 2, and detailed information on individual dataset is referred to the original publication noted in Table 2.
For each of the six case studies representing the five major microbiome sites, we constructed separate species interaction networks (SIN) with the 16s-rRNA sequence dataset from each treatment in the six case studies. For example, with BV dataset (Srinivasan et al.) 6 , we built two SINs: one with the 16s-rRNA microbiome samples from BV group and another with the samples from healthy group. We followed the 4-step computational procedure described in the previous section and built 15 networks in total for the 15 treatments of the healthy and diseased microbiome groups, covering five major microbiome sites (gut, lung, oral, skin, and vaginal) and representing several diseases including HIV-infection, inflammatory bowel disease (IBD), periodontitis, cystic fibrosis (CF), Atopic Dermatitis (AD), and bacterial vaginosis (BV), as detailed in previous Table 2.
After getting respective SIN for each of the 15 treatments of the six case studies, we utilized our homemade trio finder program (TFP.pl) program, supplied in the Supplementary document, to compute the 19 triangle motifs or trio types defined in previous Table 1. The results from TFP computing are listed in the following Table 3 for the class of "trios without MAO handle" and Table 4 for the class of "trios with MAO handle, " respectively.
The class of "Trios without MAO handle". In consideration of the existence or absence of MAO handle connected with trios, the 19 triangle motifs are classified into two classes (explained in Table 1) trios without MAO handle (upper section in Table 1 and discussed in this sub-section further) and trios with MAO handle (bottom section in Table 1 and discussed in the next sub-section).
As displayed in Table 3, the class of trios without MAO handle is further distinguished as two categories: trios with MAO and trios without MAO. The category of trios without MAO, including four types that differ from each  Table 2. Datasets utilized to develop and test the TFP (trio finder program) .*The numbers parenthesized are the sample sizes, i.e., the number of individuals sampled in the original study or that used for constructing species correlation network. When there are two numbers (sample sizes) in the parentheses, the first number is the sample size of the original study, and the second number is the actual sample size we used to reconstruct the species correlation network in this article. The reason to use a different sample size from the original sample size is to avoid the influence of unbalanced treatments-the difference between treatments in the sample size is large enough to influence the results of network analysis. For example, in the case of IBD, the number of UC samples is 38, which is significantly more than the numbers of samples in the two other treatments (both are 18). To avoid the side effect of unbalanced treatments, 18 samples were drawn out of 38 samples and the sampling process was repeated for 50 times. From each sampling process, a network was reconstructed and the network properties including the numbers of trios were computed. The averages from the 50 sampling operations were taken as the final result for the sampled treatment. Two treatments, i.e., UC in IBD case, AD in Skin case were sampled to deal with the issue of unbalanced treatments.  other by the signs of trio links [(---), (+ --), (+ + -), (+ + +)], although listed for comparative purpose, may be less important than the category of trios with MAO for the following reason: because MAO is not involved apparently, the number of trios in this category may be too many to focus on for further etiological studies in practice.
Given the particular significance of the category of trios with MAO, we further distinguish four types (Type 1-4) at the tertiary level classification, depending on the signs (+ or -) of the trio links. Due to the special role of MAO, we further distinguish two sub-types for Type-1 (Type-1A & Type-1B) and Type-2 (Type-2A & Type 2B), respectively, at the tertiary level by noting the 'position' of MAO in the trio (see Table 1). This classification results in the six types in the category of trios with MAO, i.e., Type-1A, Type-1B, Type-2A, Type-2B, Type-3, and Type-4 (also see previous Table 1). Table 3 shows that theoretically possible Type-2 and Type-3 were not detected in our case studies. The counterpart types in the category of trios without MAO were not detected either. Actually, the apparent prohibition of both Type-2 and Type-3 is not difficult to explain by their internal interactions. Type-2, which has three links with (+ + -) interaction relationships, may be hard to sustain because a third link of negative (-) interaction would be 'coerced' to follow the 'mainstream norm' of two other collaborative relationships. Similarly, a trio consisting of three totally opposing nodes is unlikely to sustain because they would most likely 'destroy' each other. Obviously, regardless whether or not MAO is involved, the arguments regarding (+ + -) and (---) hold; hence in both trios with and without MAO, these two patterns may not be sustainable.
We name Type-1 (+ --) triangle motif as dominantly inhibitive trio given that negative interactions form majority in the system. In Type-1A sub-type, MAO takes a mixed strategy, collaborating with one and competing with another node in the trio. We term Type-1A as dominantly inhibitive trio with mixed strategy. In the Type-1B sub-type, MAO competes with both nodes in the trio simultaneously, and we term this type dominantly inhibitive trio with pure strategy. Among the six tested cases displayed in Table 3, in the cases of skin and oral, no Type-1A trio was detected; in the cases of IBD and lung, no Type-1B was detected. In the other cases, the RDHT of Type-1 ranged from 0 to infinity. That is, disease may raise or lower the number of Type-1 trios depending on the type of microbiome and its associated disease, possibly on other factors, and the difference can be exploited to detect the impact of diseases.
We name Type-4 (+ + +) triangle motif as fully facilitative trio given that positive interactions are the sole interaction in this type of trio system. It is also the most abundant triangle motif among the four types in the category of trios with MAO.
In summary, the results in Table 3 suggest that dominantly inhibitive trio (i.e., Type-1, including both Type-1A with mixed strategy and Type-1B with pure strategy) and fully facilitative trio (Type-4) possess the potential to act as in silico biomarker for differentiating the healthy and diseased microbiomes. As to the criteria for differentiating disease from healthy microbiome, we previously defined the ratio of disease to healthy trio (RDHT) as indicator of the changes, but actual application of the indicator is individual case specific, depending on the types of microbiome, disease, and possibly other factors. In fact, the taxonomic identities and biological characteristics (such as anaerobes or opportunistic pathogens) of trio nodes should play a rather important role in deciphering the mechanism of specific trio formation as we demonstrate elsewhere.

Microbiomes & Associated Disease Treatments
Single-Link MAO Double-Link MAO Triple-Link MAO Negative  1029  311  1340  207  123  286  616  8  37  76  239  360   HIV-ART  434  21  455  171  17  1  189  24  13  3  1   The class of "Trios with MAO handle". In the class of trios with MAO handle, MAO is connected with the trio as a 'handle' rather than as a constituent node of the trio. In contrast with the previously discussed class of trios without MAO handle, there is no 'prohibited' trio in the trios with MAO handle. Therefore, all three categories (SLM, DLM, and TLM) including nine types (classified at the tertiary level by considering the link signs) are practically possible. Note that the trios in this class actually do not contain MAO because usually MAO is unique in microbiome network. The results in Tables 5 and 6 suggest that the range of RDHT spans from zero to infinity. In a half of the cases (48%, or 13 out of 27 cases in the last three columns of Table 5), RDHT exceeded one, that is, diseases tend to raise the number of trios in the class of trios with MAO handle. Furthermore in 44% of the cases (12 out of 27), the RDHT exceeded 10, i.e., diseases caused more than 10 times increase in the number of trios; in 1/3 of the cases (9 out of 27), the RDHT reaches infinity, i.e., the trios occurred only in diseased microbiome networks. Given the striking differences in RDHT among different microbiome-disease treatments, we consider this class of trios also possesses the potential to act as in silico biomarker for assessing the effects of diseases on the human microbiome. Since it seems that the numbers of trios in this class are far greater than those in the class of trios without MAO handle, we believe that the previously identified fully facilitative trios and dominantly inhibitive trios may have an advantage over this class in exploring the mechanisms of disease effects. Another argument in support of our opinion is that the fully facilitative trios and dominantly inhibitive trios are simpler with three nodes only.
General patterns of the trio differences between the healthy and diseased microbiomes. In the following, we further look into general patterns by cataloging the 19 trios into five categories and summing up the total trios of each category in Table 5. In Table 5, besides listing the microbiome sites and healthy/disease treatments in the first two columns, the remaining five columns display the respective ratios of the trios in diseased network to those in healthy network for each of the five categories, i.e., RDHT for: trios with MAO, trios without MAO, single-link with MAO (SLM), double-link with MAO (DLM) and triple-link with MAO (TLM), respectively. That is, for each category, we define and compute the ratio of diseased to healthy trios (RDHT). In the ideal scenario when disease has no impact on the microbiome, the RDHT should be 1. If the ratio is larger than 1, then it indicates that the disease may raise the number of trios in the specific category; vice versa, it indicates that the disease may lower the number of trios if the RDHT is smaller than 1. Table 5 shows that in approximately 47% cases (21 out of 45) disease caused a decrease in the number of trios. Specifically, the decline of trios occurred mostly in two diseases: HIV and periodontitis. The RDHT values range from zero to infinity; the occurrence of zero or near zero (≤0.1) counts to 9, and that of infinity reaches 29% (13    Table 3. out of 45). The number of RDHT exceeding 10 (i.e., disease caused more than 10 times of increase in the trios) approaches to 38% (17 out of 45). In these cases, disease led to a significant increase in the number of trios. More interesting insights can be found by looking into the third level classification-considering the sign (+ , -) of interactions (links) in the trios, as well as the 'position' of signs (see Table 1). Table 6 summarized the  RDHT of Type-1, 2, 3, 4 trios from the information presented in Tables 3 & 4 to further reveal patterns and trends embedded in the six trios that are associated with MAO but without a MAO handle (see Table 1). Of course, the undetected Type-2 and Type-3 appear to be "prohibited" in our case studies as explained previously.
Among the six cases we analyzed, except for the CF-lung case, which we cannot draw a definite conclusion due to data insufficiency, there were three cases (IBD-gut, AD-skin, and BV-vaginal) that displayed disease-up-regulated trios trend, and two cases (HIV-gut and periodontitis-oral) displayed disease-down-regulated trend. Although further accumulating test cases is certainly meaningful, this splitting trend of the up or down of trios does not affect the testing of our primary hypothesis-whether or not the trios we defined can differentiate between healthy and diseased microbiomes. This is because the validity of our hypothesis hinges on the level of difference or gap in the trio numbers (i.e., RDHT) rather than on the sign or direction of the difference (rise or decline). Indeed, we believe that the variable sign of the difference among microbiomes may simply be a biomedical reality.
Finally, we suggest that, among 19 types of trios we initially propose to test, 12 are indeed promising as in silico biomarkers. The six trio types we excluded are the four types in the category of "trios without MAO", and three types (Type-2A, Type-2B, & Type-3) in the category of "trios with MAO". The reason they are excluded is either because they are either too abundant (to be indicative) or too rare (not detected) in both healthy and diseased samples, to be indicative. Indeed, the entire category of "trios without MAO" is excluded, not only because they are too abundant to be indicative, but also because they lack special node (in this study, the MAO). We demonstrated that the following 12 types or categories are the most promising: Type 1A (dominantly inhibitive trio with mixed strategy), Type 1B (dominantly inhibitive trio with pure strategy), Type-4 (fully facilitative), SLM (including 2 types), DLM (3 types), and TLM (4 types). We particularly favor the first three types, and give them the special names in particular in consideration that they are simpler and less abundant (in general) than the four-nodes SLM, DLM, and TLM. This may give them an advantage in further studying their etiological implications experimentally. As mentioned previously, two further improvements can be made to reveal potentially more meaningful biomedical insights. One is to look up the taxonomic identities or biomedical characteristics such as the trios of anaerobes, and another is to replace the MAO with other special network nodes such as MDO (most dominant OTU) or hub nodes. We will demonstrate these additional improvements elsewhere. Data accessibility. The datasets utilized in this study are available in the original studies cited in Table 2. The study does not involve any experiments involving humans and/or animals.