New statistical method identifies cytokines that distinguish stool microbiomes

Regressing an outcome or dependent variable onto a set of input or independent variables allows the analyst to measure associations between the two so that changes in the outcome can be described by and predicted by changes in the inputs. While there are many ways of doing this in classical statistics, where the dependent variable has certain properties (e.g., a scalar, survival time, count), little progress on regression where the dependent variable are microbiome taxa counts has been made that do not impose extremely strict conditions on the data. In this paper, we propose and apply a new regression model combining the Dirichlet-multinomial distribution with recursive partitioning providing a fully non-parametric regression model. This model, called DM-RPart, is applied to cytokine data and microbiome taxa count data and is applicable to any microbiome taxa count/metadata, is automatically fit, and intuitively interpretable. This is a model which can be applied to any microbiome or other compositional data and software (R package HMP) available through the R CRAN website.

Researchers may want to know how microbiome composition (outcome or dependent variable) changes in relation to patient characteristics (inputs or independent variables) such as age, disease severity, or cytokine level. The value of this type of analysis is to understand how patient characteristics may directly impact microbial compositional and variability. For example, if we know the microbiome changes in a systematic way as patients age, those changes due to age can be accounted for and removed when studying disease processes and the differences between healthy and sick patients [1][2][3] . If the composition of the taxa population is different for changing levels of cytokines, these can be examined as a possible mechanism of disease or for treatment targets 4 .
Associating patient outcomes (dependent variable) with input covariates (independent variables) is often a regression analysis problem to statisticians. Regression methods from classical statistics have been developed such as linear regression for numerical outcomes like cholesterol levels, logistic regression for binary outcomes like responder versus non-responder, or proportional hazards regression for survival analyses like time to death 5 . As computing power increased over the years, new regression algorithms have emerged that apply to more complex models and outcomes. These include nonlinear, mixed, and nonparametric regression (e.g., splines) models 6 , and nonparametric recursive partitioning to finds regions of the covariate space where the outcome is more homogeneous 7,8 . Within the recursive partitioning framework even more complex outcomes have been modeled such as survival models 9 or Haseman-Elston models from genetic epidemiology to associate genetic markers in sibpairs with disease 10 .
However, these models won't work off-the-shelf when the outcome (dependent variable) are the taxa counts and the inputs (independent variables) are the metadata since the mathematics and computing behind them don't apply to microbiome taxa count outcomes. To get around this, researchers have developed ad hoc regression methods. For example, we regressed individual taxa proportions (Bacilli, Clostridia, and Gammaproteobacteria) onto day of life, days on antibiotics, breast milk, and C-section in a study of necrotizing enterocolitis in pre-term births to see how these variables impacted the levels of these taxa 11 . Since each taxon was regressed separately, this approach was suboptimal by not using the covariation information among the taxa and metadata in the analysis, and it was not possible to model how all three taxa changed simultaneously. A second approach uses random forests, however these models use the individual taxa as inputs or independent variables to predict an outcome

Regression With Microbiome Composition as The Outcome
The method presented here has been developed for microbiome compositional data outcomes using recursive partitioning that separates samples into non-overlapping subgroups based on patient covariate values -samples within a subgroup are more similar to each other in their microbial composition than to samples in other subgroups. Similarity of a separation into two subgroups is measured by the Dirichlet-multinomial (DM) model loglikelihood ratio 15,17 . The Dirichlet-multinomial recursive partitioning algorithm (DM-RPart) allows the regression of the compositional data onto the covariates without having to resort to one-taxa-at-a-time, complex assumptions, or data reduction.
Dm-rpart application to ihmp data. To demonstrate the utility of DM-RPart for investigating host covariate-microbiome interactions we curated a longitudinal dataset generated as part of the iHMP to test the hypothesis that changes in cytokines representing one or more physiological processes produce changes in the composition of the gut microbiome. These data consist of host-serum cytokines and gut microbiome samples collected from individuals during periods of self-reported viral upper respiratory infection. This work was supported by the NIH Common Fund Human Microbiome Project (HMP) 164(1U54DE02378901) 18,19 . Twelve cytokines were selected that may represent one or more physiological process associated with antimicrobial activity (IL-17F,  IL-17A, IL-21, IL-22, IL-23, IL-12p40) 20 , autoimmunity (Eotaxin) 21 , allergy (IL-4, IL-13) 22 , and viral infection (IFNG) 23 , as well as cytokines related to obesity (Leptin) 24 and regulation of inflammation (TGFB) 25 . Partitioning all 128 microbiome samples produced an optimal tree that first split samples based on leptin, then partitioned high-leptin microbiomes based on IFNG, and low-leptin microbiomes based on IL4/IL23 (Fig. 1A). Specifically, the DM-RP model partitions the 128 samples into 5 non-overlapping groups: 37 samples with Leptin < 1476 and IL4 < 60 (Terminal Node 3); 10 samples with Leptin < 1476, IL4 > 60, and IL23 < 62 (Terminal Node 5); 10 samples with Leptin < 1476, IL4 > 60, IL-23 > 62 (Terminal Node 6), 50 samples with Leptin > = 1476 AND IFNG < 66 (Terminal Node 8); and 21 samples with Leptin > = 1476 AND IFNG > = 66 (Terminal Node 9). The percentages in each node represent the percent of all 128 samples. The terminal nodes at the bottom of the tree represent the subgroups. Selection of the variables and cut points is an automatic process based on optimizing a measure of homogeneity such that the samples in two child nodes, say the 57 samples with Leptin < 1476 and the 71 samples with Leptin >1476 are more homogeneous compared to when these samples are combined in the parent node above them. Figure 1B shows the composition of different genera in the terminal nodes, or subgroups, in Fig. 1. Note that DM-RP is data-driven and fully automatic with selection and order of the splits being optimal.
This model makes biological sense. Leptin levels are associated with BMI 24 , and obesity has frequently been associated with changes in the composition of the gut microbiome. Interestingly, changes in leptin levels have also been associated with changes in the composition of CD4 + T cells. Increased leptin correlates with an increased Th1: Th2 ratio 26,27 , which may explain why we observe IFNG-related variation in the microbiome 4 on a high-leptin background. By contrast, low leptin correlates with a decreased Th1:Th2 ratio, which may in turn explain why we observe IL4-related variation in the microbiome 28 on a low-leptin background. The presence of IL23 partitioning in our model potentially indicates Th17 related variation in the gut microbiome 29 may also be observed on a low leptin background. Our results therefore support an established relationship between obesity and the microbiome and, furthermore, allow us to generate novel hypotheses about leptin-dependent detection of CD4 +T cell interaction with the microbiome. www.nature.com/scientificreports www.nature.com/scientificreports/ Fitting DM-RPart trees. Splitting the 128 microbiome samples into five subsets based on Leptin, IFNG, IL4, and IL23 is a two-step process. In the first step, microbiome samples are recursively split into two groups based on a covariate and cut point such that samples above or below that cut point are more homogeneous. In this example, the parent node (i.e., the set of all 128 samples) is split into two children nodes based on the subjects Leptin value. Finding the cut point is described in the technical paper, but in brief, all cut points over all covariates (i.e., IL-17F, IL-17A, IL-21, IL-22, IL-23, IL-12p40, Eotaxin, IL-4, IL-13, IFNG, Leptin, and TGFB) are tested. The covariate/ cut point that splits the samples into two groups is selected and entered into the tree (Fig. 1A) that results in the microbiome samples in the two children node being more homogeneous compared to when the samples are combined in the parent node. This splitting step is repeated on the microbiome samples in each child node (i.e., recursively applied) until a small number of samples, say 5, remain in a node, or the taxa counts are identical in all the samples. www.nature.com/scientificreports www.nature.com/scientificreports/ The second step of the DM-RPart algorithm is to find the right-sized tree after the first step of fitting a full tree partitioned down to a small number of samples within each terminal node. To avoid overfitting the data, the full tree is pruned back to a smaller right-sized tree by cross-validation and cost-complexity pruning. Cross-validation error is defined as the total squared Euclidean distance between validation data to the mean taxa of each terminal node for a given subtree which is known as the PRESS statistic (see details in Methods section) which becomes smaller as the partition finds homogenous subgroups (see Breiman et al., Chpt. 11, Section 5) 7 . The cross-validation results for the Leptin/IFNG/IL4/IL23 tree is shown in Table 1. Column one is the number of terminal nodes or complexity of the subtrees, column two is the cost complexity parameter α for pruning, column three the number of splits in the tree, the fourth column the difference in error between nested trees (rows), and the last column is the mean of 100 times 10-fold cross-validation error for selecting the best size tree. The right-sized tree is selected as smallest mean cross-validation level which is five on the full iHMP data.

DM-RPart application to Insulin sensitivity data.
We subsequently used DM-RPart to examine whether insulin sensitivity may influence host-microbiome interactions. We combined samples for 27 insulin resistant (IR) and 64 insulin sensitive (IS) based on steady state plasma glucose (SSPG) measurements. The same variables as above plus a categorical variable indicating if the sample was from an IR or IS subject were included. Notably, DM-RPart returned no significant tree when partitioning microbiomes from both IR and IS individuals which are shown in Table 2 where the smallest mean cross-validation error happened in the root node (Tree Complexity 1) which means no split on the insulin resistant/sensitivity data.
Application of DM-RPart to simulation data. In this simulation we assess the performance of DM-RPart under a range of sample size (number of microbiome samples), difference between DM parameters (π and θ), and normally distributed covariates (Table 3). For each simulation three microbiome datasets were generated using the Dirichlet-multinomial parameters θ π( , ) from the HMP saliva, throat, and tongue body sites. Each body site was associated with a randomly generated normal covariate with mean −1, 0, and 1 for saliva, throat, and tongue, respectively. Three standard deviations were used to allow differing levels of overlap between the groups. The number of reads for each sample is 50,000. The datasets were generated under Dirichlet-Multinomial distribution using the function Dirichlet.multinomial in the R package HMP. www.nature.com/scientificreports www.nature.com/scientificreports/ The statistical assessment of DM-RPart is defined as the average of mean squared error in each terminal node for the right-sized pruned trees. The mean squared error is based on the distance of the samples in the node from the node mean (π) with the smaller distance indicates better fit of the data. Note that in the simulation, the group membership was never used for fitting the tree. Table 4 shows the mean results for 1000 iterations. As expected, the distance goes up as the amount of variability among microbiome samples increase (θ increases from 0.08 to 0.6), as the sample size decreases (N from 360 to 120), and as the covariate standard deviation increases (SD from 0.2 to 0.5).

Discussion
The Dirichlet-multinomial distribution was introduced previously for formal hypothesis testing in microbiome research 15 and is extended here for microbiome regression. This paper introduces a novel approach of applying recursive partitioning to find covariates that separate and explain differences in microbiome taxa count data. The utility of this approach lies in discovering covariates associated with the microbiome taxa counts, and for forming hypotheses for future testing and patient stratification in clinical trials.
DM-RPart uses recursive partitioning to regress microbiome compositional data onto covariates. Microbiome taxa count data is compositional since a fixed number of sequence reads are sampled and an increase in one taxa must be accounted for by a decrease in one or more other taxa to ensure that the sum of the proportions always adds to 1 30 . The Dirichlet-multinomial distribution analyzes microbiome compositional data using taxa counts avoiding the need to rarefy which is known to cause loss of information 31 , the reduction of multivariate taxa counts into a single number diversity measure which is difficult to interpret 32 , and the fallacy that systems biology can be learned by univariate statistics through one-taxa-at-a-time analyses 33 .
Any covariates can be used for fitting trees such as treatment or disease group, age, gender, or metabolites. In the first example we focused on cytokines in order to develop hypotheses about microbiome/cytokine associations. In the second example we were interested in whether insulin resistant versus insulin sensitive patients had different microbiomes.
The selection of the covariates is not limited, though we encourage covariate sets be selected that can be interpreted biologically versus adding hundreds or thousands of covariates that will be difficult to understand and result in spurious results due to the high dimensionality of the data 34,35 .
Two known limitations of recursive partitioning are the tendency to overfit the data, and binarizing of continuous variables at each split of the tree into two levels. We address the overfitting problem using a cross-validation cost-complexity pruning algorithm. The issue of splitting continuous variables (binarizing) has been argued as a reason to avoid recursive partitioning algorithms. Nevertheless, RP is a commonly used exploratory analysis tool in the statistics community, and DM-RPart clearly provides insights into the relationship between covariates and the microbiome that cannot be quantified by other approaches. As an exploratory tool it also needs to be kept in mind that the results should be used to advise on the hypothesis testing and design of future studies. We believe the problem of binarizing covariates is outweighed by the insight this model provides.
Other methods are used in microbiome research to associate microbiome data with covariates such as PERMANOVA 36 or multiple co-inertia analysis (MCIA) 37 . However, these methods differ from DM-RPart in terms of the type of outcome (e.g., PERMANOVA regresses pairwise differences of diversity onto covariates) or how the results are presented (e.g., MCIA presents ordination plots based on taxa counts and covariates). Because  www.nature.com/scientificreports www.nature.com/scientificreports/ these methods and DM-RPart are using different data and approaches a benchmarking comparison was not felt appropriate.
Models fit using the DM-RPart function in R can be used to predict the terminal node assignment and microbiome for new data with the same covariates. The ability to predict in R can be used to validate the models as well and provide independent test data estimates of accuracy.

Methods introduction to Dirichlet multinomial distribution. The Dirichlet Multinomial (DM) probability dis-
tribution on the set of microbiome taxa count samples is defined as: is a microbiome sample with x ij the number of reads of taxa j found in sample i, = … j J 1, , unique taxa and = … i N 1, , microbiome samples. The parameter π π π π =  { , , , }, is a vector of the expected taxa proportions, and the parameter θ ≥ 0 is the overdispersion which is a measure of the between sample variability. The interpretation of π θ P x ( ; , ) i is the probability of observing a sample X i in a group given the parameters π θ ( , ). In practice, π θ ( , ) are unknown so can be estimated using either the maximum likelihood estimation (MLE) or the method of moments (MoM). In this paper, MoM is chosen since MLE is time consuming. Mathematical details on the Dirichlet-multinomial and MoM calculations are presented in La Rosa et al. 15 .

Log-likelihood ratio (LLR) statistic.
In DM-RPart, we use the likelihood ratio statistic from the DM model to partition the covariate space into non-overlapping homogeneous subregions. Let θ π .L x x x ( , ; , , , ) given the method of moments (MoM) parameter estimates θ π( , ). The larger the likelihood the more confidence that the data were generated from the model with parameters θ π( , ). The likelihood function is calculated as the product of the probabilities of each observed sample, In classical statistics a likelihood ratio test is routinely used for testing alternative models, and was used to test if two sets of microbiome samples come from the same DM model or from different DM models 15  has well-known statistical properties and is the basis that used to calculate a P value for hypothesis testing.
Instead of hypothesis testing to calculate a P value, in the DM-RPart algorithm LR is used as a measure of how similar and dissimilar subsets of microbiome samples are to each other. In practice for computational accuracy, the log-likelihood π π θ θ … =∑ LL x x x Px ( , ; , , , ) log( ( ; , )) N i 1 2 is used in place of the likelihood function but behaves mathematically the same way. The LR is therefore replaced with the log-likelihood ratio (LLR), with ≈ LLR 0 indicating the two groups are similar. The further LLR is from 0 the stronger the evidence that the samples come from two groups. In DM-RPart the loglikelihood functions for the microbiome samples combined in a single group LL 0 and split into two subgroups LL LL , 1 2 are used to decide how strong the evidence that the samples should be partitioned into two subgroups (see below for details).
introduction to recursive partitioning. The general framework of recursive partitioning for a binary (Yes/No) outcome is introduced to illustrate the method. A dataset with N = 100 random observations was generated with two predictor (independent) variables X and Y simulated from a Uniform distribution between 0 and 1, and a binary outcome (dependent) variable Z = 1 with probability = 0.95 if Y ≤ 0.5, probability = 0.05 if Y > 0.5, X ≤ 0.5, and probability = 0.70 if Y > 0.5, X > 0.5. Otherwise Z = 0. The simulated data is shown Fig. 2 along with the boundaries partitioning the covariate space into non-overlapping subregions with different probabilities of Z = 1.
The goal of recursive partitioning is to automatically find boundaries such as those in Fig. 2 over many covariates. The algorithm does this by calculating how well every possible split along X and Y, defined as the midpoints between ordered values improves the homogeneity of the data in the two subsets relative to the unsplit set. Each possible split is scored according to a defined measure of homogeneity, which for binary outcomes is often the Gini diversity index which is minimized if all the outcomes in a subset are the same. In practice the first and last few cut points would not be tested to ensure a minimum, say 5, samples are in each subgroup. After each cut point along X is scored, the same procedure is applied to the other covariates. After all possible splits over all www.nature.com/scientificreports www.nature.com/scientificreports/ covariates are scored, the split with the best score is selected and the data in a (parent) node are partitioned into two non-overlapping (child) nodes.
The algorithm proceeds recursively by repeating this step on each child node until all the Z (outcome) variables within a node have the same value, or there are less than some minimal number of observations need for splitting, say 5. The order of the splits defines a tree structure with the top node containing all the data, and the terminal nodes containing subregions of the covariate space defined by splits along the covariates.
For the simulated data in Fig. 2, 65 samples belong to class 1 and 35 to class 0. Using a majority rule, the data in the parent node would be classified as 1 with a misclassification rate of 35% (i.e., the 35 class 0 samples incorrectly classified as class 1). The first split occurs at ≥ .
Y 0 505 sending 61 of the 100 samples to the left child node, and < .
Y 0 505 sending the remaining 39 samples to the right child node. The samples in the right child node were not further split with high homogeneity of 38 class 1 samples and 1 class 0 sample for a misclassification rate of 1 out of 39, or 2.5%. The samples in the left child node were classified by majority rule as class 0 since 34 out of 61 were class 0, incorrectly classifying the 27 class 1 samples as class 0 samples, a 44.3% misclassification rate. This node was further split along < .
X 0 488 to the left child node, and ≥ . X 0 488 to the right child node, with misclassification rates of 2 out of 26 (7.7%) and 10 out of 25 (40.0%), respectively (Fig. 3). While the misclassification rate of this last node is high, the overall misclassification rate is 2 class 1 samples misclassified as 0, and 10 + 1 = 13 class 0 samples classified as 1, for a total misclassification rate of 13%, a large improvement over the 35% misclassification from not partitioning the data.
As an exploratory data analysis method 38 , additional attractions of recursive partitioning models are the intuitiveness of the decisions rules defined by the order of the splits down any path, and the ability to fit many variables in the model which otherwise could not be easily displayed visually by other graphical methods.  www.nature.com/scientificreports www.nature.com/scientificreports/ cost complexity and pruning. To determine the right-sized tree such as shown in Fig. 3, the recursive partitioning algorithm fits a full tree, T Full until a user-defined stopping criterion such as a minimum number of samples is reached, or a node is found with all outcomes being identical. In the example the full tree shown in Fig. 4 has 9 terminal nodes. To avoid overfitting the data this tree is pruned back using a cost-complexity parameter to eliminate branches that balances predictive accuracy with the size (complexity) of the tree 16 . Let T Full denote the full tree with 9 terminal nodes from 8 splits (for this example), and T denote a pruned subtree of T Full obtained by cutting off branches and collapsing samples in children nodes into a higher parent node. The complexity of any tree, T , is the number of nodes in T , where ≤ ≤ T 1 9 with = T 9 Full and the root node (no splits) = T 1 Root . Other trees obtained from pruning T Full will have complexity ranging between these two extremes.
The cost complexity for any subtree T is where α ≥ 0 is the cost complexity parameter (CP), T is the complexity of the tree, and ∑ = S T ( ) n T n 1 is the sum over all terminal nodes in T of a measure of homogeneity, which in the case of binary outcomes is usually Gini diversity. The α for each subtree is calculated as the differences of the relative errors of two nested trees, ⊂ T T i j , where T i is obtained by pruning T j . α is a tuning parameter that balances the tree size T and goodness of fit to the data ∑ = S T ( ) n T n 1 . For a categorical response, the relative error is calculated based on the misclassification rate for each split, with α calculated as the difference of relative error divided by the reduction in complexity. For example, for the tree . The optimal α has the smallest cross-validation estimate of error rate. In the example α = .
0 043 has the smallest estimated error rate of 0.429 giving the size of the optimal tree as 2 splits and 3 terminal nodes (Table 5 and Fig. 2). Because pruning uses a nested algorithm there exists only one pruned tree with 2 splits and 3 terminal nodes so a unique tree is obtained.
An alternative method of selecting the right size tree is the 1 SE rule. Sometimes the cross-validation errors rapidly decrease for the first splits, followed by a long flat tail with small random fluctuations. When the position of the minimum in the flat tail is unstable, the 1 SE rule provides the simplest tree whose accuracy is comparable   www.nature.com/scientificreports www.nature.com/scientificreports/ to the tree with the smallest cross-validation estimate of error rate. The 1 SE rule is implemented as follows: First define the tree T k0 with the smallest cross-validation error R T ( ) k0 . Second, the tree T k1 is selected, where T k1 is the simplest tree such that

Tree Complexity Cost Complexity (α) Number of Splits Relative Error Cross-validation Error
Researches could choose either of the smallest cross-validation estimate of error rate or the 1 SE rule to select the right size tree.
Microbiome recursive partitioning. We presented recursive partitioning using a simple example with a binary outcome to illustrate the method and introduce the mathematics. This method can be used with categorical or continuous outcome variables, and has been extended to more complex models such as linear regression 39 and genetic epidemiology 10 . Most standard statistical packages now contain versions of recursive partitioning.
In this section, microbiome recursive partitioning is presented that extends classical recursive partitioning to microbiome taxa counts as the outcome variable with splitting based on the Dirichlet-multinomial log-likelihood ratio (LLR) and cross-validation cost-complexity pruning to avoid overfitting. The goal is to find partitions of the covariate space by recursive partitioning so microbiome samples within terminal nodes are more similar to each other than they are to microbiome samples in other terminal nodes.
Consider go into the left child node, and all others into the right child node. As in standard recursive partitioning, the process is repeated for all cut points and covariates, with each split of a parent node scored by the loglikelihood ratio For each parent node, the split with the largest absolute value of LLR is selected as the split for that node. This process is repeated recursively being applied to each child node to fit T Full . This parameter is specified by the user.
The cost complexity in DM-RP for any subtree T is is the sum of the LL over all terminal nodes in the subtree T and α adjusts for the size of the tree. When α = 0, the largest tree is selected, because α T part is dropped. As α → ∞, the root tree will be selected since larger trees result in larger cost-complexity values. For a given α, a subtree α T( ) which minimizes α = ∑ can be found because there are finite number of subtrees of T FULL . A 10-fold cross-validation method for selecting the right sized tree is developed. In any node we can calculate the MOM estimate of the Dirichlet-multinomial parameter π ( ) as the average taxa proportions of taxa samples contained in that node. For a new taxa sample, it will fall into a node based on its covariates and the splitting rules defined by the tree. The Euclidean distance between the sample's taxa frequency and π is calculated. The mean of the squared Euclidean distance of all the new test samples to the π of the terminal nodes they fall into is calculated as a measure of fit and used for cross-validation pruning. The procedure is implemented as follows: 1. Fit T Full , and calculate all cost complexity parameters α 2. Randomly partition the data into 10 equal sized subsets. 3

Data availability
DM-RPart is available in the R HMP package as open source code through CRAN. In addition, the data and code to generate Fig. 1 are included as a vignette and can be used to reproduce our results and as a template for running other datasets.