Abstract
Differential abundance (DA) analysis of microbiome data continues to be a challenging problem due to the complexity of the data. In this article we define the notion of “sampling fraction” and demonstrate a major hurdle in performing DA analysis of microbiome data is the bias introduced by differences in the sampling fractions across samples. We introduce a methodology called Analysis of Compositions of Microbiomes with Bias Correction (ANCOMBC), which estimates the unknown sampling fractions and corrects the bias induced by their differences among samples. The absolute abundance data are modeled using a linear regression framework. This formulation makes a fundamental advancement in the field because, unlike the existing methods, it (a) provides statistically valid test with appropriate pvalues, (b) provides confidence intervals for differential abundance of each taxon, (c) controls the False Discovery Rate (FDR), (d) maintains adequate power, and (e) is computationally simple to implement.
Introduction
A number of procedures have been proposed and used in the literature for identifying deferentially abundant taxa between two or more ecosystems. A detailed survey of some of the existing methods and their performance has been discussed in Weiss et al.^{1}. As noted in a list of studies^{2,3,4,5,6}, the observed microbiome data are relative abundances which sum to a constant, hence they are compositional. Standard statistical methods are not appropriate for analyzing compositional data^{7}. Methods such as ANOVA, Kruskal–Wallis test do not appropriately take into consideration the compositional feature of microbiome data when performing differential abundance (DA) analysis. As demonstrated in literatures^{1,2}, these methods are subject to inflated false discovery rates (FDR). Although metagenomeSeq^{8} was specifically developed for microbiome data, it too is subject to inflated FDR under the Gaussian mixture model^{1,2}.
ANCOM^{2}, which is based on Aitchison’s methodology, uses relative abundances to infer about absolute abundances. According to an extensive simulation study^{1}, among the available methods for DA analysis, only ANCOM performs well in controlling FDR while maintaining high power, as long as the sample size is not too small. One of the deficiencies of ANCOM is that it does not provide p value for individual taxon, nor can it provide standard errors or confidence intervals of DA for each taxon, and it can be computationally intensive.
The Differential Ranking (DR) methodology^{6} reformulates the DA analysis as a multinomial regression problem. By imposing the Additive LogRatio transformation to relative abundances, the DR methodology accounts for compositionality of microbiome data. As demonstrated in^{6}, the ranks of relative differentials perfectly correlate with ranks of absolute differentials. However, similar to ANCOM, the DR procedure does not provide p values or confidence intervals to declare statistical significance.
It is important to distinguish between absolute and relative abundances of taxa in a unit volume of an ecosystem. Change in the absolute abundance of a single taxon can alter the relative abundances of all taxa (Fig. 1). The choice of parameter for statistical analysis is important and needs to be clearly stated. Often researchers are interested in identifying taxa that are different in mean absolute abundance between two or more ecosystems^{6}. Testing hypotheses regarding mean relative abundance is not equivalent to testing hypotheses regarding mean absolute abundance^{2,6}. In addition, note that not all samples have the same sampling fraction, which is defined as the ratio of the expected absolute abundance of a taxon in a random sample (e.g., a stool sample) to its absolute abundance in a unit volume of the ecosystem (e.g., a unit volume of gut) where the sample was derived from. Consequently, the observed counts are not comparable between samples. Thus, all DA methodologies require the counts to be properly normalized to account for differences in sampling fractions across samples. Sampling fraction is affected by two components, namely, the microbial load in a unit volume of the ecosystem and the library size of the corresponding sample (e.g., total species abundances sequenced from a subject’s stool sample). Therefore, it is not sufficient to normalize the library size across samples as one needs to take into consideration the differences in the microbial loads. Consider the toy example in Fig. 2. Suppose the gut of subject A as well as B consist of only two taxa, the red and green varieties. Clearly, the true absolute abundance of each taxon is 50% more in subject B’s ecosystem as compared with subject A’s. However, they each have the same library size (six each) in their respective samples. Furthermore, sample relative abundance as well as sample absolute abundances are identical in the two samples. If a normalization method is based only on the library size and ignores the sampling fraction, then the two samples would be considered as normalized. Consequently, an investigator would falsely conclude that none of the taxa are differentially abundant in the two ecosystems. This erroneous conclusion would be avoided if one recognizes that we have a larger sampling fraction in the sample obtained from A’s ecosystem than from B’s (\(\frac{3}{9}\) vs \(\frac{2}{9}\)), Thus, normalizing data on the basis of sampling fractions gives a better description of the truth than normalization methods that rely purely on the library sizes.
Ideally, under the null hypothesis, the test statistic for DA analysis should be (at least approximately) centered at zero (i.e., unbiased). However, for many DA methods, this is not always true for at least one of the following reasons: (1) The test statistic may not be designed for testing hypothesis regarding the actual parameter of interest; (2) Data are not properly normalized; (3) Underlying structure, such as compositionality, is ignored. Motivated by the limitations of existing DA methods, in this paper we propose a methodology called Analysis of Compositions of Microbiomes with Bias Correction (ANCOMBC) that is aimed to address the problems mentioned above. As in ANCOM and DR, the proposed ANCOMBC methodology assumes that the observed sample is an unknown fraction of a unit volume of the ecosystem, and the sampling fraction varies from sample to sample. ANCOMBC accounts for sampling fraction by introducing a samplespecific offset term in a linear regression framework, that is estimated from the observed data. The offset term serves as the bias correction, and the linear regression framework in log scale is analogous to logratio transformation to deal with the compositionality of microbiome data. The case of zero counts is also discussed in “Methods” section. This methodology has some conceptual similarities with DR, but is fundamentally different. With ANCOMBC, one can perform standard statistical tests and construct confidence intervals for DA. Moreover, as demonstrated in benchmark simulation studies, ANCOMBC (a) controls the FDR very well while maintaining adequate power compared with other popular methods, and (b) it is substantially faster than ANCOM. The CPU time (RStudio, x86_64appledarwin15.6.0, and macOS) is 0.28 min vs. 63 min when the number of taxa is 500. The CPU time for ANCOM increases dramatically as the number of taxa increases to 1000. In this case, the CPU times for ANCOMBC and ANCOM are 0.51 and 211 min, respectively. In addition to results based on synthetic data, we also illustrate ANCOMBC using the wellknown global gut microbiota dataset^{9}.
Results
Normalization
Using simulated data, we illustrate how the existing normalization methods fail to eliminate the bias introduced by differences in sampling fractions across samples, whereas the normalization method introduced in ANCOMBC performs well. Specifically, we compare our proposed method with CumulativeSum Scaling (CSS) implemented in metagenomeSeq^{8}, Median (MED) in DESeq2^{10}, Upper Quartile (UQ) and Trimmed Mean of M values (TMM), and TotalSum Scaling (TSS). In addition, we also considered modified versions of UQ and TMM implemented in edgeR^{11}. These are obtained by multiplying the normalization factors with the corresponding library size to account for “effective library size”^{12}, and are denoted as ELibUQ and ELibTMM (see Supplementary Table 7 for formulas and Supplementary Fig. 11 for workflow).
We considered a variety of simulation scenarios as follows. The details of the simulation study are presented in the Supplementary Notes.
 (1)
Unbalanced microbial load in two experimental groups and balanced library size for each sample. This results in a large variability in sampling fractions (Fig. 3).
 (2)
Unbalanced microbial load in two experimental groups and unbalanced library size for each sample. This results in a moderate variability in sampling fractions (Supplementary Fig. 1).
 (3)
Balanced microbial load in two experimental groups and balanced library size for each sample. This results in a small variability in sampling fractions (Supplementary Fig. 2).
Thus, we simulated data where sampling fraction in Group 1 is systematically different from sampling fraction in Group 2. Consequently, observed absolute abundances in the samples in the two groups were systematically different even though the actual absolute abundances in the ecosystems are same. To evaluate the performance of each normalization method, we introduced a residual measure that estimates the deviation between the estimated sampling fraction and the true sampling fraction (see Supplementary Discussion). For simplicity of exposition, we plotted the centered residuals, by subtracting the group average of the residuals. If a normalization method is effective then it should eliminate the bias due to the differences in the sampling fractions so that samples from the two groups (circles and triangles) in Fig. 3 should intermix and not cluster by the group labels.
From Fig. 3 (and Supplementary Figs. 1, 2) we notice that the samples normalized by ANCOMBC are nicely intermixed and do not cluster by the group labels. This is not the case with most of the remaining methods where residuals cluster by group labels, thus indicating that they are unable to eliminate the underlying differences in sampling fractions between the two groups. Thus, under the null hypothesis of no difference in the absolute abundance of a taxon in two groups, their test statistics are not centered at zero. This results in inflated FDR (see Supplementary Discussion). We also note from Fig. 3 and Supplementary Figs. 1, 2, that not only ANCOMBC does well in estimating the bias due to differences in sampling fraction, the variability in the estimates of the sampling fractions is very small as seen from the height of the box plot for ANCOMBC. This is an important observation because it suggests that the variability in the estimator of bias due to sampling fraction is potentially negligible in the test statistic described in “Methods” section.
Clearly, as seen in Fig. 4a, b and Supplementary Figs. 3a, b, 4a,b, the normalization of data has a major effect on the FDR and power of various methods.
DA analyses
Simulating data from PoissonGamma distributions (see Supplementary Notes for simulation settings and Supplementary Fig. 12 for workflow), we evaluated the performance of various methods in terms of FDR and power. Since there is no hard threshold available for DR to declare whether a taxon is differentially abundant or not, it was not included in this simulation study.
Not surprisingly, standard Wilcoxon ranksum test applied to relative abundance data leads to highly inflated FDR (Fig. 4a and Supplementary Figs. 3a, 4a) in all simulation scenarios. This is primarily because such standard tests ignore the compositional structure of the data, and seen from Fig. 3, TSS does not successfully normalize the data. Simply applying nonparametric tests without any normalization can also be problematic when the sampling fractions are different across experimental groups (Fig. 4a). The two widely used countbased methods in RNASeq literature, edgeR (implemented using ELibTMM^{12} by default) and DESeq2, generally exceed the 5% nominal FDR level when there are differences in sampling fractions (Fig. 4a and Supplementary Fig. 3a). For instance, edgeR has FDR as large as 40% (Fig. 4a), meaning that 40% of findings could be potentially false discoveries. The zeroinflated Gaussian mixture model used in metagenomeSeq (ZIG) consistently has the largest FDR when sampling fractions are not constant (Fig. 4a and Supplementary Fig. 3a). In some cases, the FDR could be as much as 70%, which perhaps is partly due to the Gaussian distribution assumption for log abundance data. Although metagenomeSeq using zeroinflated LogGaussian mixture model successfully controls the FDR under 5% in all simulations, it suffers a severe loss of power (Fig. 4b and Supplementary Figs. 3b, 4b). The power of detecting differentially abundant taxa could be lower than 10%.
Similar to ANCOM, ANCOMBC not only controls the FDR at the nominal level (5%) but also maintains adequate power in all simulation settings considered here. An important observation to be made regarding all methods, other than ANCOM and ANCOMBC, is that as the sample size within each group increases, so does the FDR. This is perhaps a consequence of the fact that the test statistics are not centered at the true null parameter but are shifted due to differences in the sampling fraction. Hence asymptotically, these tests fail to control the false positive as well as FDR (see Supplementary Discussion).
In addition to the above PoissonGamma model, we performed simulations using the real global patterns data^{13}, to get a broader perspective on the performance of the various methods (see Supplementary Notes for simulation details). In this case again, ANCOM and ANCOMBC controlled the FDR and competed well in terms of power with all other methods. The estimated FDR of DESeq2 and edgeR increased further in this simulation setup (Supplementary Fig. 5a, b) compared with the simulation using PoissonGamma distribution. Note that DESeq2 and edgeR were designed for PoissonGamma distribution, and hence it is not surprising that these methods performed poorly in this new setup.
Illustration using gut microbiota data
We illustrate ANCOMBC by analyzing the US, Malawi and Venezuela gut microbiota data^{9}. This dataset consists of 11,905 OTUs obtained from subjects in the USA (n = 317), Malawi (n = 114), and Venezuela (n = 99). We first assessed the performance of different normalization methods mentioned above. One heuristic approach to gain insights on the impact of normalization is to examine how well the normalized samples separate from each other according to their phenotypes in a nonmetric multidimensional scaling (NMDS) plot. We provide the results for Malawi and Venezuela populations in Fig. 5.
As seen from this figure, ANCOMBC appears to perform very well visually in separating samples from the two populations and has the largest betweengroup sum of squares (BSS). BSS measures how well clusters are separated. Larger the BSS value the better a method is in clustering objects according to group labels. ELibTMM, CSS, and MED also performed well. Consistent with the bias correction and FDR/Power simulations reported in Figs. 3 and 4, where ELibUQ, UQ, TMM, and TSS perform poorly in correcting biases and have poor FDR control, they also have poor performances in distinguishing samples based on their nationalities.
We also report results of pairwise DA analyses at phylum level among the above three countries using ANCOMBC. It is wellknown that the infant gut microbiota evolve with their age^{14} due to changes in the feeding patterns, diet, and other exposures. Hence, for illustration purposes, we performed a stratified analysis by considering two age groups, infants below 2 years (labeled as “infants”) and adults between 18 and 40 (labeled as “adults”). Results of all pairwise comparisons are provided in Fig. 6a and Supplementary Table 1. Note that ANCOMBC is the first method in the literature that can not only identify differentially abundant taxa while controlling the FDR for multiple testing, it also provides 95% simultaneous confidence intervals for the mean DA of each taxon in the two experimental groups. These confidence intervals are adjusted for multiplicity using Bonferroni method. Thus, a researcher can evaluate the effect size associated with each taxon when comparing two experimental groups. This is particularly important in the present climate when researchers are increasingly skeptical about making decisions based on p values (alone)^{15}.
Interestingly, phyla such as Cyanobacteria, Elusimicrobia, Euryarchaeota, and Spirochetes, which are known to be associated with rural environment and hygiene^{16,17,18,19}, are significantly more abundant among Malawi than the US infants and adults. We discover an interesting trend in the absolute abundance of phylum Verrucomicrobia, whose absolute abundance is known to increase with antibiotics usage to protect against pathogens and other opportunistic bacteria^{20}. Consistent with the high usage of antibiotics in the western world among infants as well as adults, we discover a significant increase in the absolute abundance of Verrucomicrobia in US relative to Malawi adults and infants, and relative to Venezuelan adults (Fig. 6a). Similarly, there is a significant increase in its absolute abundance among Venezuelan infants compared with Malawi (Fig. 6a).
It is welldocumented in the literature that BMI is linked to the ratio of Bacteroidetes to Firmicutes^{21}. In our sample, the US infants, as well as adults, had higher BMI than their counterparts in Malawi; The US infants also had higher BMI than Venezuela infants (Supplementary Table 2). Interestingly the ratio of Bacteroidetes to Firmicutes was larger among Malawi infants than the US infants (Fig. 6b and Supplementary Table 3). Similarly, the ratio was significantly larger among Venezuela infants than the US infants (Fig. 6b and Supplementary Table 3). Although the differences of the ratio of Bacteroidetes to Firmicutes between US and nonUS adults were not significant, the effect sizes showed a similar trend as infants indicating that US adults had smaller ratio of Bacteroidetes to Firmicutes. We did not find any significant differences between Malawi and Venezuelan infants as well as adults. These results are in line with our findings that there were no differences in the mean absolute abundances of Firmicutes as well as Bacteroidetes among Malawi and Venezuelan infants as well as adults (Fig. 6a).
Discussion
The DA analysis of microbiome data is a challenging problem^{5,6}, in part due to inaccessibility of data necessary for drawing inferences on DA in two or more ecosystems. An important unobservable parameter that impacts DA analysis is the sampling fraction of a sample drawn from a unit volume of ecosystem. As noted in previous studies^{5,6}, the bias correction due to sampling fraction is a major hurdle. While, ANCOM as well as DR procedures find ways to get around the problem from different perspectives, there is room for improvement. Secondly, differential relative abundance analysis of microbiome data is not equivalent to differential absolute abundance analysis of microbiome data. Often simplex or compositional data analysisbased methods transform the simplex coordinate system to Euclidean space by performing log ratio transformation. However, such methods require the researcher to prespecify the reference taxon and the results may be highly dependent on the choice of the reference taxon^{6}. It is important to reiterate that ANCOM computes logratios with respect to all taxa and thus the choice of reference taxon is not an issue for ANCOM. As demonstrated mathematically in ANCOM methodology^{2}, as long as two taxa are not differentially abundant between two ecosystems, one can draw inferences about DA using differential relative abundance.
ANCOMBC enjoys several important unique characteristics. First, it is the only method available in the literature that estimates the sampling fraction and performs DA analysis by correcting bias due to differential sampling fractions across samples. It is the only procedure that provides valid p values and confidence intervals for each taxon. Second, unlike ANCOM, it simplifies DA analysis by recasting the problem as a linear regression problem with an offset. The offset is due to the sampling fraction. By virtue of linear regression formulation, ANCOMBC can be applied to a broad collection of study designs, including longitudinal data, repeated measurements design, covariance adjusted analysis, and so on. Using a broad range of simulations studies, we demonstrate that ANCOMBC, like ANCOM, controls the FDR very well, while almost all other methods investigated in this paper fail.
The ANCOMBC methodology may not perform well when the sample sizes are very small, such as n = 5 per group. The FDR is not controlled by ANCOMBC in such cases (Supplementary Fig. 6a, b). However, when the sample size increases to 10, our simulation results indicate that ANCOMBC controls FDR with adequate power (Supplementary Fig 6a, b). We also evaluated the performance of ANCOMBC when the number of taxa is small, as when researchers perform DA analysis at the phylum or class levels. Even in such instances, ANCOMBC controls the FDR very well while maintaining high power (Supplementary Table 4). ANCOMBC performs best in terms of FDR control when the proportion of differentially abundant taxa is not too large (e.g., <75%). Otherwise, it may have slightly elevated FDR (Supplementary Fig. 7a, b). However, none of the other methods control the FDR either, in fact, they have larger FDRs than ANCOMBC.
In many applications, researchers are interested in drawing inferences regarding DA of taxa in more than two ecosystems. We extended ANCOMBC to deal with such multigroup situations. Extensive simulations suggest that ANCOMBC controls FDR while maintaining high power (Supplementary Table 5).
In summary, the proposed ANCOMBC methodology (1) explicitly tests hypothesis regarding differential absolute abundance of individual taxon and provides valid confidence intervals; (2) provides an approach to correct the bias induced by (unobservable) differential sampling fractions across samples; (3) takes into account the compositionality of the microbiome data, and (4) does not rely on strong parametric assumptions. With the linear regression framework adopted in ANCOMBC, it allows researchers to derive p value associated with each taxon as well as confidence interval estimation for differential absolute abundance. These are unique to ANCOMBC, to the best of our knowledge. Last but not the least, because of the regression framework adopted in ANCOMBC, it can be extended to more general settings involving multigroup comparisons, adjusting covariates as well as applying to longitudinal/repeated measurements data.
Methods
Notation
The notations described in ANCOMBC methodology are summarized in Table 1.
Data preprocessing
We adopted the methodology of ANCOMII^{22} as the preprocessing step to deal with different types of zeros before performing DA analysis.
There are instances where some taxa are systematically absent in an ecosystem. For example, there may be taxa present in a soil sample from a desert that might absent in a soil sample from a rain forest. In such cases, the observed zeros are called structural zeros. Let p_{ij} denote proportion nonzero samples of the ith taxon in the jth group, and let \({\hat{p}}_{ij}=\frac{1}{{n}_{j}}\mathop{\sum }\nolimits_{k = 1}^{{n}_{j}}I({O}_{ijk}\ne 0)\) denote the estimate of p_{ij}. In practice, we declare the ith taxon to have structural zeros in the jth group if either of the following is true:
 (a)
\({\hat{p}}_{ij}=0.\)
 (b)
\({\hat{p}}_{ij}1.96\scriptstyle{\sqrt{\frac{{\hat{p}}_{ij}(1{\hat{p}}_{ij})}{{n}_{j}}}}\le 0\).
If a taxon is considered to be a structural zero in an experimental group then, for that specific ecosystem, the taxon is not used in further analysis. Thus, suppose there are three ecosystems A, B, and C and suppose taxon X is a structural zero in ecosystems A and B but not in C, then taxon X is declared to be differentially abundant in C relative to A and B and not analyzed further. If taxon Y is a structurally zero in ecosystem A but not in B and C, in that case we declare that taxon Y is differentially abundant in B relative to A as well as differentially abundant in C relative to A. We then compare the absolute abundance of taxon Y between B and C using the methodology described in this section. Taxa identified to be structural zeros among all experimental groups are ignored from the following analyses.
In a similar fashion, we address the outlier zeros as well as sampling zeros using the methodology developed in ANCOMII^{22}.
Model assumptions
Assumption 0.1.
where \({\sigma }_{w,ijk}^{2}\) = variability between specimens within the kth sample from the jth group. Therefore, \({\sigma }_{w,ijk}^{2}\) characterizes the withinsample variability. Typically, researchers do not obtain more than one specimen at a given time in most microbiome studies. Consequently, variability between specimens within sample is usually not estimated.
According to Assumption 0.1, in expectation the absolute abundance of a taxon in a random sample is in constant proportion to the absolute abundance in the ecosystem of the sample. In other words, the expected relative abundance of each taxon in a random sample is equal to the relative abundance of the taxon in the ecosystem of the sample.
Assumption 0.2. For each taxon i, A_{ijk}, j = 1, …, g, k = 1, …, n_{j}, are independently distributed with
where \({\sigma }_{b,ij}^{2}\) = betweensample variation within group j for the ith taxon.
The Assumption 0.2 states that for a given taxon, all subjects within and between groups are independent, where θ_{ij} is a fixed parameter rather than a random variable.
Regression framework
From Assumptions 0.1 and 0.2, we have:
Motivated by the above setup, we introduce the following linear model framework for logtransformed OTU counts data:
with
Note that with a slight abuse of notation for simplicity of exposition, the above logtransformation of data is inspired by the Box–Cox family of transformations^{23} which are routinely used in data analysis. Note that d in the above equation is not exactly log(c) due to Jensenʼs inequality, it simply reflects the effect of c
An important distinction from standard ANOVA: Before we describe the details of the proposed methodology, we like to draw attention to a fundamental difference between the current formulation of the problem and the standard oneway ANOVA model. For simplicity, let us suppose we have two groups, say a treatment and a control group. Let us also suppose that there is only one taxon in our microbiome study and n subjects are assigned to the treatment group and n are assigned to the control group. Suppose the researcher is interested in comparing the mean absolute abundance of the taxon in the ecosystems of the two groups. Then under the above assumptions, the model describing the study is given by:
Then trivially the sample mean absolute abundance of jth group is given by \({\bar{y}}_{j.}=\frac{1}{n}\mathop{\sum }\nolimits_{k = 1}^{n}{y}_{jk}\) and \(E({\bar{y}}_{j.})=\frac{1}{n}\mathop{\sum }\nolimits_{k = 1}^{n}{d}_{jk}+{\mu }_{j}={\bar{d}}_{j.}+{\mu }_{j}\). The difference in the sample means between the two groups is \({\bar{y}}_{1.}{\bar{y}}_{2.}\) and its expectation is \(E({\bar{y}}_{1.}{\bar{y}}_{2.})=({\bar{d}}_{1.}{\bar{d}}_{2.})+({\mu }_{1}{\mu }_{2})\). Under the null hypothesis μ_{1} = μ_{2}, \(E({\bar{y}}_{1.}{\bar{y}}_{2.})={\bar{d}}_{1.}{\bar{d}}_{2.}\ne \,0\), unless \({\bar{d}}_{1.}={\bar{d}}_{2.}\). Thus because of the differential sampling fractions, which are sample specific, the numerator of the standard ttest under the null hypothesis for these microbiome data is nonzero. This introduces bias and hence inflates the Type I error. On the other hand, the standard oneway ANOVA model for two groups, which is not applicable for the microbiome data described in this paper, is of the form:
Hence under the null hypothesis μ_{1} = μ_{2}, \(E({\bar{y}}_{1.}{\bar{y}}_{2.})=0\). Thus, in this case the standard ttest is appropriate. Hence in this paper we develop methodology to eliminate the bias introduced by the differential sampling fraction by each sample. To do so, we exploit the fact that we have a large number of taxa on each subject and we borrow information across taxa to estimate this bias, which is the essence of the following methodology.
Bias and variance of bias estimation under the null hypothesis: From the above model (equation (4)), for each j, note that \(E({\bar{y}}_{ij\cdot })={\bar{d}}_{j\cdot }+{\mu }_{ij}\) and \(E({\bar{y}}_{\cdot jk})={d}_{jk}+{\bar{\mu }}_{\cdot j}\), where \(\bar{w}\) represents the arithmetic mean over the suitable index. Using the least squares framework, we therefore estimate μ_{ij} and d_{jk} as follows:
Note that \(E({\hat{\mu }}_{ij})=E({\overline{y}}_{ij\cdot })={\mu }_{ij}+{\overline{d}}_{j\cdot }\). Thus, for each j = 1, 2, …g, \({\hat{\mu }}_{ij}\) is a biased estimator and \(E({\hat{\mu }}_{i1}{\hat{\mu }}_{i2})=({\mu }_{i1}{\mu }_{i2})+{\overline{d}}_{1\cdot }{\overline{d}}_{2\cdot }\). Denote \(\delta ={\overline{d}}_{1\cdot }{\overline{d}}_{2\cdot }.\) To begin with, in the following we shall assume there are two experimental groups with balanced design, i.e., g = 2 and n_{1} = n_{2} = n. Later the methodology is easily extended to unbalanced design and multigroup settings. Suppose we have two ecosystems and for each taxon i, i = 1, 2, …m, we wish to test the hypothesis
Under the null hypothesis, \(E({\hat{\mu }}_{i1}{\hat{\mu }}_{i2})=\delta \, \ne \, 0\), and hence biased. The goal of ANCOMBC is to estimate this bias and accordingly modify the estimator \({\hat{\mu }}_{i1}{\hat{\mu }}_{i2}\) so that the resulting estimator is asymptotically centered at zero under the null hypothesis and hence the test statistic is asymptotically centered at zero. First, we make the following observations. Since \(E({\overline{y}}_{ij\cdot })={\overline{d}}_{j\cdot }+{\mu }_{ij}\) and \({\hat{\mu }}_{ij}={\overline{y}}_{ij\cdot }\), therefore \({\hat{\mu }}_{ij}\) is an unbiased estimator of \({\overline{d}}_{j\cdot }+{\mu }_{ij}\). From (5) and Lyapunov central limit theorem, we have:
Let Σ_{jk} denote an m × m covariance matrix of \({\epsilon }_{jk}={({\epsilon }_{1jk},{\epsilon }_{2jk},\ldots ,{\epsilon }_{mjk})}^{T}\), where \({\sigma }_{ii^{\prime} jk}\) is the \({(i,i^{\prime} )}^{th}\) element of Σ_{jk} and \({\sigma }_{ijk}^{2}\) is the ith diagonal element of Σ_{jk}. Furthermore, suppose
Assumption 0.3.
Denote 1 = (1, 1, …, 1)^{T}, then we have
Hence
Thus, for each k = 1, 2, …, n, and for each taxon i = 1, 2, …, m, according to Assumption 0.3, we have:
Thus,
denote the mean residual sum of squares. Then under some mild regularity conditions^{24}, we have the following consistency result
Therefore, using \({\hat{\sigma }}_{ij}\) for σ_{ij} in (8) and appealing to Slutsky’s theorem, we have:
Furthermore, based on Assumption 0.3, from (8) and (15) we obtain:
Consequently,
The above observations regarding the convergence of various statistics play a critical role in the following. Since the sampling fraction is constant for all taxa within the subject, we attempt to pool information across taxa when estimating δ. We model the taxa using the following Gaussian mixtures model. For the ith taxon, i = 1, 2, …, m, let \({\Delta }_{i}={\hat{\mu }}_{i1}{\hat{\mu }}_{i2}\). Let C_{0} denote the set of taxa that are not differentially abundant between the two groups, i.e., C_{0} = {i ∈ (1, 2, …, m): μ_{i1} = μ_{i2}}, C_{1} denote the set of taxa whose mean abundance in group 1 is less than that of group 2, i.e., C_{1} = {i ∈ (1, 2, …, m): μ_{i1} < μ_{i2}}, and let C_{2} denote the set of taxa whose mean abundance in group 1 is greater than that of group 2, i.e., C_{2} = {i ∈ (1, 2, …, m): μ_{i1} > μ_{i2}}, Let π_{r} denote the probability that a taxon belongs to set C_{r}, r = 0, 1, 2. For simplicity of estimation of parameters, similar to GEE, we shall assume that Δ_{i}, i = 1, 2, …, m are independently distributed. Thus, we ignore the underlying correlation structure when estimating δ. This is similar to what is often done in other omics studies. Thus, we model the distribution of Δ_{i} by Gaussian mixture as follows:
where
 (1)
ϕ is the normal density function,
 (2)
δ + l_{1} and δ + l_{2} are means for Δ_{i}∣C_{1}, and Δ_{i}∣C_{2}, respectively. l_{1} < 0, l_{2} > 0,
 (3)
ν_{i0}, ν_{i1}, and ν_{i2} are variances of Δ_{i}∣C_{0}, Δ_{i}∣C_{1}, and Δ_{i}∣C_{2}, respectively.
For computational simplicity, we assume that ν_{i1} > ν_{i0}, ν_{i2} > ν_{i0}. Thus, without loss of generality for κ_{1}, κ_{2} > 0, let ν_{i1} = ν_{i0} + κ_{1} and ν_{i2} = ν_{i0} + κ_{2}. While this assumption is not a requirement for our method, it is reasonable to assume that variability among differentially abundant taxa is larger than that among the null taxa. By making this assumption, we speedup the computation time.
Assuming samples are independent between experimental groups, we begin by first estimating \({\nu }_{i0}^{2}={\rm{V}}{\rm{ar}}({\hat{\mu }}_{i1}{\hat{\mu }}_{i2})={\rm{V}}{\rm{ar}}({\hat{\mu }}_{i1})+{\rm{V}}{\rm{ar}}({\hat{\mu }}_{i2})\). Using the estimator stated in (14), we estimate \({\nu }_{i0}^{2}\) consistently as follows:
In all future calculations, we plug in \({\hat{\nu }}_{i0}^{2}\) for \({\nu }_{i0}^{2}\). This is similar in spirit to many statistical procedures involving nuisance parameters. The following lemma is useful in the sequel.
Lemma 0.1.
\(\frac{\partial }{\partial \theta }{\mathrm{log}}\,f(x)={E}_{f(z x)}[\frac{\partial }{\partial \theta }{\mathrm{log}}\,f(z)+\frac{\partial }{\partial \theta }{\mathrm{log}}\,f(x z)]\).^{25}
Let \(\Theta ={(\delta ,{\pi }_{1},{\pi }_{2},{\pi }_{3},{l}_{1},{l}_{2},{\kappa }_{1},{\kappa }_{2})}^{T}\) denote the set of unknown parameters, then for each taxon the loglikelihood can be reformulated using Lemma 0.1, as follows:
Then the E–M algorithm is described as follows:
Estep: Compute conditional probabilities of the latent variable. Define \({p}_{r,i}={\rm{P}}{\rm{r}}(i\in {C}_{r} {\Delta }_{i})=\frac{{\pi }_{r}\phi (\frac{{\Delta }_{i}\, \, (\delta \, +\, {l}_{r})}{{\nu }_{ir}})}{{\sum }_{r}{\pi }_{r}\phi (\frac{{\Delta }_{i}\, \, (\delta \, + \, {l}_{r})}{{\nu }_{ir}})},r=0,1,2;i=1,\ldots ,m\), which are conditional probabilities representing the probability that an observed value follows each distribution. Note that l_{0} = 0.
Mstep: Maximize the likelihood function with respect to the parameters, given the conditional probabilities.
We shall denote the resulting estimator of δ by \({\hat{\delta }}_{{\rm{EM}}}.\)
Next we estimate \({\rm{V}}{\rm{ar}}({\hat{\delta }}_{{\rm{EM}}})\). Since the likelihood function is not a regular likelihood and hence it is not feasible to derive the Fisher information. Consequently, we take a simpler and a pragmatic approach to derive an approximate estimator of \({\rm{V}}{\rm{ar}}({\hat{\delta }}_{{\rm{EM}}})\) using \({\rm{V}}{\rm{ar}}({\hat{\delta }}_{{\rm{WLS}}})\), which is defined below. Extensive simulation studies suggest that \({\hat{\delta }}_{{\rm{EM}}}\) and \({\hat{\delta }}_{{\rm{WLS}}}\) are highly correlated (Supplementary Fig. 9) and it appears to be reasonable to approximate \({\rm{V}}{\rm{ar}}({\hat{\delta }}_{{\rm{EM}}})\) by \({\rm{V}}{\rm{ar}}({\hat{\delta }}_{{\rm{WLS}}})\).
Let {C_{r}} = m_{r}, r = 0, 1, 2, then
The above expression is of the form
where
 (1)
1 = (1, …, 1)^{T},
 (2)
\({a}_{r}={({a}_{r1},{a}_{r2},\ldots ,{a}_{r{m}_{r}})}^{T}:= {(\frac{1}{{\nu }_{ir}^{2}})}^{T},\,i\in {C}_{r},\,r=0,1,2\),
 (3)
\({x}_{r}={({x}_{r1},{x}_{r2},\ldots ,{x}_{r{m}_{r}})}^{T}:= {({\Delta }_{i}{l}_{i})}^{T},\,i\in {C}_{r},\,r=0,1,2\). Note that l_{0} = 0,
 (4)
\(\alpha ={({\alpha }_{1},{\alpha }_{2},\ldots ,{\alpha }_{m})}^{T}\equiv {({a}_{1}^{T},{a}_{2}^{T},{a}_{3}^{T})}^{T}\),
 (5)
\(u={({u}_{1},{u}_{2},\ldots ,{u}_{m})}^{T}\equiv {({x}_{1}^{T},{x}_{2}^{T},{x}_{3}^{T})}^{T}\).
For the simplicity of notation, we relabel a and x by α and u, respectively. Denote Cov(x) = Cov(u) by Ω, and let \({\omega }_{ii^{\prime} }\) denotes the \((i,i^{\prime} )\) element of Ω. As in Assumption 0.3, we make the following assumption
Assumption 0.4.
Using the above expressions, we compute the variance as follows:
Recall that (a) for i ∈ C_{0}, \({\omega }_{ii}={\rm{V}}{\rm{ar}}({\Delta }_{i})={\nu }_{i0}^{2}=O({n}^{1})\), (b) for i ∈ C_{1}, \({\omega }_{ii}={\rm{V}}{\rm{ar}}({\Delta }_{i})={\nu }_{i1}^{2}={\nu }_{i0}^{2}+{\kappa }_{1}=O(1)\), and (c) for i ∈ C_{2}, \({\omega }_{ii}={\rm{V}}{\rm{ar}}({\Delta }_{i})={\nu }_{i2}^{2}={\nu }_{i0}^{2}+{\kappa }_{2}=O(1)\). Note that \({\alpha }_{i}=\frac{1}{{\rm{V}}{\rm{ar}}({\Delta }_{i})}=\frac{1}{{\omega }_{ii}}\), thus we have:
Since \({\nu }_{i0}^{2}=O({n}^{1})\), \({\nu }_{i1}^{2}=O(1)\), and \({\nu }_{i2}^{2}=O(1)\), consequently, a_{1i} = O(n), a_{2i} = a_{3i} = O(1), and
Using these facts and Assumption 0.4 in (26), we get
Thus, under Assumption 0.4 regarding \({\omega }_{ii^{\prime} }\), the contribution of the covariance terms in the above variance expression is negligible as long as m is very large compared with n, which is usually the case. Hence
Furthermore, appealing to Cauchy–Schwartz inequality we get
Hence, as long as m_{0} is large, the contribution made by \({\rm{V}}{\rm{ar}}({\hat{\delta }}_{{\rm{WLS}}})\) and \({\rm{Cov}}({\hat{\mu }}_{i1}{\hat{\mu }}_{i2},{\hat{\delta }}_{{\rm{WLS}}})\) relative to \({\rm{V}}{\rm{ar}}({\hat{\mu }}_{i1}{\hat{\mu }}_{i2})\) is negligible.
Neglect the covariance term in (26), let \({\hat{C}}_{r}\) denote the estimator of C_{r}, r = 0, 1, 2 from the E–M algorithm, define
an estimator of \({\rm{V}}{\rm{ar}}({\hat{\delta }}_{{\rm{WLS}}})\) under the Assumption 0.4. Then
We performed extensive simulations to evaluate the bias and variance of \({\hat{\delta }}_{{\rm{EM}}}\) and that of \({\hat{\delta }}_{{\rm{WLS}}}\). The scatter plot (Supplementary Fig. 9) of \({\hat{\delta }}_{{\rm{EM}}}\) and \({\hat{\delta }}_{{\rm{WLS}}}\) are almost perfectly linear with correlation coefficient nearly 1 in all cases. This suggests that \({\hat{\delta }}_{{\rm{WLS}}}\) is a very good approximation for \({\hat{\delta }}_{{\rm{EM}}}\). The variance of \({\hat{\delta }}_{{\rm{EM}}}\) as well as that of \({\hat{\delta }}_{{\rm{WLS}}}\) are roughly of the order \({n}^{1}{m}_{0}^{1}\), as we expected. In addition, they are approximately unbiased (Supplementary Table 6).
Hypothesis testing for twogroup comparison
For taxon i, we test the following hypothesis
using the following test statistic which is approximately centered at zero under the null hypothesis:
From Slutsky’s theorem, we have:
If the sample size is not very large and/or the number of nonnull taxa are very large, then we modify the above test statistic as follows:
To control the FDR due to multiple comparisons, we recommend applying the Holm–Bonferroni method^{26} or Bonferroni^{27,28} correction rather than the Benjamini–Hochberg (BH) procedure^{29} to adjust the raw p values as research has showed that it is more appropriate to control the FDR when p values were not accurate^{30}, and the BH procedure controls the FDR provided you have either independence or some special correlation structures such as perhaps positive regression dependence among taxa^{29,31}. In our simulation studies, since the absolute abundances for each taxon are generated independently, we compared the ANCOMBC results adjusted either by Bonferroni correction (Fig. 4) or BH procedure (Supplementary Fig. 10), it is clearly that the FDR control by Bonferroni correction is more conservative while implementing BH procedure results in FDR around the nominal level (5%). Obviously, ANCOMBC has larger power when using BH procedure.
Hypothesis testing for multigroup comparison
In some applications, for a given taxon, researchers are interested in drawing inferences regarding DA in more than two ecosystems. For example, for a given taxon, researchers may want to test whether there exists at least one experimental group that is significantly different from others, i.e., to test
Similar to the twogroup comparison, after getting the initial estimates of \({\hat{\mu }}_{ij}\) and \({\hat{d}}_{jk}\), setting the reference group r (e.g., r = 1), and obtaining the estimator of the bias term \({\hat{\delta }}_{rj}\) through E–M algorithm, the final estimator of mean absolute abundance of the ecosystem (in log scale) are obtained by transforming \({\hat{\mu }}_{ij}\) of (6) into:
Thus, based on (18) and the E–M estimator of δ_{rj}, as \(m,\min ({n}_{j},{n}_{j^{\prime} })\to \infty\)
Similarly, the estimator of the sampling fraction is obtained by transforming \({\hat{d}}_{jk}\) of (6) into
As by (13) and the E–M estimator of δ_{rj}
which indicates that we are only able to estimate sampling fractions up to an additive constant (\({\overline{d}}_{r\cdot }\)).
Define the test statistic for pairwise comparison as:
For computational simplicity, inspired by the William’s type of test^{32,33,34,35}, we reformulate the global test with the following test statistic:
Under null, \({W}_{i,jj^{\prime} }{\to }_{d}N(0,1)\), thus we can construct the null distribution of W_{i} by simulations, i.e., for each specific taxon i,
 (a)
Generate \({W}_{i,jj^{\prime} }^{(b)} \sim N(0,1),j\, \ne \, j^{\prime} \in \{1,\,\ldots ,\,g\},b=1,\ldots ,B.\)
 (b)
Compute \({W}_{i}^{(b)}=\mathop{\max }\limits_{j\ne j^{\prime} \in \{1,\ldots ,g\}}{W}_{i,jj^{\prime} }^{(b)}.\)
 (c)
Repeat above steps B times (e.g., B = 1000), we then get the null distribution of W_{i} by \({({W}_{i}^{(1)},\ldots ,{W}_{i}^{(B)})}^{T}.\)
Finally, p value is calculated as
and the Bonferroni correction is applied to control the FDR.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
DNA sequences from the global gut microbiota study^{9} can be found in MGRAST https://www.mgrast.org/index.html server under search string “mgp401” for Illumina V416S rRNA.
Code availability
All analyses can be found under https://github.com/FrederickHuangLin/ANCOMBC.
References
Weiss, S. et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5, 27 (2017).
Mandal, S. et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 26, 27663 (2015).
Gloor, G. B. & Reid, G. Compositional analysis: a valid approach to analyze microbiome highthroughput sequencing data. Can. J. Microbiol. 62, 692–703 (2016).
Gloor, G. B., Wu, J. R., PawlowskyGlahn, V. & Egozcue, J. J. It’s all relative: analyzing microbiome data as compositions. Ann. Epidemiol. 26, 322–329 (2016).
Gloor, G. B., Macklaim, J. M., PawlowskyGlahn, V. & Egozcue, J. J. Microbiome datasets are compositional: and this is not optional. Front. Microbiol. 8, 2224 (2017).
Morton, J. T. et al. Establishing microbial composition measurement standards with reference frames. Nat. Commun. 10, 2719 (2019).
Aitchison, J. The statistical analysis of compositional data. J. R. Stat. Soc. Ser. B 44, 139–160 (1982).
Paulson, J. N., Stine, O. C., Bravo, H. C. & Pop, M. Differential abundance analysis for microbial markergene surveys. Nat. methods 10, 1200 (2013).
Tanya, Y. et al. Human gut microbiome viewed across age and geography. Nature 486, 222 (2012).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 15, 550 (2014).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Chen, Y., McCarthy, D., Robinson, M. & Smyth, G. K. EdgeR: Differential Expression Analysis of Digital Gene Expression Data User’s Guide. https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf (accessed 17 September 2008) (2014).
Caporaso, J. G. et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl Acad. Sci. 108, 4516–4522 (2011).
Lozupone, C. A. et al. Metaanalyses of studies of the human microbiota. Genome Res. 23, 1704–1714 (2013).
Amrhein, V., Greenland, S. & McShane, B. Scientists rise up against statistical significance. https://www.nature.com/articles/d41586019008579?fbclid=IwAR3K6PysQ9FY4togs39BSciW3YsKPf6EE0Il9R8zxkW4GvrGBHFuz8yF5c (2019).
Codd, G. Cyanobacterial toxins: occurrence, properties and biological significance. Water Sci. Technol. 32, 149–156 (1995).
Herlemann, D. P., Geissinger, O. & Brune, A. The termite group I phylum is highly diverse and widespread in the environment. Appl. Environ. Microbiol. 73, 6682–6685 (2007).
ObregonTito, A. J. et al. Subsistence strategies in traditional societies distinguish gut microbiomes. Nat. Commun. 6, 6505 (2015).
Halperin, J. J. A tale of two spirochetes: lyme disease and syphilis. Neurologic Clin. 28, 277–291 (2010).
Dubourg, G. et al. Highlevel colonisation of the human gut by Verrucomicrobia following broadspectrum antibiotic treatment. Int. J. antimicrobial agents 41, 149–155 (2013).
Castaner, O. et al. The Gut Microbiome Profile in Obesity: A Systematic Review. Int. J. Endocrinol. 1–9 (2018). https://doi.org/10.1155/2018/4095789.
Kaul, A., Mandal, S., Davidov, O. & Peddada, S. D. Analysis of microbiome data in the presence of excess zeros. Front. Microbiol. 8, 2114 (2017).
Box, G. E. & Cox, D. R. An analysis of transformations. J. R. Stat. Soc. Ser. B 26, 211–243 (1964) (Methodological).
Peddada, S. D. & Smith, T. Consistency of a class of variance estimators in linear models under heteroscedasticity. Sankhyā Indian J. Stat. Ser. B 59, 1–10 (1997).
McLachlan, G. & Krishnan, T. The EM Algorithm and Extensions, Vol. 382 (John Wiley & Sons, 2007).
Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979).
Dunn, O. J. Estimation of the means of dependent variables. Ann. Math. Stat. 29, 1095–1111 (1958).
Dunn, O. J. Multiple comparisons among means. J. Am. Stat. Assoc. 56, 52–64 (1961).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).
Lim, C., Sen, P. K. & Peddada, S. D. Robust analysis of high throughput screening assay data. Technometrics 55, 150–160 (2013).
Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001).
Williams, D. A test for differences between treatment means when several dose levels are compared with a zerodose control. Biometrics 27, 103–117 (1971).
Williams, D. A. Some inference procedures for monotonically ordered normal means. Biometrika 64, 9–14 (1977).
Peddada, S. D., Prescott, K. E. & Conaway, M. Tests for order restrictions in binary data. Biometrics 57, 1219–1227 (2001).
Farnan, L., Ivanova, A. & Peddada, S. D. Linear mixed effects models under inequality constraints with applications. PloS ONE 9, e84778 (2014).
Acknowledgements
This research was funded by the Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA.
Author information
Authors and Affiliations
Contributions
Both authors contributed equally to the theory and methodology described in this paper. All numerical works and computations were conducted by H.L. who developed ANCOMBC pipeline in R that is freely and publicly available. Please contact H.L. for software requests.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Nathan Olson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lin, H., Peddada, S.D. Analysis of compositions of microbiomes with bias correction. Nat Commun 11, 3514 (2020). https://doi.org/10.1038/s41467020170417
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467020170417
This article is cited by

Supertaxon in human microbiome are identified to be associated with colorectal cancer
BMC Bioinformatics (2022)

LinDA: linear models for differential abundance analysis of microbiome compositional data
Genome Biology (2022)

Impact of Helicobacter pylori infection on fluid duodenal microbial community structure and microbial metabolic pathways
BMC Microbiology (2022)

Elevated atmospheric CO2 alters the microbial community composition and metabolic potential to mineralize organic phosphorus in the rhizosphere of wheat
Microbiome (2022)

Spatiotemporal variation in the rootassociated microbiota of orchardgrown apple trees
Environmental Microbiome (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.