Analysis of compositions of microbiomes with bias correction

Differential abundance (DA) analysis of microbiome data continues to be a challenging problem due to the complexity of the data. In this article we define the notion of “sampling fraction” and demonstrate a major hurdle in performing DA analysis of microbiome data is the bias introduced by differences in the sampling fractions across samples. We introduce a methodology called Analysis of Compositions of Microbiomes with Bias Correction (ANCOM-BC), which estimates the unknown sampling fractions and corrects the bias induced by their differences among samples. The absolute abundance data are modeled using a linear regression framework. This formulation makes a fundamental advancement in the field because, unlike the existing methods, it (a) provides statistically valid test with appropriate p-values, (b) provides confidence intervals for differential abundance of each taxon, (c) controls the False Discovery Rate (FDR), (d) maintains adequate power, and (e) is computationally simple to implement.

A number of procedures have been proposed and used in the literature for identifying deferentially abundant taxa between two or more ecosystems. A detailed survey of some of the existing methods and their performance has been discussed in Weiss et al. 1 . As noted in a list of studies [2][3][4][5][6] , the observed microbiome data are relative abundances which sum to a constant, hence they are compositional. Standard statistical methods are not appropriate for analyzing compositional data 7 . Methods such as ANOVA, Kruskal-Wallis test do not appropriately take into consideration the compositional feature of microbiome data when performing differential abundance (DA) analysis. As demonstrated in literatures 1,2 , these methods are subject to inflated false discovery rates (FDR). Although meta-genomeSeq 8 was specifically developed for microbiome data, it too is subject to inflated FDR under the Gaussian mixture model 1,2 . ANCOM 2 , which is based on Aitchison's methodology, uses relative abundances to infer about absolute abundances. According to an extensive simulation study 1 , among the available methods for DA analysis, only ANCOM performs well in controlling FDR while maintaining high power, as long as the sample size is not too small. One of the deficiencies of ANCOM is that it does not provide p value for individual taxon, nor can it provide standard errors or confidence intervals of DA for each taxon, and it can be computationally intensive.
The Differential Ranking (DR) methodology 6 reformulates the DA analysis as a multinomial regression problem. By imposing the Additive Log-Ratio transformation to relative abundances, the DR methodology accounts for compositionality of microbiome data. As demonstrated in 6 , the ranks of relative differentials perfectly correlate with ranks of absolute differentials. However, similar to ANCOM, the DR procedure does not provide p values or confidence intervals to declare statistical significance.
It is important to distinguish between absolute and relative abundances of taxa in a unit volume of an ecosystem. Change in the absolute abundance of a single taxon can alter the relative abundances of all taxa (Fig. 1). The choice of parameter for statistical analysis is important and needs to be clearly stated. Often researchers are interested in identifying taxa that are different in mean absolute abundance between two or more ecosystems 6 . Testing hypotheses regarding mean relative abundance is not equivalent to testing hypotheses regarding mean absolute abundance 2,6 . In addition, note that not all samples have the same sampling fraction, which is defined as the ratio of the expected absolute abundance of a taxon in a random sample (e.g., a stool sample) to its absolute abundance in a unit volume of the ecosystem (e.g., a unit volume of gut) where the sample was derived from. Consequently, the observed counts are not comparable between samples. Thus, all DA methodologies require the counts to be properly normalized to account for differences in sampling fractions across samples. Sampling fraction is affected by two components, namely, the microbial load in a unit volume of the ecosystem and the library size of the corresponding sample (e.g., total species abundances sequenced from a subject's stool sample). Therefore, it is not sufficient to normalize the library size across samples as one needs to take into consideration the differences in the microbial loads. Consider the toy example in Fig. 2. Suppose the gut of subject A as well as B consist of only two taxa, the red and green varieties. Clearly, the true absolute abundance of each taxon is 50% more in subject B's ecosystem as compared with subject A's. However, they each have the same library size (six each) in their respective samples. Furthermore, sample relative abundance as well as sample absolute abundances are identical in the two samples. If a normalization method is based only on the library size and ignores the sampling fraction, then the two samples would be considered as normalized. Consequently, an investigator would falsely conclude that none of the taxa are differentially abundant in the two ecosystems. This erroneous conclusion would be avoided if one recognizes that we have a larger sampling fraction in the sample obtained from A's  Fig. 1 The distinction between absolute abundances and relative abundances. As shown in this figure, all taxa (in different colors and shapes) may be identically abundant in a unit volume of two ecosystems (e.g., a unit volume of gut), except for one differentially abundant taxon (the green variety). Due to this one differentially abundant taxon, the two ecosystems may differ in the relative abundance of all taxa. A researcher may not only be interested in knowing if the mean relative abundance of a taxon is different between two ecosystems but may also want to know if the absolute abundance of a taxon is different in a unit volume of two ecosystems.
ecosystem than from B's ( 3 9 vs 2 9 ), Thus, normalizing data on the basis of sampling fractions gives a better description of the truth than normalization methods that rely purely on the library sizes.
Ideally, under the null hypothesis, the test statistic for DA analysis should be (at least approximately) centered at zero (i.e., unbiased). However, for many DA methods, this is not always true for at least one of the following reasons: (1) The test statistic may not be designed for testing hypothesis regarding the actual parameter of interest; (2) Data are not properly normalized; (3) Underlying structure, such as compositionality, is ignored. Motivated by the limitations of existing DA methods, in this paper we propose a methodology called Analysis of Compositions of Microbiomes with Bias Correction (ANCOM-BC) that is aimed to address the problems mentioned above. As in ANCOM and DR, the proposed ANCOM-BC methodology assumes that the observed sample is an unknown fraction of a unit volume of the ecosystem, and the sampling fraction varies from sample to sample. ANCOM-BC accounts for sampling fraction by introducing a sample-specific offset term in a linear regression framework, that is estimated from the observed data. The offset term serves as the bias correction, and the linear regression framework in log scale is analogous to log-ratio transformation to deal with the compositionality of microbiome data. The case of zero counts is also discussed in "Methods" section. This methodology has some conceptual similarities with DR, but is fundamentally different. With ANCOM-BC, one can perform standard statistical tests and construct confidence intervals for DA. Moreover, as demonstrated in benchmark simulation studies, ANCOM-BC (a) controls the FDR very well while maintaining adequate power compared with other popular methods, and (b) it is substantially faster than ANCOM. The CPU time (RStudio, x86_64-apple-darwin15.6.0, and macOS) is 0.28 min vs. 63 min when the number of taxa is 500. The CPU time for ANCOM increases dramatically as the number of taxa increases to 1000. In this case, the CPU times for ANCOM-BC and ANCOM are 0.51 and 211 min, respectively. In addition to results based on synthetic data, we also illustrate ANCOM-BC using the well-known global gut microbiota dataset 9 .

Results
Normalization. Using simulated data, we illustrate how the existing normalization methods fail to eliminate the bias introduced by differences in sampling fractions across samples, whereas the normalization method introduced in ANCOM-BC performs well. Specifically, we compare our proposed method  2 The bias introduced by cross-sample variations in sampling fractions. Sampling fraction is defined as the ratio of expected absolute abundance in a sample to the corresponding absolute abundance in the ecosystem, which could be empirically estimated by the ratio of library size to the microbial load. Differences in sampling fractions may introduce bias and increase in false positive as well as false negative rates in differential abundance analysis. In this toy example, the microbial load for subject A in a unit volume of ecosystem (e.g., a unit volume of gut) is 18 (12 red + 6 green), while for subject B is 27 (18 red + 9 green). However, the samples taken from subject A and B have the same library size 6 (4 red + 2 green), the same observed absolute abundance as well as the same relative abundance of red and green taxa. Thus, one may mistakenly conclude that the red and green taxa are not differentially abundant, which is not the case in the two ecosystems. This false negative conclusion is caused by differences in the sampling fractions in the two samples. The sampling fraction in sample A is 3/9 and for B it is 2/9. One can similarly construct examples where a false positive conclusion is arrived at. Thus, a normalization method must account for differences in sampling fractions to avoid such erroneous conclusions.
with Cumulative-Sum Scaling (CSS) implemented in metagen-omeSeq 8 , Median (MED) in DESeq2 10 , Upper Quartile (UQ) and Trimmed Mean of M values (TMM), and Total-Sum Scaling (TSS). In addition, we also considered modified versions of UQ and TMM implemented in edgeR 11 . These are obtained by multiplying the normalization factors with the corresponding library size to account for "effective library size" 12 , and are denoted as ELib-UQ and ELib-TMM (see Supplementary Table 7 for formulas and Supplementary Fig. 11 for workflow).
We considered a variety of simulation scenarios as follows. The details of the simulation study are presented in the Supplementary Notes. Thus, we simulated data where sampling fraction in Group 1 is systematically different from sampling fraction in Group 2. Consequently, observed absolute abundances in the samples in the two groups were systematically different even though the actual absolute abundances in the ecosystems are same. To evaluate the performance of each normalization method, we introduced a residual measure that estimates the deviation between the estimated sampling fraction and the true sampling fraction (see Supplementary Discussion). For simplicity of exposition, we plotted the centered residuals, by subtracting the group average of the residuals. If a normalization method is effective then it should eliminate the bias due to the differences in the sampling fractions so that samples from the two groups (circles and triangles) in Fig. 3 should intermix and not cluster by the group labels.
From Fig. 3 (and Supplementary Figs. 1, 2) we notice that the samples normalized by ANCOM-BC are nicely intermixed and do not cluster by the group labels. This is not the case with most of the remaining methods where residuals cluster by group labels, thus indicating that they are unable to eliminate the underlying differences in sampling fractions between the two groups. Thus, under the null hypothesis of no difference in the absolute abundance of a taxon in two groups, their test statistics are not centered at zero. This results in inflated FDR (see Supplementary Discussion). We also note from Fig. 3 and Supplementary Figs. 1, 2, that not only ANCOM-BC does well in estimating the bias due to differences in sampling fraction, the variability in the estimates of the sampling fractions is very small as seen from the height of the box plot for ANCOM-BC. This is an important observation because it suggests that the variability in the estimator of bias due to sampling fraction is potentially negligible in the test statistic described in "Methods" section.
Clearly, as seen in Fig. 4a, b and Supplementary Figs. 3a, b, 4a, b, the normalization of data has a major effect on the FDR and power of various methods.
DA analyses. Simulating data from Poisson-Gamma distributions (see Supplementary Notes for simulation settings and Supplementary Fig. 12 for workflow), we evaluated the performance of various methods in terms of FDR and power. Since there is no hard threshold available for DR to declare whether a taxon is differentially abundant or not, it was not included in this simulation study.
Not surprisingly, standard Wilcoxon rank-sum test applied to relative abundance data leads to highly inflated FDR ( Fig. 4a and Supplementary Figs. 3a, 4a) in all simulation scenarios. This is primarily because such standard tests ignore the compositional structure of the data, and seen from Fig. 3, TSS does not successfully normalize the data. Simply applying nonparametric Text on the upper left corner indicates the color for each method and variances are provided within parenthesis for each method. The variability in sampling fractions is set to be large. An ideal box plot should display a narrow height (i.e., smaller variability) and samples from the two groups should be intermixed and not display any systematic separation. We note that all existing methods have larger variances compared with ANCOM-BC, and TSS has the largest variance. Except ANCOM-BC, UQ, and TMM, we see from the plot that circles and triangles are systematically separated, which indicates that ELib-UQ, ELib-TMM, CSS, MED, and TSS do not account for systematic bias due to differences in sampling fractions across groups.
tests without any normalization can also be problematic when the sampling fractions are different across experimental groups (Fig. 4a). The two widely used count-based methods in RNA-Seq literature, edgeR (implemented using ELib-TMM 12 by default) and DESeq2, generally exceed the 5% nominal FDR level when there are differences in sampling fractions ( Fig. 4a and Supplementary Fig. 3a). For instance, edgeR has FDR as large as 40% (Fig. 4a), meaning that 40% of findings could be potentially false discoveries. The zero-inflated Gaussian mixture model used in metagenomeSeq (ZIG) consistently has the largest FDR when sampling fractions are not constant ( Fig. 4a and Supplementary  Fig. 3a). In some cases, the FDR could be as much as 70%, which perhaps is partly due to the Gaussian distribution assumption for log abundance data. Although metagenomeSeq using zeroinflated Log-Gaussian mixture model successfully controls the FDR under 5% in all simulations, it suffers a severe loss of power ( Fig. 4b and Supplementary Figs. 3b, 4b). The power of detecting differentially abundant taxa could be lower than 10%. Similar to ANCOM, ANCOM-BC not only controls the FDR at the nominal level (5%) but also maintains adequate power in all simulation settings considered here. An important observation to be made regarding all methods, other than ANCOM and ANCOM-BC, is that as the sample size within each group increases, so does the FDR. This is perhaps a consequence of the fact that the test statistics are not centered at the true null parameter but are shifted due to differences in the sampling fraction. Hence asymptotically, these tests fail to control the false positive as well as FDR (see Supplementary Discussion).
In addition to the above Poisson-Gamma model, we performed simulations using the real global patterns data 13 , to get a broader perspective on the performance of the various methods (see Supplementary Notes for simulation details). In this case again, ANCOM and ANCOM-BC controlled the FDR and competed well in terms of power with all other methods. The estimated FDR of DESeq2 and edgeR increased further in this simulation set-up ( Supplementary Fig. 5a, b) compared with the simulation using Poisson-Gamma distribution. Note that DESeq2 and edgeR were designed for Poisson-Gamma distribution, and hence it is not surprising that these methods performed poorly in this new set-up.
Illustration using gut microbiota data. We illustrate ANCOM-BC by analyzing the US, Malawi and Venezuela gut microbiota data 9 . This dataset consists of 11,905 OTUs obtained from subjects in the USA (n = 317), Malawi (n = 114), and Venezuela (n = 99). We first assessed the performance of different normalization methods mentioned above. One heuristic approach to gain insights on the impact of normalization is to examine how well the normalized samples separate from each other according to their phenotypes in a nonmetric multidimensional scaling (NMDS) plot. We provide the results for Malawi and Venezuela populations in Fig. 5.
As seen from this figure, ANCOM-BC appears to perform very well visually in separating samples from the two populations and has the largest between-group sum of squares (BSS). BSS measures how well clusters are separated. Larger the BSS value the better a method is in clustering objects according to group labels. ELib-TMM, CSS, and MED also performed well. Consistent with the bias correction and FDR/Power simulations reported in Figs. 3 and 4, where ELib-UQ, UQ, TMM, and TSS perform poorly in correcting biases and have poor FDR control, they also have poor performances in distinguishing samples based on their nationalities.
We also report results of pairwise DA analyses at phylum level among the above three countries using ANCOM-BC. It is wellknown that the infant gut microbiota evolve with their age 14 due to changes in the feeding patterns, diet, and other exposures. Hence, for illustration purposes, we performed a stratified analysis by considering two age groups, infants below 2 years (labeled as "infants") and adults between 18 and 40 (labeled as  Table 1. Note that ANCOM-BC is the first method in the literature that can not only identify differentially abundant taxa while controlling the FDR for multiple testing, it also provides 95% simultaneous confidence intervals for the mean DA of each taxon in the two experimental groups. These confidence intervals are adjusted for multiplicity using Bonferroni method. Thus, a researcher can evaluate the effect size associated with each taxon when comparing two experimental groups. This is particularly important in the present climate when researchers are increasingly skeptical about making decisions based on p values (alone) 15 .
Interestingly, phyla such as Cyanobacteria, Elusimicrobia, Euryarchaeota, and Spirochetes, which are known to be associated with rural environment and hygiene [16][17][18][19] , are significantly more abundant among Malawi than the US infants and adults. We discover an interesting trend in the absolute abundance of phylum Verrucomicrobia, whose absolute abundance is known to increase with antibiotics usage to protect against pathogens and other opportunistic bacteria 20 . Consistent with the high usage of antibiotics in the western world among infants as well as adults, we discover a significant increase in the absolute abundance of Verrucomicrobia in US relative to Malawi adults and infants, and relative to Venezuelan adults (Fig. 6a). Similarly, there is a significant increase in its absolute abundance among Venezuelan infants compared with Malawi (Fig. 6a).
It is well-documented in the literature that BMI is linked to the ratio of Bacteroidetes to Firmicutes 21 . In our sample, the US infants, as well as adults, had higher BMI than their counterparts in Malawi; The US infants also had higher BMI than Venezuela infants (Supplementary Table 2). Interestingly the ratio of Bacteroidetes to Firmicutes was larger among Malawi infants than the US infants ( Fig. 6b and Supplementary Table 3). Similarly, the ratio was significantly larger among Venezuela infants than the US infants ( Fig. 6b and Supplementary Table 3). Although the differences of the ratio of Bacteroidetes to Firmicutes between US and non-US adults were not significant, the effect sizes showed a similar trend as infants indicating that US adults had smaller ratio of Bacteroidetes to Firmicutes. We did not find any significant differences between Malawi and Venezuelan infants as well as adults. These results are in line with our findings that there were no differences in the mean absolute abundances of Firmicutes as well as Bacteroidetes among Malawi and Venezuelan infants as well as adults (Fig. 6a).

Discussion
The DA analysis of microbiome data is a challenging problem 5,6 , in part due to inaccessibility of data necessary for drawing inferences on DA in two or more ecosystems. An important unobservable parameter that impacts DA analysis is the sampling fraction of a sample drawn from a unit volume of ecosystem. As noted in previous studies 5, 6 , the bias correction due to sampling fraction is a major hurdle. While, ANCOM as well as DR procedures find ways to get around the problem from different perspectives, there is room for improvement. Secondly, differential relative abundance analysis of microbiome data is not equivalent to differential absolute abundance analysis of microbiome data. Often simplex or compositional data analysis-based methods transform the simplex coordinate system to Euclidean space by performing log ratio transformation. However, such methods require the researcher to prespecify the reference taxon and the results may be highly dependent on the choice of the reference taxon 6 . It is important to reiterate that ANCOM computes log-ratios with respect to all taxa and thus the choice of reference taxon is not an issue for ANCOM. As demonstrated mathematically in ANCOM methodology 2 , as long as two taxa are not differentially abundant between two ecosystems, one can draw inferences about DA using differential relative abundance. ANCOM-BC enjoys several important unique characteristics. First, it is the only method available in the literature that estimates the sampling fraction and performs DA analysis by correcting bias due to differential sampling fractions across samples. It is the only procedure that provides valid p values and confidence intervals for each taxon. Second, unlike ANCOM, it simplifies DA analysis by recasting the problem as a linear regression problem with an offset. The offset is due to the sampling fraction. By virtue of linear regression formulation, ANCOM-BC can be applied to a broad collection of study designs, including longitudinal data, repeated measurements design, covariance adjusted analysis, and so on. Using a broad range of simulations studies, we demonstrate that ANCOM-BC, like ANCOM, controls the FDR very well, while almost all other methods investigated in this paper fail.
The ANCOM-BC methodology may not perform well when the sample sizes are very small, such as n = 5 per group. The FDR is not controlled by ANCOM-BC in such cases (Supplementary Fig. 6a, b). However, when the sample size increases to 10, our simulation results indicate that ANCOM-BC controls FDR with adequate power (Supplementary Fig 6a, b). We also evaluated the performance of ANCOM-BC when the number of taxa is small, as when researchers perform DA analysis at the phylum or class levels. Even in such instances, ANCOM-BC controls the FDR very well while maintaining high power (Supplementary Table 4). ANCOM-BC performs best in terms of FDR control when the proportion of differentially abundant taxa is not too large (e.g., <75%). Otherwise, it may have slightly elevated FDR (Supplementary Fig. 7a, b). However, none of the other methods control the FDR either, in fact, they have larger FDRs than ANCOM-BC.
In many applications, researchers are interested in drawing inferences regarding DA of taxa in more than two ecosystems. We extended ANCOM-BC to deal with such multigroup situations. Extensive simulations suggest that ANCOM-BC controls FDR while maintaining high power (Supplementary Table 5).
In summary, the proposed ANCOM-BC methodology (1) explicitly tests hypothesis regarding differential absolute abundance of individual taxon and provides valid confidence intervals; (2) provides an approach to correct the bias induced by (unobservable) differential sampling fractions across samples; (3) takes into account the compositionality of the microbiome data, and (4) does not rely on strong parametric assumptions. With the linear regression framework adopted in ANCOM-BC, it allows researchers to derive p value associated with each taxon as well as confidence interval estimation for differential absolute abundance. These are unique to ANCOM-BC, to the best of our knowledge. Last but not the least, because of the regression framework adopted in ANCOM-BC, it can be extended to more general settings involving multigroup comparisons, adjusting covariates as well as applying to longitudinal/repeated measurements data.

Methods
Notation. The notations described in ANCOM-BC methodology are summarized in Table 1.
Data preprocessing. We adopted the methodology of ANCOM-II 22 as the preprocessing step to deal with different types of zeros before performing DA analysis.
There are instances where some taxa are systematically absent in an ecosystem. For example, there may be taxa present in a soil sample from a desert that might absent in a soil sample from a rain forest. In such cases, the observed zeros are called structural zeros. Let p ij denote proportion non-zero samples of the ith taxon in the jth group, and letp ij ¼ 1 n j Pn j k¼1 IðO ijk ≠0Þ denote the estimate of p ij . In practice, we declare the ith taxon to have structural zeros in the jth group if either of the following is true: If a taxon is considered to be a structural zero in an experimental group then, for that specific ecosystem, the taxon is not used in further analysis. Thus, suppose there are three ecosystems A, B, and C and suppose taxon X is a structural zero in ecosystems A and B but not in C, then taxon X is declared to be differentially abundant in C relative to A and B and not analyzed further. If taxon Y is a structurally zero in ecosystem A but not in B and C, in that case we declare that taxon Y is differentially abundant in B relative to A as well as differentially abundant in C relative to A. We then compare the absolute abundance of taxon Y between B and C using the methodology described in this section. Taxa identified to be structural zeros among all experimental groups are ignored from the following analyses.
In a similar fashion, we address the outlier zeros as well as sampling zeros using the methodology developed in ANCOM-II 22 .
where σ 2 w;ijk = variability between specimens within the kth sample from the jth group. Therefore, σ 2 w;ijk characterizes the within-sample variability. Typically, researchers do not obtain more than one specimen at a given time in most microbiome studies. Consequently, variability between specimens within sample is usually not estimated.
According to Assumption 0.1, in expectation the absolute abundance of a taxon in a random sample is in constant proportion to the absolute abundance in the ecosystem of the sample. In other words, the expected relative abundance of each taxon in a random sample is equal to the relative abundance of the taxon in the ecosystem of the sample.
Assumption 0.2. For each taxon i, A ijk , j = 1, …, g, k = 1, …, n j , are independently distributed with where σ 2 b;ij = between-sample variation within group j for the ith taxon. The Assumption 0.2 states that for a given taxon, all subjects within and between groups are independent, where θ ij is a fixed parameter rather than a random variable.
Regression framework. From Assumptions 0.1 and 0.2, we have: Motivated by the above set-up, we introduce the following linear model framework for log-transformed OTU counts data: with Eðϵ ijk Þ ¼ 0; Note that with a slight abuse of notation for simplicity of exposition, the above log-transformation of data is inspired by the Box-Cox family of transformations 23 which are routinely used in data analysis. Note that d in the above equation is not exactly log(c) due to Jensenʼs inequality, it simply reflects the effect of c An important distinction from standard ANOVA: Before we describe the details of the proposed methodology, we like to draw attention to a fundamental difference between the current formulation of the problem and the standard oneway ANOVA model. For simplicity, let us suppose we have two groups, say a treatment and a control group. Let us also suppose that there is only one taxon in our microbiome study and n subjects are assigned to the treatment group and n are assigned to the control group. Suppose the researcher is interested in comparing the mean absolute abundance of the taxon in the ecosystems of the two groups. Then under the above assumptions, the model describing the study is given by: ; n: Then trivially the sample mean absolute abundance of jth group is given by y j: ¼ 1 n P n k¼1 y jk and Eð y j: Þ ¼ 1 The difference in the sample means between the two groups is y 1: À y 2: and its expectation is Eð y 1: À y 2: Þ ¼ ð d 1: À d 2: Þ þ ðμ 1 À μ 2 Þ. Under the null hypothesis μ 1 = μ 2 , Eð y 1: À y 2: Þ ¼ d 1: À d 2: ≠ 0, unless . Thus because of the differential sampling fractions, which are sample specific, the numerator of the standard t-test under the null hypothesis for these microbiome data is non-zero. This introduces bias and hence inflates the Type I error. On the other hand, the standard one-way ANOVA model for two groups, which is not applicable for the microbiome data described in this paper, is of the form: ; n: Hence under the null hypothesis μ 1 = μ 2 , Eð y 1: À y 2: Þ ¼ 0. Thus, in this case the standard t-test is appropriate. Hence in this paper we develop methodology to eliminate the bias introduced by the differential sampling fraction by each sample.
To do so, we exploit the fact that we have a large number of taxa on each subject and we borrow information across taxa to estimate this bias, which is the essence of the following methodology. Bias and variance of bias estimation under the null hypothesis: From the above model (equation (4)), for each j, note that Eð y ijÁ Þ ¼ d jÁ þ μ ij and Eð y Ájk Þ ¼ d jk þ μ Áj , where w represents the arithmetic mean over the suitable index. Using the least squares framework, we therefore estimate μ ij and d jk as follows: ; n j ; j ¼ 1; 2; g; Note that Eðμ ij Þ ¼ Eðy ijÁ Þ ¼ μ ij þ d jÁ . Thus, for each j = 1, 2, …g,μ ij is a biased estimator and Eðμ i1 Àμ i2 To begin with, in the following we shall assume there are two experimental groups Group index, j = 1, 2, …, g. k Sample index, k = 1, 2, …, n j . θ ij a Expected absolute abundance of ith taxon in a unit volume of ecosystem in the jth group.
Unobserved absolute abundance of ith taxon in a unit volume of ecosystem of kth sample in the jth group.
Microbial load in a unit volume of ecosystem of kth sample in the jth group.
Unobserved relative abundance of ith taxon in a unit volume of ecosystem of kth sample in the jth group.
Observed absolute abundance of ith taxon in a random specimen taken from a unit volume of ecosystem of kth sample in the jth group. O. jk b Library size of a random specimen taken from a unit volume of ecosystem of kth sample in the jth group.
Observed relative abundance of ith taxon in a random specimen taken from a unit volume of ecosystem of kth sample in the jth group.
c jk a For kth sample from the jth group, c jk represents the proportion of its ecosystem (unobserved absolute abundance) in a random sample (observed absolute abundance), thus c jk ¼ . We shall refer to this constant as "sampling fraction". with balanced design, i.e., g = 2 and n 1 = n 2 = n. Later the methodology is easily extended to unbalanced design and multigroup settings. Suppose we have two ecosystems and for each taxon i, i = 1, 2, …m, we wish to test the hypothesis Under the null hypothesis, Eðμ i1 Àμ i2 Þ ¼ δ ≠ 0, and hence biased. The goal of ANCOM-BC is to estimate this bias and accordingly modify the estimatorμ i1 Àμ i2 so that the resulting estimator is asymptotically centered at zero under the null hypothesis and hence the test statistic is asymptotically centered at zero. First, we make the following observations. Since Eðy ijÁ Þ ¼ d jÁ þ μ ij andμ ij ¼ y ijÁ , thereforeμ ij is an unbiased estimator of d jÁ þ μ ij . From (5) and Lyapunov central limit theorem, we have:μ Let Σ jk denote an m × m covariance matrix of ϵ jk ¼ ðϵ 1jk ; ϵ 2jk ; ; ϵ mjk Þ T , where σ ii 0 jk is the ði; i 0 Þ th element of Σ jk and σ 2 ijk is the ith diagonal element of Σ jk . Furthermore, suppose Assumption 0.3.
Assuming samples are independent between experimental groups, we begin by first estimating Using the estimator stated in (14), we estimate ν 2 i0 consistently as follows: In all future calculations, we plug inν 2 i0 for ν 2 i0 . This is similar in spirit to many statistical procedures involving nuisance parameters. The following lemma is useful in the sequel.
Let Θ ¼ ðδ; π 1 ; π 2 ; π 3 ; l 1 ; l 2 ; κ 1 ; κ 2 Þ T denote the set of unknown parameters, then for each taxon the log-likelihood can be reformulated using Lemma 0.1, as follows: Then the E-M algorithm is described as follows: • E-step: Compute conditional probabilities of the latent variable. Define ; m, which are conditional probabilities representing the probability that an observed value follows each distribution. Note that l 0 = 0.
• M-step: Maximize the likelihood function with respect to the parameters, given the conditional probabilities.
We shall denote the resulting estimator of δ byδ EM : Next we estimate Varðδ EM Þ. Since the likelihood function is not a regular likelihood and hence it is not feasible to derive the Fisher information. Consequently, we take a simpler and a pragmatic approach to derive an approximate estimator of Varðδ EM Þ using Varðδ WLS Þ, which is defined below. Extensive simulation studies suggest thatδ EM andδ WLS are highly correlated ( Supplementary Fig. 9) and it appears to be reasonable to approximate Varðδ EM Þ by Varðδ WLS Þ.
Let {C r } = m r , r = 0, 1, 2, then The above expression is of the form where (1) 1 = (1, …, 1) T , (2) a r ¼ ða r1 ; a r2 ; ; a rm r Þ T :¼ ð 1 For the simplicity of notation, we relabel a and x by α and u, respectively. Denote Cov(x) = Cov(u) by Ω, and let ω ii 0 denotes the ði; i 0 Þ element of Ω. As in Assumption 0.3, we make the following assumption Assumption 0.4. Using the above expressions, we compute the variance as follows: Recall that (a) for i ∈ C 0 , , thus we have: Since Using these facts and Assumption 0.4 in (26), we get Thus, under Assumption 0.4 regarding ω ii 0 , the contribution of the covariance terms in the above variance expression is negligible as long as m is very large compared with n, which is usually the case. Hence Furthermore, appealing to Cauchy-Schwartz inequality we get Hence, as long as m 0 is large, the contribution made by Varðδ WLS Þ and Covðμ i1 À μ i2 ;δ WLS Þ relative to Varðμ i1 Àμ i2 Þ is negligible. Neglect the covariance term in (26), letĈ r denote the estimator of C r , r = 0, 1, 2 from the E-M algorithm, define an estimator of Varðδ WLS Þ under the Assumption 0.4. Then ; as m; n ! 1: ð32Þ We performed extensive simulations to evaluate the bias and variance ofδ EM and that ofδ WLS . The scatter plot ( Supplementary Fig. 9) ofδ EM andδ WLS are almost perfectly linear with correlation coefficient nearly 1 in all cases. This suggests thatδ WLS is a very good approximation forδ EM . The variance ofδ EM as well as that ofδ WLS are roughly of the order n À1 m À1 0 , as we expected. In addition, they are approximately unbiased (Supplementary Table 6).
Hypothesis testing for two-group comparison. For taxon i, we test the following hypothesis using the following test statistic which is approximately centered at zero under the null hypothesis: From Slutsky's theorem, we have: ; 1Þ; as m; n ! 1: If the sample size is not very large and/or the number of non-null taxa are very large, then we modify the above test statistic as follows: To control the FDR due to multiple comparisons, we recommend applying the Holm-Bonferroni method 26 or Bonferroni 27,28 correction rather than the Benjamini-Hochberg (BH) procedure 29 to adjust the raw p values as research has showed that it is more appropriate to control the FDR when p values were not accurate 30 , and the BH procedure controls the FDR provided you have either independence or some special correlation structures such as perhaps positive regression dependence among taxa 29,31 . In our simulation studies, since the absolute abundances for each taxon are generated independently, we compared the ANCOM-BC results adjusted either by Bonferroni correction (Fig. 4) or BH procedure ( Supplementary Fig. 10), it is clearly that the FDR control by Bonferroni correction is more conservative while implementing BH procedure results in FDR around the nominal level (5%). Obviously, ANCOM-BC has larger power when using BH procedure.
Hypothesis testing for multigroup comparison. In some applications, for a given taxon, researchers are interested in drawing inferences regarding DA in more than two ecosystems. For example, for a given taxon, researchers may want to test whether there exists at least one experimental group that is significantly different from others, i.e., to test H 0;i : \ j≠j 0 2f1; ;gg μ ij ¼ μ ij 0 H 1;i : ∪ j≠j 0 2f1; ;gg μ ij ≠ μ ij 0 : Similar to the two-group comparison, after getting the initial estimates ofμ ij andd jk , setting the reference group r (e.g., r = 1), and obtaining the estimator of the bias termδ rj through E-M algorithm, the final estimator of mean absolute abundance of the ecosystem (in log scale) are obtained by transformingμ ij of (6) into:μ Ã ij :¼μ Thus, based on (18) and the E-M estimator of δ rj , as m; minðn j ; n j 0 Þ ! 1 if taxon i is not differentially abundant between group j and j 0 ; μ ij À μ ij 0 otherwise: ( ð37Þ Similarly, the estimator of the sampling fraction is obtained by transformingd jk of (6) intod As by (13) and the E-M estimator of δ rĵ d Ã jk ! p d jk À d rÁ as m; minðn j ; n j 0 Þ ! 1; ð39Þ which indicates that we are only able to estimate sampling fractions up to an additive constant (d rÁ ). Define the test statistic for pairwise comparison as: For computational simplicity, inspired by the William's type of test 32-35 , we reformulate the global test with the following test statistic: Under null, W i;jj 0 ! d Nð0; 1Þ, thus we can construct the null distribution of W i by simulations, i.e., for each specific taxon i, Finally, p value is calculated as IðW ðbÞ i >W i Þ; i ¼ 1; ; m ð42Þ and the Bonferroni correction is applied to control the FDR.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
DNA sequences from the global gut microbiota study 9 can be found in MG-RAST https://www.mg-rast.org/index.html server under search string "mgp401" for Illumina V4-16S rRNA.