Clustering microbiome data using mixtures of logistic normal multinomial models

Discrete data such as counts of microbiome taxa resulting from next-generation sequencing are routinely encountered in bioinformatics. Taxa count data in microbiome studies are typically high-dimensional, over-dispersed, and can only reveal relative abundance therefore being treated as compositional. Analyzing compositional data presents many challenges because they are restricted to a simplex. In a logistic normal multinomial model, the relative abundance is mapped from a simplex to a latent variable that exists on the real Euclidean space using the additive log-ratio transformation. While a logistic normal multinomial approach brings flexibility for modeling the data, it comes with a heavy computational cost as the parameter estimation typically relies on Bayesian techniques. In this paper, we develop a novel mixture of logistic normal multinomial models for clustering microbiome data. Additionally, we utilize an efficient framework for parameter estimation using variational Gaussian approximations (VGA). Adopting a variational Gaussian approximation for the posterior of the latent variable reduces the computational overhead substantially. The proposed method is illustrated on simulated and real datasets.


Introduction
The human microbiome comprises of complex communities of microorganisms including but not limited to bacteria, fungi, and viruses, that inhabit in and on a human body Morgan and Huttenhower (2012); Li (2015). It is estimated that there are approximately 10 14 microbial cells associated with the human body, which is around 10 times the number of human cells Ley et al. (2006); Fraher et al. (2012). The human microbiome plays a significant role in human health and disease status. There is evidence indicating that microbial dysbiosis may lead to diseases such as cardiovascular diseases Koeth et al. (2013), diabetes Qin et al. (2012), inflammatory bowel disease Greenblum et al. (2012), obesity Turnbaugh et al. (2009), and many others. Next generation sequencing techniques, such as the 16S ribosomal RNA (rRNA) amplicon sequencing or shotgun metagenomics sequencing, provide an effective way for quantification and comparison of the bacterial composition, including types and abundance of different bacteria within biological samples Streit and Schmitz (2004) ;Kuczynski et al. (2012); Yatsunenko et al. (2012); Äijö et al. (2018). In 16S rRNA sequencing, the 16S rRNA, which is ubiquitous in all bacterial organisms but it also has distinct variable regions that can be used to discriminate between different bacteria is first PCR-amplified and then sequenced Kuczynski et al. (2012). Shotgun sequencing on the other hand is an untargeted sequencing of all microbial genomes in a sample Quince et al. (2017). In either case, short reads are preprocessed through steps of quality control and filtering steps. The processed raw sequence reads are then clustered into operational taxonomic units (OTUs) at a certain similarity level Eckburg et al. (2005) where each OTU is characterized by a representative DNA sequence that could be assigned to a taxonomic lineage by comparing to a known database. Li (2015) Resulting read counts at different taxonomic levels for n samples over K + 1 taxa are stored as a n × (K + 1) matrix W, with the entry W [i, k] representing the counts recorded for the k th taxon in the i th sample.
Statistical analysis of microbiome data is complicated. The microbiome count data can only reveal relative abundance, i.e., the abundance for each taxa are constrained by the total sum of the microbes in that particular sample and the total sum of microbes could vary among the samples depending on the sequencing depth. Different individuals could share various communities of microorganisms, with only a few major ones in common, and even for one person, the microbial composition could be totally different in different body sites.
The heterogeneity of the microbiome samples also leads to over-dispersion. See Hamady and Knight Hamady and Knight (2009) for a detailed review of challenges related to analyzing microbiome data. Standard multivariate analysis usually fails to capture these properties of the microbiome data. Different models have been proposed for the microbiome counts in the literature that capture one or more of the above intrinsic characteristics such as the negative binomial model , zero-inflated negative binomial modelZhang and Yi (2020), zero-inflated Poisson modelJoseph et al. (2013); Xu et al. (2020), Dirichletmultinomial model (Holmes et al., 2012;Chen and Li, 2013;Subedi et al., 2020), and the logistic normal multinomial modelXia et al. (2013). While modelling such count data, a negative binomial (NB) model can allow for the variance to be larger than the mean using a dispersion parameter, thus handling over-dispersion better than a simple Poison model. The zero-inflated negative binomial (ZINB) and zero-inflated Poisson (ZIP) have been proposed to account for excessive number of zeros Joseph et al. (2013). Xu et al Xu et al. (2015) provides a comparison among the zero-inflated models. However, the NB and ZINB model ignore the compositional nature of these microbial counts. Chen and Li Chen and Li (2013), Holmes et al Holmes et al. (2012), Wadsworth et al Wadsworth et al. (2017) and Subedi et al Subedi et al. (2020) utilized the Dirichlet-multinomial model for microbial counts that takes into account the compositional nature of these data. Alternately, Xia et al Xia et al. (2013) employed the logistic normal multinomial model, mapping the relative abundance from a simplex to a latent variable that exists on the real Euclidian space using the additive log-ratio transformation. Cao et al Cao et al. (2017)  Clustering microbiome samples into groups that share similar microbial compositional patterns is of great interest Holmes et al. (2012). Clustering algorithms are usually categorized into hierarchical clustering and distance-based clustering. Hierarchical clustering has been applied for clustering microbiome data, yet it requires the choice of a cut-off threshold, according to which samples can be divided into groups Holmes et al. (2012). On the other hand, k−means clustering, a distance-based method, might not be appropriate for microbiome compositions because it is typically used for continuous data and obtains spherical clusters. Hence, model-based clustering approaches that utilize a finite mixture model have been widely used in the last decade to cluster microbiome data (Holmes et al., 2012;Subedi et al., 2020). A finite mixture model assumes that the population consists of a finite mixture of subpopulations (or clusters), each represented by a known distribution McLachlan and Peel (2000); Zhong and Ghosh (2003);Frühwirth-Schnatter (2006);McNicholas (2016). Due to the flexibility in choosing component distributions to model different type of data, several mixture models based on discrete distributions have been developed to study count data, especially, for gene expression data. Rau et al Rau et al. (2011)  a Dirichlet prior to a multinomial distribution that describes the taxa counts, and proposed a mixture of DM models to cluster samples.
In this paper, we develop a model-based clustering approach using the logistic normal multinomial model proposed by Xia et al Xia et al. (2013) to cluster microbiome data. In the logistic normal multinomial model, the observed counts are modeled using a multinomial distribution, and the relative abundance is regarded a random vector on a simplex, which is further mapped to a latent variable that exists on the real Euclidean space through an additive log-ratio transformation. While this approach captures the additional variability compared to a multinomial model, it does not possess a closed form expression of the loglikelihood functions and of the posterior distributions of the latent variables. Therefore, the expected complete-data log-likelihoods needed in the E-step of a traditional EM algorithm are usually intractable. In such a scenario, one commonly used approach is a variant of the EM algorithm that relies on Bayesian techniques using Markov chain Monte Carlo (MCMC); however, this would typically bring in high computational cost. Here, we develop a variant of the EM algorithm, here on referred to as a variational EM algorithm for parameter estimation that utilizes variational Gaussian approximations (VGA). In Variational Gaussian approximations (VGA) Barber and Bishop (1998), a complex posterior distribution is approximated using computationally convenient Gaussian densities by minimizing the Kullback-Leibler (KL) divergence between the true and the approximating densities Bishop (2006);Arridge et al. (2018). Adopting a variational Gaussian approximation delivers accurate approximations of the complex posterior while reducing computational overhead substantially. Hence, this approach has become extremely popular in many different fields in machine learning. Barber and Bishop (1998); Bishop (2006); Archambeau et al. (2007); Khan et al. (2012); Challis and Barber (2013); Blei et al. (2017).
The contribution of the paper is two folds -first, we develop a computationally efficient framework for parameter estimation for logistic normal multinomial model through the use of variational Gaussian approximations and second, we utilize this framework to develop a model-based clustering framework for clustering microbiome data. Through simulations and applications to microbiome data, the utilities of the proposed approach is illustrated. The paper is structured as follows: Section 2 describes the logistic normal multinomial model for microbiome count data and details the variational Gaussian approximations. Section 2.3 provides a mixture model framework based on the model described in Section 2 together with a variational EM algorithm for parameter estimation. In Section 3, clustering results are illustrated by applying the proposed algorithm on both simulated and real data. Finally, discussion on the advantages and limitations along with some future directions are provided in Section 4 2 Methodology 2.1 The logistic normal multinomial model for microbiome compositional data Suppose we have K + 1 bacterial taxa for a sample denoted as a random vector W = (W 1 , . . . , W K+1 ) . Here, the taxa could represent any level of the bacterial phylogeny such as OTU, species, genus, phylum, etc. Due to the fact that taxa counts from 16S sequencing can only reveal relative abundance, let's suppose there is a vector Θ = (Θ , . . . , Θ K+1 ) such that K+1 k=1 Θ k = 1, which represents the underlying composition of the bacterial taxa. Then, the microbial taxa count W can be modeled as a multinomial random variable with the following conditional density function: Several models have been proposed in the literature that capture the relative abundance nature of microbiome data and analyze the compositional data Holmes et al. (2012);Xia et al. (2013). Here we use the model by Xia et al Xia et al. (2013) that utilizes an additive log-ratio transformation φ(Θ) proposed by Aitchison Aitchison (1982) such that: This transformation φ maps the vector Θ from a K-dimensional simplex to the K-dimensional real space R K while Y is assumed to follow a multivariate normal distribution with mean µ and covariance Σ with the density function As this additive log-ratio transformation is a one-to-one map, the inverse operator of φ exists and is given by Hence, the joint density of W and Y up to a constant is as follows:

A variational Gaussian lower bound
For the microbiome data, only the count vector W are observed while the latent variable Y is unobserved. The marginal density of W can be written as Note that this marginal distribution of W involves multiple integrals and cannot be further simplified. Here, in presence of missing data, an expectation-maximization algorithm Dempster et al. (1977) or some variant of it is typically utilized for parameter estimation. An EM-algorithm comprises two steps: an E-step in which the expected value of the complete data (i.e. observed and missing data) log-likelihood is computed given the observed data and current parameter estimate and an M-step in which the complete data log-likelihood is maximized. These step are repeated until convergence to obtain the maximum likelihood estimate of the parameters. To compute the expected value of the complete data log-likelihood, E(Y | w) and E(YY T | w) needs to be computed for which we need p(y|w). Mathematically, However, the denominator involves multiple integrals and cannot be further simplified. One could employ a Markov chain Monte Carlo (MCMC) approach to explore the posterior state space; however, these methods are typically computational expensive, especially for highdimensional problems. Here, we propose the use of variational Gaussian approximation (VGA) Barber and Bishop (1998) for parameter estimation. A VGA aims to find an optimal and tractable approximation that has a Gaussian parametric form to approximate the true complex posterior by minimizing the Kullback-Leibler divergence between the true and the approximating densities. It has been successfully used in many practical applications to overcome this challenge. Bishop (2006); Archambeau et al. (2007); Wainwright et al. (2008); Khan et al. (2012); Challis and Barber (2013); Arridge et al. (2018). In order to utilize VGA, we define a new latent variable η by transforming Y such that is a (K + 1) × K matrix which takes the form as an identity matrix attached by a row of K zeros. Given that Y ∼ N (µ, Σ), the new latent variable η ∼ N (μ,Σ) wherẽ Then, the underlying composition variable Θ can be written as a function of η: Suppose we have an approximating density q(η), then the marginal log density of W can be written as: where the first part F (q(η), w) = q(η) log p(w, η) q(η) dη is called the evidence lower bound (ELBO) Barber and Bishop (1998) and the second part D KL (q||p) = log q(η) p(η|w) q(η) dη is the Kullback-Leibler divergence from p(η|w) to q(η). Hence, minimizing the Kullback-Leibler divergence is equivalent to maximizing the following evidence lower bound (ELBO).
In VGA, we assume q(η) is a Gaussian distribution, such that Given the fact that q(η) is fully characterized by its mean vector and covariance matrix, the above lower bound is a function of the variational parameters m and V and we aim to find the optimal set of (m, V ) such that it maximizes F (q(η, w)). F (q(η, w)) can be separated into three parts: Up to a constant, the last integral, which is denoted as γ, in the above decomposition is given as follows: Similar to Blei and Lafferty,Blei and Lafferty (2006) we use an upper bound for the expectation of log sum exponential term with a Taylor expansion, where ξ ∈ R is introduced as a new variational parameter.
Here, we further assume that V is a diagonal matrix with the first K diagonal element of V as v 2 k and the K + 1 th diagonal element is set to 0 such that We also denote the k−th element of m as m k such that Hence, the expectation Based on this upper bound, we obtain a concave lower bound to γ and to the ELBO. The new concave variational Gaussian lower bound to the model evidence log p(w) is given as is the generalized inverse ofΣ. Details on the derivation of this lower bound can be found in Appendix A. Given fixed w,μ, andΣ, this lower bound only depends on the variational parameter set (m, V, ξ).
Maximization of the lower boundF m, V,μ,Σ, ξ with respect to ξ has a closed form solution and is given byξ However, maximization with respect to m and v k , k = 1, . . . , K do not possess analytical solutions. We use Newton's method to search for roots to the following derivatives: with v 2 = (v 2 1 , . . . , v 2 K , 0) denoting the diagonal element of V as a vector; and Details can be found in Appendix A.

Mixture of logistic normal multinomial models
Assume there are G subgroups in the population, with π g denoting the mixing weight of the g−th component such that G g=1 π g = 1. Then, a G−component finite mixtures logistic normal multinomial models can be written as where f g (w | ϑ g ) represents the density function of the observation W = w, given that W comes from the g−th component with parameters ϑ g .
Provided n observed counts, w = (w 1 , . . . , w n ) with a transformed underlying the com-position Y i , i = 1, . . . , n, the likelihood of a G−component finite mixture is given as In clustering, the unobserved component membership is denoted by an indicator variable z ig , i = 1, . . . , n, g = 1, . . . , G that takes the form In order to utilize the variational approach for parameter estimation, we again define a new latent variable η such that η = BY and Therefore, the complete data (i.e., observed counts W and unobserved class label indicator variable) log-likelihood using the marginal density of W is To perform variational inference on the mixture model, we substitute log p(w i |η i )p(η i |µ g , Σ g )dη i by the variational Gaussian lower boundF m, V,μ,Σ, ξ derived in Section 2.2. Hence, the variational Gaussian lower bound of complete data log likelihood can be written as: Hence, we need to find optimal solutions to variational parameters (m ig , V ig , ξ i ) that are associated with each observation w i , i = 1, . . . , n, as well as the model group-specific Gaussian parameters (μ g ,Σ g ), g = 1, . . . , G, such that the complete data variational Gaussian lower boundL is maximized. The use of VGA provides great reduction in the computational time.

The variational EM algorithm
Parameter estimation can be done in an iterative EM-type approach, from here on referred to as variational EM such that the following steps are iterated until convergence. For the parameters that do not have a closed form solution to the optimization, we perform one step of Newton's method to approximate the root to their first derivatives.
Step 1: Conditional on the variational parameters (m ig , V ig , ξ i ) and model group-specific Gaus- .
This involves the marginal distribution of W and hence, we use an approximation of Step 2: Updateξ i ,m ig ,V ig : • updateξ i according to Equation 6; • updatem ig by performing one step of Newton's method for approximating the root to the derivative in Equation 7, then letm ig(K+1) = 0; • for k = 1, . . . , K, updatev 2 igk by performing one step of Newton's method search- Step 3: Update π ig ,μ g andΣ g aŝ Note that the original parameters µ g and Σ g can be obtained by the transformation An Aitken acceleration criterionAitken (1926) is employed to stop the iterations. More specifically, at t th iteration, when t > 2, calculate is the variational Gaussian lower bound who approximates the log likelihood at t th iteration. Then, the algorithm will be stopped when < for a given Böhning et al. (1994). In our analysis, we is set to be 1 × 10 −3 .

Hybrid Approach
While the VGA based approach only approximates the posterior distribution and it does not guarantee exact posterior (Ghahramani and Beal, 1999), it is computationally efficient. On the other hand, a fully Bayesian MCMC based approach can generate exact results, fitting such models can take substantial computational time. For example, fitting one iteration using a fully Bayesian MCMC model for a five dimensional dataset (from Simulation study 1) with n = 1000 takes on average of 45 minutes. In a clustering context, the number of iterations required for the analysis is typically in hundreds. Thus, we provide a computationally efficient hybrid approach in which -Step 1: Fit the model using the VGA based approach.
-Step 2: Estimate the component indicator variable Z ig conditional on the parameter estimates from the VGA based approach.
-Step 3: Using the parameter estimates from Step 1 as the initial values for the parameters and using the classification from Step 2, compute the MCMC based expectation for the latent variableη ig as: ng is a random sample from the posterior distribution ofη ig simulated via the RStan package for iterations k = 1, . . . , R (after discarding the burn-in). - Step 4: Obtain the final estimates of the model parameters as: The hybrid approach comes with a substantial reduction in computational overhead compared to a traditional MCMC based approach but it can generate samples from the exact posterior posterior distribution. Fitting such a model using the hybrid approach on a fivedimensional dataset from Simulation 1 with N = 1000 takes on average about 246 minutes.
Recall that one iteration of a fully Bayesian MCMC based approach on the same dataset takes on average 45 minutes for one iteration and the number of iterations required are typically in hundreds. When the primary goal is to detect the underlying clusters (which is the case for our the real data analysis), the VGA based approach is sufficient. However, when the primarily goal is posterior inference, we recommend the hybrid approach as it can better yield an exact posterior similar to the MCMC-EM approach but is computationally efficient.
For simulation studies 1 and 2 in which we show parameter recovery, we show parameter estimation using both VGA and the hybrid approach.

Initialization
For initialization ofẑ ig , we used k-means clustering MacQueen (1967); Hartigan and Wong (1979) on the estimate of the underlying latent variable η i obtained by first calculating the underlying composition using w i / K+1 k=1 w ik for each observation; mapping this composition to the latent variable y i using the additive log-ratio transformation in Equation 1, and transforming the variable to get η i through Equation 2. For initializing the variational parameters for each observation w i , we obtain η i first, same as in theẑ ig initialization step.
We use this calculated latent variable η i as initialization of m ig . V ig for each i are initialized as K + 1 diagonal matrix such thatṼ kk = 1 for k = 1, . . . , K and V kk = 0 for k = K + 1.
ξ i 's are initialized using 1. According to the initialization on the group labelẑ ig ,μ g andΣ g are initialized as group-specific mean and covariance of η i , respectively.

Model Selection and Performance Assessment
In the clustering context, the number of components G is unknown. Hence, one typically fits models for a large range of possible G and the number of clusters is then chosen a posteriori using a model selection criteria. The Bayesian information criterion (BIC)Schwarz (1978) When the true class labels are known (e.g., in simulation studies), we assess the performance of our proposed model using the adjusted Rand index (ARI) Hubert and Arabie (1985). It is a measure of the pairwise agreement between the predicted and true classifications such that an ARI of 1 indicates perfect classification and 0 indicates that the classification obtained is no better than by chance.

Simulation Studies and Real Data Analysis
To illustrate the performance of our proposed clustering framework, we conducted two sets of simulation studies. For both studies, the i−th observed counts data W i are generated as: 1. First, we generate the total counts K+1 k=1 W ik as a random number from a uniform distribution U [5000, 10000].
2. Given pre-specified group specific parameters µ g and Σ g , we transform using Equation 3 to getμ g andΣ g and generate η i from N (μ g ,Σ g ).
3. Based on η i , we calculate Θ i using the inverse additive log-ratio transformation φ −1 using Equation 4.
4. Using Θ i as the underlying composition, together with the total counts generated at the first step, we generate discrete random numbers W i from multinomial distributions. 5. To initialize the variational parameters, we need to use the additive log-ratio transformation which takes the log transformation of the observed count for taxa k divided by total count for all taxa for sample i. If there are any 0 in the generated count data, we substitute the 0 with 1 for initialization.
We also compared the performance of our proposed model to a Dirichlet mixture model (Holmes et al., 2012) which is widely used to cluster microbiome data. Implementation of the Dirichlet mixture model is available in the R package DirichletMultinomial (Morgan, 2020).

Simulation Study 1
In this simulation study, we generated 100 datasets where the underlying latent variable The parameters used to generate the datasets are summarized in

Simulation Study 2
In this simulation study, we generated 100 datasets with the underlying latent variable Y from three component five-dimensional multivariate Gaussian distributions (see Figure 3 for the pairwise scatterplot of the underlying latent variable Y i ).
There are n 1 = 300 observations in Group 1, n 2 = 400 observations in Group 1, and n 3 = 200 observations in Group 3. The true parameters are summarized in Table 3. The proposed algorithm was applied on all 100 datasets where for each dataset, we fitted the models with for G = 1, . . . , 4. In 94 out of the 100 datasets, a G = 3 model was selected using the BIC, a G = 2 was selected for 3 out of the 100 datasets, and G = 4 models were selected for the remaining 3 datasets. The overall mean ARI for all 100 datasets was 0.93 (sd of 0.06) and the mean ARI for 94 datasets where a three component model was selected was 0.94 (sd of 0.01). The average and standard deviation of the estimated parameters for the all 100 datasets using the VGA approach are summarized in Table 3 and using the hybrid approach are summarized in Table 4. Note that the parameter estimation using both approaches are very close to the true value of the parameters. Average computation time for Simulation Study 1 using the proposed VGA approach was 214.91 (sd of 35.91) minutes on a single-core processor. It took additional on average 40.84 (sd of 16.64) minutes for one iteration of the full Bayesian. Thus, the mean computation time for the hybrid approach was

Additional Simulation Studies
To test the performance of the proposed algorithm on higher dimensional datasets, as well as datasets generated from mixture of Dirichlet-multinomial models, we performed a series of 10 additional simulation studies, each containing 100 datasets, as described below: • Generate 100 datasets from a two-component mixture of logistic normal multinomial models with each of the following: -K, the dimension of the latent variable, being 5, 10, and 20; n, the sample size, being 100, 200, and 500. -True parameters are the same for different n but same K.
• Generate 100 datasets from mixture of two-component Dirichlet-multinomial models with dimension 6, and sample size of 200.
We ran the proposed algorithm for G = 1 : 5 on all datasets and used BIC for model selection. We also applied the Dirichlet-multinomial mixture (DMM) models on these datasets with BIC for model selection. Table 5 shows the number of times the correct model (G = 2) was selected out of 100 datasets for each simulation study, as well as the average ARI with standard deviation, for the proposed algorithm and the Dirichlet-multinomial mixture models.
When data were generated from a mixture of logistic normal multinomial models, in all simulation scenarios, the proposed algorithm identified the correct number of components for more than 80 times out of 100 with average ARI ≥ 0.96, except for the case with K = 20, n = 100. Note that in the later case, the number of parameters needed to be estimated is far larger than the sample size. Also, it is observed that, in general, when sample size increases, performance of the proposed algorithm in terms of the number of times it selects the correct model as well as the average ARI also increases. However, the Dirichlet-multinomial mixture model did not perform as well on data simulated from the logistic normal multinomial mixture models. Even in the case of K = 20, n = 200, where it correctly selected the two-component model 99 out of 100 times, the average ARI was only 0.6286 with standard deviation of 0.1284.
When the data was generated from the Dirichlet-multinomial mixture models, our proposed model is able to recover the underlying cluster in 85 out of the 100 datasets with an average ARI of 0.8957 and standard deviation of 0.0916 whereas the Dirichlet-multinomial mixture model was able to recover the underlying cluster structure in all 100 datasets with an average ARI of 0.9589 (sd=0.0456). When performing each simulation study, the computational job was distributed onto a computer cluster, where the proposed algorithm applied on each one of the 100 datasets was run on a one-core slot. Table 5 summarizes the average elapsed time for running the proposed algorithm in minutes, with standard deviation.
For most cases, it takes the proposed algorithm less than 60 minutes. As the number of observations and the dimensionality of data increases, the time to convergence increased as well. Table 6 summarizes how many times each G = 1 : 5 were selected by the proposed algorithm. In nine out of the ten studies, our approach was able to identify the correct number of components in at least 85 out of 100 datasets. For the scenario when K = 20 and n = 100 (small sample size and high dimensional), both proposed algorithm and the Dirichlet-multinomial mixture model approach favoured a one-component model. We also summarized the average of L 1 norm between the true parameters and the estimated values along with the standard errors for the simulations with data generated from mixture of logistic normal multinomial models in Table 7. It shows that, when the dimensionality is low, the proposed algorithm can not only identify the correct underlying group structure but also is able to recover the true parameters well. As the dimensionality increases, the proposed algorithm can still capture the true number of components in the data with high classification accuracy and the estimated central location parameter (µ) is also estimated close to the true value. However, the estimation of the spread parameter (Σ) become less precise as dimensionality becomes higher; however, the distance between the true and the estimated parameters decreases as the sample size becomes larger.

Real Data Analysis
We applied our proposed algorithms to four microbiome datasets from three studies: and breast milk were obtained. As the DNA extraction from breast milk was not feasible in most cases, they were not analyzed further. All infants were exclusively breastfed at 3 days, 96% at 1 month, and 56% at 4 months and for each newborn, oral cavity and gut samples were taken from birth to up to 4 months. The samples were sequenced using high-resolution shotgun metagenomics (Quince et al., 2017) with an improved strain-level computational profiling of known and poorly characterized microbiome members (Segata, 2018 • The Shi 2015 Data: We also applied our algorithm to the Shi 2015 dataset (Shi et al., 2015) available through the R package curatedMetagenomicData (Pasolli et al., 2017) as Shi 2015 dataset. Periodontitis is a common oral disease that affects about 50% of the American adults and is associated with alterations in the subgingival microbiome of individual tooth sites. The study aimed to identify and predict the disease progression using the compositions of the subgingival microbiome. Samples were collected from 12 healthy individuals with chronic periodontitis from multiples tooth sites per individual, before and after nonsurgical therapy that consisted of scaling and root planing. Only the samples from the tooth sites that were clinically resolved after the therapy were selected for the study resulting in an average of two sites per subject. Although samples were obtained from multiple sites of individuals, Shi and colleagues (Shi et al., 2015) state that individual tooth sites are likely to have independent clinical states and unique microbial communities in subgingival pockets, and therefore, we treated them as independent samples for our analysis. This resulted in a total of 48 samples (24 periodontitis samples and 24 recovered samples).
• The atlas1006 Data: We also applied our algorithm to the atlas1006 dataset available through the R package microbiome (Lahti et al., 2014). The dataset comprises of microbiome compositions of 1045 western adults with no reported health complications. Covariates information such as age, BMI category, sex, and nationality are also available. The BMI category was underweight (n = 21), lean (n=484), overweight (n = 197), obese (n = 222), severe obese (n = 99) and morbid-obese (n = 22). For our analysis, we combined both underweight and lean into one category "lean" and obese, severe obese, and morbid-obese into one category "obese", thus resulting in three BMI categories.
As our approach is currently not designed for high dimensional data, we first utilized the R package ALDEx2 (Fernandes et al., 2013;Gloor, 2015) for differential abundance analysis on the observed genus counts to identify the genera that are different among different groups in the datasets. This step is analogous to conducting differential expression analysis in RNA-seq studies before performing cluster analysis to identify variables that are group differentiating. As the sample size of Ferretti 2018 and Shi 2015 datasets were less than 50, we only focused on a small set (top four) of differentially abundant genera and aggregated the remaining genera in a category "Others" to preserve relative abundance. For the atlas1006 dataset which has a sample size of n = 1045, we used top 20 differentially abundant genera that were differentially abundant in "obese" and "lean" category. This "Other" genus was then used as the reference level for computing the underlying composition and conducting the additive log-ratio transformation. We applied our algorithm to this dataset for G = 1 to 4 on all datasets and used BIC for model selection. We also ran the Dirichletmultinomial mixture model with the same set of genera for G = 1 to 4 on all datasets and utilized BIC for model selection. Summary of the clustering performances are provided in Table 8. In two out of the four datasets (i.e., the Ferretti gut microbiome dataset and the Shi dataset) our approach outperforms the Dirichlet-multinomial mixture models while both approaches provides a perfect classification on the Ferretti oral microbiome dataset.
It is important to note that for the atlas1006 datasets, both approaches fail to recover the underlying group structure.
To visualize the true and recovered cluster structure, we conducted principal component analysis using the transformed variable Y.
Two observations were misclassified in the Ferretti gut dataset using our proposed algorithm, and as can be seen in Figure 6, the misclassified infant had a similar microbiome composition to the adult and the misclassified adult had a similar microbiome composition to the infant. Similarly, as seen in Figure 6, most of the misclassified observations for the Shi dataset also had a similar microbiome composition to the group they were assigned. Shi and colleagues (Shi et al., 2015) also utilized the microbiome profiles to classify the clinical state. They performed a supervised classification using the weighted gene-voting algorithm and leave-one-out cross-validation yielding 33 out of 48 samples as correctly classified into  the respective clinical state (correct classification rate of 68.75%) whereas 6 samples were misclassified into the incorrect clinical state and 9 samples were not assigned to any clinical state due to low prediction strength. Note that in supervised classification, the group labels are used to build a predictive model which is then used to make predictions on new or "future" observations. Here, we achieved a correction classification rate of 85.42% (i.e., 41/48 correct classification). On the atlas1006 dataset, recall that both the proposed algorithm and the Dirichlet-multinomial approach did not perform well in recovering the underlying groups based on the BMI. Visualization of the dataset using the principal components show that there is in fact not a clear separation in the microbiome compositions between the various BMI categories in the dataset. Although the recovered cluster structure does not correspond to the known BMI categories, Figure 6 shows that the proposed approach is able to recover homogenous clusters. We do not have further information to investigate what the recovered cluster structure can yield insight into Based on suggestion from a reviewer, an alternate approach to using the differentially abundant taxa by using the most abundant taxa instead was investigated. To keep the dimensionality the same, for Ferretti and Shi datasets, we used the top 4 most abundant taxa and for the atlas1006 dataset, we used the top 20 most abundant taxa. We ran our proposed approach for G = 1, . . . , 4 and used BIC for model selection.

Conclusion
A model-based clustering framework for microbiome compositional data is developed using a mixture of logistic normal multinomial models. The novelty of this work is multi-fold.
Previous work Xia et al. (2013)  Through simulation studies, we have shown that the proposed algorithm delivers accurate parameter recovery and good clustering performance. The proposed method is also illustrated on four real datasets in Section 3.4 where we demonstrate that the proposed models can recover the underlying cluster (group) structure in the real data. While in the datasets with small sample size, we focus on small dimensional data by data aggregation to most differentially abundant genera in real data analysis, for larger dataset, more taxa can be used. Because of adopting an underlying Gaussian distribution, the number of parame-ters in the covariance matrix alone grows quadratically with K. Thus, in high dimensional datasets with small sample size, estimating Σ −1 becomes more challenging as it can lead to degenerate solutions and a host of other issues related to model convergence and fitting while using a traditional maximum likelihood based expectation-maximization approach.
This a well-known issue with Gaussian mixture models and are typically dealt with either variable/feature selection or dimension reduction. Feature selection typically eliminates the redundant or irrelevant variables and reduce computational cost, provide a better understanding of data and improve predictions (Haq et al., 2019). ALDEx2 utilized here is a widely used variable/feature selection technique specifically designed for microbiome data that identifies taxa that are differentially abundance in different conditions. Through a comparative study of ALDEx2 with other approaches commonly used for differential abundance analysis, Quinn et al. Quinn et al. (2018) showed that ALDEx2 has high precision (i.e., few false positives) across different scenarios. However, information on the group structure or conditions may not be available a-priori. In such case, one may conduct feature selection by selecting the top few most abundant taxa and collapsing low-abundant taxa into one category "Others" to preserve the compositional nature of the data. Alternately, mixtures of logistic multinomial models can be extended to high-dimensional data by introducing subspace clustering techniques through the latent variable Mcnicholas and Murphy (2008); McNicholas and Murphy (2010); Bouveyron and Brunet-Saumard (2014). This will be the topic of some future work. Additionally, it has been well-established that different environmental or biological covariates can affect the microbiome compositions. Some future work will also focus on developing a mixture of logistic normal multinomial regression models to investigate the relationship of biological/environmental covariates with the microbiome compositions within each clusters.
Data Availability Statement: The datasets used in this manuscript are publicly available from the R package curatedMetagenomicData and microbiome.
The third integral is intractable, because of the log-sum exponential term. We upper bound this term with a Taylor expansion similar to previous literature Blei and Lafferty (2006) resulting in the following Therefore, the third integral is lower bounded by Combining all three integrals, we obtain a concave variational Gaussian lower bound to the model evidencẽ We maximize this lower bound with respect to the variational parameters ξ, m, V .
First, we maximize the lower bound 5 with respect to ξ. The derivative with respect to we take one step of update based on the optimization discussed above.