Introduction

Finding subgroups in biomedical data is a key task in biomedical research and precision medicine. In complex multifactorial data, this is achieved by various methods such as cluster analysis, principal component analysis or several implementations of unsupervised machine learning. However, simpler one-dimensional data, such as many different readouts from cell experiments, preclinical or human laboratory experiments or clinical signs also often reveal a distribution that is more complex than a single mode. Indeed, various tests for multimodal distributions of one-dimensional data have been proposed (for a review, see1), emphasizing the relevance of the problem.

Given the frequency of normally distributed biomedical data, Gaussian mixtures play a particularly important role in the multimodal distribution of one-dimensional data that are composed of different subgroups generated by the influence of various biological factors on the research readout. A Gaussian mixture model (GMM), expressed as $$p(x|\Theta )$$, is defined as the weighted sum of M >1 components $$p(x|{\theta }_{m})$$ in $${\mathbb{R}}$$,

$$p(x|\Theta )={\sum }_{m=1}^{M}{\alpha }_{m}p(x|{\theta }_{m}),$$
(1)

where $$x\in X=\{{x}_{1},\ldots ,{x}_{N}\}$$ is a data sample and $$\Theta =({\alpha }_{1},\ldots ,{\alpha }_{M};{\theta }_{1},\ldots ,{\theta }_{M})$$ the parameterization of the GMM with am corresponding to the weight of each component $$m=1,\ldots ,M$$. Each component $$p(x|{\theta }_{m})$$ is represented as a normal distribution with parameters $${\theta }_{m}=({\mu }_{m},{\sigma }_{m})$$, i.e., mean value und the standard deviation.

However, although GMM fitting is often aimed at obtaining the separate modes composing the mixture, current technical implementations are not optimized for this task. The common approach to model GMM is maximizing the likelihood for a given dataset, using some variant of Expectation Maximization (EM)2,3 or Markov-chain Monte-Carlo algorithm4. However, one problem with maximizing only the model’s likelihood is that there may be multiple models with high likelihood that pass statistical tests but are structurally different and may be more suited for modelling classes. A possible solution has been proposed as restricted EM5, which can lead to different solutions, but tends to be complicated and restrains are limited to relations within the parameters of a GMM, instead of general properties of the GMM as whole. Thus, maximizing the likelihood, or an equivalent measure, of the GMM is a common approach to GMM fitting that has several limitations that have been incompletely addressed. Moreover, it does not directly address the separation of the modes of which the GMM is composed.

Therefore, we propose an evolutionary algorithm for GMM fitting that uses an adjustable error function that is based on χ2 statistics and the probability density function. The algorithm can be directly targeted at the separation of the modes of the mixture by employing additional criterion for the degree by which single modes overlap. Using Bayesian decision borders6, a simple and effective classifier can be created that allows a valid association of data set members to distinct groups defined by the mode of the Gaussian mixture.

Methods

Distribution optimization algorithm

“Distribution Optimization” minimizes a given distribution error by utilizing an evolutionary algorithm. Hence, a “population” of GMMs is processed through many iterations with multiple phases, which mutate, filter (select) and recombine the GMMs (Fig. 1). In general, after randomly initializing a first “generation” of GMM components, in iteration GMM parameters are randomly changed and recombined and individuals with better fitness are chosen, to finally form a new “generation” of GMMs.

The first phase, the initialization step, creates the population of GMMs and draws a value for every parameter of every GMM. It therefore depends on the minima and maxima of every structural model parameter. The individuals are drawn from a uniform distribution within those limits. The mean and standard deviations are drawn within the minimum and maximum of the original data. The weights of each component are randomly drawn from [0,1] and divided by their sum so that they add up to a value of 1.

The core of the “Distribution Optimization” algorithm is its error function, which in its negated form also serves as the fitness function. It is defined as the χ2 value, which can be used to compare a theoretical and an empiric distribution. The data space is split into B adjacent intervals (B being derived from standard deviation7,8 for our datasets) with $${k}_{j}$$ being the fraction of empirical data points falling into the jth interval and Pj being the density of the GMM on the same interval. The error is then defined as:

$${{\rm{\chi }}}^{2}={\sum }_{j=1}^{B}\frac{{k}_{j}-{p}_{j}N}{{p}_{j}N}$$
(2)

In addition to the χ2 value, a second error term is included as the so-called “overlap error”. It approximates the fraction of the area overlapped between modes, relative to the total area under the total density curve. The overall overlap error of a GMM is the highest overlap error among all components, given as

$$OverlapErro{r}_{m}={\sum }_{i=1}^{N}\min (\frac{\mathop{max}\limits_{j\in \{1..M\}\backslash m}({x}_{i},{\theta }_{j})\ast {\alpha }_{j}}{p({x}_{i}|{\theta }_{m})\ast {a}_{m}},1)$$
(3)
$$OverlapError=\mathop{{ma}{{x}}_{m=1...M}}\limits_{\,}(OverlapErro{r}_{m})$$
(4)

During the selection step, the individuals with high fitness values are chosen. The selection is not limited to just the best individuals but uses the tournament selection algorithm9 that allows a broader choice of individuals. A random number of individuals form a tournament to that one of them will win at random but weighted with the individual’s rank of fitness. Therefore, some less-fit individuals will be drawn into the next generation to enforce broader diversity of potential solutions. The selection will be repeated until the size of a generation is reached. The same individual can be selected repeatedly.

The mutation and recombination steps modify the individuals in each generation. During the mutation step, random individuals are selected and their parameters changed. Here, a random uniform mutation is employed. Therefore, a random number of parameters of an individual are chosen and then randomly drawn again within their respective minimum and maximum values. This process introduces the variety of solutions. The recombination step modifies individuals by choosing two random individuals and replacing them with their “children”. A pair of children $${c}_{1},{c}_{2}$$ is defined by the weighted average of their parents $${p}_{1},{p}_{2}$$ with a uniformly drawn weight $$\alpha$$ out of [0,1].

$${c}_{1}=\alpha {p}_{1}+(1-\alpha ){p}_{2}$$
(5)
$${c}_{2}=(1-\alpha ){p}_{1}+\alpha {p}_{2}$$
(6)

The recombination operation mainly fulfills the role of local optimization, which is why we have chosen the weighted average operation, while the mutation operation is used for introducing the variance.

Finally, the algorithm finishes after a fixed number of iterations by returning the GMM with the highest fitness.

Implementation

Implementation of the “Distribution Optimization” evolutionary algorithm into the similarly named R library (https://cran.r-project.org/package=DistributionOptimization) makes use of the genetic algorithms provided by the R package “GA” (https://cran.r-project.org/package=GA10). The functionality can be accessed by the function of the same name: “DistributionOptimization()”. As input a data vector and the desired number of Gaussians are expected. As a balance between”OverlapError” and χ2 value is desired, a ratio between both in favor of the χ2 value may be set through the hyperparameter “OverlapTolerance”, which when set at its maximum value of 1 results in pure optimization of fitness. The general parameters defining the evolution can be set as provided by the “GA” package. Parameters for population size and the number of iterations may be chosen, depending on the difficulty of the task, and increased manually when necessary. The output consists of all parameters necessary to reproduce the final GMM, and of the seed needed to reproduce the evolutionary algorithm. The user can monitor the process through the “Monitor” parameter, which either silences the call or outputs stepwise fitness improvements. Further utilities needed for clustering, such as likelihood ratio test or the Akaike information criterion (AIC11), require the R Package “AdaptGauss” (https://CRAN.R-project.org/package=AdaptGauss12).

Data sets and analyses

To assess the suitability of the “Distribution Optimization” for automated separation of the components of a GMM, the algorithm was applied on four different data sets available from published reports as specified below.

Data set 1 is used as the main example data set to demonstrate the function of the algorithm and the usage of the associated R library. It comprises cold pain thresholds acquired from in n = 148 healthy volunteers13. Noxious cold stimuli had been applied using a 3 × 3 cm2 thermode placed on the volar forearm. The thermode was cooled down by −8 °C/s, starting from 32 °C. Following establishment of the individual cold pain threshold to stimulus with slower cooling (1 °C/s), where the subject had to indicate when the sensation changes from cool to painful, 11 stimuli with fast decreasing temperature were applied, ranging from −5 to +5 °C, in steps of 1 °C, from that threshold. Stimuli were applied at randomized order and the subjects rated each stimulus with respect to its painfulness. The “yes”/”no” responses were submitted to binary logistic regression to obtain the phasic cold pain threshold. The obtained pain thresholds to fast-cooling (“phasic”) thermal stimuli had been analyzed with respect to modal distribution. Specifically, the parameters of the GMM were optimized using the EM algorithm as implemented in our interactive R package “AdaptGauss”. The analysis had resulted in a trimodal distribution of phasic cold pain thresholds, with Gaussian modes located at mean temperatures of 24.5, 18.1 and 7.5 °C in decreasing order of cold pain sensitivity.

To determine the optimum number of components, model optimization had been done for M = 2 to 4 components, i.e., one mode less or more than in the original analysis. The final model was reestablished based on the Akaike information criterion. To test the robustness of the results, the “Distribution Optimization” algorithm was run 100 times using different values of “seed”. The 95% confidence intervals of the GMM parameter estimates were obtained as the 2.5th and 97.5th percentiles of the estimates from the 100 runs. This was compared to GMM fits in which the EM algorithm was used for model adaptation.

Data set 2 consists of heat pain thresholds that had been used for the original publication of the interactive “AdaptGauss” R library12. The data had been obtained by locating a thermode at the skin of the forearm of n = 127 healthy volunteers and raising the temperature at 1 °C/s until the subject indicated a painful sensation. Interactive GMM analysis optimizing, based on the EM algorithm but with visually guided correction of the fit versus the observed PDE of the data, had identified a distribution pattern with M = 4 Gaussian modes located at temperatures of 32.3, 37.2, 41.4, and 45.4 °C.

Data set 3 has been acquired in a cohort of n = 31 healthy subjects in whom the cortical excitability had been modulated by applying transcranial magnetic stimulation14. Following inhibitory stimulation of the primary motor cortex, the changes in the amplitudes of motor evoked potentials displayed a trimodal distribution with modes located at 69.7, 115.1 and 158.4%.

Finally, data set 4 contains a sample of n = 10,000 microarrays derived gene expression values, observed in subjects with and without leukemia15. A total of 7,747 different Genes where assessed on 554 subjects, of whom 109 subjects where healthy while 445 patients were diagnosed with some kind of leukemia. Every sample represents the logarithmic intensity of a gene expression.

The automated “Distribution Optimization” GMM algorithm was run on these data sets and the obtained fits were compared with the original fits and with fits using the non-interactive standard EM algorithm implemented in the R library “mclust” (https://cran.r-project.org/package=mclust16). In addition, the quality of the fit was assessed by visual inspection of the fit and of a derived quantile-quantile (QQ) plot of the observed and predicted data distributions.

Results

Comparison of GMM fits using DistributionOptimization or the EM algorithm

The GMM fit of data set 1 using M = 3 modes as in the original publication13 resulted in mean values of m1,…,4 = 24.8, 14.7, and 8 °C. The “Distribution Optimization” algorithm well separated the three Gaussian modes, however, the Bayesian decision limits slightly differed from those obtained by either interactive or non-interactive GMM fit based on the EM algorithm (Fig. 2). In this data set, the interactive EM based fit had provided the best model solution according to an Akaike information criterion of AIC = 988.81. “Distribution Optimization” and non-interactive EM based fits were associated with values of the AIC of 994.96 and 992.36, respectively.

Fitting various GMMs to data set 1 showed that the “Distribution Optimization” algorithm always aims at least overlap among Gaussian modes (Fig. 3). The present experiment also indicted that reassessment of the originally trimodal distribution of cold pain thresholds verified the original M = 3 modes since this provided the lowest value of the AIC (Fig. 3 lower right panel). The focus of the “Distribution Optimization” algorithm on the separation of the Gaussian modes, disfavoring overlaps, became more evident in data set 2, where the non-interactive algorithm produced a solution with a counterintuitive separation of the subjects into clusters of the heat pain thresholds (Fig. 4). This is indicated by the location of the Bayesian decision limits that indicated a narrow second cluster not corresponding to the displayed data distribution. By contrast, the “Distribution Optimization” algorithm produced a GMM with means close to the original result obtained with the interactive EM based GMM fit, located at temperatures of 35, 37.2, 41 and 44.2 °C. Of note, while the solution provided by the non-interactive EM algorithm was unusable for topical interpretation, it provided nevertheless the lowest AIC criterion of AIC = 651.01, while the better grouping obtained with the interactive EM or “Distribution Optimization” based fits were associated with slightly higher values of AIC of 654.64 and 659.2, respectively.

Fits of data set 3 again showed that the “Distribution Optimization” algorithm produced results comparable to those obtained with the alternatives tested in this analysis (Fig. 5). For data set 4, using M = 3 modes and aiming at their separation, “Distribution Optimization” terminated with Gaussian modes located at −44.1, −3.2 and 29% relative gene expression, whereas the raw EM based fit found almost superimposed modes with means at −1.1, −0.2 and 3% relative expression, and non-meaningful Bayesian decision limits (Fig. 6).

Robustness of the distribution optimization results

When running 100 fits of data set 1 with M = 4 modes and setting the seed parameter at i in the ith run, the “Distribution Optimization” algorithm produced parameter values of the GMM that varied between 3.3 and 12.9% (Table 1). By contrast, the EM algorithm produced the same results in every run, i.e., the between runs variance of the GMM parameter values was 0%.

Discussion

The proposed method of fitting GMMs with a focus on maximum separation of the single mixes while allowing least overlap, was successful in providing satisfactory fits of the probability density distributions of the data. The results were comparable with those obtained with classical EM based fits, except for data sets where the EM algorithm produced unsatisfactory results. The observation of such results as shown with data set 2 where the GMM provided no meaningful basis for group separation in a biomedical context had been the motive of the development of the present alternative GMM analysis method. Hence, differences to the automated EM based fit are expected. Of note, when using visually guided EM based GMM fitting, results were similar to those obtained with “Distribution Optimization”. Main advantages of the latter algorithm, however, are that it excludes the subjective component of visually guided fitting, is better suited for automated GMM fitting of many data sets, and makes the results better reproducible.

The present algorithm has been developed to replace algorithms that aim at maximizing the GMM’s likelihood, such as EM2,3, or an equivalent parameter of goodness-of-fit, such as used in Markov-chain Monte-Carlo algorithms4. EM is divided into an expectation and a maximization step, which are repeated to incrementally improve the likelihood of the model. The algorithm stops when the improvement falls under a certain threshold. During the expectation step, the priori likelihood is calculated based on the current model parameters. A likelihood function can therefore be constructed, based on a classification of every data point to one of the GMM components with the highest Bayesian probability. The maximization step then maximizes the parameters, given this information. For the general problem of local maxima, many different modifications of the algorithm have been proposed. For example in CEM2,17 and ECM18, only a single component is maximized in each iteration. SMEM19, SSMEM19 and Competitive EM20 are variants of EM that include split and merge operations on GMM components. Randomly mutating EM algorithms are the Random Swap EM21 or the genetic based EM. However, these algorithms do not aim at replacing the EM algorithm but rather at reducing one of its weaknesses, the risk of ending in local and not global optima. An alternative to EM are the Markov-chain Monte-Carlo algorithms4. They nevertheless follow the same idea as EM by alternately determining weightings, using Bayesian classification, and deriving expectation values and standard deviations from those weightings. Instead of maximizing those values in each iteration, Monte Carlo sampling is used, and the final GMM is given by the stationary distribution of the Markov chain. This, however, equals in a maximization of likelihood as in the EM algorithm. All covered methods ultimately only differ in in their way of searching the maximal likelihood, which we argue is not necessarily the best or most appropriate measure.

The reproducibility of the results, however, was lower than that obtained when using classical EM based fitting, as shown in the robustness experiment. This owes to the genetic algorithm being naturally dependent on a degree of randomness. This however allows an overcoming of local maxima and has the advantage of approximating different solutions. This allows for an automated generation of multiple significant models, which can offer different interpretations and has the effect that the fits differ from those obtained with EM. Specifically, the effect demonstrated in the sample data sets is mainly caused by the choice of quality measures during the fitting process. We have shown that statistically sound GMM can be reached by optimization of a different quality measure than likelihood and that an additional quality can be introduced, producing a model that is significantly different for classification purposes than what is reached by EM. The problem search space for complex problems as GMM is not transparent and there is always explicit or implicit bias introduced through such optimizations, one can never ultimately conclude that there are no models that reach higher performance under the quality measures with a less desirable structure.

The preference of “Distribution Optimization” to EM based alternative methods for GMM fitting not only depends on the reproducibility of the results, the goodness of fit, and the robustness of the results, but has a clear contextual component. This is shown with data set 4. The EM fit of a GMM consisted in a mixture of almost superimposed Gaussian modes with nearly identical means but different standard deviations (Fig. 6). By contrast, “Distribution Optimization” produced three distinct modes with a main mode in the center and two modes of low weights at its margins. In biomedical research such represented in the leukemia derived data set 4, the usual criterion to define an effect is a difference in mean, for example as defined in the effect size measure of Cohen’s d22. Research aims often at the identification of subgroups in the data, and these subgroups are characterized by different central values of the selection parameter, such as different means. With the EM based fit, a useful group separation was impossible whereas the “Distribution Optimization” provided this readily. However, when group separation is not a topical focus and by contrast, the data are considered to represent groups with similar means but different variances, “Distribution Optimization” would be unsuitable while EM based GMM fitting may provide the desired result. Thus, the choice of the method should consider both, the statistical soundness and the topical context.

By its optimization toward mode separation, the proposed “Distribution Optimization” evolutionary algorithm for GMM fitting provides a suitable basis for group separation in multimodally distributed data. It introduces choice between multiple significant models, which would not have occurred by pure means of likelihood optimization such as in alternative approaches to GMM fitting. When group separation is intended, the “Distribution Optimization” algorithm may outperform alternative EM based methods in some data sets.