Consensus Clustering of temporal profiles for the identification of metabolic markers of pre-diabetes in childhood (EarlyBird 73)

In longitudinal clinical studies, methodologies available for the analysis of multivariate data with multivariate methods are relatively limited. Here, we present Consensus Clustering (CClust) a new computational method based on clustering of time profiles and posterior identification of correlation between clusters and predictors. Subjects are first clustered in groups according to a response variable temporal profile, using a robust consensus-based strategy. To discover which of the remaining variables are associated with the resulting groups, a non-parametric hypothesis test is performed between groups at every time point, and then the results are aggregated according to the Fisher method. Our approach is tested through its application to the EarlyBird cohort database, which contains temporal variations of clinical, metabolic, and anthropometric profiles in a population of 150 children followed-up annually from age 5 to age 16. Our results show that our consensus-based method is able to overcome the problem of the approach-dependent results produced by current clustering algorithms, producing groups defined according to Insulin Resistance (IR) and biological age (Tanner Score). Moreover, it provides meaningful biological results confirmed by hypothesis testing with most of the main clinical variables. These results position CClust as a valid alternative for the analysis of multivariate longitudinal data.


Results
Overview and description of the datasets. The EarlyBird study involved annual measurement of a range of clinical, anthropometric and metabolic variables in a cohort of children from the age of 5 to 16 years.
The Metabolic dataset includes repeated measurements of a panel of serum metabolites for 129 subjects. Of the original 82 species, for this work we used a subset of 46 for which an unambiguous annotation was available.
The Anthropometric and clinical dataset includes repeated measurements of a panel of anthropometric and clinical variables for 149 subjects, namely body weight, body mass index, body composition data generated by dual-energy x-ray absorptiometry (DEXA), skinfold thickness, actigraphy, resting energy expenditure, and pubertal Tanner scores, fasting glucose and insulin. To deal with the gaps related to missing data when overlapping Metabolic and Anthropometric measurements, we decided to study them separately, giving us the opportunity to test our clustering method on different datasets. Figure 2 illustrates the degree of overlap between datasets: the x-axis represents measured variables, and the y-axis represents subjects; the third dimension of time is not shown for simplicity. Each dataset has a number of missing measurements: rather than imputing the missing values, we decided to restrict our analysis to subjects with complete time series. In the case of the metabolic dataset, we carried out the analysis separately for early time points and late time points, with age 11 as the dividing point. This choice was motivated by the discovery of a convergence of the metabolic parameters around age 11 in all subjects, but it also enabled the inclusion of a greater number of subjects in each analysis (since it was easier to identify subjects having complete time series over a shorter time span).
Definition and assessment of the clustering method. We break down the complexity of the multidimensional analysis by first clustering subjects in groups based on their temporal profiles of a single clinical feature (phenotype), and then contrasting these groups with respect to each of the remaining clinical/anthropometric/ metabolic profiles. Any statistically significant difference in this latter step points to clinical/anthropometric/ metabolic patterns linked to the clinically determined subgroups identified in the first step.
The clustering of temporal profiles represented a challenge in itself. Several conceptual approaches have been proposed to partition a collection of temporal profiles in groups based on reciprocal similarity, which in general produce different outcomes. The approaches differ in several crucial respects: the way similarity between profiles is measured, the algorithm that performs the partition based on the computed similarities, and the criterion used to decide what should be the number of groups into which to partition the profiles. The strong dependence of the outcome on the method employed creates the problem of what clustering method and resulting partition of subjects to adopt as the starting point of the analysis. We decided to follow a strategy in which we compared the output of different methods and selected the ones that produced the most robust agreement across datasets in terms of group composition. The rationale for it is that if different conceptual approaches produce comparable outputs, then the resulting grouping is more likely to be reflecting the intrinsic properties of the profiles than those of the algorithms used. We call this strategy Consensus Clustering (CClust); we note that occasionally the same term has been used to indicate some technique to reconcile partitions resulting from different runs of the same algorithm, which is a problem unrelated to the one we are trying to solve in this work. We adopted the availability of an implementation in the R language as a criterion for the selection of algorithms to be included in our study. This choice simplified the comparative analysis of different approaches, and enhanced the reproducibility of our workflow. More importantly, this type of selection restricts the choice of algorithms to those that have been judged worth the effort of re-implementation in a widely used language by members of the scientific community. Together, the selected methods are likely to provide a good coverage of the current state of the art in the field. The algorithms and their respective implementations we used for the comparison are fuzzy classification (package Mfuzz), K-means (R package Mfuzz), K-means (R package NbClust), Hierarchical clustering (R package NbClust), Smoothing Spline Clustering (SSC) (R package SSCLUST), Time Series Clustering Utilities (R package TSclust) and Model-Based Clustering and Classification (R package longclust).
In our consensus-based method a crucial role is played by the measure of overlap between groups produced by different algorithms. The measure we used for quantifying the degree of overlap (consistency) between each pair of algorithms was the Adjusted Rand Index, the corrected-for-chance version of the Rand index (RI). Given a set of n elements -S, and two partitions -X and Y, RI is a fraction in which the numerator is the number of agreements between partitions X and Y and the denominator is the sum of the number of agreements between partitions X and Y and the number of disagreements between partitions X and Y. While RI may only assume a value between 0 and 1, the Adjusted Rand index can assume negative values if the index is less than the expected index. Additionally, unlike the less sensitive RI, the Adjusted Rand index for two random partitionings has as an expected value, which is the constant value zero.

Assessment of Consensus Clustering performance.
Given the focus of the EarlyBird study on processes associated with IR during childhood and adolescence, HOAM IR and Tanner score were identified as variables of high interest. Moreover complete temporal series of HOMA IR and pubertal Tanner score were available for all subjects in both datasets, and so the temporal profiles of these variables were selected as primary variables.. We defined a clustering as a partitioning of subjects in n groups based on the similarity of temporal profiles; we investigated values of n equal to 2 and 3, because larger values did not provide satisfactory results in preliminary tests (not shown). We performed an assessment of the quality of the CClust by partitioning the HOMA IR and Tanner score temporal profiles and studying the separation of the average curves of the resulting groups. Since the unassisted visual inspection of the average curves was inconclusive, we devised a quantitative method in which the obtained clustering was compared with a large sample of random clustering of subjects.
Clustering of HOMA IR profiles: preliminary analysis. A preliminary analysis of the data revealed that the normalized time profiles appear to converge toward a restricted set of values around time point six (age 11), as seen in Fig. 3. While the graphs are relative to HOMA IR, the value convergence seems to represent a separation point between earlier and later metabolic phases. We took advantage of this observation, and decided to split the temporal profiles at age 11 into an early and a late segment, and to cluster the two subsets separately. The benefits of the separate clustering are a subdivision of subjects into more homogeneous groups, and the availability of larger numbers of subjects with complete time profiles over the reduced time spans.
Clustering of HOMA IR profiles: comparison between methods. The marked difference among the results of the different clustering algorithms can be appreciated in Fig. 4. The different approaches produced inconsistent grouping of temporal profiles (i.e. having very little overlap with each other), which was not surprising given the challenging nature of the dataset (eg biological variability, age of the participants, etc…).
For a systematic assessment of the level of agreement between different clustering methods we computed the Adjusted Rand Index between every pair of methods averaged across 100 runs. Figure 5 illustrates the results for the HOMA IR profiles; in this figure each panel represents a matrix in which the index is shown as a circle of size proportional to its value for every possible pair of methods (one method per row and per column). Determining an average solution over several runs was necessary because the k-means algorithm is not deterministic, and thus the partitioning of the subject can change from run to run depending on the random initial assignment of the cluster centroids. The results of the comparison suggest that Mfuzz and SSC provide a high degree of consistency across all the experiments; this result was valid for the other variables as well and not just for HOMA IR (not shown). The fact that these two methods produce highly overlapping groupings of the profiles in the first place, despite being based on very different conceptual approaches lends additional confidence in the result. Among the other methods, the pair Mfuzz/K-means showed an agreement comparable with the outcome generated by Mfuzz/SSClust, although the latter performed better on the other tested clinical variables (not shown). We therefore selected Mfuzz and SSC as the building blocks of our CClust method. As result of CClust we take the set of subjects on whose group membership the two methods agree, using a criterium of maximum overlap between the two sets of groups; the subjects on which they disagree are declared unclassifiable and discarded from the remainder of the analysis. The size of the clusters produced by CClust and used for this case study are reported in Table 1 (the groups are called A and B, where the naming is arbitrary; the size of the groups produced by the each of the two methods separately are reported in Supplementary Table 1). While we expect that the selection of the pair of methods is data-set dependent, the procedure described here is quite general and can be applied to any chosen pair.
Clustering of HOMA IR profiles: permutation test. A first assessment of the quality of the clustering obtained as the consensus between Mfuzz and SSClust can be performed by plotting the average curves of the resulting groups. These curves, shown in Fig. 6, are the average curves of four clinical variables; group A and B averages are shown in red and blue respectively, where the two groups of subjects are those resulting from the clustering of HOMA IR curves. While the plots generally show a clear separation, the distance between curves is generally smaller than the standard deviations at each time point; therefore the result of the visual comparison is inconclusive. In order to rigorously quantify the quality of the clustering, we estimated the likelihood of observing such comparison between all possible pairs of clustering schemes. The following abbreviations have been used: Mfuzz and kmInMfuzz respectively for the fuzzy logic and the k-means implementations in the R package Mfuzz; nbclust1 and nbclust3 respectively for the clustering schemes described in the text and implemented in the R package NbClust; SSClust for the Smoothing Spline Clustering method (SSC) implemented in the R package SSCLUST; tscluCORThc and tscluCORTpam respectively for the clustering schemes described in the text and implemented in the R package TSclust. levels of separation by chance. We first summarized the distance between curves by adding the squares of the distance at each time point (equivalent to the square of the Euclidean distance), and we then derived an empirical distribution of values of this aggregated sum by repeatedly partitioning the subjects at random and computing the resulting sum of squares (n = 10000 permutations). The results are reported in Table 2, showing a separation significantly different from a random effect in all cases except one.

Discovery of associations between primary and secondary variables: HOMA IR-based clustering.
The ultimate purpose of the clustering was to identify significant correlations between risk groups and secondary variables -metabolites, clinical and anthropometric variables. Specifically, a first question we sought to answer was whether any of the secondary variables were significantly different between groups A and B at any age. A second question of interest was whether the whole time profile of any secondary variable is significantly different between groups A and B.   Table 2. Likelihood of observing a distance between average curves as large (or larger) as the one obtained by CClust over n = 10000 random partitions of the subjects in the 2-way clustering of HOMA IR temporal profiles (late time points). An average curve for each of the n = 2 clusters was obtained for each one of the random partitions, and for each one of the four clinical variables shown, and the Euclidean distance between curves was computed. The values in the table indicate the percentage of random partitions producing a distance at least as large as the one measured for the CClust clusters. in the last column of the heatmap was computed using Fisher's method. For example, the left heatmap of Fig. 7 shows that two groups resulting from the clustering of HOMA IR temporal profiles of male subjects (late time points) had average Creatine values that were significantly different only at age 16 (p < 0.05), but the difference remained close to significant at other ages, with p < 0.1 for each of the two-group comparisons at ages 7, 8, 9, 12, 14, 15 and 16. Overall, the cumulative evidence across the whole age range favoring a difference between the two groups reached statistical significance as shown by the combined Fisher p-value. The values in the rightmost heatmap were computed as follows: values for a certain variable were transformed into z-scores separately for each year (i.e. standardized to 0 mean and standard deviation of 1 considering all subjects per year), then the difference between the average z-score of group A minus the average of group B for each year was computed and represented as a color shade. A difference of 1 or −1 indicates that there was a difference of one standard deviation between the two clusters means. Thus for example the right side of Fig. 7 shows that average Creatine was higher in group A, and that the difference reached one standard deviation at ages 9 and 14.
We note that the p-values reported for this comparison were not corrected for multiple testing. None of the variables reached statistical significance when FDR correction was applied ( Figure S7); this is possibly due to the limited number of subjects in each group. This, together with the exploratory nature of this work, makes us consider these p-values only as indications of significance.

Time profile comparison.
In order to answer the second question, for each secondary variable we first summarized the time profiles of the variable separately for each of groups A and B, and then performed a A-vs-B comparison of the resulting summary profiles. We opted for a simple summarization based on the average profile, defined as the sequence of group averages at each time point. The main advantage of summarization by simple average over more sophisticated modelling approaches, such as mixed effect models, is that this makes no difficult to verify assumptions on the data (for example, in terms of presumed linearity of the time evolution), thus resulting in a broader applicability of this newly proposed method. For the subsequent comparison step, after computing an average profile of each group for each secondary variable we then measured the Euclidean distance between the average curves. To obtain a statistical significance assessment of the measured distances, for each variable we estimated how likely it was to obtain the observed distance value purely by chance, by performing a random permutation based analysis. The resulting p-values are reported in Table 3; in the table we also report the FDR-corrected p-values.

Discovery of associations between primary and secondary variables:
Tanner score-based clustering. We repeated the above procedure on the dataset of 129 subjects for which we had metabolite measurements. This time we used the Tanner score temporal profiles as primary variables, and we also compared the results of the 2-way and 3-way clusterings, obtained forcing the number of desired groups to 2 and 3 respectively. Both 2-way and 3-way clustering produced well defined subject groups (Fig. 9), that could be interpreted as "early" and "late" stages in the former, and "early", "intermediate" and "late" stage in the latter.
To select the best solutions generated by 2-way and 3-way CClust clusterings we compared the significance of the separation of average clinical variables values between the two cases. Using random permutation-based analysis, we obtained the empirical distribution of the Euclidean distance between the average curves, with one average curve computed for each cluster; we repeated the analysis for the 2-way and the 3-way clusterings, and for each one of the seven representative clinical variables. The resulting p-values unambiguously pointed to the 2-way solution as the one most distant from the random distribution of average curve separation values (Tables 4 and  5). We therefore performed the remainder of the analysis on the 2-way clustering, producing the table of random permutation p-values (Table 6), and the heatmaps of Mann-Whitney p-values (not corrected for multiple testing) and average z-score differences (Figs 10 and 11). None of the variables reached statistical significance when FDR correction was applied ( Figure S7).

Figure 8.
Left: heatmap of significance of difference between metabolic/clinic variable means between the two groups of samples (clustering according to HOMA2.IR, late time points, females only), Right: heatmap of difference between the normalized averages (z-scores) of the two clusters. Legend: TennSc: Tanner score, RQ; respiratory quotient; ch_wtsds, child body weight z score; ch_wcsd, child waist circumference z score; ch_glucose, child glucose; ch_gest: child gestational age; ch_bwt_sds, child birth weight z score, TBLH, total body less head.

Discussion
In this work we tackle the problem of finding relevant associations between the temporal profiles of a set of quantitative variables (clinical, anthropometric, metabolic) measured at yearly intervals between the ages of 5 and 16 for a cohort of subjects. We introduce a notion of variable prioritization by requiring the user to select a variable of particular interest (called primary variables), typically one directly associated to a phenotype under study. If  the user is interested in more than on phenotype, the workflow can be repeated for each one of the associated variables. As an example, in the EarlyBird study we selected HOMA IR and Tanner score as proxies, respectively, for insulin resistance and developmental phases in childhood and puberty. The prioritization of a variable enables the partition of the subjects into groups homogeneous with respect to the temporal evolution of the related  Table 4. Likelihood of observing a distance between average curves at least as large as the one obtained by CClust over n = 10000 random partitions of the subjects in the 2-way clustering of Tanner score temporal profiles. An average curve for each of the n = 2 clusters was obtained for each one of the random partitions, and for each one of the seven clinical variables shown, and the Euclidean distance between curves was computed. The values in the table indicate the percentage of random partitions producing a distance at least as large as the one measured for the CClust clusters. Legend: wcsd: waist circumference z score. phenotype, and focuses the subsequent analysis on the discovery of associations between such phenotype and the remaining variables.
Our proposed approach solves a number of challenges that a user can expect to encounter in the implementation of this workflow. We describe a method to remedy the lack of agreement between existing clustering strategies while making use of existing results in the field. We term this method "consensus clustering", because it is based on the identification of the two algorithms providing the largest overlap between partitions; the initial pool of algorithms comprises conceptually different approaches, in order to make the consensus non-trivial and to ensure it reflects the intrinsic structure of data.
A challenge that is relatively common in analyses encompassing a large numbers of subjects and measurements carried over several years was the non-uniform data coverage of subjects between different types of data -in other words, not all clinical, metabolic and anthropometric variables were available for every subject. The solution we used was to personalize the analysis to the range of ages and individuals available for each data type. Similarly, we decided to restrict our analysis to subjects with complete time series for the primary variables, which sacrificed some of the available data values but removed any concern about possible artifacts introduced by imputation techniques.
Another challenge was that none of the metabolic or clinical variables reached statistical significance when the subgroups were contrasted in a timepoint-wise manner, possibly due to the limited number of subjects. Interestingly, however, when we contrasted the whole time profiles of the different subgroups we obtained significant differences for several metabolic variables between HOMA IR subgroups (male subjects only), and for anthropometric variables between Tanner score subgroups (female subjects only). The uncorrected point-wise p-values reported in the heatmaps are still a valuable result, useful for prioritizing clinical variables and metabolic species in view of a possible follow-up study.
For instance, the application of our methodology to study HOMA IR trajectories in boys has revealed a limited contribution of anthropometric parameters to IR clinical behaviour. The approach also highlighted some serum metabolic patterns related to amino acid metabolism (histidine, glutamine, lysine, valine), central energy metabolism (creatine) and ketogenesis (acetate, 3D-hydroxybutyrate). Some of the metabolite patterns are consistent with previous findings, including the positive association of branch chained amino acids with IR 10 and decreased ketogenesis in obese prepubertal children 11 . Since susceptibility to pre-diabetes and obesity later in life is influenced by various factors during childhood growth and puberty, our approach provides us with tools to explore the interactions between pubertal staging, metabolic functions and IR. One key factor currently being studied is excess of body weight during childhood which can also influence pubertal development and IR, through influences on timing of pubertal onset and pubertal hormonal levels 12 . This is exemplified here as well, through the very strong and gender-specific patterns of anthropometric and metabolites associated with pubertal staging. Such data will provide important opportunities to examine the molecular processes associated with adiposity-IR interactions during the complex period of puberty and adolescence.
In conclusion, we show that our consensus-based method is able to cluster the study subjects into groups possessing desirable properties. First, the groups are robust with respect to the method, in other words their grouping reflects the consensus among different conceptual approaches to clustering. Second, we show that the groups we obtained produced statistically significant separation between most of the main clinical variables, giving confidence that the risk groups we identified may have real biological correlates. We then proceeded to identify the metabolic, anthropometric and remaining clinical variables that correlated with the risk groups, and were able to discuss their biological relevance. Statistics. We performed all statistical calculations in the R language ver. 3.3.2 14 . Heatmaps were created in R using the RColorBrewer ver 1.1-2 15 , Lattice ver. 0.20-34 16 , GridExtra ver. 2.2.1 17 , and Grid ver. 0.7-4 14 packages. The values in the leftmost heatmap were computed using a Matt-Whitney U test for each of the secondary variables (one per row); the test was performed between the values for subjects in the A and B groups, where A and B were the group of subjects resulting from the clustering (performed using CClust) of temporal profiles of the primary variable to which the figure refers to. The summary p-value in the last column of the heatmap was computed using the CombinePValue package 18 . The values in the rightmost heatmap were computed as follows:  Table 5. Likelihood of observing a distance between average curves at least as large as the one obtained by CClust over n = 10000 random partitions of the subjects in the 3-way clustering of Tanner score temporal profiles. An average curve for each of the n = 3 clusters was obtained for each one of the random partitions, and for each one of the seven clinical variables shown, and the Euclidean distance between curves was computed. The values in the table indicate the percentage of random partitions producing a distance at least as large as the one measured for the CClust clusters.  Table 6. Empirical p-values obtained by random permutations using Tanner score as primary variable and metabolites as secondary ones. Numbers shown indicate the likelihood of observing a distance between the reported metabolite average curves over n = 10000 random partitions which is as large (or larger) as the one obtained by 2-way clustering of Tanner score profiles using our algorithm. Asterisks mark the statistically significant values (p < 0.05). (a) Metabolites. (b) Anthropometric variables. Anthropometric variables were analyzed separately because they were available for a range of time points different than the other variables.

Materials and Methods
values for a certain variable were transformed into z-scores separately for each year (i.e. standardized to 0 mean and standard deviation of 1 considering all subjects per year), then the difference between the average z-score of group A minus the average of group B for each year was computed and represented as a color shade.
Clustering. We used the following R packages of clustering algorithms for our analysis: MFuzz ver. 2.34.00, NBClust ver. 3.0, ssClust ver. 3.0, TSclust ver 1.2.3, Longclust ver 1.2. A comparison of the results of the application of these algorithms to the HOMA IR time profiles is reported in Fig. 4 (graphs of clustered time profiles) and in Supplementary Table ST1 (size of clusters). Additional details on the clustering methods employed in this study are reported in Supplementary Materials and Methods.
Measure of overlap. The measure we used for quantify the degree of overlap (consistency) between every pair of methods was the Adjusted Rand Index, the corrected for chance version of the Rand index. Though the Rand index may only assume a value between 0 and 1, the Adjusted Rand index can assume negative values if the index is less than the expected index. Given a set of n elements S and two partitions X and Y the Rand index is a fraction in which at the numerator there is the number of agreements between partitions X and Y and at the denominator the sum of number of agreements between partitions X and Y and the number of disagreements between partitions X and Y.
Accordance. We conducted the study in accordance with the ethics guidelines of the Declaration of Helsinki II.
Approval. Ethics approval was granted by the Plymouth Local Research Ethics Committee (1999). heatmap of difference between the normalized averages (z-scores) of the two clusters. White boxes are fill-ins for the anthropometric variables missing time points. Legend: TennSc: Tanner score, RQ; respiratory quotient; ch_wtsds, child body weight z score; ch_wcsd, child waist circumference z score; ch_glucose, child glucose; ch_gest: child gestational age; ch_bwt_sds, child birth weight z score, TBLH, total body less head. Figure 11. Left: heatmap of significance of difference between metabolic variable means between the two groups of samples (clustering by CClust according to Tanner score, all time points, males only), Right: heatmap of difference between the normalized averages (z-scores) of the two clusters. White boxes are fill-ins for the anthropometric variables missing time points. Legend: TennSc: Tanner score, RQ; respiratory quotient; ch_ wtsds, child body weight z score; ch_wcsd, child waist circumference z score; ch_glucose, child glucose; ch_gest: child gestational age; ch_bwt_sds, child birth weight z score, TBLH, total body less head.